LLMs in Production, Part 1 — The Honest Stack
The conference-talk version of an LLM product is a single call to GPT-4 with a clever prompt. The production version is fifteen pieces of moving infrastructure, three providers, two databases, and a retry policy that took six weeks to tune. This series is about the second one.
The pieces nobody mentions
Building anything past a demo means owning, in some form:
- A prompt registry. Prompts are code. They get versions, they get reviewed, they get rolled back. If your prompts live as string literals in your application code, you have a junior-engineer problem dressed up as a technical decision.
- An evaluation suite. Without it, every prompt change is a guess. With it, you can ship 20 prompt iterations a week and know which ones regressed accuracy.
- A response cache. Same question, same answer, ten times cheaper. The number-one cost reduction lever for almost every production LLM app.
- A rate limiter on the inbound side. Not because users will abuse you. Because OpenAI's rate limits don't care about your traffic shape; you need to be the one shaping it.
- A fallback policy. Provider A goes down once a quarter. Provider B exists. Your code needs to know which prompt template works for both.
Where most teams underspend
In our experience, two areas:
Eval coverage
Teams ship a dozen prompts to production with single-digit eval examples each. Then a model upgrade breaks half of them silently because the eval set didn't include the edge cases that mattered. The minimum bar is 50 examples per prompt, half adversarial.
Observability of inputs, not just outputs
You instrument latency, token cost, and response status. You don't instrument the input distribution — and that's where drift hides. Six months in, your users' phrasing changes; your prompt was tuned for the original distribution; quality degrades; you have no signal.
What we're going to cover
Part 2 of this series gets into the trade-offs between RAG, fine-tuning, and prompt-only approaches — and when each is the right choice. Part 3 (separate piece, separate series) will get into multi-agent orchestration, but that's a different mess.
The goal of this series isn't to be exhaustive. It's to short-circuit some of the expensive lessons we paid for so you don't have to.


