Skip to main content
LLMs in Production, Part 1 — The Honest Stack
AIApplied AI

LLMs in Production, Part 1 — The Honest Stack

All articles

What an LLM-powered product actually looks like once you stop reading announcements and start shipping. Architecture, costs, failure modes, and the trade-offs nobody markets.

Portrait of AnouarAnouarFounder & lead writer
2026-05-249 min read

LLMs in Production, Part 1 — The Honest Stack

The conference-talk version of an LLM product is a single call to GPT-4 with a clever prompt. The production version is fifteen pieces of moving infrastructure, three providers, two databases, and a retry policy that took six weeks to tune. This series is about the second one.

The pieces nobody mentions

Building anything past a demo means owning, in some form:

  • A prompt registry. Prompts are code. They get versions, they get reviewed, they get rolled back. If your prompts live as string literals in your application code, you have a junior-engineer problem dressed up as a technical decision.
  • An evaluation suite. Without it, every prompt change is a guess. With it, you can ship 20 prompt iterations a week and know which ones regressed accuracy.
  • A response cache. Same question, same answer, ten times cheaper. The number-one cost reduction lever for almost every production LLM app.
  • A rate limiter on the inbound side. Not because users will abuse you. Because OpenAI's rate limits don't care about your traffic shape; you need to be the one shaping it.
  • A fallback policy. Provider A goes down once a quarter. Provider B exists. Your code needs to know which prompt template works for both.

Where most teams underspend

In our experience, two areas:

Eval coverage

Teams ship a dozen prompts to production with single-digit eval examples each. Then a model upgrade breaks half of them silently because the eval set didn't include the edge cases that mattered. The minimum bar is 50 examples per prompt, half adversarial.

Observability of inputs, not just outputs

You instrument latency, token cost, and response status. You don't instrument the input distribution — and that's where drift hides. Six months in, your users' phrasing changes; your prompt was tuned for the original distribution; quality degrades; you have no signal.

What we're going to cover

Part 2 of this series gets into the trade-offs between RAG, fine-tuning, and prompt-only approaches — and when each is the right choice. Part 3 (separate piece, separate series) will get into multi-agent orchestration, but that's a different mess.

The goal of this series isn't to be exhaustive. It's to short-circuit some of the expensive lessons we paid for so you don't have to.

Discussion

Comments will be enabled soon.

More from AI

Newsletter

Get every new piece.

Long-form writing on AI, cybersecurity, and cloud — straight to your inbox. No spam, one-click unsubscribe.

Signups opening soon.