LLMs in Production, Part 1: The Honest Stack

What an LLM-powered product actually looks like once you stop reading announcements and start shipping. The real stack, per layer, with the tools we have used and the cost numbers we have measured.

The conference-talk version of an LLM product is a single call to GPT-4 with a clever prompt. The production version is fifteen pieces of moving infrastructure, three providers, two databases, and a retry policy that took six weeks to tune. This series is about the second one.

In this piece: the seven layers every real production LLM system has, the tools we have used at each layer, and the numbers that matter for choosing between them. Part 2 (already published) walks through the RAG-vs-fine-tune-vs-prompt decision; part 3 will cover evals. Read them in any order.

The pieces nobody mentions

A demo can be one API call. A production system needs, in some form:

A prompt registry
An evaluation suite with golden examples
A response cache
A model gateway with fallback
A rate limiter on the inbound side
An observability pipeline that captures inputs, not just outputs
A retry-and-timeout policy that does not amplify provider incidents

Each of these is a layer of the stack. Each has well-known tools you can buy, build, or get for free. Each has a wrong default that most teams reach for first.

Prompt registry

Prompts are code. They get versioned, reviewed, rolled back. They sit in your repo (or in a managed registry that gives you read-write APIs), and a change to one needs an eval re-run before it merges. If your prompts are string literals scattered across React components, you have a junior-engineer problem dressed up as a technical decision.

Free option: a YAML or Markdown directory in your repo, loaded at build time, versioned by Git. This is what we run for cybercloud.ma's own publishing pipeline. Eight prompts, all in one folder.

Managed options that are worth knowing: Langfuse (open source, self-hostable, free tier), PromptLayer, Pezzo, Braintrust. The managed pitch is the live-editing UI where a non-engineer can iterate prompts; if no non-engineers will ever touch your prompts, save the money.

Evaluation suite

Without it, every prompt change is a guess. With it, you can ship 20 prompt iterations a week and know which ones regressed.

The minimum bar we enforce on serious projects: 50 examples per prompt, half adversarial. Adversarial means examples that broke an earlier version of the prompt or that mimic the actual failure modes we have seen in logs. Synthetic adversarial examples (generated by another LLM) are worth 30 percent of a real one; weight your set accordingly.

The eval-set rule of thumb that has saved us most often: any new edge case from production becomes a permanent eval example by the end of the week. The set never shrinks.

Tools we have used and recommend for evals: Braintrust (paid, eval-first), Langfuse evals (open source), Arize Phoenix (open source, observability-adjacent), LangSmith (paid, LangChain ecosystem). For very small projects, a Python script with pytest parametrize and assert llm_output_passes(expected_criteria) is enough; the managed tools earn their keep above roughly 200 eval cases.

Response cache

Same question, same answer, 10 to 90 percent cheaper. The number-one cost reduction lever for almost every production LLM app.

Two flavours to know:

Provider-side prompt caching. Anthropic gives you about 90 percent off the cached portion of a prompt after the first read in a 5-minute window. OpenAI gives you 50 percent off cached input. If you reuse the same system prompt or the same retrieved chunks across many calls, this is free money you should already be taking.
Your own semantic cache. A request comes in. You embed it, look up the closest match in a cache index, and if it is similar enough (cosine > 0.95 or so) you return the prior answer. Saves on full inference cost but introduces a quality risk. We have shipped this in production once and now have second thoughts; the failure mode is users seeing slightly-wrong cached answers and losing trust. Use it for high-volume, low-stakes endpoints (autocomplete, "did you mean") not for primary answers.

Model gateway with fallback

Provider A goes down approximately once per quarter. Sometimes for 20 minutes, sometimes for 4 hours. Provider B exists. Your code needs to know which prompt template works for both, and your traffic needs to be able to flip without a deploy.

Tools we have used:

LiteLLM (open source proxy, very popular in 2026). You write your code against the OpenAI API shape and LiteLLM translates to Anthropic, Gemini, Bedrock, Together, and 90 other providers. Add a fallback policy in YAML.
Portkey (managed gateway). LiteLLM-style routing plus caching, retries, virtual keys, and a UI for ops.
Helicone (gateway + observability). Tied to its observability product.

For most teams new to multi-provider, LiteLLM in a Kubernetes pod (or as a library imported into your app) is the right starting point. Portkey and Helicone become worth it when the operational interface matters more than the cost.

Inbound rate limiter

OpenAI's tier-based rate limits do not care about the shape of your traffic. You need to be the one shaping it. Otherwise the spiky 11am traffic from a marketing campaign rate-limits the steady traffic from your customer support queue.

The right pattern: queue requests with priority. Priority 1 (paying customer doing primary action) goes first. Priority 2 (background batch) gets deferred. Priority 3 (experimental endpoint nobody is watching) gets dropped above a threshold.

Tools: a Redis-backed queue you write yourself takes about 80 lines of Python and works at most scales. BullMQ for Node. RQ or Celery for Python if you already have one. Provider-specific: LiteLLM and Portkey both ship with queue + rate-limit features.

Observability that captures inputs

You instrument latency, token cost, and response status. You do not instrument the input distribution, and that is where drift hides. Six months in, your users' phrasing changes, your prompt was tuned for the original distribution, quality degrades, and you have no signal.

Concrete example: we shipped an internal assistant calibrated on questions of the form "How do I do X?". Two months in, users had figured out the assistant was reliable and started asking "Should we do X or Y?". Open-ended deliberation questions were a different distribution, the prompt was not built for them, answers got mealy. We only noticed because a sceptical user complained; by then 8 percent of traffic was this new shape. With proper input observability we would have known in two days.

Tools: Langfuse (open source, our default), Helicone, LangSmith, Arize Phoenix. All of them can capture full input traces with metadata. The work you have to do is decide what fields to attach: user tier, calling endpoint, session ID, retrieved chunk count, model version, prompt version, latency budget, retry count. A trace without metadata is just a log line you cannot query.

Retry and timeout policy that does not amplify incidents

When a provider goes degraded (slow but not down), naive retry logic turns one slow call into three slow calls and a thundering herd. We have learned this twice the hard way.

Rules we now run with:

Per-call timeout: 30 seconds. Higher and you are wasting users' patience and your own latency budget.
Retry count: 2. Three or more retries on a degraded provider turns a 10-percent error rate into a 30-percent retry storm.
Exponential backoff with jitter. 1s, 2s, 4s with random plus-or-minus 30 percent. The jitter is what stops you from synchronising your retries with everyone else's.
Circuit breaker on provider. After 20 percent error rate for 60 seconds, stop calling the provider for 2 minutes and flip to fallback. This is the policy that has saved us from accidentally taking ourselves down with a provider outage.

Most HTTP client libraries (axios, httpx, anthropic-sdk's built-in retry) support most of this; you usually need to add the circuit breaker yourself.

Where most teams underspend

In our experience, two areas.

Eval coverage

Teams ship a dozen prompts to production with single-digit eval examples each. Then a model upgrade (or, more painfully, a provider deprecation that forces a model swap) breaks half of them silently because the eval set did not include the edge cases that mattered.

The minimum bar is the 50-examples-per-prompt rule above. The better bar is: every production failure becomes an eval. Every reported bug becomes an eval. Within six months of the first deploy, your eval set is large enough that you trust it more than your gut about whether a change is safe.

Treating prompts as documentation

Prompts written for the model often double as the canonical statement of how the feature should behave. If your product manager wants to know "what does the system do when a customer asks X?", the prompt should be the authoritative answer.

This works if the prompt is readable, structured, and lives in version control with sensible diffs. It breaks down completely if the prompt is a 4,000-character string in a TypeScript file with no comments. Treat the prompt as a spec, write it like one, and review prompt changes as carefully as you review schema migrations.

What we are going to cover

Part 2, RAG, Fine-tune, or Just Prompt?, goes into how to ground an LLM in your own data and the trade-offs between the three honest approaches. Part 3 (separate piece, not yet published) will cover evals: the three techniques we trust, why ROC/precision/recall lie about LLM quality, and how to run 1,000 evals in 90 seconds.

The goal of this series is not to be exhaustive. It is to short-circuit some of the expensive lessons we paid for, so you do not have to.