LLMs in Production, Part 2 — RAG, Fine-tune, or Just Prompt?
There are three honest ways to make an LLM useful with your data. Most teams pick the wrong one because the choice gets framed by tool vendors, not engineering trade-offs. Here's the actual decision tree.
The three approaches in one paragraph each
Prompt-only
You pass everything the model needs to know inside the prompt itself. Few-shot examples, instructions, context. No external retrieval, no model training.
Best for: stable knowledge that fits in context, low-volume use cases, prototypes, anything where the "knowledge" is really just behavior shaping.
RAG (Retrieval-Augmented Generation)
You keep your data in a vector store (or full-text index, or both). On each query, you retrieve relevant chunks and stuff them into the prompt before the model answers.
Best for: large, changing knowledge bases; needs for citation/auditability; situations where the cost of retraining outweighs the cost of retrieval.
Fine-tuning
You take a base model and train a thin layer on top with examples of input → output for your domain. The model learns the patterns rather than reading them at inference time.
Best for: consistent output formats, domain-specific jargon, tone-of-voice replication, situations where you have ≥1000 high-quality example pairs.
The actual decision tree
Is the knowledge stable + small? → Prompt-only.
Is the knowledge large or changes often? → RAG.
Is the output FORMAT or TONE the problem (not knowledge)? → Fine-tune.
None of the above? → Combine. RAG + fine-tuned formatter is the most common production stack.
The cost reality
| One-time | Per-query | Update cost | |
|---|---|---|---|
| Prompt-only | $0 | High (long prompts = more tokens) | Editing the prompt |
| RAG | Vector store + chunking pipeline | Moderate (retrieval is cheap, but context is still big) | Reindex on data change |
| Fine-tune | $500–$10k training run | Lower per query (shorter prompts) | New training run every meaningful update |
For most products, RAG wins on total cost of ownership because knowledge changes faster than fine-tunes can keep up.
What teams get wrong
The single biggest mistake: fine-tuning when the actual problem is bad retrieval. You can fine-tune a model to memorize 200 facts about your product, but you can't fine-tune it to know about the policy your team published yesterday. If your data changes faster than your training cadence, you need RAG, not a fine-tune.
The second biggest mistake: shipping vanilla RAG without thinking about chunk size, retrieval count, or rerank. Default chunking + top-3 retrieval is almost never the right configuration. Spend a week tuning these and your RAG quality jumps more than switching models would.
What we covered, what's next
This piece is opinionated, deliberately. The trade-offs are real and the cost of choosing wrong is high. If you want a more comprehensive treatment, the Anthropic and OpenAI cookbooks both have good worked examples — but neither will tell you when to not use their tool.
Next in the production series: how we run evals at speed, why ROC/precision/recall lie about LLM quality, and the three eval techniques we trust.


