Skip to main content
LLMs in Production, Part 2 — RAG, Fine-tune, or Just Prompt?
AIApplied AI

LLMs in Production, Part 2 — RAG, Fine-tune, or Just Prompt?

All articles

A decision framework for the three main ways to ground an LLM in your data. Choose wrong and you spend six months building the wrong thing.

Portrait of AnouarAnouarFounder & lead writer
2026-05-2311 min read

LLMs in Production, Part 2 — RAG, Fine-tune, or Just Prompt?

There are three honest ways to make an LLM useful with your data. Most teams pick the wrong one because the choice gets framed by tool vendors, not engineering trade-offs. Here's the actual decision tree.

The three approaches in one paragraph each

Prompt-only

You pass everything the model needs to know inside the prompt itself. Few-shot examples, instructions, context. No external retrieval, no model training.

Best for: stable knowledge that fits in context, low-volume use cases, prototypes, anything where the "knowledge" is really just behavior shaping.

RAG (Retrieval-Augmented Generation)

You keep your data in a vector store (or full-text index, or both). On each query, you retrieve relevant chunks and stuff them into the prompt before the model answers.

Best for: large, changing knowledge bases; needs for citation/auditability; situations where the cost of retraining outweighs the cost of retrieval.

Fine-tuning

You take a base model and train a thin layer on top with examples of input → output for your domain. The model learns the patterns rather than reading them at inference time.

Best for: consistent output formats, domain-specific jargon, tone-of-voice replication, situations where you have ≥1000 high-quality example pairs.

The actual decision tree

Is the knowledge stable + small? → Prompt-only.
Is the knowledge large or changes often? → RAG.
Is the output FORMAT or TONE the problem (not knowledge)? → Fine-tune.
None of the above? → Combine. RAG + fine-tuned formatter is the most common production stack.

The cost reality

One-timePer-queryUpdate cost
Prompt-only$0High (long prompts = more tokens)Editing the prompt
RAGVector store + chunking pipelineModerate (retrieval is cheap, but context is still big)Reindex on data change
Fine-tune$500–$10k training runLower per query (shorter prompts)New training run every meaningful update

For most products, RAG wins on total cost of ownership because knowledge changes faster than fine-tunes can keep up.

What teams get wrong

The single biggest mistake: fine-tuning when the actual problem is bad retrieval. You can fine-tune a model to memorize 200 facts about your product, but you can't fine-tune it to know about the policy your team published yesterday. If your data changes faster than your training cadence, you need RAG, not a fine-tune.

The second biggest mistake: shipping vanilla RAG without thinking about chunk size, retrieval count, or rerank. Default chunking + top-3 retrieval is almost never the right configuration. Spend a week tuning these and your RAG quality jumps more than switching models would.

What we covered, what's next

This piece is opinionated, deliberately. The trade-offs are real and the cost of choosing wrong is high. If you want a more comprehensive treatment, the Anthropic and OpenAI cookbooks both have good worked examples — but neither will tell you when to not use their tool.

Next in the production series: how we run evals at speed, why ROC/precision/recall lie about LLM quality, and the three eval techniques we trust.

Discussion

Comments will be enabled soon.

More from AI

Newsletter

Get every new piece.

Long-form writing on AI, cybersecurity, and cloud — straight to your inbox. No spam, one-click unsubscribe.

Signups opening soon.