LLMs in Production, Part 2: RAG, Fine-tune, or Just Prompt?

A decision framework for the three ways to ground an LLM in your data, with real vendor pricing, the embedding model trade-off table, and the worked example we hand to every team that asks.

There are three honest ways to make an LLM useful with your data. Most teams pick the wrong one because the choice gets framed by tool vendors, not engineering trade-offs. We have watched at least six teams over the last year spend three to nine months on fine-tunes that should have been a RAG pipeline, or build a 60,000-token "context window" prompt that should have been a fine-tune. Here is the actual decision tree, with the numbers that drive it.

The three approaches in one paragraph each

Prompt-only

You pass everything the model needs to know inside the prompt itself. Few-shot examples, instructions, context. No external retrieval, no model training.

Best for: stable knowledge that fits in context, low-volume use cases, prototypes, anything where the "knowledge" is really just behaviour shaping (tone, output format, allowed verbs).

RAG (Retrieval-Augmented Generation)

You keep your data in a vector store, or a full-text index, or both. On each query, you retrieve relevant chunks and inject them into the prompt before the model answers.

Best for: large or changing knowledge bases, citation and auditability requirements, situations where the cost of retraining outweighs the cost of retrieval. Most production "AI assistant" use cases land here.

Fine-tuning

You take a base model and train a thin layer on top with thousands of input-and-expected-output pairs. The model learns the patterns rather than reading them at inference time.

Best for: consistent output formats, domain-specific jargon, tone-of-voice replication, situations where you have at least 1,000 high-quality example pairs and the underlying patterns are stable for months at a time.

The actual decision tree

Is the knowledge stable + small enough to fit in context? -> Prompt-only.
Is the knowledge large, fast-changing, or auditable? -> RAG.
Is FORMAT or TONE the problem (not knowledge)? -> Fine-tune.
None of the above? -> Combine. RAG + fine-tuned formatter is the
                     most common production stack in 2026.

The trap most teams fall into is treating these as exclusive. They are not. The real production stack in serious AI products is almost always RAG for knowledge, plus a small fine-tune for output formatting or domain register. Anthropic, OpenAI, and Cohere all sell components for both halves of that stack.

The cost reality, with current numbers

Numbers below are mid-2026 list prices, rounded for legibility. Your contracted rate is probably 10-30 percent lower at any meaningful volume.

Inference cost per million tokens (input / output)

Model	Input	Output	Cached input
GPT-4o	$2.50	$10.00	$1.25
GPT-4o-mini	$0.15	$0.60	$0.075
o1 (reasoning)	$15.00	$60.00	$7.50
o1-mini	$3.00	$12.00	$1.50
Claude Sonnet 4	$3.00	$15.00	$0.30 (90% off)
Claude Haiku 4.5	$1.00	$5.00	$0.10 (90% off)
Gemini 1.5 Pro	$1.25	$5.00	varies by region
Gemini 1.5 Flash	$0.075	$0.30	n/a

Two things matter from this table for your architecture choice:

Cached input is much cheaper than fresh input. Anthropic's prompt caching offers about 90 percent off the cached portion after the first read. OpenAI's cached input is half the fresh input rate. If you are sending the same system prompt or the same retrieved documents to the model many times in a window, caching is the single biggest cost lever you have.
Output tokens dominate the bill, not input. A 4,000-token answer at GPT-4o output rate costs the same as 16,000 input tokens. Designs that "save tokens" by compressing the prompt but accept longer answers are net-negative.

One-time cost to get each approach into production

Approach	One-time	Per query	Update cost
Prompt-only	Effectively zero	High (long prompts every call)	Edit the prompt, redeploy
RAG	Vector store provisioning, chunking pipeline, eval set: 1-3 engineer-weeks	Moderate (retrieval is cheap, context is still big)	Reindex on data change
Fine-tune	Data labelling: 1-4 engineer-weeks. Training run: $500-$10k	Lower per query (shorter prompts work)	New training run every meaningful update, ~$500-$10k each time

For most products, RAG wins on total cost of ownership because the knowledge changes faster than fine-tunes can keep up, and the per-query saving of fine-tuning rarely makes up for the per-update cost.

RAG, the part the cookbooks gloss over

A "vanilla" RAG setup, the one most blog posts show, is: chunk every document into 1,000-character pieces, embed each chunk, store them in a vector database, retrieve the top three by cosine similarity, stuff them into the prompt. This works for a demo. It does not work for production. Six things matter:

Chunk size and overlap

Chunks that are too small lose context (you retrieve "we recommend X" without the conditions that follow). Chunks that are too large include irrelevant material that drags down retrieval precision. For prose documents we typically end up between 400 and 800 tokens with 50-100 token overlap. For structured data (FAQs, policy clauses, code), chunk by semantic unit (one clause = one chunk) not by character count.

Embedding model choice

Model	Dimensions	Price per 1M tokens	Notes
OpenAI text-embedding-3-small	1,536	$0.02	Default starting point. Good.
OpenAI text-embedding-3-large	3,072	$0.13	Better recall, 6x cost, 2x storage
Voyage voyage-3-large	1,024	$0.18	Often top of MTEB. Strong on code + technical.
Cohere embed-v4	up to 1,536	$0.10	Best-in-class for multilingual incl. Arabic
Open weights (BGE, E5)	384-1,024	self-hosted	Free, ~2x your team's time to operate

If your corpus is English-only technical text, OpenAI-small is the right default until you have a measured reason to change. If you serve Arabic or French + English from the same index, Cohere embed-v4 is the noticeably better choice. If you are storing source code, Voyage-3-large is what we have ended up using.

Hybrid search beats pure vector

Pure semantic similarity misses exact-match queries. A user searching for "error code E_TIMEOUT_5023" wants the document that contains that literal string, not the one that is semantically closest. Production RAG runs vector search and BM25 (or your favourite full-text index) in parallel, then merges. Postgres with pgvector handles both natively. Weaviate has hybrid built in. If you are on Pinecone, you bolt BM25 on yourself.

Reranking, the cheapest 20-point quality jump

After you retrieve the top 20-50 candidates, a reranker scores each against the query and picks the top 3-5. Cohere Rerank 3.5 costs about $2 per 1,000 searches and reliably improves answer relevance by 15-30 percent on our internal benchmarks. Voyage and Jina sell competitive rerankers. There is also bge-reranker on open weights. The reranker is the cheapest single thing you can add to a RAG pipeline that meaningfully improves quality.

Vector store choice

Store	Hosted?	Cost shape	When right
pgvector	Self or Supabase/Neon	Free if you already run Postgres	Up to ~10M vectors with HNSW. The pragmatic default.
Qdrant	Self or cloud	Free self-host, $20+/mo cloud	When pgvector starts hurting; great for filtering.
Weaviate	Self or cloud	Free self-host, ~$25+/mo cloud	Hybrid search is best-in-class out of the box.
Pinecone	Hosted only	$70+/mo starter, grows fast	When you want zero infra, are okay paying.
LanceDB	Embedded or server	Free	When you want vectors as files (versionable in Git LFS).

The default for new projects in 2026 should be pgvector unless you have measured that it is not enough. The promise of "managed vector databases" is mostly that they save you the operational learning curve; that is a one-time cost. The bill is forever.

Where the metadata lives

Every chunk needs metadata: source document, author, last-updated, access permissions, language, document type. Without this, you cannot filter ("only retrieve from documents the current user is allowed to see"), you cannot evict stale entries, and you cannot debug "why did this answer use a document from 2019". Build this into your chunking pipeline from day one, not as a retrofit.

When to actually fine-tune

Fine-tuning earns its keep in three scenarios. Outside these, prefer RAG with a sharper prompt.

1. Consistent output format. You need every answer in a specific JSON shape, or in a specific markdown structure, or as a specific markup language. Prompt-only solutions for this leak edge cases that fine-tunes do not. The hourly cost: maybe 2 engineer-days to assemble 200-500 examples, $50-$200 for a small fine-tune.

2. Domain register and tone. Your audience is, say, Moroccan-French legal practitioners and you need text that lands in that register. Prompt instructions get most of the way there; a fine-tune nails the last 15 percent. You need 1,000+ real-world examples for this to work, not synthetic ones.

3. Latency or cost reduction at high volume. A fine-tuned smaller model can replace a larger model with prompt instructions, with 5-10x cost reduction and 2-3x latency reduction. This only pencils out above roughly 100,000 queries per day per use case. Below that, the engineering time costs more than the inference saving.

The OpenAI fine-tune API charges about $25 per 1M training tokens for GPT-4o-mini (training data, not inference). Anthropic's Claude fine-tuning is available via AWS Bedrock for Haiku. For open weights, Together AI, Fireworks, and Anyscale offer fine-tuning of Llama, Mistral, and Qwen models at competitive rates with deployable endpoints.

The worked example we keep coming back to

A team comes to us with: "We have 800 internal product policy documents. Our support team wants to ask plain-language questions and get back grounded, citable answers. The documents update weekly. The team has 30 daily users."

The wrong answer: "Let us fine-tune a model on your policies."

The right answer:

pgvector on the existing Postgres instance, one HNSW index
Cohere embed-v4 for embeddings (the corpus has French and English mixed)
Chunks of 600 tokens with 100-token overlap, one chunk per logical clause where possible
Hybrid retrieval (vector + BM25), top 40 candidates
Cohere Rerank 3.5 down to top 5
Claude Haiku 4.5 for the answer, with the 5 chunks injected and a structured citation requirement
Prompt caching on the system prompt and the policy-rendering instructions

Total infrastructure cost: about $40 per month at this scale (one Postgres instance, a few thousand reranks, a few thousand answers). Build time: 6 working days to a usable v1, another two weeks of eval-driven iteration to get the answer quality past 85 percent on the team's golden set.

A fine-tune for the same problem would have cost between $2,000 and $8,000, taken eight weeks, and would have started going stale the day after each weekly policy update. That comparison is not unusual.

What teams get wrong, again

The single biggest mistake is fine-tuning when the actual problem is bad retrieval. You can fine-tune a model to memorise 200 facts about your product, but you cannot fine-tune it to know about the policy your team published yesterday. If your data changes faster than your training cadence, you need RAG.

The second biggest mistake is shipping vanilla RAG (1,000-character chunks, top-3 vector retrieval, no rerank, no filter) and concluding that "RAG does not work" when the answers are mediocre. Default chunking is almost never the right configuration. Spend a focused week on chunk size, hybrid search, and reranking, and your RAG quality jumps more than switching from GPT-4o-mini to GPT-4o would.

The third mistake is building all of this without an eval set. With no measurements, every change feels like an improvement when it is shipped on Monday and like a regression by Thursday. The eval set is non-negotiable.

What we covered, what is next

This piece is opinionated, deliberately. The trade-offs are real and the cost of choosing wrong is high. If you want a deeper treatment, the Anthropic and OpenAI cookbooks both have good worked examples, but neither will tell you when to not use their tool, and neither has direct cost-of-ownership analysis.

Next in the production series: how we run evals at speed, why ROC/precision/recall lie about LLM quality, and the three eval techniques we trust. If you want to be told when it ships, the newsletter signup is at the bottom of the homepage.