The gap both methods close

A pretrained model knows an enormous amount about the world and nothing about your world. It has never seen your contracts, your product catalog, your internal policies, or last week's incident report. Retrieval-augmented generation and fine-tuning are the two standard ways to close that gap, and they do it through completely different mechanisms. RAG supplies knowledge at the moment of the question. Fine-tuning changes the model itself. Understanding which mechanism fits your problem is one of the highest-leverage decisions in any LLM project.

How RAG actually works

Retrieval-augmented generation does not touch the model's weights. Instead, at query time, it finds the most relevant pieces of your documents and places them into the prompt, so the model answers from material it can see directly.

query embed tovector vector store(doc chunks) top-kchunks LLM answer
The RAG pipeline at query time

The setup, done once, is to split your documents into chunks, pass each through an embedding model that turns text into a vector capturing its meaning, and store those vectors in a vector database. At query time the question is embedded the same way, the database returns the chunks whose vectors are closest to the question's vector, and those chunks are pasted into the prompt ahead of the question. The model never memorized anything; it looks the relevant material up live, every time.

This is why RAG is the default for knowledge that is private or changing. Update a document, re-embed that chunk, and the next query uses the new version with no retraining. It also lets the system cite its sources, because it knows exactly which chunks it retrieved.

The retrieval step, in math

"Closest" has a precise meaning. Each chunk and the query are vectors, and relevance is measured by cosine similarity, the cosine of the angle between them:

\[ \text{sim}(q, d) = \frac{q \cdot d}{\lVert q \rVert \, \lVert d \rVert} = \frac{\sum_i q_i d_i}{\sqrt{\sum_i q_i^2}\,\sqrt{\sum_i d_i^2}} \]

A score near 1 means the vectors point the same way, so the texts mean similar things; a score near 0 means they are unrelated. Retrieval ranks all chunks by this score and returns the top \(k\). The quality of a RAG system lives almost entirely in this step: how documents are chunked, which embedding model is used, how many chunks are retrieved, and how conflicting or duplicate chunks are handled. The language model on the end is rarely the bottleneck.

How fine-tuning actually works

Fine-tuning takes the pretrained model and continues training it on your own examples, adjusting its weights so the knowledge or behavior is baked in. Where RAG adds information to the prompt, fine-tuning changes the function the model computes.

It is the right tool when you need to change how the model behaves rather than what facts it can reach: matching a specific output format or house voice, handling specialized domain language, following a particular instruction style, or sharpening reliability on a narrow, high-volume task. These are things that are hard to specify in a prompt but easy to demonstrate with examples.

LoRA and the cost of fine-tuning

Full fine-tuning updates every weight, which for a large model is expensive in memory and compute. The dominant modern approach, LoRA (low-rank adaptation) [3], makes it cheap. The insight is that the change to a weight matrix during fine-tuning tends to be low rank, so instead of learning a full update \(\Delta W\), LoRA learns two small matrices whose product approximates it:

\[ W' = W + \Delta W = W + B A, \qquad B \in \mathbb{R}^{d \times r},\; A \in \mathbb{R}^{r \times k},\; r \ll \min(d,k) \]

Only \(A\) and \(B\) are trained; the original \(W\) is frozen. The parameter saving is dramatic. For a \(4096 \times 4096\) weight matrix, a full update is about 16.8 million parameters. A LoRA update at rank \(r = 8\) is just \(r(d+k) = 8 \times 8192 = 65{,}536\) parameters, 0.39% of the full count. That is why a large model can be fine-tuned on a single GPU, and why you can keep many small task-specific adapters instead of many full model copies.

LoRA parameters for one 4096×4096 matrix
MethodTrainable paramsShare of full
Full fine-tune16,777,216100%
LoRA r = 865,5360.39%
LoRA r = 16131,0720.78%
LoRA r = 64524,2883.12%

The decision framework

The cleanest way to choose is to name the kind of gap you are closing.

Which tool fits which gap
Your needUseWhy
Knowledge that changes oftenRAGUpdate documents, not weights
Answers that must cite sourcesRAGYou know which chunks were used
A specific format, tone, or voiceFine-tuningBehavior is hard to prompt, easy to show
A narrow, high-volume taskFine-tuningBakes in reliability, shortens prompts
Specialized domain languageFine-tuningTeaches vocabulary and patterns
Fresh knowledge and a fixed styleBothThey are complementary, not exclusive

A worked cost comparison

Imagine a support assistant that must answer from a knowledge base of a few thousand articles that change weekly.

  • RAG path. One-time: embed the articles (cents to a few dollars). Ongoing: a vector database, plus the extra prompt tokens for retrieved chunks on each query. Updating an article means re-embedding one chunk. No training, ever.
  • Fine-tuning path. Prepare a labeled dataset, run a training job, evaluate, deploy the adapter. Then, because the articles change weekly and the facts are now baked into weights, repeat the whole cycle to stay current. The recurring training cost never goes away.

For changing knowledge, RAG is both cheaper to start and far cheaper to maintain. Fine-tuning's upfront cost only pays off when what you are teaching is stable behavior, not moving facts.

When to use both

The strongest systems often combine them. Fine-tune the model for the behavior you want, such as your support team's tone and escalation rules, and use RAG to supply the facts, such as the current state of a customer's account or the latest policy. Fine-tuning sets how it answers; RAG sets what it knows. They sit at different layers and do not conflict.

Common failure modes

  • Fine-tuning to inject facts that change. The most expensive mistake in the field. If the information moves, you will retrain forever. That is a RAG problem wearing a fine-tuning costume.
  • Bad chunking in RAG. Chunks too large dilute the signal; too small lose context. Retrieval quality, not the model, is usually what makes a RAG system feel dumb.
  • Skipping evaluation. Both approaches need a test set of real questions with known good answers. Without it you are tuning blind.
  • Reaching for fine-tuning first. Start with good prompting. Add RAG if the model lacks knowledge. Fine-tune only when behavior cannot be fixed by either. Most projects never need the third step.

The adaptation toolkit, drawn out

RAG, LoRA, RLHF, prompting, and the rest of the LLM stack are covered in the LLM Flashcards: 332 visual cards, as a PDF and an Anki set.

See the cards

References

  1. Lewis et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS, 2020.
  2. Karpukhin et al. Dense Passage Retrieval for Open-Domain Question Answering. 2020.
  3. Hu et al. LoRA: Low-Rank Adaptation of Large Language Models. 2021.

Related reading