A language model generates text one token at a time. To produce each new token, it runs attention over everything that came before. The KV cache is what stops that from being catastrophically wasteful, and understanding it explains a surprising amount about why LLMs cost what they do to run.

The problem: generation repeats itself

Recall how attention works: every token attends to every previous token by comparing a query against keys, then blending values. During generation, the model produces token one, then token two, then token three, appending each to the sequence and running the model again.

Here is the waste. When the model generates token 100, the keys and values for tokens 1 through 99 are exactly the same as they were when it generated token 99. Nothing about the past changed. Recomputing them every step would mean redoing almost all the work on every single token, and the total cost would grow with the square of the output length.

The fix: cache the keys and values

The insight is simple. The key and value vectors for past tokens never change once computed, so store them. When generating a new token, compute the key and value for just that one new token, append them to the stored set, and reuse everything else.

Hand-drawn diagram of the KV cache: storing key and value tensors for past tokens during autoregressive generation so they are not recomputed each step.
The KV cache, one card from the deck

With the cache, generating each new token only requires work proportional to the sequence length, not the sequence length squared. This is the difference between generation being practical and being unusable. Every production LLM serving system relies on it.

The catch: memory

The cache solves the compute problem by spending memory. And that memory is large. For every token in the context, the model stores a key and a value vector in every layer, for every attention head. The total grows linearly with three things at once: the sequence length, the number of layers, and the batch size (how many requests are served at once).

This is why long context is expensive and why serving many users simultaneously is hard. The KV cache for a long conversation can easily exceed the size of the model's own weights. At that point, the cache, not the model, is what limits how many requests a GPU can handle and how long a context it can support.

Why this drives so much engineering

Because the KV cache is the bottleneck, a large share of inference research targets it directly:

  • Grouped-query and multi-query attention let multiple query heads share keys and values, shrinking the cache.
  • Paged attention manages cache memory in blocks, like virtual memory, to avoid waste.
  • Cache compression and eviction reduce or prune what is stored, trading a little quality for a lot of memory.

This last area is where our lab works. Our KV cache compression implementations explore how far the cache can be shrunk before quality suffers, which directly affects how cheaply long-context models can be served.

Inference internals, one card at a time

The KV cache, attention variants, speculative decoding, and the rest of the inference stack are in the LLM Flashcards: 180 hand-drawn cards.

See the deck

Related reading