Concept Intermediate · 3 min read

What is KV cache in LLM inference

Q: What is KV cache in LLM inference

The KV cache in LLM inference is a memory mechanism that stores the keys and values computed from previous tokens to avoid recomputing them during autoregressive generation. This caching speeds up inference by reusing past attention computations, enabling faster and more efficient token-by-token generation.

Quick answer

The KV cache in LLM inference is a memory mechanism that stores the keys and values computed from previous tokens to avoid recomputing them during autoregressive generation. This caching speeds up inference by reusing past attention computations, enabling faster and more efficient token-by-token generation.

KV cache is a key-value caching mechanism that stores intermediate attention states during LLM inference to accelerate autoregressive token generation.

How it works

During autoregressive inference, a large language model (LLM) generates text one token at a time. Each new token's prediction depends on all previously generated tokens. The model uses self-attention, which computes attention scores using keys, values, and queries derived from token embeddings.

The KV cache stores the keys and values computed for all previous tokens so the model doesn't recompute them for every new token. Instead, it only computes the query for the new token and attends over the cached keys and values. This is like remembering all previous conversation points so you don’t have to repeat or reprocess them every time you add a new sentence.

Concrete example

Imagine generating the sentence "Hello world" token by token. Without KV cache, the model recomputes attention keys and values for "Hello" when generating "world". With KV cache, it reuses the stored keys and values for "Hello" and only computes new ones for "world".

python

from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

messages = [{"role": "user", "content": "Write a poem about AI."}]

# Example call illustrating KV cache usage (conceptual, actual SDK handles caching internally)
response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    max_tokens=50
)

print(response.choices[0].message.content)

# Note: KV cache is managed internally by the model during inference to speed up token generation.

output

AI is a spark of light,
Guiding us through the night,
Learning, growing day by day,
In endless, wondrous ways.

When to use it

Use KV cache during autoregressive generation tasks where tokens are generated sequentially, such as chatbots, text completion, and code generation. It significantly reduces latency and computational cost for long outputs.

Do not rely on KV cache for non-autoregressive models or tasks that require full sequence reprocessing, like masked language modeling or bidirectional attention.

Key terms

Term	Definition
KV cache	Storage of keys and values from previous tokens to speed up attention computation during inference.
Key	Vector representing token features used to compute attention weights.
Value	Vector containing token information combined with attention weights to produce output.
Query	Vector derived from the current token used to attend over keys.
Self-attention	Mechanism allowing tokens to attend to other tokens in the sequence.

Key Takeaways

KV cache stores intermediate attention states to avoid redundant computation during token generation.
It accelerates autoregressive inference by reusing keys and values from previous tokens.
Use KV cache for efficient long-sequence generation in chatbots, code completion, and text generation.
The cache is managed internally by LLM APIs and is transparent to most users.
Not applicable for non-autoregressive or bidirectional transformer models.

Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.