What is KV cache in LLM inference
KV cache in LLM inference is a memory mechanism that stores the keys and values computed from previous tokens to avoid recomputing them during autoregressive generation. This caching speeds up inference by reusing past attention computations, enabling faster and more efficient token-by-token generation.KV cache is a key-value caching mechanism that stores intermediate attention states during LLM inference to accelerate autoregressive token generation.How it works
During autoregressive inference, a large language model (LLM) generates text one token at a time. Each new token's prediction depends on all previously generated tokens. The model uses self-attention, which computes attention scores using keys, values, and queries derived from token embeddings.
The KV cache stores the keys and values computed for all previous tokens so the model doesn't recompute them for every new token. Instead, it only computes the query for the new token and attends over the cached keys and values. This is like remembering all previous conversation points so you don’t have to repeat or reprocess them every time you add a new sentence.
Concrete example
Imagine generating the sentence "Hello world" token by token. Without KV cache, the model recomputes attention keys and values for "Hello" when generating "world". With KV cache, it reuses the stored keys and values for "Hello" and only computes new ones for "world".
from openai import OpenAI
import os
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
messages = [{"role": "user", "content": "Write a poem about AI."}]
# Example call illustrating KV cache usage (conceptual, actual SDK handles caching internally)
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
max_tokens=50
)
print(response.choices[0].message.content)
# Note: KV cache is managed internally by the model during inference to speed up token generation. AI is a spark of light, Guiding us through the night, Learning, growing day by day, In endless, wondrous ways.
When to use it
Use KV cache during autoregressive generation tasks where tokens are generated sequentially, such as chatbots, text completion, and code generation. It significantly reduces latency and computational cost for long outputs.
Do not rely on KV cache for non-autoregressive models or tasks that require full sequence reprocessing, like masked language modeling or bidirectional attention.
Key terms
| Term | Definition |
|---|---|
| KV cache | Storage of keys and values from previous tokens to speed up attention computation during inference. |
| Key | Vector representing token features used to compute attention weights. |
| Value | Vector containing token information combined with attention weights to produce output. |
| Query | Vector derived from the current token used to attend over keys. |
| Self-attention | Mechanism allowing tokens to attend to other tokens in the sequence. |
Key Takeaways
-
KV cachestores intermediate attention states to avoid redundant computation during token generation. - It accelerates autoregressive inference by reusing keys and values from previous tokens.
- Use
KV cachefor efficient long-sequence generation in chatbots, code completion, and text generation. - The cache is managed internally by LLM APIs and is transparent to most users.
- Not applicable for non-autoregressive or bidirectional transformer models.