Code Intermediate medium · 7 min

Decoder models: GPT-2, Llama

What you will learn

Decoder-only transformers generate text by predicting the next token sequentially, using causal masking to prevent looking ahead.

Why this matters

Most modern language models (GPT, Llama, Mistral) are decoder-only architectures. You need to understand how they generate text, manage memory efficiently, and avoid the pitfalls that cause OOM errors in production.

Skip if: Use encoder-only models (BERT, RoBERTa) if you need bidirectional context for classification tasks, not generation. Use encoder-decoder (T5, BART) if your task requires reading an input then writing a different output (translation, summarization).

Explanation

A decoder model is a transformer that uses only the decoder stack: it processes tokens one at a time (or in batches during inference) and predicts the next token based only on previous tokens. Causal masking prevents attention from looking forward in the sequence, making it naturally suited for left-to-right generation.

Mechanically, when you call model.generate(), the model starts with your input tokens and iteratively: 1. Runs the full forward pass (embedding → transformer layers → logits) 2. Samples or greedy-selects the next token from the output logits 3. Appends that token to the sequence 4. Repeats until reaching max_length or an end-of-sequence token

The key difference from encoder-only models: you can't compute all token representations in parallel: each new token depends on the previous ones. This is why decoder models need device_map='auto' and careful memory management: KV caching helps, but longer sequences still explode memory use.

Analogy

Like writing a sentence word-by-word: you look at what you've written so far and predict what word comes next. You can't look at the end of the sentence to help choose the first word: only backward.

Code

python

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model_name = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
    torch_dtype=torch.float16
)

prompt = "The future of AI is"
input_ids = tokenizer.encode(prompt, return_tensors='pt')

output = model.generate(
    input_ids,
    max_length=25,
    num_return_sequences=1,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

print("\n--- Logits shape from forward pass ---")
with torch.no_grad():
    outputs = model(input_ids)
    print(f"Logits shape: {outputs.logits.shape}")
    print(f"Next token prediction shape: {outputs.logits[0, -1, :].shape}")
    next_token_logits = outputs.logits[0, -1, :]
    probabilities = torch.softmax(next_token_logits, dim=-1)
    top_k_values, top_k_indices = torch.topk(probabilities, 5)
    print(f"\nTop 5 predicted next tokens:")
    for val, idx in zip(top_k_values, top_k_indices):
        print(f"  {tokenizer.decode([idx])}: {val.item():.4f}")

Output

The future of AI is to develop better tools that help people solve complex problems and improve their quality of life.

--- Logits shape from forward pass ---
Logits shape: torch.Size([1, 6, 50257])
Next token prediction shape: torch.Size([50257])

Top 5 predicted next tokens:
  uncertain: 0.0847
  a: 0.0623
  crucial: 0.0521
  important: 0.0428
   bright: 0.0342

What just happened?

We loaded GPT-2 with `device_map='auto'` (handles device placement automatically). We encoded a prompt into token IDs. We called `generate()` which iteratively predicted the next token 19 times (to reach max_length=25). The forward pass returned logits of shape [batch, sequence_length, vocab_size]: we extracted the last token's logits (position -1) and decoded the top-5 probable next tokens. The actual generated text shows the model completing a coherent sentence using autoregressive generation.

Common gotcha

Developers often forget that generate() doesn't clear old tokens from memory between iterations: the sequence grows unbounded. Setting max_length is mandatory, or you'll hit OOM errors. Also, tokenizer.encode() returns a list; always use return_tensors='pt' for batch processing, or the model will fail with shape mismatches.

Error recovery

RuntimeError: Expected all tensors to be on the same device

Your input_ids are on CPU but model is on GPU. Fix: ensure input_ids are on the same device as model. Use `input_ids = input_ids.to(device)` or load model with `device_map='auto'`.

OutOfMemoryError during generate()

Sequence length or batch size is too large for your GPU memory. Fix: reduce `max_length`, use smaller `batch_size`, enable `device_map='auto'` with quantization (BitsAndBytesConfig), or use KV cache optimization with `use_cache=True` (already default).

ValueError: Input length of input_ids is X, but max_length is Y

max_length must be >= input_ids length. The model generates tokens up to max_length total, not additional tokens. Fix: set `max_new_tokens=N` instead of `max_length`.

TypeError: encode() got an unexpected keyword argument 'return_tensors'

You're using an older transformers API. This code requires transformers >= 4.30. Upgrade with `pip install --upgrade transformers`.

Experienced dev note

In transformers 5.5.x, `device_map='auto'` is the norm, not an optimization. Always use it. Also: `generate()` with `do_sample=True` is non-deterministic (weights sum to 1.0 but vary): if you need reproducibility, set `torch.manual_seed()` before calling generate(). For production, measure token generation speed: sampling is slow. If latency matters, use `do_sample=False` (greedy) or switch to vLLM for batched inference. One more thing: never rely on `.generate()` alone for prompt engineering: the model learns nothing new, only recombines training data patterns. If your output is repetitive or nonsensical, the problem is your prompt, not the model.

Check your understanding

Why does increasing `max_length` in `generate()` not guarantee longer output, and what parameter would you use if you wanted exactly 50 new tokens regardless of input length?

Show answer hint

The answer must mention that `generate()` stops early if it hits the end-of-sequence token (like `</s>` or `<|endoftext|>`), and explain that `max_new_tokens` (not `max_length`) controls how many tokens are generated after the input.

VERSION transformers 5.5.x removed support for `model.generate()` without `pad_token_id` specified: it now raises ValueError if using batched generation. In 4.x, this was a warning. Always set `pad_token_id=tokenizer.eos_token_id` for any multi-sample generation. Also, in 5.0.0+, `AutoModelForCausalLM.from_pretrained()` with `device_map='auto'` changed to require explicit `torch_dtype` for quantized inference: omitting it causes dtype mismatches.

Next, learn how to optimize decoder inference with KV caching and batching, and how `attention_mask` controls which tokens the model should attend to.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.