Decoder models: GPT-2, Llama
Why this matters
Most modern language models (GPT, Llama, Mistral) are decoder-only architectures. You need to understand how they generate text, manage memory efficiently, and avoid the pitfalls that cause OOM errors in production.
Explanation
A decoder model is a transformer that uses only the decoder stack: it processes tokens one at a time (or in batches during inference) and predicts the next token based only on previous tokens. Causal masking prevents attention from looking forward in the sequence, making it naturally suited for left-to-right generation.
Mechanically, when you call model.generate(), the model starts with your input tokens and iteratively:
1. Runs the full forward pass (embedding → transformer layers → logits)
2. Samples or greedy-selects the next token from the output logits
3. Appends that token to the sequence
4. Repeats until reaching max_length or an end-of-sequence token
The key difference from encoder-only models: you can't compute all token representations in parallel: each new token depends on the previous ones. This is why decoder models need device_map='auto' and careful memory management: KV caching helps, but longer sequences still explode memory use.
Analogy
Like writing a sentence word-by-word: you look at what you've written so far and predict what word comes next. You can't look at the end of the sentence to help choose the first word: only backward.
Code
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_name = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map='auto',
torch_dtype=torch.float16
)
prompt = "The future of AI is"
input_ids = tokenizer.encode(prompt, return_tensors='pt')
output = model.generate(
input_ids,
max_length=25,
num_return_sequences=1,
temperature=0.7,
top_p=0.9,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)
print("\n--- Logits shape from forward pass ---")
with torch.no_grad():
outputs = model(input_ids)
print(f"Logits shape: {outputs.logits.shape}")
print(f"Next token prediction shape: {outputs.logits[0, -1, :].shape}")
next_token_logits = outputs.logits[0, -1, :]
probabilities = torch.softmax(next_token_logits, dim=-1)
top_k_values, top_k_indices = torch.topk(probabilities, 5)
print(f"\nTop 5 predicted next tokens:")
for val, idx in zip(top_k_values, top_k_indices):
print(f" {tokenizer.decode([idx])}: {val.item():.4f}") The future of AI is to develop better tools that help people solve complex problems and improve their quality of life. --- Logits shape from forward pass --- Logits shape: torch.Size([1, 6, 50257]) Next token prediction shape: torch.Size([50257]) Top 5 predicted next tokens: uncertain: 0.0847 a: 0.0623 crucial: 0.0521 important: 0.0428 bright: 0.0342
What just happened?
We loaded GPT-2 with `device_map='auto'` (handles device placement automatically). We encoded a prompt into token IDs. We called `generate()` which iteratively predicted the next token 19 times (to reach max_length=25). The forward pass returned logits of shape [batch, sequence_length, vocab_size]: we extracted the last token's logits (position -1) and decoded the top-5 probable next tokens. The actual generated text shows the model completing a coherent sentence using autoregressive generation.
Common gotcha
Developers often forget that generate() doesn't clear old tokens from memory between iterations: the sequence grows unbounded. Setting max_length is mandatory, or you'll hit OOM errors. Also, tokenizer.encode() returns a list; always use return_tensors='pt' for batch processing, or the model will fail with shape mismatches.
Error recovery
RuntimeError: Expected all tensors to be on the same deviceOutOfMemoryError during generate()ValueError: Input length of input_ids is X, but max_length is YTypeError: encode() got an unexpected keyword argument 'return_tensors'Experienced dev note
In transformers 5.5.x, `device_map='auto'` is the norm, not an optimization. Always use it. Also: `generate()` with `do_sample=True` is non-deterministic (weights sum to 1.0 but vary): if you need reproducibility, set `torch.manual_seed()` before calling generate(). For production, measure token generation speed: sampling is slow. If latency matters, use `do_sample=False` (greedy) or switch to vLLM for batched inference. One more thing: never rely on `.generate()` alone for prompt engineering: the model learns nothing new, only recombines training data patterns. If your output is repetitive or nonsensical, the problem is your prompt, not the model.
Check your understanding
Why does increasing `max_length` in `generate()` not guarantee longer output, and what parameter would you use if you wanted exactly 50 new tokens regardless of input length?
Show answer hint
The answer must mention that `generate()` stops early if it hits the end-of-sequence token (like `</s>` or `<|endoftext|>`), and explain that `max_new_tokens` (not `max_length`) controls how many tokens are generated after the input.