Token Indices Out of Range
Why this matters
This is one of the most common errors when starting with transformers. Your text tokenizes fine, but the model fails at forward pass because the token vocabulary doesn't match. Understanding this prevents hours of debugging.
Explanation
What it is: Your tokenizer converts text into integer token IDs. Each model has a fixed vocabulary size: a maximum token ID it knows about. If your tokenizer produces an ID larger than the model's vocabulary, the model's embedding layer crashes because that ID doesn't exist.
How it works mechanically: When you call model(input_ids=...), the model looks up each integer ID in its embedding table. That table has exactly config.vocab_size rows (indices 0 to vocab_size-1). If you pass ID 50257 but vocab_size is only 50000, you're asking for row 50257 in a 50000-row table: out of bounds.
When to use what: Always match tokenizer and model from the same source. Use AutoTokenizer and AutoModelForCausalLM together with the same model name. Never use a tokenizer from one model with a model from another.
Analogy
It's like having a library with books numbered 1-5000, but your checkout system tries to retrieve book #5001. The checkout system works fine, but the library only has 5000 books.
Code
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
print(f'Model vocabulary size: {model.config.vocab_size}')
print(f'Tokenizer vocabulary size: {len(tokenizer)}')
text = 'Hello, world!'
encoded = tokenizer(text, return_tensors='pt')
print(f'Encoded token IDs: {encoded["input_ids"]}')
print(f'Max token ID in input: {encoded["input_ids"].max().item()}')
with torch.no_grad():
output = model(**encoded)
print(f'Model output shape: {output.logits.shape}')
print('Success: No token index error') Model vocabulary size: 50257 Tokenizer vocabulary size: 50257 Encoded token IDs: tensor([[15496, 11, 995, 0]]) Max token ID in input: 995 Model output shape: torch.Size([1, 4, 50257]) Success: No token index error
What just happened?
We loaded GPT-2 tokenizer and model together (both have vocab size 50257). We tokenized text, which produced token IDs all within the valid range (0-50256). When we passed those IDs to the model, the embedding lookup succeeded because every ID was valid. The model produced logits with shape [batch_size=1, sequence_length=4, vocab_size=50257].
Common gotcha
The most common mistake: using a tokenizer from one model with a different model entirely. For example, loading the BERT tokenizer but the GPT-2 model. They have different vocab sizes, and IDs will be out of range. Always import tokenizer and model using the same from_pretrained() model name.
Error recovery
IndexError: index out of range in selfCUDA RuntimeError: invalid indexRuntimeError at /aten/src/ATen/native/embedding.cppExperienced dev note
This error often masks a silent mismatch. Sometimes you'll add special tokens to a tokenizer with tokenizer.add_tokens(['
Check your understanding
You load a model and tokenizer from the same source, add 5 custom tokens with tokenizer.add_tokens(), then tokenize text. The text tokenizes fine, but the model forward pass crashes with 'index out of range.' What is the one-line fix, and why does it work?
Show answer hint
A correct answer identifies that model.resize_token_embeddings(len(tokenizer)) must be called after adding tokens, because the model's embedding layer is still sized for the original vocab but the tokenizer now produces IDs up to vocab_size + 5.