Code Beginner easy · 4 min

Token Indices Out of Range

What you will learn

When your tokenizer produces token IDs that don't exist in your model's vocabulary, the model crashes: here's why and how to fix it.

Why this matters

This is one of the most common errors when starting with transformers. Your text tokenizes fine, but the model fails at forward pass because the token vocabulary doesn't match. Understanding this prevents hours of debugging.

Skip if: You don't need to worry about this if you always use the exact same tokenizer and model that were trained together (e.g., AutoTokenizer.from_pretrained('gpt2') paired with AutoModelForCausalLM.from_pretrained('gpt2')). The error only occurs when vocabularies mismatch.

Explanation

What it is: Your tokenizer converts text into integer token IDs. Each model has a fixed vocabulary size: a maximum token ID it knows about. If your tokenizer produces an ID larger than the model's vocabulary, the model's embedding layer crashes because that ID doesn't exist.

How it works mechanically: When you call model(input_ids=...), the model looks up each integer ID in its embedding table. That table has exactly config.vocab_size rows (indices 0 to vocab_size-1). If you pass ID 50257 but vocab_size is only 50000, you're asking for row 50257 in a 50000-row table: out of bounds.

When to use what: Always match tokenizer and model from the same source. Use AutoTokenizer and AutoModelForCausalLM together with the same model name. Never use a tokenizer from one model with a model from another.

Analogy

It's like having a library with books numbered 1-5000, but your checkout system tries to retrieve book #5001. The checkout system works fine, but the library only has 5000 books.

Code

python

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

print(f'Model vocabulary size: {model.config.vocab_size}')
print(f'Tokenizer vocabulary size: {len(tokenizer)}')

text = 'Hello, world!'
encoded = tokenizer(text, return_tensors='pt')
print(f'Encoded token IDs: {encoded["input_ids"]}')
print(f'Max token ID in input: {encoded["input_ids"].max().item()}')

with torch.no_grad():
    output = model(**encoded)
    print(f'Model output shape: {output.logits.shape}')
    print('Success: No token index error')

Output

Model vocabulary size: 50257
Tokenizer vocabulary size: 50257
Encoded token IDs: tensor([[15496,   11, 995,     0]])
Max token ID in input: 995
Model output shape: torch.Size([1, 4, 50257])
Success: No token index error

What just happened?

We loaded GPT-2 tokenizer and model together (both have vocab size 50257). We tokenized text, which produced token IDs all within the valid range (0-50256). When we passed those IDs to the model, the embedding lookup succeeded because every ID was valid. The model produced logits with shape [batch_size=1, sequence_length=4, vocab_size=50257].

Common gotcha

The most common mistake: using a tokenizer from one model with a different model entirely. For example, loading the BERT tokenizer but the GPT-2 model. They have different vocab sizes, and IDs will be out of range. Always import tokenizer and model using the same from_pretrained() model name.

Error recovery

IndexError: index out of range in self

Cause: token ID is >= model vocab_size. Fix: verify tokenizer and model match. Print tokenizer.vocab_size and model.config.vocab_size: they must be equal.

CUDA RuntimeError: invalid index

Same root cause as above but on GPU. Fix: before moving to GPU, ensure vocabularies match by testing on CPU first with the same code.

RuntimeError at /aten/src/ATen/native/embedding.cpp

Cause: PyTorch embedding table doesn't have the requested index. Fix: check if you accidentally added custom tokens to tokenizer without resizing model with model.resize_token_embeddings(len(tokenizer)).

Experienced dev note

This error often masks a silent mismatch. Sometimes you'll add special tokens to a tokenizer with tokenizer.add_tokens(['']) but forget to call model.resize_token_embeddings(len(tokenizer)). The tokenizer now has more tokens than the model knows about. The fix is one line, but the debugging path is confusing: many developers think the problem is their input data, not the vocabulary mismatch. Always verify vocab sizes match immediately after loading, before any tokenization.

Check your understanding

You load a model and tokenizer from the same source, add 5 custom tokens with tokenizer.add_tokens(), then tokenize text. The text tokenizes fine, but the model forward pass crashes with 'index out of range.' What is the one-line fix, and why does it work?

Show answer hint

A correct answer identifies that model.resize_token_embeddings(len(tokenizer)) must be called after adding tokens, because the model's embedding layer is still sized for the original vocab but the tokenizer now produces IDs up to vocab_size + 5.

VERSION In transformers < 5.0, some tokenizers had implicit vocab size padding that masked this error. In 5.5.x, all tokenizers strictly enforce vocab_size matching. Always call resize_token_embeddings() explicitly when adding tokens.

Now that you understand token vocabularies, learn how tokenizers actually split text into subwords and why 'hello world' might tokenize into unexpected pieces.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.