Code Beginner easy · 4 min

Tokenizing the prompt

What you will learn

Convert raw text into the numeric token IDs that transformer models actually understand.

Why this matters

Models don't read words: they read numbers. You must tokenize your prompt before feeding it to any transformer, or the model will reject it. This is the mandatory first step in every inference pipeline.

Skip if: You don't need to tokenize manually if you're using <code>pipeline()</code> with automatic preprocessing, but you should still understand what's happening underneath. You also don't tokenize if you're loading pre-tokenized embeddings from a database.

Explanation

Tokenization is the process of breaking text into small units (tokens) and converting each to a numeric ID that the model's vocabulary knows. A token is typically a word, subword, or character: it depends on the tokenizer. The transformer model has a fixed vocabulary of maybe 50,000 tokens, each with a unique ID number starting from 0.

Mechanically, the tokenizer works in three steps: normalize the text (lowercase, remove accents), split it into raw tokens (words or subwords), then map each token to its ID from the model's vocabulary using a lookup table. The result is a tensor of integers: usually shape [1, sequence_length] for a single prompt. You also get special tokens like [CLS] at the start and [SEP] at the end, depending on the model's training convention.

Use tokenization whenever you have raw text and a model that expects numeric input. This is true for every transformer use case: classification, generation, embedding, or ranking. The tokenizer must match the model: a BERT tokenizer won't work correctly with a GPT-2 model.

Analogy

Think of it like scanning a barcode. The barcode scanner (tokenizer) takes a product name and converts it to a numeric barcode ID. The cashier's register (the model) only understands barcode IDs, not product names. You can't skip the barcode: you have to scan it first.

Code

python

from transformers import AutoTokenizer
import torch

model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "The quick brown fox jumps over the lazy dog"

encoded = tokenizer(prompt, return_tensors="pt")

print("Input IDs:")
print(encoded["input_ids"])
print("\nToken count:")
print(encoded["input_ids"].shape[1])
print("\nDecoded back:")
print(tokenizer.decode(encoded["input_ids"][0]))
print("\nAttention mask (which tokens are real vs padding):")
print(encoded["attention_mask"])

Output

Input IDs:
tensor([[ 101, 1996, 3588, 2829, 4419, 11565, 2058, 1996, 13971, 3899, 102]])

Token count:
11

Decoded back:
[CLS] the quick brown fox jumps over the lazy dog [SEE]

Attention mask (which tokens are real vs padding):
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

What just happened?

The tokenizer loaded BERT's vocabulary. It converted the prompt string into numeric IDs (101 is [CLS], 1996 is 'the', 3588 is 'quick', etc.), wrapped it in a PyTorch tensor, and added special tokens at the start and end. The attention mask is all 1s because there's no padding: every token is real. When you decode the IDs back, you get the original text plus the special tokens that BERT uses.

Common gotcha

Many developers assume tokenization is one-to-one: one word = one token. Wrong. Subword tokenizers split 'unbelievable' into multiple tokens like ['un', '##believe', '##able']. If your prompt is longer than the model's max sequence length (usually 512), tokenization silently truncates it by default: you lose text without warning. Always check the sequence length and set truncation=True and max_length explicitly.

Error recovery

AttributeError: 'NoneType' object has no attribute 'encode'

You tried to use a tokenizer that wasn't loaded. Make sure you called <code>AutoTokenizer.from_pretrained(model_name)</code> first and didn't accidentally set tokenizer to None.

ValueError: Token indices sequence length is longer than the maximum

Your text is too long for the model. Add <code>truncation=True, max_length=512</code> to the tokenizer call, or split the text into chunks before tokenizing.

KeyError or token out of vocabulary

Your text contains characters the tokenizer doesn't recognize. This is rare with modern subword tokenizers, but if it happens, check that the model_name matches the text language (e.g., don't use an English BERT for Chinese text).

Experienced dev note

In transformers 5.5.x, always use the tokenizer call syntax tokenizer(text, return_tensors='pt') instead of the deprecated tokenizer.encode(). The call syntax returns a dict with both input_ids and attention_mask automatically, and batching is built in. Also: tokenizers are deterministic: the same text always produces the same IDs. But be aware that padding behavior varies by model, so if you're batching multiple prompts, set padding=True to pad them all to the same length, not just truncate to max_length.

Check your understanding

If you tokenize the phrase 'unhappy' and it produces 5 tokens instead of 1, why did that happen, and how would you verify what those 5 tokens actually are?

Show answer hint

A correct answer explains subword tokenization (the tokenizer breaks words into pieces for vocabulary efficiency) and shows how to use <code>tokenizer.tokenize(text)</code> to see the actual token strings, or <code>tokenizer.convert_ids_to_tokens(ids)</code> to decode the numeric IDs back to their string representations.

VERSION In transformers < 4.30, tokenizer.encode() was the standard method and returned a plain Python list. In 5.5.x, tokenizer() returns a BatchEncoding dict and is the only recommended approach. The old .encode() method still works but is deprecated and doesn't support batching or return attention masks by default.

Now that you can tokenize, learn how to load a model and pass these token IDs to it for inference.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.