Code Beginner easy · 4 min

What a tokenizer does: text to numbers

What you will learn

A tokenizer converts raw text into numerical tokens that transformer models can understand and process.

Why this matters

You cannot pass raw text directly to a transformer model: every model expects numerical input. Understanding tokenization prevents silent bugs where your model receives corrupted input or produces nonsensical outputs. It's also where text length limits, special tokens, and encoding mismatches originate.

Skip if: You don't need to build your own tokenizer: Hugging Face provides pre-trained tokenizers for every published model. Only build custom tokenizers if you're training a model from scratch on a completely new domain with a non-standard vocabulary.

Explanation

What it is: A tokenizer is a converter that takes human-readable text and breaks it into discrete units called tokens, then maps each token to a unique integer ID. That integer sequence is what the model actually processes. How it works mechanically: The tokenizer first splits text (using word boundaries, subword rules, or character-level splits), then looks up each token in a vocabulary dictionary to get its integer ID. Different tokenizers use different splitting strategies: GPT uses byte-pair encoding (BPE), BERT uses WordPiece, others use SentencePiece. The output is always a tensor of integers plus metadata like attention masks that tell the model which positions are real tokens versus padding. When to use it: Always use the official tokenizer for your model: it's built specifically for that model's vocabulary and was used during training. Mismatched tokenizers cause dramatic performance drops because the model's embeddings no longer align with the token IDs.

Analogy

Think of a tokenizer like a postal system. Your message (text) is broken into individual letters and words (tokens), then each one gets stamped with a unique zip code (integer ID). The mail truck (model) doesn't read English: it only reads zip codes. If you use the wrong postal system's codes, the truck delivers your message to the wrong place.

Code

python

import torch
from transformers import AutoTokenizer

text = "Machine learning transforms text into numbers."

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

encoded = tokenizer(text, return_tensors="pt")

print("Input text:")
print(repr(text))
print("\nTokenizer output:")
print(encoded)
print("\nToken IDs as list:")
print(encoded["input_ids"].tolist())
print("\nDecoded back to text:")
print(tokenizer.decode(encoded["input_ids"][0]))
print("\nIndividual token strings:")
tokens = tokenizer.convert_ids_to_tokens(encoded["input_ids"][0])
print(tokens)

Output

Input text:
'Machine learning transforms text into numbers.'

Tokenizer output:
{'input_ids': tensor([[ 101, 3298, 2500, 14840, 2487, 1157, 3546,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

Token IDs as list:
[[101, 3298, 2500, 14840, 2487, 1157, 3546, 102]]

Decoded back to text:
[CLS] machine learning transforms text into numbers. [SEE]

Individual token strings:
['[CLS]', 'machine', 'learning', 'transforms', 'text', 'into', 'numbers', '[SEP]']

What just happened?

The code loaded BERT's pre-trained tokenizer, passed raw text through it, and got back a dictionary containing three tensors: <code>input_ids</code> (the integer token sequence), <code>token_type_ids</code> (segment indicators, all zeros for a single sentence), and <code>attention_mask</code> (all ones, meaning all tokens are real, not padding). The text was automatically converted to lowercase (BERT convention), split into 8 tokens including special markers <code>[CLS]</code> at the start and <code>[SEP]</code> at the end, and each mapped to an integer. Converting back to tokens shows exactly how BERT sees the text internally.

Common gotcha

Developers assume tokenization is reversible: it mostly is, but not always perfectly. If you tokenize, then decode, whitespace and capitalization may not match the original. More critically, using the wrong tokenizer for a model (e.g., GPT-2 tokenizer on a BERT model) produces completely wrong integer sequences, but Python won't error: the model will just produce garbage outputs that look plausible. Always verify you're using the tokenizer from the same model card.

Error recovery

KeyError: 'input_ids'

You're passing raw text to the model instead of encoding it first. Always call tokenizer(text) before passing to model. Encode: tokenizer(text, return_tensors='pt'), then pass the result to model().

ValueError: Token indices sequence length is longer than the maximum

Your text is longer than the model's max sequence length (usually 512 for BERT). Either truncate with tokenizer(text, max_length=512, truncation=True) or split long documents into chunks.

RuntimeError: Expected all tensors to be on the same device

Your tokenizer output is on CPU but your model is on GPU (or vice versa). Move the encoded tensor to the model's device: encoded = tokenizer(text, return_tensors='pt').to(model.device).

Experienced dev note

Tokenization is where 80% of NLP bugs hide. A mismatched or misconfigured tokenizer silently produces wrong embeddings that downstream tasks amplify into massive errors. Always inspect the actual token IDs early: print tokenizer.convert_ids_to_tokens(input_ids[0]) to verify the tokenizer is splitting text the way you expect. In production, log the first few batches of token sequences to catch misconfigurations before they cascade. Also: different tokenizers have different special tokens ([CLS], [MASK], etc.): always check the model's documentation for which ones to expect.

Check your understanding

If you tokenize the same sentence twice with two different tokenizers (say, BERT's and GPT-2's), the integer sequences will be completely different. Explain why a transformer model trained with BERT's tokenizer would fail catastrophically if you switched to GPT-2's tokenizer at inference time, even though the text input is identical.

Show answer hint

The answer must mention that the model's embedding layer has learned associations between specific integer IDs and semantic meanings during training. Those associations are only valid for the tokenizer that was used during training. A different tokenizer produces different integer IDs that don't map to the learned embeddings: the model tries to look up embeddings for token IDs it has never seen.

VERSION In transformers >= 5.0.0, tokenizer() automatically returns PyTorch tensors if return_tensors='pt' is specified. In < 5.0.0, you had to manually convert lists to tensors. Always use return_tensors='pt' with modern transformers.

Now that you understand how text becomes numbers, you need to learn how to prepare batches of multiple sentences using the tokenizer's padding and attention_mask features: this is essential before feeding data to a real model.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.