What a tokenizer does: text to numbers
Why this matters
You cannot pass raw text directly to a transformer model: every model expects numerical input. Understanding tokenization prevents silent bugs where your model receives corrupted input or produces nonsensical outputs. It's also where text length limits, special tokens, and encoding mismatches originate.
Explanation
What it is: A tokenizer is a converter that takes human-readable text and breaks it into discrete units called tokens, then maps each token to a unique integer ID. That integer sequence is what the model actually processes. How it works mechanically: The tokenizer first splits text (using word boundaries, subword rules, or character-level splits), then looks up each token in a vocabulary dictionary to get its integer ID. Different tokenizers use different splitting strategies: GPT uses byte-pair encoding (BPE), BERT uses WordPiece, others use SentencePiece. The output is always a tensor of integers plus metadata like attention masks that tell the model which positions are real tokens versus padding. When to use it: Always use the official tokenizer for your model: it's built specifically for that model's vocabulary and was used during training. Mismatched tokenizers cause dramatic performance drops because the model's embeddings no longer align with the token IDs.
Analogy
Think of a tokenizer like a postal system. Your message (text) is broken into individual letters and words (tokens), then each one gets stamped with a unique zip code (integer ID). The mail truck (model) doesn't read English: it only reads zip codes. If you use the wrong postal system's codes, the truck delivers your message to the wrong place.
Code
import torch
from transformers import AutoTokenizer
text = "Machine learning transforms text into numbers."
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
encoded = tokenizer(text, return_tensors="pt")
print("Input text:")
print(repr(text))
print("\nTokenizer output:")
print(encoded)
print("\nToken IDs as list:")
print(encoded["input_ids"].tolist())
print("\nDecoded back to text:")
print(tokenizer.decode(encoded["input_ids"][0]))
print("\nIndividual token strings:")
tokens = tokenizer.convert_ids_to_tokens(encoded["input_ids"][0])
print(tokens) Input text:
'Machine learning transforms text into numbers.'
Tokenizer output:
{'input_ids': tensor([[ 101, 3298, 2500, 14840, 2487, 1157, 3546, 102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}
Token IDs as list:
[[101, 3298, 2500, 14840, 2487, 1157, 3546, 102]]
Decoded back to text:
[CLS] machine learning transforms text into numbers. [SEE]
Individual token strings:
['[CLS]', 'machine', 'learning', 'transforms', 'text', 'into', 'numbers', '[SEP]']
What just happened?
The code loaded BERT's pre-trained tokenizer, passed raw text through it, and got back a dictionary containing three tensors: <code>input_ids</code> (the integer token sequence), <code>token_type_ids</code> (segment indicators, all zeros for a single sentence), and <code>attention_mask</code> (all ones, meaning all tokens are real, not padding). The text was automatically converted to lowercase (BERT convention), split into 8 tokens including special markers <code>[CLS]</code> at the start and <code>[SEP]</code> at the end, and each mapped to an integer. Converting back to tokens shows exactly how BERT sees the text internally.
Common gotcha
Developers assume tokenization is reversible: it mostly is, but not always perfectly. If you tokenize, then decode, whitespace and capitalization may not match the original. More critically, using the wrong tokenizer for a model (e.g., GPT-2 tokenizer on a BERT model) produces completely wrong integer sequences, but Python won't error: the model will just produce garbage outputs that look plausible. Always verify you're using the tokenizer from the same model card.
Error recovery
KeyError: 'input_ids'ValueError: Token indices sequence length is longer than the maximumRuntimeError: Expected all tensors to be on the same deviceExperienced dev note
Tokenization is where 80% of NLP bugs hide. A mismatched or misconfigured tokenizer silently produces wrong embeddings that downstream tasks amplify into massive errors. Always inspect the actual token IDs early: print tokenizer.convert_ids_to_tokens(input_ids[0]) to verify the tokenizer is splitting text the way you expect. In production, log the first few batches of token sequences to catch misconfigurations before they cascade. Also: different tokenizers have different special tokens ([CLS], [MASK], etc.): always check the model's documentation for which ones to expect.
Check your understanding
If you tokenize the same sentence twice with two different tokenizers (say, BERT's and GPT-2's), the integer sequences will be completely different. Explain why a transformer model trained with BERT's tokenizer would fail catastrophically if you switched to GPT-2's tokenizer at inference time, even though the text input is identical.
Show answer hint
The answer must mention that the model's embedding layer has learned associations between specific integer IDs and semantic meanings during training. Those associations are only valid for the tokenizer that was used during training. A different tokenizer produces different integer IDs that don't map to the learned embeddings: the model tries to look up embeddings for token IDs it has never seen.