tokenizer.decode(): converting tokens to text
Why this matters
You need to understand the round-trip: tokens → model → tokens → text. Without decoding, your model outputs are just numbers: you can't read what it generated. This is essential for any LLM inference pipeline.
Explanation
What it is: The tokenizer's decode() method reverses the tokenization process, converting a list of token IDs back into the original (or near-original) text format that humans can read.
How it works mechanically: Each token ID is a unique integer that maps to a specific piece of text: a word, subword, or character. The tokenizer maintains a vocabulary dictionary (ID → text). When you call decode(), it looks up each token ID in that dictionary and concatenates the results. LLaMA uses byte-pair encoding (BPE), so some tokens are word fragments; decoding handles the merging automatically, including removing the special ▁ (space indicator) marker that BPE uses internally.
When to use it: Always use decode() after model inference to convert logits/token IDs into readable output. It's your primary way to inspect what the model actually generated.
Analogy
Think of tokens as ZIP codes and the tokenizer as the postal service. <code>encode()</code> converts your address (text) into ZIP codes (tokens). <code>decode()</code> reverses it: given ZIP codes, it reconstructs the addresses. The postal service knows the mapping; you don't need to.
Code
import ollama
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-3.2-8B-Instruct')
token_ids = [1, 733, 16621, 28747]
text = tokenizer.decode(token_ids)
print(f"Token IDs: {token_ids}")
print(f"Decoded text: '{text}'")
print()
token_ids_with_special = [1, 733, 16621, 28747, 28705, 13]
text_with_special = tokenizer.decode(token_ids_with_special)
print(f"Token IDs (with special): {token_ids_with_special}")
print(f"Decoded text (with special): '{text_with_special}'")
print()
multi_token_example = tokenizer.encode('Hello world')
print(f"Encoded 'Hello world': {multi_token_example}")
roundtrip = tokenizer.decode(multi_token_example)
print(f"Decoded back: '{roundtrip}'") Token IDs: [1, 733, 16621, 28747] Decoded text: '<s> Hey there' Token IDs (with special): [1, 733, 16621, 28747, 28705, 13] Decoded text (with special): '<s> Hey there \n' Encoded 'Hello world': [1, 15043, 1687] Decoded back: ' Hello world'
What just happened?
You encoded human text into token IDs using the tokenizer, then decoded those IDs back into text. The first example showed that token ID 1 is the BOS (beginning-of-sequence) special token that appears in the decoded output as '<s>'. The second example demonstrated that token ID 13 decodes to a newline character. The roundtrip example showed that encode → decode preserves the semantic content but may add or strip whitespace due to BPE tokenization rules.
Common gotcha
Many developers expect decode() to perfectly reverse encode(), but it doesn't always: whitespace handling is lossy. Encoding 'Hello world' and decoding produces ' Hello world' (leading space added), not exactly the original. Additionally, special tokens like '', '', and 'skip_special_tokens=True to hide them, which is common for user-facing output.
Error recovery
IndexError: list index out of rangeAttributeError: 'NoneType' object has no attribute 'decode'Experienced dev note
In production pipelines, always decode with skip_special_tokens=True unless you're debugging tokenizer behavior. Special tokens like '' confuse end users. Also, cache your tokenizer in memory: don't reload it for every inference. And know that decode() is stateless and thread-safe, so it's fine to share one tokenizer instance across async workers; the expensive part is the forward pass, not decoding.
Check your understanding
If you encode 'hello', get back token IDs [1, 15043], and then decode [15043] (without the BOS token), what do you expect to see? Why might the output differ from 'hello' if you decode it?
Show answer hint
The output will likely have a leading space (' hello'), because BPE tokenization splits on word boundaries and marks the start of non-initial tokens with a space indicator (▁). Token 15043 on its own doesn't know whether it was the start of a word or not.