How to beginner · 3 min read

How to decode tokens with Hugging Face tokenizer

Q: How to decode tokens with Hugging Face tokenizer

Use the Hugging Face transformers library's tokenizer decode() method to convert token IDs back into human-readable text. First, load a pretrained tokenizer, then call tokenizer.decode(token_ids) to get the decoded string.

Quick answer

Use the Hugging Face transformers library's tokenizer decode() method to convert token IDs back into human-readable text. First, load a pretrained tokenizer, then call tokenizer.decode(token_ids) to get the decoded string.

PREREQUISITES

Python 3.8+
pip install transformers
Basic knowledge of tokenization

Setup

Install the transformers library from Hugging Face and import the tokenizer class. You need Python 3.8 or higher.

bash

pip install transformers

Step by step

This example shows how to load a pretrained tokenizer, encode text into tokens, and then decode those tokens back to text.

python

from transformers import AutoTokenizer

# Load a pretrained tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Example text
text = "Hello, Hugging Face!"

# Encode text to token IDs
token_ids = tokenizer.encode(text, add_special_tokens=False)
print("Token IDs:", token_ids)

# Decode token IDs back to string
decoded_text = tokenizer.decode(token_ids)
print("Decoded text:", decoded_text)

output

Token IDs: [7592, 1010, 17662, 2224, 999]
Decoded text: hello, hugging face!

Common variations

Use decode(token_ids, skip_special_tokens=True) to remove special tokens like [CLS] or [SEP].
For batch decoding, use tokenizer.batch_decode(list_of_token_id_lists).
Different models have different tokenizers, e.g., gpt2, roberta-base.

python

decoded_skip_special = tokenizer.decode(token_ids, skip_special_tokens=True)
print("Decoded without special tokens:", decoded_skip_special)

batch_tokens = [token_ids, token_ids]
batch_decoded = tokenizer.batch_decode(batch_tokens)
print("Batch decoded:", batch_decoded)

output

Decoded without special tokens: hello, hugging face!
Batch decoded: ['hello, hugging face!', 'hello, hugging face!']

Troubleshooting

If decoding returns unexpected characters, ensure you use the correct tokenizer matching your model.
Check if add_special_tokens was set during encoding; special tokens may appear in decoded output.
For unknown tokens, verify your tokenizer vocabulary and model compatibility.

✅

Key Takeaways

Use tokenizer.decode() to convert token IDs back to readable text.
Pass skip_special_tokens=True to remove model-specific special tokens during decoding.
Batch decode multiple token sequences efficiently with tokenizer.batch_decode().
Always load the tokenizer that matches your model to avoid decoding errors.
The Hugging Face transformers library provides a simple API for tokenization and decoding.

Verified 2026-04 · bert-base-uncased, gpt2, roberta-base

Verify ↗