How to beginner · 3 min read

How to decode tokens with Hugging Face tokenizer

Quick answer
Use the Hugging Face transformers library's tokenizer decode() method to convert token IDs back into human-readable text. First, load a pretrained tokenizer, then call tokenizer.decode(token_ids) to get the decoded string.

PREREQUISITES

  • Python 3.8+
  • pip install transformers
  • Basic knowledge of tokenization

Setup

Install the transformers library from Hugging Face and import the tokenizer class. You need Python 3.8 or higher.

bash
pip install transformers

Step by step

This example shows how to load a pretrained tokenizer, encode text into tokens, and then decode those tokens back to text.

python
from transformers import AutoTokenizer

# Load a pretrained tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Example text
text = "Hello, Hugging Face!"

# Encode text to token IDs
token_ids = tokenizer.encode(text, add_special_tokens=False)
print("Token IDs:", token_ids)

# Decode token IDs back to string
decoded_text = tokenizer.decode(token_ids)
print("Decoded text:", decoded_text)
output
Token IDs: [7592, 1010, 17662, 2224, 999]
Decoded text: hello, hugging face!

Common variations

  • Use decode(token_ids, skip_special_tokens=True) to remove special tokens like [CLS] or [SEP].
  • For batch decoding, use tokenizer.batch_decode(list_of_token_id_lists).
  • Different models have different tokenizers, e.g., gpt2, roberta-base.
python
decoded_skip_special = tokenizer.decode(token_ids, skip_special_tokens=True)
print("Decoded without special tokens:", decoded_skip_special)

batch_tokens = [token_ids, token_ids]
batch_decoded = tokenizer.batch_decode(batch_tokens)
print("Batch decoded:", batch_decoded)
output
Decoded without special tokens: hello, hugging face!
Batch decoded: ['hello, hugging face!', 'hello, hugging face!']

Troubleshooting

  • If decoding returns unexpected characters, ensure you use the correct tokenizer matching your model.
  • Check if add_special_tokens was set during encoding; special tokens may appear in decoded output.
  • For unknown tokens, verify your tokenizer vocabulary and model compatibility.

Key Takeaways

  • Use tokenizer.decode() to convert token IDs back to readable text.
  • Pass skip_special_tokens=True to remove model-specific special tokens during decoding.
  • Batch decode multiple token sequences efficiently with tokenizer.batch_decode().
  • Always load the tokenizer that matches your model to avoid decoding errors.
  • The Hugging Face transformers library provides a simple API for tokenization and decoding.
Verified 2026-04 · bert-base-uncased, gpt2, roberta-base
Verify ↗