How to use AutoTokenizer in python
Direct answer
Use
AutoTokenizer.from_pretrained() to load a tokenizer by model name, then call tokenizer(text) to tokenize input text in Python.Setup
Install
pip install transformers Imports
from transformers import AutoTokenizer Examples
inHello, how are you?
out{'input_ids': [101, 7592, 1010, 2129, 2024, 2017, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
inTransformers are amazing for NLP tasks.
out{'input_ids': [101, 19081, 2024, 6429, 2005, 17953, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
in
out{'input_ids': [101, 102], 'attention_mask': [1, 1]}
Integration steps
- Import AutoTokenizer from transformers.
- Load a pretrained tokenizer using AutoTokenizer.from_pretrained with the model name.
- Call the tokenizer on your input text to get tokenized output.
- Use the tokenized output for model input or further processing.
Full code
from transformers import AutoTokenizer
# Load tokenizer for a pretrained model, e.g., bert-base-uncased
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
# Example input text
text = "Hello, how are you?"
# Tokenize the input text
encoded_input = tokenizer(text)
# Print the tokenized output
print(encoded_input) output
{'input_ids': [101, 7592, 1010, 2129, 2024, 2017, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]} API trace
Request
{"model_name_or_path": "bert-base-uncased", "text": "Hello, how are you?"} Response
{"input_ids": [101, 7592, 1010, 2129, 2024, 2017, 102], "token_type_ids": [0, 0, 0, 0, 0, 0, 0], "attention_mask": [1, 1, 1, 1, 1, 1, 1]} Extract
encoded_input = tokenizer(text); use encoded_input['input_ids'] or encoded_input directlyVariants
Batch Tokenization ›
Use when tokenizing multiple texts at once for efficient batch processing.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
texts = ["Hello, how are you?", "Transformers are great!"]
encoded_batch = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
print(encoded_batch) Tokenization with Padding and Truncation ›
Use when you need fixed-length inputs with padding and truncation for model compatibility.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
text = "This is a longer text that might need truncation."
encoded = tokenizer(text, padding='max_length', truncation=True, max_length=10)
print(encoded) Tokenization with Return Tensors ›
Use when you want the output as PyTorch tensors directly for model input.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
text = "Hello, how are you?"
encoded = tokenizer(text, return_tensors='pt')
print(encoded) Performance
Latency~10-50ms per single text tokenization depending on text length
CostFree; runs locally without API calls
Rate limitsNone; local library usage
- Use batch tokenization to reduce overhead when processing multiple texts.
- Enable truncation to limit token length and reduce memory usage.
- Use fast tokenizers (default in transformers) for better performance.
| Approach | Latency | Cost/call | Best for |
|---|---|---|---|
| Single text tokenization | ~10-50ms | Free | Quick tokenization of individual texts |
| Batch tokenization | ~20-100ms | Free | Efficient processing of multiple texts |
| Tokenization with return_tensors | ~15-60ms | Free | Direct input to PyTorch or TensorFlow models |
Quick tip
Always specify <code>padding</code> and <code>truncation</code> parameters when tokenizing batches to ensure consistent input sizes.
Common mistake
Forgetting to call <code>from_pretrained()</code> and trying to instantiate <code>AutoTokenizer</code> directly causes errors.