Concept Beginner · 3 min read

What is tokenization in LLMs

Q: What is tokenization in LLMs

Tokenization in large language models (LLMs) is the process of breaking down text into smaller units called tokens, which can be words, subwords, or characters. These tokens are the basic input units that LLMs use to understand and generate language.

Quick answer

Tokenization in large language models (LLMs) is the process of breaking down text into smaller units called tokens, which can be words, subwords, or characters. These tokens are the basic input units that LLMs use to understand and generate language.

Tokenization is the process that converts raw text into smaller units called tokens, enabling large language models (LLMs) to process and generate language effectively.

How it works

Tokenization splits text into manageable pieces called tokens. Think of it like cutting a sentence into puzzle pieces so the model can understand and predict the next piece. Instead of processing entire sentences or paragraphs at once, LLMs work with these tokens to analyze context and generate coherent responses.

For example, the sentence "I love AI" might be tokenized into ["I", " love", " AI"]. Some tokenizers break words further into subwords or characters, especially for rare or new words, improving the model's flexibility.

Concrete example

Here’s a Python example using the tiktoken library (OpenAI’s tokenizer) to tokenize a sentence:

python

import os
import tiktoken

# Initialize tokenizer for gpt-4o
enc = tiktoken.encoding_for_model("gpt-4o")

text = "Tokenization breaks text into tokens."
tokens = enc.encode(text)
print("Tokens:", tokens)
print("Decoded tokens:", [enc.decode([t]) for t in tokens])

output

Tokens: [19203, 1332, 133, 133, 50256]
Decoded tokens: ['Token', 'ization', ' breaks', ' text', '.']

When to use it

Use tokenization whenever you input text into an LLM for tasks like text generation, summarization, or translation. It’s essential for converting raw text into a format the model understands. Avoid skipping tokenization because LLMs cannot process raw strings directly.

Tokenization also helps optimize model efficiency by controlling token length, which affects cost and speed.

Key terms

Term	Definition
Token	A unit of text (word, subword, or character) used as input for LLMs.
Tokenizer	A tool or algorithm that converts raw text into tokens.
Subword	A fragment of a word used as a token to handle rare or complex words.
Encoding	The process of converting tokens into numerical IDs for model input.

✅

Key Takeaways

Tokenization converts raw text into tokens, the fundamental units LLMs process.
Different tokenizers split text differently, often into subwords for flexibility.
Always tokenize text before feeding it to an LLM to ensure proper understanding.
Token count affects model input size, cost, and performance.
Understanding tokenization helps optimize prompt design and model usage.

Verified 2026-04 · gpt-4o, gpt-4o-mini

Verify ↗