Concept Beginner · 3 min read

What is tokenization in LLMs

Quick answer
Tokenization in large language models (LLMs) is the process of breaking down text into smaller units called tokens, which can be words, subwords, or characters. These tokens are the basic input units that LLMs use to understand and generate language.
Tokenization is the process that converts raw text into smaller units called tokens, enabling large language models (LLMs) to process and generate language effectively.

How it works

Tokenization splits text into manageable pieces called tokens. Think of it like cutting a sentence into puzzle pieces so the model can understand and predict the next piece. Instead of processing entire sentences or paragraphs at once, LLMs work with these tokens to analyze context and generate coherent responses.

For example, the sentence "I love AI" might be tokenized into ["I", " love", " AI"]. Some tokenizers break words further into subwords or characters, especially for rare or new words, improving the model's flexibility.

Concrete example

Here’s a Python example using the tiktoken library (OpenAI’s tokenizer) to tokenize a sentence:

python
import os
import tiktoken

# Initialize tokenizer for gpt-4o
enc = tiktoken.encoding_for_model("gpt-4o")

text = "Tokenization breaks text into tokens."
tokens = enc.encode(text)
print("Tokens:", tokens)
print("Decoded tokens:", [enc.decode([t]) for t in tokens])
output
Tokens: [19203, 1332, 133, 133, 50256]
Decoded tokens: ['Token', 'ization', ' breaks', ' text', '.']

When to use it

Use tokenization whenever you input text into an LLM for tasks like text generation, summarization, or translation. It’s essential for converting raw text into a format the model understands. Avoid skipping tokenization because LLMs cannot process raw strings directly.

Tokenization also helps optimize model efficiency by controlling token length, which affects cost and speed.

Key terms

TermDefinition
TokenA unit of text (word, subword, or character) used as input for LLMs.
TokenizerA tool or algorithm that converts raw text into tokens.
SubwordA fragment of a word used as a token to handle rare or complex words.
EncodingThe process of converting tokens into numerical IDs for model input.

Key Takeaways

  • Tokenization converts raw text into tokens, the fundamental units LLMs process.
  • Different tokenizers split text differently, often into subwords for flexibility.
  • Always tokenize text before feeding it to an LLM to ensure proper understanding.
  • Token count affects model input size, cost, and performance.
  • Understanding tokenization helps optimize prompt design and model usage.
Verified 2026-04 · gpt-4o, gpt-4o-mini
Verify ↗