Concept beginner · 3 min read

What is a token in AI

Quick answer
A token in AI is a unit of text that language models process, such as words or subwords. Tokens are the building blocks for models like gpt-4o to understand and generate language by predicting sequences of tokens.
Token is a unit of text that AI language models use to process and generate language.

How it works

Tokens are like puzzle pieces of language. Instead of reading whole sentences at once, AI models break text into smaller pieces called tokens. These can be whole words, parts of words, or even characters depending on the tokenizer. The model then predicts the next token based on the previous ones, building sentences piece by piece. This is similar to how you might guess the next word in a sentence by looking at the words before it.

Concrete example

Here is a simple example using the OpenAI Python SDK to tokenize text with gpt-4o tokenizer (conceptual, as tokenization is often done with separate tokenizer libraries):

python
from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

text = "Hello, world!"

# This example simulates tokenization by calling the model to count tokens
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": text}]
)

print(f"Input text: {text}")
print(f"Model response: {response.choices[0].message.content}")

# Note: For actual tokenization, use tokenizer libraries like tiktoken or Hugging Face tokenizers.
output
Input text: Hello, world!
Model response: Hello, world!

When to use it

Use tokens when working with language models to understand input length limits, cost, and model behavior. Tokenization is essential for preprocessing text before sending it to models like gpt-4o or claude-3-5-sonnet-20241022. Avoid treating tokens as whole words because tokenizers often split words into subwords or characters, affecting token counts and model outputs.

Key terms

TermDefinition
TokenA unit of text (word, subword, or character) processed by AI models.
TokenizerA tool or algorithm that splits text into tokens.
SubwordA fragment of a word used as a token to handle rare or complex words.
Language modelAn AI model that predicts the next token in a sequence to generate text.

Key Takeaways

  • Tokens are the fundamental units AI models use to read and generate text.
  • Tokenization breaks text into pieces smaller than words, often subwords or characters.
  • Understanding tokens helps manage input size and cost when using language models.
Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022
Verify ↗