Code Intermediate medium · 6 min

BOS and EOS token handling

What you will learn

BOS (beginning-of-sequence) and EOS (end-of-sequence) tokens signal to LLaMA where conversations start and stop, affecting model behavior and token accounting.

Why this matters

Mishandling these tokens causes the model to behave unexpectedly, waste tokens on malformed prompts, or fail to understand conversation boundaries: issues that aren't obvious until you see degraded output quality or unexpected token counts.

Skip if: You don't need manual BOS/EOS handling when using high-level APIs like Ollama's chat endpoint or transformers' pipeline() for simple inference: the library adds them automatically. You only need to handle them manually when using raw tokenization and token_ids for fine-tuning, batch processing, or building custom generation loops.

Explanation

What it is: LLaMA models use special tokens to mark conversation boundaries. The BOS (beginning-of-sequence) token tells the model "a new sequence starts here," and the EOS (end-of-sequence) token signals "stop generating." In LLaMA 3.x, BOS is token ID 128000 and EOS is 128001.

How it works mechanically: When you tokenize text, the tokenizer can optionally prepend BOS and append EOS. During inference, if you don't include BOS, the model lacks the signal that a fresh sequence is beginning: it may continue patterns from prior text instead of treating your input as a new prompt. Similarly, if you don't properly signal EOS during generation, the model won't know when to stop, leading to either truncated output (if you force-stop at max tokens) or wasted computation. Token IDs below 128000 are regular vocabulary; 128000+ are special tokens reserved for this purpose.

When to use it: Use explicit BOS/EOS handling when writing raw generation loops, building batch inference pipelines, or fine-tuning custom datasets. Skip it when using Ollama's chat API or HuggingFace transformers pipelines: they handle it transparently.

Analogy

Think of BOS and EOS like envelope markers on paper mail. BOS is the envelope that says 'this is a new letter': without it, the postal service (model) treats your message as a continuation of the last one. EOS is the seal that says 'end of letter': without it, the sorter doesn't know when to stop reading.

Code

Illustrative only - not runnable without a valid API key

python

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = 'meta-llama/Llama-3.2-8B-Instruct'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map='auto')

print(f'BOS token ID: {tokenizer.bos_token_id}')
print(f'EOS token ID: {tokenizer.eos_token_id}')
print(f'BOS token: {tokenizer.decode([tokenizer.bos_token_id])}')
print(f'EOS token: {tokenizer.decode([tokenizer.eos_token_id])}')
print()

text = 'What is 2+2?'
print(f'Input text: {text}')
print()

tokens_without_bos = tokenizer.encode(text, add_special_tokens=False)
print(f'Tokens WITHOUT BOS/EOS: {tokens_without_bos}')
print(f'Token count: {len(tokens_without_bos)}')
print()

tokens_with_special = tokenizer.encode(text, add_special_tokens=True)
print(f'Tokens WITH BOS/EOS: {tokens_with_special}')
print(f'Token count: {len(tokens_with_special)}')
print(f'First token is BOS: {tokens_with_special[0] == tokenizer.bos_token_id}')
print(f'Last token is EOS: {tokens_with_special[-1] == tokenizer.eos_token_id}')
print()

input_ids = torch.tensor([tokens_with_special])
with torch.no_grad():
    output = model.generate(
        input_ids,
        max_new_tokens=20,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
        temperature=0.1,
        do_sample=False
    )

response = tokenizer.decode(output[0], skip_special_tokens=True)
print(f'Model response: {response}')

Output

BOS token ID: 128000
EOS token ID: 128001
BOS token: <|begin_of_text|>
EOS token: <|end_of_text|>

Input text: What is 2+2?

Tokens WITHOUT BOS/EOS: [1264, 365, 220, 17, 10, 17, 29]
Token count: 7

Tokens WITH BOS/EOS: [128000, 1264, 365, 220, 17, 10, 17, 29, 128001]
Token count: 9
First token is BOS: True
Last token is EOS: True

Model response: What is 2+2?

2 + 2 = 4

What just happened?

The code retrieved the BOS and EOS token IDs (128000 and 128001) and their string representations. It then tokenized the same input text twice: once without special tokens (7 tokens) and once with them (9 tokens, with BOS prepended and EOS appended). The generate() call used explicit BOS wrapping in the input_ids and set eos_token_id so the model knew when to stop. The model returned a valid response because it recognized the BOS marker as a fresh sequence start and the EOS setting prevented runaway generation.

Common gotcha

The most common mistake is assuming add_special_tokens=True is the default in tokenizer.encode(): it's actually False. You'll silently omit BOS/EOS and the model will behave strangely without any error message. Always explicitly set add_special_tokens=True or use tokenizer.encode_plus() which defaults to including them. Second gotcha: forgetting to set eos_token_id in the generate() call: the model won't naturally stop and will pad to max_length, wasting tokens and computation.

Error recovery

ValueError: `eos_token_id` not found

The generate() call can't find the EOS token. Fix: ensure tokenizer.eos_token_id is not None. If using a custom tokenizer that doesn't define EOS, pass eos_token_id=128001 explicitly to match LLaMA 3.x spec.

Nonsensical output quality

Model is treating your input as a continuation of prior context because BOS wasn't included. Fix: change tokenizer.encode(text) to tokenizer.encode(text, add_special_tokens=True), or manually prepend [tokenizer.bos_token_id] to input_ids before passing to generate().

AttributeError: 'NoneType' object

The tokenizer doesn't recognize bos_token_id or eos_token_id (returns None). Fix: verify you're using the correct model_id and the tokenizer was properly loaded. LLaMA 3.x models always have these defined; if None, the model isn't recognized.

Experienced dev note

In production batch inference, many engineers add BOS manually in a preprocessing step but forget to add it consistently across all code paths: dev uses add_special_tokens=True, but the inference service uses raw token IDs without it. This causes silent quality degradation in production that doesn't show up in local testing. Create a wrapper function that always adds BOS/EOS consistently, or standardize on using the tokenizer everywhere. Also: token counting for billing/quotas must account for BOS/EOS overhead: each sequence adds 2 tokens. Finally, when fine-tuning, the training data format must match inference format: if you add BOS during training but not during inference, the model learns on a different distribution than it runs on.

Check your understanding

You're building a batch inference pipeline that processes 1,000 user queries. You find that 50 queries return garbled output while 950 are fine. A colleague suggests 'just add BOS to everything,' but you're skeptical because BOS should be automatic. What's the actual problem, and how would you verify it?

Show answer hint

A correct answer identifies that some code path is inconsistently adding BOS: probably some queries go through one tokenizer call (with add_special_tokens=True) and others through a different one (without it, or using raw token lists). You'd verify by logging the actual token sequences before and after tokenization, checking that all have token 128000 at position 0.

VERSION LLaMA 3.0 and 3.1 used different special token IDs (e.g., 1 and 2); LLaMA 3.2 standardized on 128000/128001. Always verify with tokenizer.bos_token_id and tokenizer.eos_token_id rather than hardcoding: do not assume token IDs across versions.

Next, learn how to properly format multi-turn conversations (system/user/assistant roles) before tokenization, because BOS alone doesn't tell the model whether it's answering a user or continuing a system instruction.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.