BOS and EOS token handling
Why this matters
Mishandling these tokens causes the model to behave unexpectedly, waste tokens on malformed prompts, or fail to understand conversation boundaries: issues that aren't obvious until you see degraded output quality or unexpected token counts.
Explanation
What it is: LLaMA models use special tokens to mark conversation boundaries. The BOS (beginning-of-sequence) token tells the model "a new sequence starts here," and the EOS (end-of-sequence) token signals "stop generating." In LLaMA 3.x, BOS is token ID 128000 and EOS is 128001.
How it works mechanically: When you tokenize text, the tokenizer can optionally prepend BOS and append EOS. During inference, if you don't include BOS, the model lacks the signal that a fresh sequence is beginning: it may continue patterns from prior text instead of treating your input as a new prompt. Similarly, if you don't properly signal EOS during generation, the model won't know when to stop, leading to either truncated output (if you force-stop at max tokens) or wasted computation. Token IDs below 128000 are regular vocabulary; 128000+ are special tokens reserved for this purpose.
When to use it: Use explicit BOS/EOS handling when writing raw generation loops, building batch inference pipelines, or fine-tuning custom datasets. Skip it when using Ollama's chat API or HuggingFace transformers pipelines: they handle it transparently.
Analogy
Think of BOS and EOS like envelope markers on paper mail. BOS is the envelope that says 'this is a new letter': without it, the postal service (model) treats your message as a continuation of the last one. EOS is the seal that says 'end of letter': without it, the sorter doesn't know when to stop reading.
Code
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = 'meta-llama/Llama-3.2-8B-Instruct'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map='auto')
print(f'BOS token ID: {tokenizer.bos_token_id}')
print(f'EOS token ID: {tokenizer.eos_token_id}')
print(f'BOS token: {tokenizer.decode([tokenizer.bos_token_id])}')
print(f'EOS token: {tokenizer.decode([tokenizer.eos_token_id])}')
print()
text = 'What is 2+2?'
print(f'Input text: {text}')
print()
tokens_without_bos = tokenizer.encode(text, add_special_tokens=False)
print(f'Tokens WITHOUT BOS/EOS: {tokens_without_bos}')
print(f'Token count: {len(tokens_without_bos)}')
print()
tokens_with_special = tokenizer.encode(text, add_special_tokens=True)
print(f'Tokens WITH BOS/EOS: {tokens_with_special}')
print(f'Token count: {len(tokens_with_special)}')
print(f'First token is BOS: {tokens_with_special[0] == tokenizer.bos_token_id}')
print(f'Last token is EOS: {tokens_with_special[-1] == tokenizer.eos_token_id}')
print()
input_ids = torch.tensor([tokens_with_special])
with torch.no_grad():
output = model.generate(
input_ids,
max_new_tokens=20,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.eos_token_id,
temperature=0.1,
do_sample=False
)
response = tokenizer.decode(output[0], skip_special_tokens=True)
print(f'Model response: {response}') BOS token ID: 128000 EOS token ID: 128001 BOS token: <|begin_of_text|> EOS token: <|end_of_text|> Input text: What is 2+2? Tokens WITHOUT BOS/EOS: [1264, 365, 220, 17, 10, 17, 29] Token count: 7 Tokens WITH BOS/EOS: [128000, 1264, 365, 220, 17, 10, 17, 29, 128001] Token count: 9 First token is BOS: True Last token is EOS: True Model response: What is 2+2? 2 + 2 = 4
What just happened?
The code retrieved the BOS and EOS token IDs (128000 and 128001) and their string representations. It then tokenized the same input text twice: once without special tokens (7 tokens) and once with them (9 tokens, with BOS prepended and EOS appended). The generate() call used explicit BOS wrapping in the input_ids and set eos_token_id so the model knew when to stop. The model returned a valid response because it recognized the BOS marker as a fresh sequence start and the EOS setting prevented runaway generation.
Common gotcha
The most common mistake is assuming add_special_tokens=True is the default in tokenizer.encode(): it's actually False. You'll silently omit BOS/EOS and the model will behave strangely without any error message. Always explicitly set add_special_tokens=True or use tokenizer.encode_plus() which defaults to including them. Second gotcha: forgetting to set eos_token_id in the generate() call: the model won't naturally stop and will pad to max_length, wasting tokens and computation.
Error recovery
ValueError: `eos_token_id` not foundNonsensical output qualityAttributeError: 'NoneType' objectExperienced dev note
In production batch inference, many engineers add BOS manually in a preprocessing step but forget to add it consistently across all code paths: dev uses add_special_tokens=True, but the inference service uses raw token IDs without it. This causes silent quality degradation in production that doesn't show up in local testing. Create a wrapper function that always adds BOS/EOS consistently, or standardize on using the tokenizer everywhere. Also: token counting for billing/quotas must account for BOS/EOS overhead: each sequence adds 2 tokens. Finally, when fine-tuning, the training data format must match inference format: if you add BOS during training but not during inference, the model learns on a different distribution than it runs on.
Check your understanding
You're building a batch inference pipeline that processes 1,000 user queries. You find that 50 queries return garbled output while 950 are fine. A colleague suggests 'just add BOS to everything,' but you're skeptical because BOS should be automatic. What's the actual problem, and how would you verify it?
Show answer hint
A correct answer identifies that some code path is inconsistently adding BOS: probably some queries go through one tokenizer call (with add_special_tokens=True) and others through a different one (without it, or using raw token lists). You'd verify by logging the actual token sequences before and after tokenization, checking that all have token 128000 at position 0.