Code Intermediate medium · 6 min

What instruction tuning is and why it differs from base pre-training

What you will learn

Instruction tuning teaches a pre-trained model to follow user directions by training on input-output pairs, fundamentally different from pre-training's next-token prediction objective.

Why this matters

You need to understand this distinction to know which dataset format to prepare, which loss function to monitor, and why your base model performs poorly on tasks until tuned: it wasn't trained to take instructions, only predict text.

Skip if: Do not use instruction tuning if you are building a domain-specific text generator (like poetry generation from a style) where you want the model to continue a style rather than follow explicit instructions. Also skip it if your task is rare/specialized enough that pre-training on domain data first would be more efficient than instruction tuning.

Explanation

What it is: Instruction tuning is a supervised fine-tuning approach where you show a pre-trained language model examples of (instruction, expected response) pairs and train it to predict the correct response given an instruction. The model learns to be a task executor, not just a text predictor.

How it works mechanically: During pre-training, the model learns P(next_token | previous_tokens): pure autoregressive prediction. During instruction tuning, you feed formatted examples like "Classify: Is this sentiment positive? Text: Great product! Answer: positive" and train the model to minimize loss only on the answer portion, not the instruction. The model learns that when it sees certain instruction patterns, it should produce specific output patterns. This is still causal language modeling, but the data and training target are different.

When to use it: Use instruction tuning when you have a pre-trained model and want it to follow user directions reliably. This is the standard path for creating usable assistants from base models like Llama or Mistral.

Analogy

Pre-training is like reading billions of books to learn language patterns. Instruction tuning is like a mentor giving your student thousands of worked examples saying 'when someone asks you to do X, here's how you should respond': the student already knows the language, now they learn the expected behavior.

Code

python

import json
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig

# Step 1: Create instruction tuning data (input-output pairs)
instruction_data = [
    {
        "instruction": "Classify the sentiment of this review.",
        "input": "The product broke after one week. Total waste of money.",
        "output": "Negative"
    },
    {
        "instruction": "Classify the sentiment of this review.",
        "input": "Arrived on time, works perfectly, great value!",
        "output": "Positive"
    },
    {
        "instruction": "Summarize this text in one sentence.",
        "input": "Machine learning models require large amounts of labeled data. The quality of this data directly impacts model performance. Preprocessing and validation are critical steps.",
        "output": "Data quality and preprocessing are essential for machine learning model performance."
    },
    {
        "instruction": "Translate to French.",
        "input": "Hello, how are you?",
        "output": "Bonjour, comment allez-vous?"
    }
]

# Step 2: Format data for instruction tuning (instruction + input → output)
def format_instruction_example(example):
    if example["input"]:
        return f"Instruction: {example['instruction']}\nInput: {example['input']}\nOutput: {example['output']}"
    else:
        return f"Instruction: {example['instruction']}\nOutput: {example['output']}"

formatted_data = [
    {"text": format_instruction_example(ex)}
    for ex in instruction_data
]

print("Example formatted instruction:")
print(formatted_data[0]["text"])
print("\n" + "="*60 + "\n")

# Step 3: Compare pre-training vs instruction tuning loss targets
print("PRE-TRAINING LOSS TARGET:")
print("Given: 'The quick brown fox jumps'")
print("Predict: 'over' (next token)")
print("Loss computed on: predicting the next word unconditionally\n")

print("INSTRUCTION TUNING LOSS TARGET:")
print("Given: 'Classify sentiment. Input: Great product! Output:'")
print("Predict: 'Positive' (conditional on instruction)")
print("Loss computed ONLY on: the output portion, not instruction/input\n")

# Step 4: Show the tokenization difference
tokenizer = AutoTokenizer.from_pretrained("gpt2")

inst_example = "Classify sentiment. Input: Good movie! Output: Positive"
tokens = tokenizer.encode(inst_example)
print(f"Full example tokenized ({len(tokens)} tokens):")
print(f"{inst_example}")
print(f"Token IDs: {tokens}\n")

# In instruction tuning, we only compute loss on token indices corresponding to "Positive"
output_start_idx = inst_example.find("Output: ") + len("Output: ")
output_tokens = tokenizer.encode("Positive")
print(f"Loss computed ONLY on these tokens: {output_tokens}")
print(f"(tokens for the answer 'Positive')")
print("\n" + "="*60 + "\n")

# Step 5: Demonstrate with a minimal trainer config (no actual training)
config = SFTConfig(
    output_dir="./instruction_tuned",
    max_steps=2,
    per_device_train_batch_size=1,
    learning_rate=2e-4,
    logging_steps=1,
)

print(f"Instruction Tuning Config Summary:")
print(f"- Loss function: Causal Language Modeling (still)")
print(f"- Data format: (instruction, input, output) tuples")
print(f"- Loss target: output tokens only (vs. all tokens in pre-training)")
print(f"- Learning rate: {config.learning_rate} (much lower than pre-training)")
print(f"- Typical duration: 1-3 epochs (vs. pre-training: 3-10 epochs)")

Output

Example formatted instruction:
Instruction: Classify the sentiment of this review.
Input: The product broke after one week. Total waste of money.
Output: Negative

============================================================

PRE-TRAINING LOSS TARGET:
Given: 'The quick brown fox jumps'
Predict: 'over' (next token)
Loss computed on: predicting the next word unconditionally

INSTRUCTION TUNING LOSS TARGET:
Given: 'Classify sentiment. Input: Great product! Output:'
Predict: 'Positive' (conditional on instruction)
Loss computed ONLY on: the output portion, not instruction/input

Full example tokenized (14 tokens):
Classify sentiment. Input: Good movie! Output: Positive
Token IDs: [47066, 21942, 13, 20704, 25, 4599, 3297, 0, 18934, 25, 43352]

Loss computed ONLY on these tokens: [43352]
(tokens for the answer 'Positive')

============================================================

Instruction Tuning Config Summary:
- Loss function: Causal Language Modeling (still)
- Data format: (instruction, input, output) tuples
- Loss target: output tokens only (vs. all tokens in pre-training)
- Learning rate: 0.0002 (much lower than pre-training)
- Typical duration: 1-3 epochs (vs. pre-training: 3-10 epochs)

What just happened?

The code demonstrated the structural difference between pre-training and instruction tuning: pre-training optimizes the model to predict the next token given any sequence, while instruction tuning formats examples as (instruction + input → output) and trains the model to predict only the output portion. The config shows that instruction tuning uses a much lower learning rate and fewer epochs because the model already has language knowledge: it only needs to learn task-specific behavior.

Common gotcha

Developers often compute loss on the entire formatted string, including the instruction and input. This teaches the model to predict instructions given outputs (backwards) instead of outputs given instructions. You must use attention masks or a custom loss function to zero out loss on instruction/input tokens. SFTTrainer handles this automatically if you use the text field correctly, but if you build custom training loops, this is where errors creep in.

Error recovery

Model learns to output instruction-like text instead of answers

You computed loss on instruction tokens instead of output tokens. Verify your data formatting includes clear delimiters (e.g., 'Output:') and that SFTTrainer or your loss function masks non-output tokens.

Training loss plateaus but model doesn't follow instructions

Your learning rate is too high (pre-training rates are 1e-4 to 5e-4; instruction tuning is 2e-5 to 5e-4). Lower it by 5-10x and retrain. Also check that your instruction format is consistent across all examples.

Model outputs reasonable answers but only in training set format

Your test prompts don't match the exact instruction format used in training. Test with prompts matching your training format exactly, then gradually test variants.

Experienced dev note

Pre-training and instruction tuning sound similar but they're teaching opposite behaviors. Pre-training says 'predict anything that comes next.' Instruction tuning says 'when you see instruction pattern X, output Y.' This is why base models (before instruction tuning) look broken on tasks: they literally were not trained to be task-solvers. Also, instruction tuning convergence is fast (often 2-5 epochs), but it's easy to overfit. Monitor validation loss on held-out instructions closely; if it plateaus while training loss drops, you've memorized your training set.

Check your understanding

If you instruction-tune a base model on 100 sentiment classification examples, why will it still fail on a new task like 'summarize this text' even though it has the same token vocabulary and was instruction tuned?

Show answer hint

Instruction tuning trains the model on specific instruction patterns (how to respond when it sees 'Classify sentiment:'). A new task requires a new instruction pattern that was never in the training data. The model learned conditional behavior on seen instructions, not generalized instruction-following. To handle new tasks, you need either in-context examples (few-shot) or training on diverse instruction types.

VERSION SFTTrainer and SFTConfig were significantly refactored in trl >= 0.8.0. Older versions used Trainer with custom logic; current versions (1.x) handle instruction/output splitting automatically if you format data as 'text' field strings with clear delimiters.

Next, learn how to format diverse instruction datasets and use LoRA to make instruction tuning computationally efficient on consumer hardware.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.