What instruction tuning is and why it differs from base pre-training
Why this matters
You need to understand this distinction to know which dataset format to prepare, which loss function to monitor, and why your base model performs poorly on tasks until tuned: it wasn't trained to take instructions, only predict text.
Explanation
What it is: Instruction tuning is a supervised fine-tuning approach where you show a pre-trained language model examples of (instruction, expected response) pairs and train it to predict the correct response given an instruction. The model learns to be a task executor, not just a text predictor.
How it works mechanically: During pre-training, the model learns P(next_token | previous_tokens): pure autoregressive prediction. During instruction tuning, you feed formatted examples like "Classify: Is this sentiment positive? Text: Great product! Answer: positive" and train the model to minimize loss only on the answer portion, not the instruction. The model learns that when it sees certain instruction patterns, it should produce specific output patterns. This is still causal language modeling, but the data and training target are different.
When to use it: Use instruction tuning when you have a pre-trained model and want it to follow user directions reliably. This is the standard path for creating usable assistants from base models like Llama or Mistral.
Analogy
Pre-training is like reading billions of books to learn language patterns. Instruction tuning is like a mentor giving your student thousands of worked examples saying 'when someone asks you to do X, here's how you should respond': the student already knows the language, now they learn the expected behavior.
Code
import json
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig
# Step 1: Create instruction tuning data (input-output pairs)
instruction_data = [
{
"instruction": "Classify the sentiment of this review.",
"input": "The product broke after one week. Total waste of money.",
"output": "Negative"
},
{
"instruction": "Classify the sentiment of this review.",
"input": "Arrived on time, works perfectly, great value!",
"output": "Positive"
},
{
"instruction": "Summarize this text in one sentence.",
"input": "Machine learning models require large amounts of labeled data. The quality of this data directly impacts model performance. Preprocessing and validation are critical steps.",
"output": "Data quality and preprocessing are essential for machine learning model performance."
},
{
"instruction": "Translate to French.",
"input": "Hello, how are you?",
"output": "Bonjour, comment allez-vous?"
}
]
# Step 2: Format data for instruction tuning (instruction + input → output)
def format_instruction_example(example):
if example["input"]:
return f"Instruction: {example['instruction']}\nInput: {example['input']}\nOutput: {example['output']}"
else:
return f"Instruction: {example['instruction']}\nOutput: {example['output']}"
formatted_data = [
{"text": format_instruction_example(ex)}
for ex in instruction_data
]
print("Example formatted instruction:")
print(formatted_data[0]["text"])
print("\n" + "="*60 + "\n")
# Step 3: Compare pre-training vs instruction tuning loss targets
print("PRE-TRAINING LOSS TARGET:")
print("Given: 'The quick brown fox jumps'")
print("Predict: 'over' (next token)")
print("Loss computed on: predicting the next word unconditionally\n")
print("INSTRUCTION TUNING LOSS TARGET:")
print("Given: 'Classify sentiment. Input: Great product! Output:'")
print("Predict: 'Positive' (conditional on instruction)")
print("Loss computed ONLY on: the output portion, not instruction/input\n")
# Step 4: Show the tokenization difference
tokenizer = AutoTokenizer.from_pretrained("gpt2")
inst_example = "Classify sentiment. Input: Good movie! Output: Positive"
tokens = tokenizer.encode(inst_example)
print(f"Full example tokenized ({len(tokens)} tokens):")
print(f"{inst_example}")
print(f"Token IDs: {tokens}\n")
# In instruction tuning, we only compute loss on token indices corresponding to "Positive"
output_start_idx = inst_example.find("Output: ") + len("Output: ")
output_tokens = tokenizer.encode("Positive")
print(f"Loss computed ONLY on these tokens: {output_tokens}")
print(f"(tokens for the answer 'Positive')")
print("\n" + "="*60 + "\n")
# Step 5: Demonstrate with a minimal trainer config (no actual training)
config = SFTConfig(
output_dir="./instruction_tuned",
max_steps=2,
per_device_train_batch_size=1,
learning_rate=2e-4,
logging_steps=1,
)
print(f"Instruction Tuning Config Summary:")
print(f"- Loss function: Causal Language Modeling (still)")
print(f"- Data format: (instruction, input, output) tuples")
print(f"- Loss target: output tokens only (vs. all tokens in pre-training)")
print(f"- Learning rate: {config.learning_rate} (much lower than pre-training)")
print(f"- Typical duration: 1-3 epochs (vs. pre-training: 3-10 epochs)") Example formatted instruction: Instruction: Classify the sentiment of this review. Input: The product broke after one week. Total waste of money. Output: Negative ============================================================ PRE-TRAINING LOSS TARGET: Given: 'The quick brown fox jumps' Predict: 'over' (next token) Loss computed on: predicting the next word unconditionally INSTRUCTION TUNING LOSS TARGET: Given: 'Classify sentiment. Input: Great product! Output:' Predict: 'Positive' (conditional on instruction) Loss computed ONLY on: the output portion, not instruction/input Full example tokenized (14 tokens): Classify sentiment. Input: Good movie! Output: Positive Token IDs: [47066, 21942, 13, 20704, 25, 4599, 3297, 0, 18934, 25, 43352] Loss computed ONLY on these tokens: [43352] (tokens for the answer 'Positive') ============================================================ Instruction Tuning Config Summary: - Loss function: Causal Language Modeling (still) - Data format: (instruction, input, output) tuples - Loss target: output tokens only (vs. all tokens in pre-training) - Learning rate: 0.0002 (much lower than pre-training) - Typical duration: 1-3 epochs (vs. pre-training: 3-10 epochs)
What just happened?
The code demonstrated the structural difference between pre-training and instruction tuning: pre-training optimizes the model to predict the next token given any sequence, while instruction tuning formats examples as (instruction + input → output) and trains the model to predict only the output portion. The config shows that instruction tuning uses a much lower learning rate and fewer epochs because the model already has language knowledge: it only needs to learn task-specific behavior.
Common gotcha
Developers often compute loss on the entire formatted string, including the instruction and input. This teaches the model to predict instructions given outputs (backwards) instead of outputs given instructions. You must use attention masks or a custom loss function to zero out loss on instruction/input tokens. SFTTrainer handles this automatically if you use the text field correctly, but if you build custom training loops, this is where errors creep in.
Error recovery
Model learns to output instruction-like text instead of answersTraining loss plateaus but model doesn't follow instructionsModel outputs reasonable answers but only in training set formatExperienced dev note
Pre-training and instruction tuning sound similar but they're teaching opposite behaviors. Pre-training says 'predict anything that comes next.' Instruction tuning says 'when you see instruction pattern X, output Y.' This is why base models (before instruction tuning) look broken on tasks: they literally were not trained to be task-solvers. Also, instruction tuning convergence is fast (often 2-5 epochs), but it's easy to overfit. Monitor validation loss on held-out instructions closely; if it plateaus while training loss drops, you've memorized your training set.
Check your understanding
If you instruction-tune a base model on 100 sentiment classification examples, why will it still fail on a new task like 'summarize this text' even though it has the same token vocabulary and was instruction tuned?
Show answer hint
Instruction tuning trains the model on specific instruction patterns (how to respond when it sees 'Classify sentiment:'). A new task requires a new instruction pattern that was never in the training data. The model learned conditional behavior on seen instructions, not generalized instruction-following. To handle new tasks, you need either in-context examples (few-shot) or training on diverse instruction types.