Minimum viable dataset
Why this matters
Before you download massive datasets or worry about data cleaning, you need to know the bare minimum format that makes fine-tuning work. This unblocks experimentation and builds confidence that your pipeline is correct before scaling.
Explanation
A minimum viable dataset for fine-tuning is just a list of dictionaries with a text key containing prompt-response pairs. That's it. No labels, no splits, no complex structure: just raw text examples.
Mechanically, SFTTrainer expects either a Dataset object from the datasets library or a list of dictionaries. When you pass it raw text, it tokenizes each example and trains the model to predict the next token in that sequence. The trainer handles batching, tokenization, and masking automatically. You don't need to write custom collate functions or preprocessing: SFTConfig handles the defaults.
Start with 5–10 examples in a single file. This teaches you whether your environment works, your model loads, and your format is correct. Then scale to real data.
Analogy
Like writing a unit test before building the whole application: you're not testing the real scenario yet, you're just confirming the plumbing works.
Code
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import SFTTrainer, SFTConfig
import torch
# Create a minimum viable dataset: list of dicts with 'text' key
raw_data = [
{"text": "Question: What is 2+2? Answer: 4"},
{"text": "Question: What is the capital of France? Answer: Paris"},
{"text": "Question: What is 10-3? Answer: 7"},
{"text": "Question: What is the capital of Japan? Answer: Tokyo"},
{"text": "Question: What is 5*6? Answer: 30"},
]
# Convert to Hugging Face Dataset
dataset = Dataset.from_dict({"text": [ex["text"] for ex in raw_data]})
print(f"Dataset loaded: {len(dataset)} examples")
print(f"First example: {dataset[0]['text'][:60]}...")
# Load a small model for demo (2.8B parameters)
model_name = "microsoft/phi-2"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float32,
device_map="cpu",
trust_remote_code=True
)
# Configure SFT (Supervised Fine-Tuning)
training_args = SFTConfig(
output_dir="./mvd_output",
num_train_epochs=1,
per_device_train_batch_size=2,
max_seq_length=256,
learning_rate=5e-5,
logging_steps=1,
)
# Create trainer with minimal dataset
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
)
print("\nTrainer initialized successfully")
print(f"Steps per epoch: {len(trainer.get_train_dataloader())}") Dataset loaded: 5 examples First example: Question: What is 2+2? Answer: 4 Trainer initialized successfully Steps per epoch: 3
What just happened?
We created a list of 5 text examples, converted it to a Hugging Face Dataset object (which SFTTrainer expects), loaded a tokenizer and model, configured SFTConfig with basic hyperparameters, and instantiated the trainer. The trainer examined the dataset and calculated that 3 batches would be needed per epoch (5 examples ÷ batch size 2 = 2.5 batches, rounded up).
Common gotcha
Developers often try to pass raw Python lists directly to SFTTrainer instead of converting to a Dataset object first. SFTTrainer expects datasets.Dataset, not a list. Also, many assume they need to pre-tokenize: you don't. SFTTrainer does it for you automatically based on max_seq_length.
Error recovery
TypeError: 'list' object is not subscriptableAssertionError: tokenizer.pad_token is NoneCUDA out of memoryExperienced dev note
The biggest time sink for new fine-tuners is chasing dataset format issues that don't exist. Start with 5 hand-written examples and train for 1 epoch. If that works, your pipeline is solid: scale the data, not the complexity. Also: you cannot see actual training happen with 5 examples and 1 epoch (loss won't meaningfully decrease), but you'll know your code is correct, which is the entire point.
Check your understanding
Why can't you reduce max_seq_length to 32 tokens if your actual text examples are 50 tokens long, and what will happen if you try?
Show answer hint
A correct answer explains that SFTTrainer will truncate examples longer than max_seq_length, losing the end of your training data. The size should be at least as long as your longest example, plus some buffer.