Llama fine-tuning dataset preparation
Quick answer
To prepare a dataset for fine-tuning
Llama models, format your data as JSONL with prompt and completion fields for supervised learning. Ensure text is clean, tokenized properly, and split into training and validation sets to optimize fine-tuning performance.PREREQUISITES
Python 3.8+pip install datasets transformers sentencepieceBasic knowledge of JSON and text preprocessing
Setup
Install essential Python packages for dataset preparation and tokenization. Use datasets for handling data and transformers for tokenizer support.
pip install datasets transformers sentencepiece Step by step
Prepare your fine-tuning dataset as a JSONL file where each line contains a JSON object with prompt and completion keys. Clean and tokenize text, then split into training and validation sets.
import json
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split
from transformers import LlamaTokenizer
import os
# Load Llama tokenizer
# Replace with your Llama tokenizer path or model name
tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b")
# Example raw data: list of dicts with prompt and completion
raw_data = [
{"prompt": "Translate English to French: Hello, how are you?", "completion": "Bonjour, comment ça va ?"},
{"prompt": "Summarize: The quick brown fox jumps over the lazy dog.", "completion": "A fox jumps over a dog."}
]
# Save raw data to JSONL
jsonl_path = "llama_finetune_data.jsonl"
with open(jsonl_path, "w", encoding="utf-8") as f:
for entry in raw_data:
json.dump(entry, f)
f.write("\n")
# Load dataset from JSONL
dataset = Dataset.from_json(jsonl_path)
# Tokenize function
max_length = 512
def tokenize_function(example):
# Concatenate prompt and completion with separator
text = example["prompt"] + "\n" + example["completion"]
tokens = tokenizer(text, truncation=True, max_length=max_length)
return tokens
# Apply tokenization
tokenized_dataset = dataset.map(tokenize_function, remove_columns=["prompt", "completion"])
# Split into train and validation
train_test = tokenized_dataset.train_test_split(test_size=0.1, seed=42)
# Prepare DatasetDict
final_dataset = DatasetDict({
"train": train_test["train"],
"validation": train_test["test"]
})
print(f"Training samples: {len(final_dataset['train'])}")
print(f"Validation samples: {len(final_dataset['validation'])}") output
Training samples: 1 Validation samples: 1
Common variations
- Use different tokenizer versions or custom tokenizers compatible with your Llama model.
- Prepare datasets with additional metadata fields if your fine-tuning framework supports them.
- For instruction tuning, format prompts with clear instructions and expected completions.
- Use streaming dataset loading for large datasets with
datasets.load_dataset(..., streaming=True).
Troubleshooting
- If tokenization truncates important content, increase
max_lengthor split long examples. - Ensure JSONL lines are valid JSON objects; malformed lines cause loading errors.
- Check tokenizer compatibility with your Llama model version to avoid token mismatch.
- Validate dataset splits to prevent data leakage between training and validation.
Key Takeaways
- Format fine-tuning data as JSONL with clear prompt and completion fields.
- Use the official Llama tokenizer to ensure token compatibility.
- Split data into training and validation sets to monitor fine-tuning quality.
- Clean and truncate text properly to fit model input limits.
- Validate JSONL formatting to avoid dataset loading errors.