How to intermediate · 3 min read

Llama fine-tuning dataset preparation

Q: Llama fine-tuning dataset preparation

To prepare a dataset for fine-tuning Llama models, format your data as JSONL with prompt and completion fields for supervised learning. Ensure text is clean, tokenized properly, and split into training and validation sets to optimize fine-tuning performance.

Quick answer

To prepare a dataset for fine-tuning Llama models, format your data as JSONL with prompt and completion fields for supervised learning. Ensure text is clean, tokenized properly, and split into training and validation sets to optimize fine-tuning performance.

PREREQUISITES

Python 3.8+
pip install datasets transformers sentencepiece
Basic knowledge of JSON and text preprocessing

Setup

Install essential Python packages for dataset preparation and tokenization. Use datasets for handling data and transformers for tokenizer support.

bash

pip install datasets transformers sentencepiece

Step by step

Prepare your fine-tuning dataset as a JSONL file where each line contains a JSON object with prompt and completion keys. Clean and tokenize text, then split into training and validation sets.

python

import json
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split
from transformers import LlamaTokenizer
import os

# Load Llama tokenizer
# Replace with your Llama tokenizer path or model name
tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b")

# Example raw data: list of dicts with prompt and completion
raw_data = [
    {"prompt": "Translate English to French: Hello, how are you?", "completion": "Bonjour, comment ça va ?"},
    {"prompt": "Summarize: The quick brown fox jumps over the lazy dog.", "completion": "A fox jumps over a dog."}
]

# Save raw data to JSONL
jsonl_path = "llama_finetune_data.jsonl"
with open(jsonl_path, "w", encoding="utf-8") as f:
    for entry in raw_data:
        json.dump(entry, f)
        f.write("\n")

# Load dataset from JSONL
dataset = Dataset.from_json(jsonl_path)

# Tokenize function
max_length = 512

def tokenize_function(example):
    # Concatenate prompt and completion with separator
    text = example["prompt"] + "\n" + example["completion"]
    tokens = tokenizer(text, truncation=True, max_length=max_length)
    return tokens

# Apply tokenization
tokenized_dataset = dataset.map(tokenize_function, remove_columns=["prompt", "completion"])

# Split into train and validation
train_test = tokenized_dataset.train_test_split(test_size=0.1, seed=42)

# Prepare DatasetDict
final_dataset = DatasetDict({
    "train": train_test["train"],
    "validation": train_test["test"]
})

print(f"Training samples: {len(final_dataset['train'])}")
print(f"Validation samples: {len(final_dataset['validation'])}")

output

Training samples: 1
Validation samples: 1

Common variations

Use different tokenizer versions or custom tokenizers compatible with your Llama model.
Prepare datasets with additional metadata fields if your fine-tuning framework supports them.
For instruction tuning, format prompts with clear instructions and expected completions.
Use streaming dataset loading for large datasets with datasets.load_dataset(..., streaming=True).

Troubleshooting

If tokenization truncates important content, increase max_length or split long examples.
Ensure JSONL lines are valid JSON objects; malformed lines cause loading errors.
Check tokenizer compatibility with your Llama model version to avoid token mismatch.
Validate dataset splits to prevent data leakage between training and validation.

✅

Key Takeaways

Format fine-tuning data as JSONL with clear prompt and completion fields.
Use the official Llama tokenizer to ensure token compatibility.
Split data into training and validation sets to monitor fine-tuning quality.
Clean and truncate text properly to fit model input limits.
Validate JSONL formatting to avoid dataset loading errors.

Verified 2026-04 · meta-llama/Llama-2-7b

Verify ↗