How to Intermediate · 4 min read

How to prepare dataset for instruction fine-tuning

Q: How to prepare dataset for instruction fine-tuning

To prepare a dataset for instruction fine-tuning, format your data as pairs of instruction and response in JSON or JSONL, ensuring clear, concise prompts and high-quality answers. Use consistent keys like "instruction" and "output" and clean text to optimize model learning.

Quick answer

To prepare a dataset for instruction fine-tuning, format your data as pairs of instruction and response in JSON or JSONL, ensuring clear, concise prompts and high-quality answers. Use consistent keys like "instruction" and "output" and clean text to optimize model learning.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable for secure access.

bash

pip install openai>=1.0

Step by step

Prepare your dataset as a list of JSON objects, each containing an instruction and its corresponding output. Save this as a JSONL file for fine-tuning.

python

import json
import os

# Example dataset for instruction fine-tuning
training_data = [
    {
        "instruction": "Translate the following English sentence to French: 'Hello, how are you?'",
        "output": "Bonjour, comment ça va ?"
    },
    {
        "instruction": "Summarize the following text: 'Artificial intelligence is transforming technology.'",
        "output": "AI is changing technology."
    }
]

# Save dataset to JSONL file
with open("instruction_finetune_dataset.jsonl", "w", encoding="utf-8") as f:
    for entry in training_data:
        f.write(json.dumps(entry) + "\n")

print("Dataset saved as instruction_finetune_dataset.jsonl")

output

Dataset saved as instruction_finetune_dataset.jsonl

Common variations

You can prepare datasets with additional fields like context or examples for few-shot fine-tuning. Async data loading or streaming can be used for large datasets. Different models may require slight format adjustments.

python

import asyncio
import json
import os
import aiofiles
from openai import OpenAI

async def async_load_dataset(file_path):
    async with aiofiles.open(file_path, mode='r', encoding='utf-8') as f:
        async for line in f:
            yield json.loads(line)

# Example of adding context field
training_data_with_context = [
    {
        "instruction": "Explain photosynthesis.",
        "context": "Biology textbook excerpt",
        "output": "Photosynthesis is the process by which plants convert sunlight into energy."
    }
]

# Different model usage example
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain photosynthesis."}]
)
print(response.choices[0].message.content)

output

Photosynthesis is the process by which plants convert sunlight into energy.

Troubleshooting

If your fine-tuning job fails, check that your JSONL file is properly formatted with one JSON object per line.
Ensure no trailing commas or invalid characters are present.
Verify that the instruction and output fields are strings and not empty.
Use UTF-8 encoding to avoid character errors.

✅

Key Takeaways

Format your dataset as JSONL with clear 'instruction' and 'output' keys for best fine-tuning results.
Clean, concise, and high-quality text improves model instruction-following ability.
Validate JSONL formatting strictly to avoid fine-tuning errors.
Add optional fields like 'context' for richer instruction tuning when needed.
Use environment variables for API keys and the latest SDK patterns for secure, maintainable code.

Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022

Verify ↗