How to Beginner · 3 min read

How to format training data for fine-tuning LLMs

Quick answer

To format training data for fine-tuning LLMs, use a JSON Lines (JSONL) file where each line is a JSON object containing prompt and completion fields. The prompt is the input text, and the completion is the desired model output, both formatted as strings.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the OpenAI Python SDK and set your API key as an environment variable to prepare for fine-tuning.

bash

pip install openai>=1.0

Step by step

Prepare your training data as a JSONL file where each line is a JSON object with prompt and completion keys. The prompt is the input text you want the model to learn from, and the completion is the expected output. Both should be strings, and the completion usually ends with a stop token or newline to signal the end.

Example format:

{"prompt": "Translate English to French: Hello, how are you?\n", "completion": " Bonjour, comment ça va?\n"}

Use the OpenAI CLI or API to upload and fine-tune your model with this data.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Example training data line
training_example = {
    "prompt": "Translate English to French: Hello, how are you?\n",
    "completion": " Bonjour, comment ça va?\n"
}

# Save to JSONL file
import json
with open("training_data.jsonl", "w") as f:
    f.write(json.dumps(training_example) + "\n")

print("Training data saved in JSONL format.")

output

Training data saved in JSONL format.

Common variations

You can format data for different fine-tuning tasks by adjusting the prompt and completion structure. For chat models, include role tokens like User: and Assistant: in the prompt and completion. Async and streaming fine-tuning workflows depend on the provider's SDK capabilities.

Troubleshooting

If your fine-tuning job fails, verify your JSONL file is valid JSON and each line contains both prompt and completion keys.
Ensure the completion ends with a newline or stop token to prevent the model from generating unwanted text.
Check for consistent formatting and avoid very long prompts or completions that exceed token limits.

✅

Key Takeaways

Use JSONL format with one JSON object per line containing 'prompt' and 'completion' keys.
Both 'prompt' and 'completion' must be strings; 'completion' should end with a newline or stop token.
Validate your JSONL file before fine-tuning to avoid errors.
Adjust prompt-completion formatting for different tasks like translation, chat, or summarization.
Keep prompts and completions within token limits to ensure successful fine-tuning.

Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022

Verify ↗