How to Beginner · 3 min read

How to prepare training data for OpenAI fine-tuning

Quick answer
To prepare training data for OpenAI fine-tuning, create a JSONL file where each line is a JSON object containing a prompt and a completion field. Ensure the data is clean, representative, and formatted with clear input-output pairs to guide the model during fine-tuning.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the OpenAI Python SDK and set your API key as an environment variable to authenticate requests.

bash
pip install openai

Step by step

Create a JSONL file where each line is a JSON object with prompt and completion keys. The prompt is the input text, and the completion is the desired output the model should learn to generate.

Example format:

{"prompt": "Translate English to French: 'Hello'", "completion": " Bonjour"}

Note the space before the completion text, which helps the model distinguish prompt from completion.

python
import json

# Example training data as a list of dicts
training_data = [
    {"prompt": "Translate English to French: 'Hello'", "completion": " Bonjour"},
    {"prompt": "Translate English to French: 'Goodbye'", "completion": " Au revoir"}
]

# Write to JSONL file
with open('fine_tune_data.jsonl', 'w', encoding='utf-8') as f:
    for entry in training_data:
        json.dump(entry, f)
        f.write('\n')

print("Training data saved to fine_tune_data.jsonl")
output
Training data saved to fine_tune_data.jsonl

Common variations

You can prepare data for different tasks by adjusting the prompt and completion format, such as question-answering, summarization, or code generation. For example, for Q&A:

{"prompt": "Q: What is AI?\nA:", "completion": " Artificial Intelligence is..."}

Use the same JSONL format but tailor prompts to your use case.

Troubleshooting

If your fine-tuning job fails, check that your JSONL file is valid JSON with no trailing commas or syntax errors. Also, ensure each completion starts with a space and ends with a stop token (like a newline or punctuation) to help the model learn boundaries.

Use JSON validators or json.load() in Python to verify file integrity before uploading.

python
import json

try:
    with open('fine_tune_data.jsonl', 'r', encoding='utf-8') as f:
        for line in f:
            json.loads(line)
    print("JSONL file is valid.")
except json.JSONDecodeError as e:
    print(f"JSON error: {e}")
output
JSONL file is valid.

Key Takeaways

  • Use a JSONL file with prompt-completion pairs for OpenAI fine-tuning data.
  • Ensure completions start with a space and end with a stop token for clarity.
  • Validate JSONL syntax before uploading to avoid fine-tuning errors.
Verified 2026-04 · gpt-4o, gpt-4o-mini
Verify ↗