How to beginner · 3 min read

How to prepare fine-tuning dataset for OpenAI

Quick answer

Prepare your fine-tuning dataset for OpenAI by creating a JSONL file where each line is a JSON object with a messages array containing role and content fields. The dataset must follow the chat format with roles like system, user, and assistant to train the model effectively.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the official openai Python package and set your API key as an environment variable for secure access.

bash

pip install openai

output

Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

Create a JSONL file where each line is a JSON object with a messages key. The value is an array of message objects with role ("system", "user", "assistant") and content fields. This format mimics chat conversations for fine-tuning chat models.

Example entry:

{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is AI?"}, {"role": "assistant", "content": "AI stands for artificial intelligence."}]}

Save multiple such lines in a file named fine_tune_data.jsonl. Then upload and create a fine-tuning job using the OpenAI Python SDK.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Upload the fine-tuning dataset
with open("fine_tune_data.jsonl", "rb") as f:
    training_file = client.files.create(file=f, purpose="fine-tune")

print(f"Uploaded file ID: {training_file.id}")

# Create fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=training_file.id,
    model="gpt-4o-mini-2024-07-18"
)

print(f"Fine-tuning job ID: {job.id}")

output

Uploaded file ID: file-abc123xyz
Fine-tuning job ID: ftjob-xyz789abc

Common variations

You can prepare datasets for different models by changing the model parameter in the fine-tuning job creation.
Use asynchronous code or streaming for large datasets, but the JSONL format remains the same.
Ensure your dataset messages follow the chat format exactly; avoid mixing formats.

Troubleshooting

If you get errors about invalid JSON, verify each line in your JSONL file is a valid JSON object.
Ensure the messages array contains at least one user and one assistant message.
Check your API key environment variable is set correctly to avoid authentication errors.

Key Takeaways

Fine-tuning datasets must be JSONL files with a messages array of chat messages.
Each message requires a role and content field to mimic chat conversations.
Upload the dataset via client.files.create with purpose fine-tune before creating a fine-tuning job.
Use the correct model name for fine-tuning, e.g., gpt-4o-mini-2024-07-18.
Validate JSONL formatting and environment variables to avoid common errors.

Verified 2026-04 · gpt-4o-mini-2024-07-18

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.