How to beginner · 3 min read

How to prepare fine-tuning dataset for OpenAI

Quick answer
Prepare your fine-tuning dataset for OpenAI by creating a JSONL file where each line is a JSON object with a messages array containing role and content fields. The dataset must follow the chat format with roles like system, user, and assistant to train the model effectively.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the official openai Python package and set your API key as an environment variable for secure access.

bash
pip install openai
output
Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

Create a JSONL file where each line is a JSON object with a messages key. The value is an array of message objects with role ("system", "user", "assistant") and content fields. This format mimics chat conversations for fine-tuning chat models.

Example entry:

{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is AI?"}, {"role": "assistant", "content": "AI stands for artificial intelligence."}]}

Save multiple such lines in a file named fine_tune_data.jsonl. Then upload and create a fine-tuning job using the OpenAI Python SDK.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Upload the fine-tuning dataset
with open("fine_tune_data.jsonl", "rb") as f:
    training_file = client.files.create(file=f, purpose="fine-tune")

print(f"Uploaded file ID: {training_file.id}")

# Create fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=training_file.id,
    model="gpt-4o-mini-2024-07-18"
)

print(f"Fine-tuning job ID: {job.id}")
output
Uploaded file ID: file-abc123xyz
Fine-tuning job ID: ftjob-xyz789abc

Common variations

  • You can prepare datasets for different models by changing the model parameter in the fine-tuning job creation.
  • Use asynchronous code or streaming for large datasets, but the JSONL format remains the same.
  • Ensure your dataset messages follow the chat format exactly; avoid mixing formats.

Troubleshooting

  • If you get errors about invalid JSON, verify each line in your JSONL file is a valid JSON object.
  • Ensure the messages array contains at least one user and one assistant message.
  • Check your API key environment variable is set correctly to avoid authentication errors.

Key Takeaways

  • Fine-tuning datasets must be JSONL files with a messages array of chat messages.
  • Each message requires a role and content field to mimic chat conversations.
  • Upload the dataset via client.files.create with purpose fine-tune before creating a fine-tuning job.
  • Use the correct model name for fine-tuning, e.g., gpt-4o-mini-2024-07-18.
  • Validate JSONL formatting and environment variables to avoid common errors.
Verified 2026-04 · gpt-4o-mini-2024-07-18
Verify ↗