How to prepare fine-tuning dataset for OpenAI
Quick answer
Prepare your fine-tuning dataset for OpenAI by creating a JSONL file where each line is a JSON object with a
messages array containing role and content fields. The dataset must follow the chat format with roles like system, user, and assistant to train the model effectively.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the official openai Python package and set your API key as an environment variable for secure access.
pip install openai output
Collecting openai Downloading openai-1.x.x-py3-none-any.whl Installing collected packages: openai Successfully installed openai-1.x.x
Step by step
Create a JSONL file where each line is a JSON object with a messages key. The value is an array of message objects with role ("system", "user", "assistant") and content fields. This format mimics chat conversations for fine-tuning chat models.
Example entry:
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is AI?"}, {"role": "assistant", "content": "AI stands for artificial intelligence."}]}Save multiple such lines in a file named fine_tune_data.jsonl. Then upload and create a fine-tuning job using the OpenAI Python SDK.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Upload the fine-tuning dataset
with open("fine_tune_data.jsonl", "rb") as f:
training_file = client.files.create(file=f, purpose="fine-tune")
print(f"Uploaded file ID: {training_file.id}")
# Create fine-tuning job
job = client.fine_tuning.jobs.create(
training_file=training_file.id,
model="gpt-4o-mini-2024-07-18"
)
print(f"Fine-tuning job ID: {job.id}") output
Uploaded file ID: file-abc123xyz Fine-tuning job ID: ftjob-xyz789abc
Common variations
- You can prepare datasets for different models by changing the
modelparameter in the fine-tuning job creation. - Use asynchronous code or streaming for large datasets, but the JSONL format remains the same.
- Ensure your dataset messages follow the chat format exactly; avoid mixing formats.
Troubleshooting
- If you get errors about invalid JSON, verify each line in your JSONL file is a valid JSON object.
- Ensure the
messagesarray contains at least oneuserand oneassistantmessage. - Check your API key environment variable is set correctly to avoid authentication errors.
Key Takeaways
- Fine-tuning datasets must be JSONL files with a
messagesarray of chat messages. - Each message requires a
roleandcontentfield to mimic chat conversations. - Upload the dataset via
client.files.createwith purposefine-tunebefore creating a fine-tuning job. - Use the correct model name for fine-tuning, e.g.,
gpt-4o-mini-2024-07-18. - Validate JSONL formatting and environment variables to avoid common errors.