Fine-tuning dataset quality tips
Quick answer
Use a clean, well-structured dataset with diverse, relevant examples formatted as
messages in JSONL. Ensure consistent labeling, remove noise, and include enough samples to cover your use case for effective fine-tuning.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python package and set your API key as an environment variable for secure access.
pip install openai>=1.0 output
Collecting openai Downloading openai-1.x.x-py3-none-any.whl (xx kB) Installing collected packages: openai Successfully installed openai-1.x.x
Step by step
Create a high-quality fine-tuning dataset by following these steps: clean data, consistent formatting, diverse examples, and sufficient size. Use JSONL format with messages arrays representing conversation turns.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Example of a clean fine-tuning JSONL entry
fine_tuning_data = [
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Translate 'Hello' to French."},
{"role": "assistant", "content": "Bonjour"}
]
},
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Summarize the following text."},
{"role": "assistant", "content": "This text explains fine-tuning best practices."}
]
}
]
# Save dataset to JSONL file
import json
with open("fine_tuning_data.jsonl", "w", encoding="utf-8") as f:
for entry in fine_tuning_data:
json.dump(entry, f)
f.write("\n")
print("Dataset saved as fine_tuning_data.jsonl") output
Dataset saved as fine_tuning_data.jsonl
Common variations
You can fine-tune different models like gpt-4o-mini or gpt-4o by specifying the model parameter. Async fine-tuning workflows and streaming completions are also supported with the OpenAI SDK.
import asyncio
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
async def create_fine_tune():
# Upload training file
training_file = await client.files.create(
file=open("fine_tuning_data.jsonl", "rb"),
purpose="fine-tune"
)
# Create fine-tuning job
job = await client.fine_tuning.jobs.create(
training_file=training_file.id,
model="gpt-4o-mini"
)
print(f"Fine-tuning job created: {job.id}")
asyncio.run(create_fine_tune()) output
Fine-tuning job created: ftjob-xxxxxxxxxxxxxxxxxxxx
Troubleshooting
- If your fine-tuned model performs poorly, check for inconsistent or noisy labels in your dataset.
- Ensure your JSONL file is properly formatted with no trailing commas or syntax errors.
- Include enough diverse examples to cover your use case and avoid overfitting.
- Use the OpenAI API logs to debug errors during fine-tuning job creation.
Key Takeaways
- Use clean, consistent JSONL datasets with properly formatted
messagesarrays for fine-tuning. - Include diverse and relevant examples to improve model generalization and avoid overfitting.
- Validate dataset formatting and remove noise before uploading to prevent fine-tuning errors.