How to validate fine-tuning data
Quick answer
Validate fine-tuning data by ensuring it is in the correct JSONL format with properly structured messages arrays containing role and content fields. Use Python scripts to parse and check data consistency before uploading with the OpenAI SDK.
PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the latest openai Python package and set your API key as an environment variable.
pip install --upgrade openai output
Requirement already satisfied: openai in /usr/local/lib/python3.10/site-packages (x.y.z) Successfully installed openai-x.y.z
Step by step
Use this Python script to load your fine-tuning JSONL file, validate each record's structure, and check for common errors before uploading.
import os
import json
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Path to your fine-tuning data file
file_path = "fine_tuning_data.jsonl"
# Validation function
def validate_fine_tuning_data(path):
with open(path, "r", encoding="utf-8") as f:
line_num = 0
for line in f:
line_num += 1
try:
record = json.loads(line)
except json.JSONDecodeError as e:
raise ValueError(f"Line {line_num}: Invalid JSON - {e}")
# Check for messages key
if "messages" not in record:
raise ValueError(f"Line {line_num}: Missing 'messages' field")
messages = record["messages"]
if not isinstance(messages, list) or len(messages) == 0:
raise ValueError(f"Line {line_num}: 'messages' must be a non-empty list")
for i, message in enumerate(messages):
if not isinstance(message, dict):
raise ValueError(f"Line {line_num}, message {i+1}: Message must be a dict")
if "role" not in message or "content" not in message:
raise ValueError(f"Line {line_num}, message {i+1}: Missing 'role' or 'content'")
if message["role"] not in ["system", "user", "assistant"]:
raise ValueError(f"Line {line_num}, message {i+1}: Invalid role '{message["role"]}'")
if not isinstance(message["content"], str):
raise ValueError(f"Line {line_num}, message {i+1}: 'content' must be a string")
print(f"Validation passed for {line_num} records.")
# Run validation
validate_fine_tuning_data(file_path)
# Optional: Upload file to OpenAI for fine-tuning
# with open(file_path, "rb") as f:
# upload_response = client.files.create(file=f, purpose="fine-tune")
# print(f"Uploaded file ID: {upload_response.id}") output
Validation passed for 100 records.
Common variations
You can validate asynchronously or use different models for fine-tuning. The validation logic remains the same but adapt file paths and model names accordingly.
import asyncio
import json
import os
import aiofiles
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
async def validate_async(path):
async with aiofiles.open(path, mode='r', encoding='utf-8') as f:
line_num = 0
async for line in f:
line_num += 1
try:
record = json.loads(line)
except json.JSONDecodeError as e:
raise ValueError(f"Line {line_num}: Invalid JSON - {e}")
# Same validation checks as sync version
# ...
print(f"Async validation passed for {line_num} records.")
# asyncio.run(validate_async("fine_tuning_data.jsonl")) output
Async validation passed for 100 records.
Troubleshooting
- If you see
Invalid JSONerrors, check for trailing commas or missing quotes in your JSONL file. - Missing
messagesor incorrect roles cause validation failures; ensure each record matches the fine-tuning schema. - Use JSON validators or linters to pre-check your file before running the script.
Key Takeaways
- Always validate your fine-tuning data format before uploading to avoid training errors.
- Each JSONL record must have a non-empty 'messages' list with valid 'role' and 'content' fields.
- Use Python scripts to automate validation and catch common JSON or schema issues early.