How to beginner · 3 min read

How to validate fine-tuning data

Quick answer
Validate fine-tuning data by ensuring it is in the correct JSONL format with properly structured messages arrays containing role and content fields. Use Python scripts to parse and check data consistency before uploading with the OpenAI SDK.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the latest openai Python package and set your API key as an environment variable.

bash
pip install --upgrade openai
output
Requirement already satisfied: openai in /usr/local/lib/python3.10/site-packages (x.y.z)
Successfully installed openai-x.y.z

Step by step

Use this Python script to load your fine-tuning JSONL file, validate each record's structure, and check for common errors before uploading.

python
import os
import json
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Path to your fine-tuning data file
file_path = "fine_tuning_data.jsonl"

# Validation function

def validate_fine_tuning_data(path):
    with open(path, "r", encoding="utf-8") as f:
        line_num = 0
        for line in f:
            line_num += 1
            try:
                record = json.loads(line)
            except json.JSONDecodeError as e:
                raise ValueError(f"Line {line_num}: Invalid JSON - {e}")

            # Check for messages key
            if "messages" not in record:
                raise ValueError(f"Line {line_num}: Missing 'messages' field")

            messages = record["messages"]
            if not isinstance(messages, list) or len(messages) == 0:
                raise ValueError(f"Line {line_num}: 'messages' must be a non-empty list")

            for i, message in enumerate(messages):
                if not isinstance(message, dict):
                    raise ValueError(f"Line {line_num}, message {i+1}: Message must be a dict")
                if "role" not in message or "content" not in message:
                    raise ValueError(f"Line {line_num}, message {i+1}: Missing 'role' or 'content'")
                if message["role"] not in ["system", "user", "assistant"]:
                    raise ValueError(f"Line {line_num}, message {i+1}: Invalid role '{message["role"]}'")
                if not isinstance(message["content"], str):
                    raise ValueError(f"Line {line_num}, message {i+1}: 'content' must be a string")

    print(f"Validation passed for {line_num} records.")

# Run validation
validate_fine_tuning_data(file_path)

# Optional: Upload file to OpenAI for fine-tuning
# with open(file_path, "rb") as f:
#     upload_response = client.files.create(file=f, purpose="fine-tune")
#     print(f"Uploaded file ID: {upload_response.id}")
output
Validation passed for 100 records.

Common variations

You can validate asynchronously or use different models for fine-tuning. The validation logic remains the same but adapt file paths and model names accordingly.

python
import asyncio
import json
import os
import aiofiles
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def validate_async(path):
    async with aiofiles.open(path, mode='r', encoding='utf-8') as f:
        line_num = 0
        async for line in f:
            line_num += 1
            try:
                record = json.loads(line)
            except json.JSONDecodeError as e:
                raise ValueError(f"Line {line_num}: Invalid JSON - {e}")
            # Same validation checks as sync version
            # ...
    print(f"Async validation passed for {line_num} records.")

# asyncio.run(validate_async("fine_tuning_data.jsonl"))
output
Async validation passed for 100 records.

Troubleshooting

  • If you see Invalid JSON errors, check for trailing commas or missing quotes in your JSONL file.
  • Missing messages or incorrect roles cause validation failures; ensure each record matches the fine-tuning schema.
  • Use JSON validators or linters to pre-check your file before running the script.

Key Takeaways

  • Always validate your fine-tuning data format before uploading to avoid training errors.
  • Each JSONL record must have a non-empty 'messages' list with valid 'role' and 'content' fields.
  • Use Python scripts to automate validation and catch common JSON or schema issues early.
Verified 2026-04 · gpt-4o-mini-2024-07-18
Verify ↗