How to Intermediate · 3 min read

How to use existing datasets for fine-tuning

Q: How to use existing datasets for fine-tuning

To use existing datasets for fine-tuning, first preprocess and format your data into the required structure (usually JSONL with prompt-completion pairs). Then upload the dataset to the provider's platform or use their API to start the fine-tuning job with your chosen model. This process customizes the model to your specific data and task.

Quick answer

To use existing datasets for fine-tuning, first preprocess and format your data into the required structure (usually JSONL with prompt-completion pairs). Then upload the dataset to the provider's platform or use their API to start the fine-tuning job with your chosen model. This process customizes the model to your specific data and task.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the OpenAI Python SDK and set your API key as an environment variable to authenticate requests.

bash

pip install openai>=1.0

Step by step

Prepare your dataset as a JSONL file with each line containing a JSON object with prompt and completion keys. Then use the OpenAI API to upload and fine-tune the model.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Step 1: Prepare dataset file (example: data.jsonl)
# Each line example: {"prompt": "Translate English to French: Hello", "completion": " Bonjour"}

# Step 2: Upload the file
upload_response = client.files.create(
    file=open("data.jsonl", "rb"),
    purpose="fine-tune"
)
file_id = upload_response.id
print(f"Uploaded file ID: {file_id}")

# Step 3: Create fine-tune job
fine_tune_response = client.fine_tunes.create(
    training_file=file_id,
    model="gpt-4o"
)
print(f"Fine-tune job ID: {fine_tune_response.id}")

# Step 4: Monitor fine-tune job status
import time

while True:
    status = client.fine_tunes.get(id=fine_tune_response.id)
    print(f"Status: {status.status}")
    if status.status in ["succeeded", "failed"]:
        break
    time.sleep(30)

print("Fine-tuning complete.")

output

Uploaded file ID: file-abc123xyz
Fine-tune job ID: ft-xyz789abc
Status: pending
Status: running
Status: succeeded
Fine-tuning complete.

Common variations

You can fine-tune asynchronously, use different base models like gpt-4o-mini, or fine-tune with other SDKs such as Anthropic's Claude or LangChain integrations.

python

import anthropic
import os

client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

# Anthropic does not support user fine-tuning as of 2026-04, but you can customize prompts.

# LangChain example for OpenAI fine-tuning
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model_name="gpt-4o", temperature=0)
# Use fine-tuned model by specifying model_name="fine-tuned-model-id" after fine-tuning

Troubleshooting

If you see Invalid file format, ensure your JSONL lines are valid JSON with correct prompt and completion keys.
If fine-tuning fails, check your dataset size and quality; very small or noisy datasets cause failures.
API rate limits can cause errors; implement retries with exponential backoff.

✅

Key Takeaways

Format your dataset as JSONL with prompt-completion pairs before fine-tuning.
Upload your dataset file with purpose 'fine-tune' using the OpenAI API.
Monitor the fine-tuning job status to know when your custom model is ready.
Use different base models or SDKs depending on your use case and provider.
Validate dataset quality and format to avoid common fine-tuning errors.

Verified 2026-04 · gpt-4o, gpt-4o-mini, claude-3-5-sonnet-20241022

Verify ↗