How to Intermediate · 3 min read

How to use existing datasets for fine-tuning

Quick answer
To use existing datasets for fine-tuning, first preprocess and format your data into the required structure (usually JSONL with prompt-completion pairs). Then upload the dataset to the provider's platform or use their API to start the fine-tuning job with your chosen model. This process customizes the model to your specific data and task.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the OpenAI Python SDK and set your API key as an environment variable to authenticate requests.

bash
pip install openai>=1.0

Step by step

Prepare your dataset as a JSONL file with each line containing a JSON object with prompt and completion keys. Then use the OpenAI API to upload and fine-tune the model.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Step 1: Prepare dataset file (example: data.jsonl)
# Each line example: {"prompt": "Translate English to French: Hello", "completion": " Bonjour"}

# Step 2: Upload the file
upload_response = client.files.create(
    file=open("data.jsonl", "rb"),
    purpose="fine-tune"
)
file_id = upload_response.id
print(f"Uploaded file ID: {file_id}")

# Step 3: Create fine-tune job
fine_tune_response = client.fine_tunes.create(
    training_file=file_id,
    model="gpt-4o"
)
print(f"Fine-tune job ID: {fine_tune_response.id}")

# Step 4: Monitor fine-tune job status
import time

while True:
    status = client.fine_tunes.get(id=fine_tune_response.id)
    print(f"Status: {status.status}")
    if status.status in ["succeeded", "failed"]:
        break
    time.sleep(30)

print("Fine-tuning complete.")
output
Uploaded file ID: file-abc123xyz
Fine-tune job ID: ft-xyz789abc
Status: pending
Status: running
Status: succeeded
Fine-tuning complete.

Common variations

You can fine-tune asynchronously, use different base models like gpt-4o-mini, or fine-tune with other SDKs such as Anthropic's Claude or LangChain integrations.

python
import anthropic
import os

client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

# Anthropic does not support user fine-tuning as of 2026-04, but you can customize prompts.

# LangChain example for OpenAI fine-tuning
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model_name="gpt-4o", temperature=0)
# Use fine-tuned model by specifying model_name="fine-tuned-model-id" after fine-tuning

Troubleshooting

  • If you see Invalid file format, ensure your JSONL lines are valid JSON with correct prompt and completion keys.
  • If fine-tuning fails, check your dataset size and quality; very small or noisy datasets cause failures.
  • API rate limits can cause errors; implement retries with exponential backoff.

Key Takeaways

  • Format your dataset as JSONL with prompt-completion pairs before fine-tuning.
  • Upload your dataset file with purpose 'fine-tune' using the OpenAI API.
  • Monitor the fine-tuning job status to know when your custom model is ready.
  • Use different base models or SDKs depending on your use case and provider.
  • Validate dataset quality and format to avoid common fine-tuning errors.
Verified 2026-04 · gpt-4o, gpt-4o-mini, claude-3-5-sonnet-20241022
Verify ↗