How to use existing datasets for fine-tuning
Quick answer
To use existing datasets for
fine-tuning, first preprocess and format your data into the required structure (usually JSONL with prompt-completion pairs). Then upload the dataset to the provider's platform or use their API to start the fine-tuning job with your chosen model. This process customizes the model to your specific data and task.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the OpenAI Python SDK and set your API key as an environment variable to authenticate requests.
pip install openai>=1.0 Step by step
Prepare your dataset as a JSONL file with each line containing a JSON object with prompt and completion keys. Then use the OpenAI API to upload and fine-tune the model.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Step 1: Prepare dataset file (example: data.jsonl)
# Each line example: {"prompt": "Translate English to French: Hello", "completion": " Bonjour"}
# Step 2: Upload the file
upload_response = client.files.create(
file=open("data.jsonl", "rb"),
purpose="fine-tune"
)
file_id = upload_response.id
print(f"Uploaded file ID: {file_id}")
# Step 3: Create fine-tune job
fine_tune_response = client.fine_tunes.create(
training_file=file_id,
model="gpt-4o"
)
print(f"Fine-tune job ID: {fine_tune_response.id}")
# Step 4: Monitor fine-tune job status
import time
while True:
status = client.fine_tunes.get(id=fine_tune_response.id)
print(f"Status: {status.status}")
if status.status in ["succeeded", "failed"]:
break
time.sleep(30)
print("Fine-tuning complete.") output
Uploaded file ID: file-abc123xyz Fine-tune job ID: ft-xyz789abc Status: pending Status: running Status: succeeded Fine-tuning complete.
Common variations
You can fine-tune asynchronously, use different base models like gpt-4o-mini, or fine-tune with other SDKs such as Anthropic's Claude or LangChain integrations.
import anthropic
import os
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
# Anthropic does not support user fine-tuning as of 2026-04, but you can customize prompts.
# LangChain example for OpenAI fine-tuning
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model_name="gpt-4o", temperature=0)
# Use fine-tuned model by specifying model_name="fine-tuned-model-id" after fine-tuning Troubleshooting
- If you see
Invalid file format, ensure your JSONL lines are valid JSON with correctpromptandcompletionkeys. - If fine-tuning fails, check your dataset size and quality; very small or noisy datasets cause failures.
- API rate limits can cause errors; implement retries with exponential backoff.
Key Takeaways
- Format your dataset as JSONL with prompt-completion pairs before fine-tuning.
- Upload your dataset file with purpose 'fine-tune' using the OpenAI API.
- Monitor the fine-tuning job status to know when your custom model is ready.
- Use different base models or SDKs depending on your use case and provider.
- Validate dataset quality and format to avoid common fine-tuning errors.