How to fine-tune embedding models
Quick answer
To fine-tune embedding models, you typically start with a pre-trained base embedding model and train it further on your domain-specific data using contrastive or supervised learning objectives. This process adjusts the model weights to produce embeddings better aligned with your task, improving similarity search or classification performance.
PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0Basic knowledge of embeddings and vector similarity
Setup
Install the OpenAI Python SDK and set your API key as an environment variable to access embedding models and fine-tuning endpoints.
pip install openai>=1.0 Step by step
This example shows how to prepare data and initiate a fine-tuning job for an embedding model using the OpenAI API. The process involves creating labeled pairs or triplets of text for supervised training.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Example training data: list of dicts with 'input' and 'label' fields
training_data = [
{"input": "apple fruit", "label": "fruit"},
{"input": "banana fruit", "label": "fruit"},
{"input": "car vehicle", "label": "vehicle"},
{"input": "truck vehicle", "label": "vehicle"}
]
# Convert training data to the format required by the fine-tuning endpoint
# Usually a JSONL file with input and metadata for supervised embedding fine-tuning
import json
with open("embedding_finetune_data.jsonl", "w") as f:
for item in training_data:
json_line = json.dumps({
"text": item["input"],
"metadata": {"label": item["label"]}
})
f.write(json_line + "\n")
# Upload the training file
training_file = client.files.create(file=open("embedding_finetune_data.jsonl", "rb"), purpose="fine-tune")
# Create a fine-tuning job for the embedding model
fine_tune_response = client.fine_tunes.create(
training_file=training_file.id,
model="text-embedding-3-small",
n_epochs=4,
learning_rate_multiplier=0.1
)
print("Fine-tuning job started:", fine_tune_response.id) output
Fine-tuning job started: ft-abc123xyz
Common variations
You can fine-tune different embedding models like text-embedding-3-large or use contrastive learning with triplet loss by preparing triplets of anchor, positive, and negative samples. Async fine-tuning and monitoring job status via API are also common.
import time
# Poll fine-tune job status
job_id = fine_tune_response.id
while True:
status = client.fine_tunes.get(id=job_id).status
print(f"Fine-tune job status: {status}")
if status in ["succeeded", "failed"]:
break
time.sleep(30)
# Use the fine-tuned embedding model
embedding_response = client.embeddings.create(
model=job_id, # fine-tuned model ID
input="sample text to embed"
)
print(embedding_response.data[0].embedding[:5]) # print first 5 dims output
Fine-tune job status: succeeded [0.0123, -0.0345, 0.0567, -0.0789, 0.0234]
Troubleshooting
- If you see errors uploading files, verify the file path and format is JSONL with one JSON object per line.
- If fine-tuning fails, check your data labels and ensure the model supports fine-tuning.
- Monitor API rate limits and quota to avoid interruptions.
Key Takeaways
- Fine-tune embeddings by training on domain-specific labeled data to improve vector quality.
- Prepare training data as JSONL with text and metadata labels for supervised fine-tuning.
- Use OpenAI's fine-tuning API to create and monitor embedding fine-tune jobs.
- Choose model variants and training parameters based on your task and dataset size.
- Validate your fine-tuned model by generating embeddings and testing similarity or classification.