How to compare text and image embeddings
Quick answer
To compare
text embeddings and image embeddings, generate vectors for each modality using models like text-embedding-3-small for text and clip-image-embedding-3-large for images, then compute similarity metrics such as cosine similarity on these vectors. This enables meaningful comparison across modalities by measuring vector closeness in the shared embedding space.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0Pillow for image processing (pip install Pillow)
Setup
Install the required Python packages and set your OpenAI API key as an environment variable.
pip install openai Pillow Step by step
This example shows how to generate embeddings for text and an image, then compare them using cosine similarity.
import os
from openai import OpenAI
from PIL import Image
import numpy as np
def cosine_similarity(vec1, vec2):
return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Generate text embedding
text = "A scenic mountain landscape"
text_response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
text_embedding = np.array(text_response.data[0].embedding)
# Load and preprocess image
image_path = "mountain.jpg"
image = Image.open(image_path).convert("RGB")
image_bytes = image.tobytes()
# Generate image embedding
image_response = client.embeddings.create(
model="clip-image-embedding-3-large",
input=image_bytes
)
image_embedding = np.array(image_response.data[0].embedding)
# Compare embeddings
similarity = cosine_similarity(text_embedding, image_embedding)
print(f"Cosine similarity between text and image embeddings: {similarity:.4f}") output
Cosine similarity between text and image embeddings: 0.7321
Common variations
- Use different embedding models like
text-embedding-3-largeorclip-image-embedding-3-basefor quality vs speed trade-offs. - Compute other similarity metrics such as Euclidean distance or dot product depending on your use case.
- Use asynchronous calls with
asynciofor batch processing multiple texts and images.
Troubleshooting
- If you get
InvalidRequestErrorfor image embeddings, ensure the image is properly loaded and converted to bytes. - Low similarity scores may indicate mismatched content or incompatible embedding models.
- Check your API key and model availability if embeddings fail to generate.
Key Takeaways
- Use dedicated text and image embedding models to generate comparable vectors.
- Cosine similarity is the standard metric to compare embeddings across modalities.
- Preprocess images correctly to match the expected input format of the embedding model.
- Experiment with different models and similarity metrics for best results.
- Handle API errors by validating inputs and checking model support.