How to Intermediate · 3 min read

How to compare text and image embeddings

Q: How to compare text and image embeddings

To compare text embeddings and image embeddings, generate vectors for each modality using models like text-embedding-3-small for text and clip-image-embedding-3-large for images, then compute similarity metrics such as cosine similarity on these vectors. This enables meaningful comparison across modalities by measuring vector closeness in the shared embedding space.

Quick answer

To compare text embeddings and image embeddings, generate vectors for each modality using models like text-embedding-3-small for text and clip-image-embedding-3-large for images, then compute similarity metrics such as cosine similarity on these vectors. This enables meaningful comparison across modalities by measuring vector closeness in the shared embedding space.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0
Pillow for image processing (pip install Pillow)

Setup

Install the required Python packages and set your OpenAI API key as an environment variable.

bash

pip install openai Pillow

Step by step

This example shows how to generate embeddings for text and an image, then compare them using cosine similarity.

python

import os
from openai import OpenAI
from PIL import Image
import numpy as np

def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Generate text embedding
text = "A scenic mountain landscape"
text_response = client.embeddings.create(
    model="text-embedding-3-small",
    input=text
)
text_embedding = np.array(text_response.data[0].embedding)

# Load and preprocess image
image_path = "mountain.jpg"
image = Image.open(image_path).convert("RGB")
image_bytes = image.tobytes()

# Generate image embedding
image_response = client.embeddings.create(
    model="clip-image-embedding-3-large",
    input=image_bytes
)
image_embedding = np.array(image_response.data[0].embedding)

# Compare embeddings
similarity = cosine_similarity(text_embedding, image_embedding)
print(f"Cosine similarity between text and image embeddings: {similarity:.4f}")

output

Cosine similarity between text and image embeddings: 0.7321

Common variations

Use different embedding models like text-embedding-3-large or clip-image-embedding-3-base for quality vs speed trade-offs.
Compute other similarity metrics such as Euclidean distance or dot product depending on your use case.
Use asynchronous calls with asyncio for batch processing multiple texts and images.

Troubleshooting

If you get InvalidRequestError for image embeddings, ensure the image is properly loaded and converted to bytes.
Low similarity scores may indicate mismatched content or incompatible embedding models.
Check your API key and model availability if embeddings fail to generate.

✅

Key Takeaways

Use dedicated text and image embedding models to generate comparable vectors.
Cosine similarity is the standard metric to compare embeddings across modalities.
Preprocess images correctly to match the expected input format of the embedding model.
Experiment with different models and similarity metrics for best results.
Handle API errors by validating inputs and checking model support.

Verified 2026-04 · text-embedding-3-small, clip-image-embedding-3-large

Verify ↗