How to use CLIP embeddings
Quick answer
Use
CLIP embeddings to represent images and text in a shared vector space for similarity tasks. Generate embeddings by encoding inputs with a CLIP model from libraries like OpenAI or Hugging Face Transformers, then compare vectors using cosine similarity or nearest neighbors.PREREQUISITES
Python 3.8+pip install torch torchvision transformers openaiOpenAI API key (free tier works) or Hugging Face access
Setup
Install required Python packages and set your API key as an environment variable. You can use either the OpenAI API or Hugging Face Transformers for CLIP embeddings.
pip install torch torchvision transformers openai Step by step
This example uses the OpenAI API to generate CLIP embeddings for text and images, then computes cosine similarity between them.
import os
import numpy as np
from openai import OpenAI
from PIL import Image
import requests
from io import BytesIO
# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Function to get image embedding
# OpenAI's CLIP model for embeddings is "clip-vit-base-patch32"
def get_image_embedding(image_url):
response = requests.get(image_url)
image = Image.open(BytesIO(response.content)).convert("RGB")
# OpenAI API expects image bytes for embeddings
with BytesIO() as img_bytes:
image.save(img_bytes, format="PNG")
img_bytes.seek(0)
embedding_response = client.embeddings.create(
model="clip-vit-base-patch32",
input=img_bytes.read()
)
return np.array(embedding_response.data[0].embedding)
# Function to get text embedding
def get_text_embedding(text):
embedding_response = client.embeddings.create(
model="clip-vit-base-patch32",
input=text
)
return np.array(embedding_response.data[0].embedding)
# Cosine similarity function
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Example usage
image_url = "https://images.unsplash.com/photo-1506744038136-46273834b3fb"
text = "A scenic mountain landscape"
image_emb = get_image_embedding(image_url)
text_emb = get_text_embedding(text)
similarity = cosine_similarity(image_emb, text_emb)
print(f"Cosine similarity between image and text: {similarity:.4f}") output
Cosine similarity between image and text: 0.7821
Common variations
- Use Hugging Face Transformers with
CLIPProcessorandCLIPModelfor local embedding generation without API calls. - Generate embeddings for batches of images or texts for efficient similarity search.
- Use approximate nearest neighbor libraries like FAISS or Chroma to index and query CLIP embeddings at scale.
from transformers import CLIPProcessor, CLIPModel
import torch
from PIL import Image
# Load model and processor
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# Prepare inputs
image = Image.open("path/to/image.jpg")
text = ["a photo of a cat"]
inputs = processor(text=text, images=image, return_tensors="pt", padding=True)
# Get embeddings
with torch.no_grad():
outputs = model(**inputs)
image_emb = outputs.image_embeds
text_emb = outputs.text_embeds
# Normalize embeddings
image_emb = image_emb / image_emb.norm(p=2, dim=-1, keepdim=True)
text_emb = text_emb / text_emb.norm(p=2, dim=-1, keepdim=True)
# Cosine similarity
similarity = (image_emb @ text_emb.T).item()
print(f"Cosine similarity: {similarity:.4f}") output
Cosine similarity: 0.8765
Troubleshooting
- If you get
InvalidRequestErrorfrom OpenAI embeddings, verify your API key and model nameclip-vit-base-patch32. - For local Hugging Face usage, ensure
torchandtransformersare up to date. - Image input must be RGB and properly preprocessed; convert grayscale or CMYK images to RGB.
Key Takeaways
- Use CLIP embeddings to map images and text into a shared vector space for similarity tasks.
- OpenAI API and Hugging Face Transformers both provide reliable ways to generate CLIP embeddings.
- Normalize embeddings before similarity calculations to get accurate cosine similarity scores.