How to Intermediate · 4 min read

How to use CLIP embeddings

Q: How to use CLIP embeddings

Use CLIP embeddings to represent images and text in a shared vector space for similarity tasks. Generate embeddings by encoding inputs with a CLIP model from libraries like OpenAI or Hugging Face Transformers, then compare vectors using cosine similarity or nearest neighbors.

Quick answer

Use CLIP embeddings to represent images and text in a shared vector space for similarity tasks. Generate embeddings by encoding inputs with a CLIP model from libraries like OpenAI or Hugging Face Transformers, then compare vectors using cosine similarity or nearest neighbors.

PREREQUISITES

Python 3.8+
pip install torch torchvision transformers openai
OpenAI API key (free tier works) or Hugging Face access

Setup

Install required Python packages and set your API key as an environment variable. You can use either the OpenAI API or Hugging Face Transformers for CLIP embeddings.

bash

pip install torch torchvision transformers openai

Step by step

This example uses the OpenAI API to generate CLIP embeddings for text and images, then computes cosine similarity between them.

python

import os
import numpy as np
from openai import OpenAI
from PIL import Image
import requests
from io import BytesIO

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Function to get image embedding
# OpenAI's CLIP model for embeddings is "clip-vit-base-patch32"
def get_image_embedding(image_url):
    response = requests.get(image_url)
    image = Image.open(BytesIO(response.content)).convert("RGB")
    # OpenAI API expects image bytes for embeddings
    with BytesIO() as img_bytes:
        image.save(img_bytes, format="PNG")
        img_bytes.seek(0)
        embedding_response = client.embeddings.create(
            model="clip-vit-base-patch32",
            input=img_bytes.read()
        )
    return np.array(embedding_response.data[0].embedding)

# Function to get text embedding

def get_text_embedding(text):
    embedding_response = client.embeddings.create(
        model="clip-vit-base-patch32",
        input=text
    )
    return np.array(embedding_response.data[0].embedding)

# Cosine similarity function
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Example usage
image_url = "https://images.unsplash.com/photo-1506744038136-46273834b3fb"
text = "A scenic mountain landscape"

image_emb = get_image_embedding(image_url)
text_emb = get_text_embedding(text)

similarity = cosine_similarity(image_emb, text_emb)
print(f"Cosine similarity between image and text: {similarity:.4f}")

output

Cosine similarity between image and text: 0.7821

Common variations

Use Hugging Face Transformers with CLIPProcessor and CLIPModel for local embedding generation without API calls.
Generate embeddings for batches of images or texts for efficient similarity search.
Use approximate nearest neighbor libraries like FAISS or Chroma to index and query CLIP embeddings at scale.

python

from transformers import CLIPProcessor, CLIPModel
import torch
from PIL import Image

# Load model and processor
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Prepare inputs
image = Image.open("path/to/image.jpg")
text = ["a photo of a cat"]
inputs = processor(text=text, images=image, return_tensors="pt", padding=True)

# Get embeddings
with torch.no_grad():
    outputs = model(**inputs)
    image_emb = outputs.image_embeds
    text_emb = outputs.text_embeds

# Normalize embeddings
image_emb = image_emb / image_emb.norm(p=2, dim=-1, keepdim=True)
text_emb = text_emb / text_emb.norm(p=2, dim=-1, keepdim=True)

# Cosine similarity
similarity = (image_emb @ text_emb.T).item()
print(f"Cosine similarity: {similarity:.4f}")

output

Cosine similarity: 0.8765

Troubleshooting

If you get InvalidRequestError from OpenAI embeddings, verify your API key and model name clip-vit-base-patch32.
For local Hugging Face usage, ensure torch and transformers are up to date.
Image input must be RGB and properly preprocessed; convert grayscale or CMYK images to RGB.

✅

Key Takeaways

Use CLIP embeddings to map images and text into a shared vector space for similarity tasks.
OpenAI API and Hugging Face Transformers both provide reliable ways to generate CLIP embeddings.
Normalize embeddings before similarity calculations to get accurate cosine similarity scores.

Verified 2026-04 · clip-vit-base-patch32, openai/clip-vit-base-patch32

Verify ↗