Debug Fix Intermediate · 4 min read

How to handle images in RAG documents

Quick answer
Handling images in RAG documents requires extracting visual features using image embedding models, then indexing those embeddings alongside text. Use multimodal models or separate vision encoders to convert images into vector representations that your retrieval system can query effectively.
ERROR TYPE config_error
⚡ QUICK FIX
Use an image embedding model to convert images into vectors before indexing them in your RAG pipeline.

Why this happens

RAG pipelines typically expect text inputs for embedding and retrieval. When images are included directly without preprocessing, the system cannot interpret or embed them, causing retrieval failures or irrelevant results. For example, passing raw image files to a text embedding model triggers errors or empty embeddings.

Broken code example:

python
from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Incorrect: passing image bytes as text
image_bytes = open("diagram.png", "rb").read()
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=image_bytes  # This is binary data, not text
)
print(response.data[0].embedding)
output
openai.error.InvalidRequestError: Input contains invalid characters or is not text

The fix

Convert images to embeddings using a dedicated image embedding model or a multimodal model that supports images. Then store these embeddings in your vector database alongside text embeddings. During retrieval, query both text and image embeddings to find relevant documents.

This works because the image embedding model transforms visual data into a numeric vector space compatible with text embeddings, enabling unified similarity search.

python
from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Correct: use an image embedding model
with open("diagram.png", "rb") as f:
    image_bytes = f.read()

response = client.embeddings.create(
    model="clip-image-embedding-1",  # Example image embedding model
    input=image_bytes
)
image_embedding = response.data[0].embedding
print(len(image_embedding))  # Should print embedding vector length, e.g., 512
output
512

Preventing it in production

  • Validate input types before embedding: separate text and images.
  • Use multimodal models or dedicated image embedding APIs for images.
  • Index image embeddings alongside text embeddings in your vector store.
  • Implement fallback logic if image embedding fails (e.g., OCR to extract text).
  • Test retrieval queries combining text and image vectors to ensure relevance.

Key Takeaways

  • Always convert images to embeddings using a dedicated image or multimodal model before indexing in RAG.
  • Store and query image embeddings alongside text embeddings for effective multimodal retrieval.
  • Validate and preprocess inputs to avoid passing raw images to text-only embedding models.
Verified 2026-04 · gpt-4o, clip-image-embedding-1
Verify ↗