How to beginner · 3 min read

Embeddings quality not good fix

Q: Embeddings quality not good fix

To fix poor embeddings quality, use a specialized embedding model like text-embedding-3-small, preprocess your input text by cleaning and normalizing it, and ensure consistent input length. Also, experiment with different models and batch sizes to optimize vector quality.

Quick answer

To fix poor embeddings quality, use a specialized embedding model like text-embedding-3-small, preprocess your input text by cleaning and normalizing it, and ensure consistent input length. Also, experiment with different models and batch sizes to optimize vector quality.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable.

bash

pip install openai>=1.0

Step by step

This example shows how to create embeddings with the OpenAI API using the text-embedding-3-small model and preprocess input text for better quality.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def preprocess_text(text: str) -> str:
    # Basic cleaning: lowercase, strip, remove extra spaces
    cleaned = ' '.join(text.lower().strip().split())
    return cleaned

input_text = "  Example input text with   irregular spacing and CAPITAL letters. "
clean_text = preprocess_text(input_text)

response = client.embeddings.create(
    model="text-embedding-3-small",
    input=clean_text
)

embedding_vector = response.data[0].embedding
print(f"Embedding vector length: {len(embedding_vector)}")

output

Embedding vector length: 1536

Common variations

Use batch input to embed multiple texts at once for consistency.
Try different embedding models like text-embedding-3-large for higher quality.
Use async calls if embedding large datasets.

python

import asyncio
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def embed_texts(texts):
    cleaned_texts = [' '.join(t.lower().strip().split()) for t in texts]
    response = await client.embeddings.acreate(
        model="text-embedding-3-small",
        input=cleaned_texts
    )
    return [item.embedding for item in response.data]

texts = ["First example.", "Second example text."]
embeddings = asyncio.run(embed_texts(texts))
print(f"Got {len(embeddings)} embeddings.")

output

Got 2 embeddings.

Troubleshooting

If embeddings seem noisy or inconsistent, verify input text is clean and normalized.
Check for input length limits; truncate or split long texts.
Try switching models if quality is poor.
Ensure API key and model names are correct to avoid silent fallback.

✅

Key Takeaways

Preprocess and normalize input text to improve embedding consistency.
Use specialized embedding models like text-embedding-3-small or larger variants for better quality.
Batch embedding multiple inputs can improve vector space coherence.
Monitor input length and API usage to avoid silent errors or degraded quality.

Verified 2026-04 · text-embedding-3-small, text-embedding-3-large

Verify ↗