How to beginner · 3 min read

Embeddings quality not good fix

Quick answer
To fix poor embeddings quality, use a specialized embedding model like text-embedding-3-small, preprocess your input text by cleaning and normalizing it, and ensure consistent input length. Also, experiment with different models and batch sizes to optimize vector quality.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable.

bash
pip install openai>=1.0

Step by step

This example shows how to create embeddings with the OpenAI API using the text-embedding-3-small model and preprocess input text for better quality.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def preprocess_text(text: str) -> str:
    # Basic cleaning: lowercase, strip, remove extra spaces
    cleaned = ' '.join(text.lower().strip().split())
    return cleaned

input_text = "  Example input text with   irregular spacing and CAPITAL letters. "
clean_text = preprocess_text(input_text)

response = client.embeddings.create(
    model="text-embedding-3-small",
    input=clean_text
)

embedding_vector = response.data[0].embedding
print(f"Embedding vector length: {len(embedding_vector)}")
output
Embedding vector length: 1536

Common variations

  • Use batch input to embed multiple texts at once for consistency.
  • Try different embedding models like text-embedding-3-large for higher quality.
  • Use async calls if embedding large datasets.
python
import asyncio
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def embed_texts(texts):
    cleaned_texts = [' '.join(t.lower().strip().split()) for t in texts]
    response = await client.embeddings.acreate(
        model="text-embedding-3-small",
        input=cleaned_texts
    )
    return [item.embedding for item in response.data]

texts = ["First example.", "Second example text."]
embeddings = asyncio.run(embed_texts(texts))
print(f"Got {len(embeddings)} embeddings.")
output
Got 2 embeddings.

Troubleshooting

  • If embeddings seem noisy or inconsistent, verify input text is clean and normalized.
  • Check for input length limits; truncate or split long texts.
  • Try switching models if quality is poor.
  • Ensure API key and model names are correct to avoid silent fallback.

Key Takeaways

  • Preprocess and normalize input text to improve embedding consistency.
  • Use specialized embedding models like text-embedding-3-small or larger variants for better quality.
  • Batch embedding multiple inputs can improve vector space coherence.
  • Monitor input length and API usage to avoid silent errors or degraded quality.
Verified 2026-04 · text-embedding-3-small, text-embedding-3-large
Verify ↗