Code Intermediate medium · 6 min

Similarity search: how retrieval works

What you will learn

Similarity search finds the most relevant documents from a collection by measuring how semantically close they are to a query using vector embeddings.

Why this matters

Retrieval is the foundation of RAG (Retrieval-Augmented Generation) systems: without understanding how similarity search works, you can't debug why your LLM gets the wrong context, why latency is high, or why results feel irrelevant. Every production LLM system uses this.

Skip if: Don't use similarity search when your entire dataset is small enough to fit in context (< 5KB of text): just put it all in the prompt. Don't use it if you need exact keyword matching (use full-text search instead). Don't use it if your documents have no semantic meaning (random binary data).

Explanation

What it is: Similarity search is the process of finding documents most relevant to a query by converting both the query and documents into vectors (embeddings) and measuring the distance between them. The closer the vectors, the more similar the documents.

How it works mechanically: When you query a retrieval system, the system embeds your query into a vector using an embedding model (like OpenAI's text-embedding-3-small). Then it compares this vector against pre-embedded documents stored in a vector database using a distance metric (cosine similarity, Euclidean distance, etc.). The documents with the smallest distances are returned as the most relevant. This happens because embedding models learn to place semantically similar text near each other in vector space.

When to use it: Use similarity search whenever you need to find relevant documents from a large collection (100+ documents) to pass to an LLM. It's the standard approach in RAG pipelines, semantic search, and recommendation systems. The key insight is that it's semantic: it understands meaning, not just keywords.

Analogy

Imagine a massive library where every book is placed in a high-dimensional warehouse based on its meaning. When you ask a question, your question is also placed in that same warehouse. The closest books to your question's location are the most relevant. Unlike a card catalog (keyword search), this system understands that 'vehicle' and 'car' belong near each other even though they're different words.

Code

python

from langchain_openai import OpenAIEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_core.documents import Document

# Initialize the embedding model
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Create sample documents
docs = [
    Document(page_content="Python is a high-level programming language known for its simplicity and readability."),
    Document(page_content="Machine learning models require large amounts of training data to perform well."),
    Document(page_content="Cats are independent animals that enjoy sleeping and hunting."),
    Document(page_content="Python provides excellent libraries for data science and machine learning."),
]

# Create an in-memory vector store and add documents
vector_store = InMemoryVectorStore.from_documents(
    documents=docs,
    embedding=embeddings
)

# Perform similarity search
query = "What programming language is good for machine learning?"
results = vector_store.similarity_search(query, k=2)

# Display results
for i, doc in enumerate(results, 1):
    print(f"Result {i}: {doc.page_content}")
    print()

Output

Result 1: Python provides excellent libraries for data science and machine learning.

Result 2: Python is a high-level programming language known for its simplicity and readability.

What just happened?

The code created a vector store, embedded all four documents into vector space using OpenAI's embedding model, then took your query string, embedded it the same way, and returned the 2 documents whose vectors were closest to the query's vector. The documents about Python ranked higher than the one about cats or the general ML statement because they semantically match the query better.

Common gotcha

Developers often assume that the number of results returned (k=2) is a performance setting: it's not. Increasing k doesn't make search faster; it just returns more documents. The real performance bottleneck is the embedding model and database size, not k. Also, similarity search returns all results ranked by relevance: there's no hard cutoff. A low-similarity document can still be returned if k is large enough, which can degrade LLM quality. Always check the actual similarity scores, not just trust that top-k is good enough.

Error recovery

AuthenticationError

You passed invalid or missing OpenAI API credentials. Set OPENAI_API_KEY environment variable or pass api_key='your-key' explicitly to OpenAIEmbeddings().

RateLimitError

The embedding API is rate-limited. Batch your embeddings using langchain_core.embeddings.cache_embeddings or wait before retrying. Use exponential backoff in production.

ValueError: model not found

You specified an embedding model that doesn't exist (e.g., text-embedding-4). Use text-embedding-3-small, text-embedding-3-large, or check OpenAI's current model list.

Experienced dev note

Caching embeddings is not optional in production: re-embedding the same documents every request wastes API calls and money. Use langchain_core.embeddings.CacheBackedEmbeddings to wrap your embedding model with a local cache. Also, InMemoryVectorStore is fine for prototyping but will fail in production on >10k documents and doesn't persist between restarts. Move to Chroma, Pinecone, or Weaviate before going live. Finally, embedding quality directly determines retrieval quality: a mediocre embedding model will cause your RAG system to retrieve irrelevant documents, and no amount of prompt engineering will fix that.

Check your understanding

If you increased k from 2 to 10 in the similarity_search call, would the query be re-embedded, and would the search take significantly longer?

Show answer hint

The correct answer explains that the query is embedded once (not re-embedded), and k is just a filtering parameter on already-computed similarities, so the time increase is negligible. The bottleneck is the initial embedding step, not the k value.

VERSION InMemoryVectorStore was stabilized in langchain-core 0.2.0. Earlier versions (< 0.2.0) required using Chroma or another persistent backend. If you're on langchain-core < 0.2.0, use ChromaDB instead.

Learn how to integrate similarity search into a RAG chain so the retrieved documents automatically feed into an LLM prompt: this is where retrieval becomes useful.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.