Similarity search: how retrieval works
Why this matters
Retrieval is the foundation of RAG (Retrieval-Augmented Generation) systems: without understanding how similarity search works, you can't debug why your LLM gets the wrong context, why latency is high, or why results feel irrelevant. Every production LLM system uses this.
Explanation
What it is: Similarity search is the process of finding documents most relevant to a query by converting both the query and documents into vectors (embeddings) and measuring the distance between them. The closer the vectors, the more similar the documents.
How it works mechanically: When you query a retrieval system, the system embeds your query into a vector using an embedding model (like OpenAI's text-embedding-3-small). Then it compares this vector against pre-embedded documents stored in a vector database using a distance metric (cosine similarity, Euclidean distance, etc.). The documents with the smallest distances are returned as the most relevant. This happens because embedding models learn to place semantically similar text near each other in vector space.
When to use it: Use similarity search whenever you need to find relevant documents from a large collection (100+ documents) to pass to an LLM. It's the standard approach in RAG pipelines, semantic search, and recommendation systems. The key insight is that it's semantic: it understands meaning, not just keywords.
Analogy
Imagine a massive library where every book is placed in a high-dimensional warehouse based on its meaning. When you ask a question, your question is also placed in that same warehouse. The closest books to your question's location are the most relevant. Unlike a card catalog (keyword search), this system understands that 'vehicle' and 'car' belong near each other even though they're different words.
Code
from langchain_openai import OpenAIEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_core.documents import Document
# Initialize the embedding model
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Create sample documents
docs = [
Document(page_content="Python is a high-level programming language known for its simplicity and readability."),
Document(page_content="Machine learning models require large amounts of training data to perform well."),
Document(page_content="Cats are independent animals that enjoy sleeping and hunting."),
Document(page_content="Python provides excellent libraries for data science and machine learning."),
]
# Create an in-memory vector store and add documents
vector_store = InMemoryVectorStore.from_documents(
documents=docs,
embedding=embeddings
)
# Perform similarity search
query = "What programming language is good for machine learning?"
results = vector_store.similarity_search(query, k=2)
# Display results
for i, doc in enumerate(results, 1):
print(f"Result {i}: {doc.page_content}")
print() Result 1: Python provides excellent libraries for data science and machine learning. Result 2: Python is a high-level programming language known for its simplicity and readability.
What just happened?
The code created a vector store, embedded all four documents into vector space using OpenAI's embedding model, then took your query string, embedded it the same way, and returned the 2 documents whose vectors were closest to the query's vector. The documents about Python ranked higher than the one about cats or the general ML statement because they semantically match the query better.
Common gotcha
Developers often assume that the number of results returned (k=2) is a performance setting: it's not. Increasing k doesn't make search faster; it just returns more documents. The real performance bottleneck is the embedding model and database size, not k. Also, similarity search returns all results ranked by relevance: there's no hard cutoff. A low-similarity document can still be returned if k is large enough, which can degrade LLM quality. Always check the actual similarity scores, not just trust that top-k is good enough.
Error recovery
AuthenticationErrorRateLimitErrorValueError: model not foundExperienced dev note
Caching embeddings is not optional in production: re-embedding the same documents every request wastes API calls and money. Use langchain_core.embeddings.CacheBackedEmbeddings to wrap your embedding model with a local cache. Also, InMemoryVectorStore is fine for prototyping but will fail in production on >10k documents and doesn't persist between restarts. Move to Chroma, Pinecone, or Weaviate before going live. Finally, embedding quality directly determines retrieval quality: a mediocre embedding model will cause your RAG system to retrieve irrelevant documents, and no amount of prompt engineering will fix that.
Check your understanding
If you increased k from 2 to 10 in the similarity_search call, would the query be re-embedded, and would the search take significantly longer?
Show answer hint
The correct answer explains that the query is embedded once (not re-embedded), and k is just a filtering parameter on already-computed similarities, so the time increase is negligible. The bottleneck is the initial embedding step, not the k value.