Workflow Intermediate medium · 7 min step

EmbeddingsFilter: fast relevance filtering without LLM calls

What you will learn

Filter retrieval results by embedding similarity before reranking to reduce computational cost and latency.

Step 3 of Advanced RAG Pipeline: After initial dense retrieval (Step 2) and before reranking (Step 4)

Why this matters

Skipping this step forces the reranker to process all retrieved documents, wasting inference time and increasing costs. A poorly calibrated similarity threshold sends low-quality documents downstream, forcing the reranker to work harder and introducing noise into the context window.

Explanation

Purpose: After retrieving top-k documents with semantic search, many are marginally relevant. An EmbeddingsFilter removes documents below a similarity threshold using only fast embedding arithmetic: no LLM calls needed. This reduces the document batch sent to the reranker by 40–70%, cutting latency by 2–4x.

How it works: Store the query embedding and document embeddings in the same vector space. Compute cosine similarity between query and each retrieved doc. Drop any doc with similarity below your threshold (e.g., 0.65). Pass the remaining docs to the reranker.

What to watch: The threshold is task-dependent. Too high and you filter out valid context; too low and you waste reranker compute. Start at 0.65 for general knowledge tasks, 0.75 for strict matching (e.g., code search). Monitor the percentage of docs filtered; if it's >80%, your initial retriever is weak: fix that instead.

Code

python

# pip install langchain-openai langchain-community scikit-learn numpy

from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

class EmbeddingsFilter:
    """Filter retrieved docs by embedding similarity before reranking."""
    
    def __init__(self, embeddings_model, similarity_threshold=0.68, min_docs=2):
        self.embeddings = embeddings_model
        self.threshold = similarity_threshold
        self.min_docs = min_docs
    
    def filter_documents(self, query: str, documents: list[Document]) -> list[Document]:
        """Filter docs by cosine similarity to query embedding."""
        if not documents:
            return []
        
        # Embed query
        query_embedding = self.embeddings.embed_query(query)
        
        # Embed all documents (or reuse cached embeddings if available)
        doc_embeddings = []
        for doc in documents:
            if 'embedding' in doc.metadata:
                # Reuse cached embedding
                doc_embeddings.append(doc.metadata['embedding'])
            else:
                # Embed on the fly (avoid this in production for retrieved docs)
                doc_embeddings.append(self.embeddings.embed_query(doc.page_content))
        
        # Compute similarities
        similarities = cosine_similarity([query_embedding], doc_embeddings)[0]
        
        # Filter by threshold
        filtered = [(doc, sim) for doc, sim in zip(documents, similarities) if sim >= self.threshold]
        
        # Safety fallback: ensure minimum docs
        if len(filtered) < self.min_docs:
            # Sort by similarity descending and take top min_docs
            sorted_pairs = sorted(zip(documents, similarities), key=lambda x: x[1], reverse=True)
            filtered = sorted_pairs[:self.min_docs]
        
        # Return docs with similarity metadata for debugging
        result = []
        for doc, sim in filtered:
            doc.metadata['filter_similarity'] = float(sim)
            result.append(doc)
        
        return result


# Example usage
if __name__ == '__main__':
    embeddings = OpenAIEmbeddings(model='text-embedding-3-small')
    
    # Mock retrieved documents (in real RAG, these come from vector DB)
    documents = [
        Document(
            page_content='Python is a programming language used for data science.',
            metadata={'embedding': embeddings.embed_query('Python is a programming language used for data science.'), 'source': 'doc1'}
        ),
        Document(
            page_content='Machine learning requires large datasets.',
            metadata={'embedding': embeddings.embed_query('Machine learning requires large datasets.'), 'source': 'doc2'}
        ),
        Document(
            page_content='The capital of France is Paris.',
            metadata={'embedding': embeddings.embed_query('The capital of France is Paris.'), 'source': 'doc3'}
        ),
        Document(
            page_content='Neural networks learn hierarchical features from data.',
            metadata={'embedding': embeddings.embed_query('Neural networks learn hierarchical features from data.'), 'source': 'doc4'}
        ),
    ]
    
    query = 'How do neural networks process data?'
    
    # Create filter and apply
    filter_obj = EmbeddingsFilter(embeddings, similarity_threshold=0.65, min_docs=2)
    filtered_docs = filter_obj.filter_documents(query, documents)
    
    print(f'Query: {query}')
    print(f'Retrieved: {len(documents)} docs | After filter: {len(filtered_docs)} docs\n')
    for doc in filtered_docs:
        print(f'Source: {doc.metadata["source"]} | Similarity: {doc.metadata["filter_similarity"]:.3f}')
        print(f'  Content: {doc.page_content}\n')

Output

Query: How do neural networks process data?
Retrieved: 4 docs | After filter: 2 docs

Source: doc4 | Similarity: 0.823
  Content: Neural networks learn hierarchical features from data.

Source: doc2 | Similarity: 0.671
  Content: Machine learning requires large datasets.

Your options

Recommended

No filtering (baseline)

Very small retrieval sets (<10 docs) or when you need every possible signal for the reranker

Pros

Simplest to implement; no tuning needed; reranker sees all candidates

Cons

Reranker wastes tokens on obviously irrelevant docs; higher latency and cost at scale

# Skip filtering entirely
docs = retriever.get_relevant_documents(query)
reranked = reranker.compress_documents(docs, query)

Static similarity threshold (0.60–0.75)

Stable domain with consistent query-doc relationships; most production RAG systems

Pros

Fast, interpretable, easy to tune; minimal overhead; deterministic results

Cons

One-size-fits-all approach; may over-filter or under-filter depending on domain shift

from langchain_community.embeddings import OpenAIEmbeddings
from sklearn.metrics.pairwise import cosine_similarity

query_embedding = embeddings.embed_query(query)
docs_filtered = [doc for doc in docs if cosine_similarity([query_embedding], [doc.metadata['embedding']])[0][0] > 0.68]

Adaptive threshold (percentile-based)

When query difficulty or doc quality varies; you want to keep top N% regardless of absolute score

Pros

Adapts to query complexity; avoids edge cases of all-low-similarity results

Cons

More complex logic; less interpretable; percentile changes with retrieval set size

import numpy as np

query_embedding = embeddings.embed_query(query)
similarities = [cosine_similarity([query_embedding], [doc.metadata['embedding']])[0][0] for doc in docs]
threshold = np.percentile(similarities, 40)  # Keep top 60%
docs_filtered = [doc for doc, sim in zip(docs, similarities) if sim > threshold]

Hybrid: threshold + minimum count

Production systems where you need both quality (threshold) and safety (never filter all docs)

Pros

Balances relevance and fallback; handles edge cases; avoids empty result sets

Cons

Two parameters to tune; slightly more complex

threshold = 0.68
min_docs = 3

query_embedding = embeddings.embed_query(query)
similarities = [cosine_similarity([query_embedding], [doc.metadata['embedding']])[0][0] for doc in docs]
docs_filtered = [doc for doc, sim in zip(docs, similarities) if sim > threshold]

# Safety fallback: if threshold filters out too much, keep top min_docs
if len(docs_filtered) < min_docs:
    sorted_indices = np.argsort(similarities)[::-1]
    docs_filtered = [docs[i] for i in sorted_indices[:min_docs]]

Validation step

Run the filter on a test query with 10–20 retrieved docs. Check that: (1) The count of filtered docs is 30–70% of input (not >80% or <10%); (2) The lowest-similarity doc that passed the filter has similarity > threshold; (3) Timing: filtering <10ms for 100 docs (compare retrieval timing); (4) Metadata field 'filter_similarity' is present on all output docs.

At scale

At 100k+ documents per query, re-embedding documents on-the-fly (line 24–26) becomes a bottleneck. Production systems must pre-compute and cache embeddings in the vector store's metadata. Also, cosine_similarity scales O(n) with doc count; for >10k docs, consider approximate similarity search (faiss, hnswlib) instead of dense computation. Threshold choice shifts with domain: generic QA works at 0.65; legal/code search needs 0.75+. If filtering removes >80% of docs, your retriever is weak: improve Step 2 instead of loosening the threshold.

↩

Rollback plan

If the filter is too aggressive and removing valid docs, (1) lower the threshold by 0.05 increments and re-validate; (2) check if your retriever's embedding model matches the filter's model (mismatch causes artificial low scores); (3) if a specific query type fails, add a domain-specific threshold override via query classifiers rather than loosening globally. If filtering fails entirely, disable it by setting threshold to 0.0 and increment min_docs to match your reranker batch size.

Debug symptoms

Reranker receives far fewer docs than expected; quality drops sharply

Diagnosis

Threshold is too high for the domain, or embedding model mismatch causing artificially low scores

Fix

Add logging: print min/max/mean similarity scores before filtering. If all scores are 0.4–0.5, model mismatch is likely. If scores are 0.8+ but still filtering, threshold is the issue. Lower threshold by 0.05–0.10 and re-test.

Filter returns all documents unchanged; no filtering happening

Diagnosis

Threshold is set too low (e.g., 0.0 or negative), or metadata embeddings are not populated

Fix

Check that documents passed to filter have 'embedding' in metadata. If not present, re-embed or set skip-cache logic. Verify threshold > 0.5 (reasonable minimum).

Inconsistent filtering behavior across same query + docs

Diagnosis

Embeddings or similarities computed on cached embeddings that differ from live query embedding (floating point drift or model version change)

Fix

Recompute all document embeddings with the exact same model instance. Cache invalidation: set a model version tag in metadata and reject stale cache.

Latency increased after adding filter (expected <10ms, actual 50–100ms)

Diagnosis

Filter is re-embedding documents on-the-fly instead of using cached embeddings

Fix

Ensure all retrieved documents have 'embedding' pre-computed in metadata. Disable the on-the-fly embedding path (remove lines 24–26) for production.

Production upgrade path

Tutorial version: static threshold, no caching. Production version: (1) Pre-compute doc embeddings at indexing time, store in vector DB metadata, never re-embed on filtering. (2) Use adaptive percentile threshold per query type (classifier detects question type, applies different thresholds). (3) Log filter_similarity distribution per batch and alert if mean drops >10% day-over-day (model drift detection). (4) A/B test threshold changes against held-out query set with human relevance judgments before rolling out.

Common gotcha

Embedding model mismatch is silent and catastrophic. If your retriever uses text-embedding-3-large but your filter uses text-embedding-3-small, similarity scores become meaningless because they're computed across different embedding spaces. The filter then silently drops valid docs. Always assert that both the retriever and filter use the same embedding model and version.

Experienced dev note

Most teams skip this step thinking rerankers are cheap. But rerankers cost 10–100x more per token than embeddings, and embedding filtering eliminates 50–70% of processing. At scale (1000s of queries/day), this step pays for itself in days. Also: threshold tuning is domain-specific and time-consuming, but one good threshold multiplies across all queries. Invest in a tuning dataset of 50–100 queries with labeled relevance, then sweep thresholds 0.50–0.80 to find the ROI sweet spot (high recall, low reranker waste). Finally, monitor filter_similarity in production; if the distribution shifts, retraining may be needed.

Check your understanding

You retrieve 20 documents for a query. The EmbeddingsFilter with threshold 0.70 keeps only 3 documents. Your reranker is now faster but your answer quality degraded. What are two possible root causes, and how would you diagnose each one differently?

Show answer hint

Root cause 1: threshold is legitimately too high for this domain/query type (check min/max similarity scores in the batch). Root cause 2: embedding model mismatch or version drift (compare retriever model vs filter model in code/logs). Diagnosis differs: score distribution check vs model name/version check.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.