Workflow Beginner easy · 5 min step

Multi-query generation

What you will learn

Transform a single user query into multiple perspectives to retrieve documents that standard single-query retrieval would miss.

Step 2 in the RAG pipeline: after user submits query, before vector retrieval

Why this matters

A user's single query often uses wording that doesn't match the vocabulary or framing in your documents. Multi-query generation expands coverage, reducing false negatives where relevant documents exist but aren't retrieved. Skip this and you get incomplete context fed to the LLM, resulting in hallucinations or incomplete answers.

Explanation

What it does: Multi-query generation takes one user question and uses an LLM to generate 2–5 alternative phrasings of that same question from different angles. For example, 'How do I fix database latency?' becomes ['Database is slow, what's the cause?', 'Optimize query performance', 'Reduce database response time'].

How to do it: You use a dedicated LLM call (or the same LLM as your main chain) with a prompt that explicitly asks for query variations. LangChain's MultiQueryRetriever handles this automatically, or you can manually orchestrate it with LLMChain + custom parsing.

What to watch for: Generated queries must remain semantically similar to the original: you're expanding vocabulary coverage, not changing intent. Poor prompting produces nonsense variations that pollute retrieval. Also, each generated query hits your vector DB separately, so this costs more compute and latency than single-query retrieval.

Code

python

# pip install langchain langchain-openai langchain-chroma chroma-db

from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
import os

api_key = os.getenv('OPENAI_API_KEY')

llm = ChatOpenAI(model='gpt-4o-mini', temperature=0.3, api_key=api_key)

generation_prompt = PromptTemplate.from_template(
    'You are an AI that helps expand search queries. '
    'Given a user question, generate 3 alternative phrasings that capture the same intent '
    'but use different vocabulary and framing. Each phrasing should be a complete question.\n\n'
    'Original question: {query}\n\n'
    'Alternative phrasings (one per line):'
)

parser = StrOutputParser()
generation_chain = generation_prompt | llm | parser

user_query = 'How do I optimize database query performance?'

alternatives_text = generation_chain.invoke({'query': user_query})
alternative_queries = [line.strip() for line in alternatives_text.split('\n') if line.strip()]

all_queries = [user_query] + alternative_queries

print('Original query:')
print(f'  {user_query}')
print('\nGenerated alternatives:')
for i, alt in enumerate(alternative_queries, 1):
    print(f'  {i}. {alt}')

print('\nAll queries to be used for retrieval:')
for q in all_queries:
    print(f'  - {q}')

Output

Original query:
  How do I optimize database query performance?

Generated alternatives:
  1. What techniques can reduce database query execution time?
  2. How do I improve the speed of slow database queries?
  3. What are best practices for speeding up SQL queries?

All queries to be used for retrieval:
  - How do I optimize database query performance?
  - What techniques can reduce database query execution time?
  - How do I improve the speed of slow database queries?
  - What are best practices for speeding up SQL queries?

Your options

Recommended

MultiQueryRetriever (built-in)

You want the simplest, most predictable behavior. Your LLM supports function calling or structured output. You're not customizing the variation strategy.

Pros

Handles prompt templating, parsing, and deduplication automatically. One line of code. Battle-tested in production.

Cons

Less control over which LLM generates queries. Fixed prompt template means you can't inject domain knowledge. Harder to debug if generations go wrong.

from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI
from langchain.vectorstores import Chroma

llm = ChatOpenAI(model='gpt-4o-mini', temperature=0)
retriever = MultiQueryRetriever.from_llm(
    retriever=vector_store.as_retriever(),
    llm=llm
)
docs = retriever.invoke(query)

Manual orchestration with LLMChain

You need full control over the generation prompt. You want to inject examples, constraints, or domain-specific guidance. You're building a custom RAG system.

Pros

Complete transparency. You see every generated query. Can customize prompt for your domain. Easy to add validation or filtering logic.

Cons

More code. You handle parsing and error handling yourself. Harder to maintain than using a library.

from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

llm = ChatOpenAI(model='gpt-4o-mini', temperature=0.3)
prompt = PromptTemplate.from_template(
    'Generate 3 alternative phrasings of this query:\nOriginal: {query}\nAlternatives (one per line):'
)
chain = prompt | llm | StrOutputParser()
alternatives = chain.invoke({'query': user_query})
generated_queries = [user_query] + [q.strip() for q in alternatives.split('\n') if q.strip()]

HyDE (Hypothetical Document Embeddings) instead

You want to generate documents that *answer* the query, not reformulate the query itself. You have a strong LLM and queries where answer-like text retrieves better than query reformulations.

Pros

Often retrieves more relevant documents than query variation. Semantically richer than rewording. Works well for factual, FAQ-style documents.

Cons

More complex. Generates false documents that must be embedded, adding compute. Requires a good LLM or you get nonsense pseudo-docs.

# HyDE: generate hypothetical answer, embed it, use to retrieve
hypothetical_doc = llm.invoke(f'Write a short document that answers: {user_query}')
hypothetical_embedding = embedder.embed_query(hypothetical_doc)
docs = vector_store.similarity_search_by_vector(hypothetical_embedding, k=5)

Validation step

Check that: (1) Each generated query is a valid, grammatical sentence. (2) Generated queries remain semantically close to the original: none introduce new topics or contradict the intent. (3) There is no exact duplication (all_queries has len(all_queries) == len(set(all_queries))). (4) Temperature is low enough (0.1–0.3) that outputs are consistent across runs. Run the generation 2–3 times with the same input; you should get the same or very similar alternatives.

At scale

At scale: Each generated query hits your vector DB separately. With 4 queries × 100 QPS, you now have 400 retrieval calls/sec. This multiplies latency (3 extra LLM calls per user query adds ~1–2 seconds). Token cost also multiplies: 4× the embedding operations. For large-scale production, use caching (same user query → reuse alternatives) or batch generation. Also, if your documents are poorly tokenized or your embedding model is weak, query expansion retrieves more irrelevant documents, requiring downstream reranking (Step 4). Test retrieval quality (precision@k) before and after adding this step; sometimes it hurts if documents are already well-indexed.

↩

Rollback plan

If generated queries are hallucinating or off-topic: (1) Lower temperature from 0.3 to 0.1 and re-run. (2) Add explicit constraints to the prompt: 'Do not introduce new topics or rephrase the intent.' (3) Manually specify 2–3 exemplars in the prompt showing good variations. (4) If the issue persists, use the MultiQueryRetriever built-in with its battle-tested prompt instead of a custom one. (5) Last resort: disable multi-query and use HyDE instead, or single-query retrieval + reranking.

Debug symptoms

Generated queries are completely off-topic: 'How do I optimize database performance?' generates 'What color is the sky?'

Diagnosis

Temperature too high (0.5+) or prompt is too permissive without examples. LLM is being too creative and ignoring the instruction to preserve intent.

Fix

Reduce temperature to 0.1–0.2. Add a few explicit examples to the prompt showing good vs. bad alternatives. Use a cheaper, more precise model like gpt-4o-mini instead of a larger model with higher temperature.

Generated queries are near-duplicates of the original; no real variation

Diagnosis

Prompt is too constraining or LLM is playing it safe. Often happens with temperature 0 and a vague prompt.

Fix

Slightly increase temperature to 0.2–0.3. Rewrite prompt to explicitly ask for 'different vocabulary,' 'alternative angle,' or 'industry jargon.' Show examples of what good variation looks like.

Retrieval recall drops after adding multi-query; fewer relevant docs are found

Diagnosis

Generated queries are noisy or your vector store is congested. Bad queries are overwriting good retrieval results in the union/aggregation step.

Fix

Add a reranking step (Step 4) after retrieval to filter. Or use a retriever that scores by relevance and deduplicates. Or reduce the number of generated alternatives from 3 to 2.

Production upgrade path

Tutorial version: Generate 3 queries, retrieve, concatenate results. Production version: (1) Generate 2–3 queries (not 5+). (2) Retrieve top-k per query. (3) Deduplicate by document ID (same doc from multiple queries = still 1 doc). (4) Apply reranking to sort by relevance. (5) Cache generated alternatives for common queries. (6) Monitor: track retrieval latency and precision@k to ensure multi-query is earning its cost. If precision drops, disable it.

Common gotcha

Developers often generate queries with too-high temperature (0.7+) to 'be creative,' which produces off-topic variations that pollute retrieval and hurt downstream ranking. The goal of multi-query is *vocabulary expansion*, not creativity. Keep temperature ≤ 0.3. Also, a common mistake is forgetting to include the original query in the final retrieval set: you generate alternatives and then only use those, discarding the original. This breaks cases where the user's exact wording happens to match documents well.

Experienced dev note

In real production systems, the marginal benefit of multi-query generation drops off after 2–3 alternatives. Generating 5+ queries adds latency and compute cost with diminishing returns on retrieval quality. Also, multi-query alone is often insufficient for hard queries: pair it with reranking (Cohere, proprietary, or LLM-based) to filter noise. Finally, if your vector embeddings are already good (you use a strong embedding model like text-embedding-3-large), the benefit of multi-query is smaller. Test on your own data; sometimes a single well-crafted reranking step beats multi-query for latency-critical systems.

Check your understanding

You generate 3 alternative queries from the original 'How do I optimize database performance?' Your alternatives are slightly off-topic variations. After retrieval, you're getting irrelevant documents mixed in. Is this a problem with multi-query generation itself, or what step comes next? What would you do to fix it?

Show answer hint

This is not a problem with multi-query; generation works fine. The problem is that without filtering, all retrieval results (good and bad) are passed downstream. You need reranking (next step) to score and filter results by relevance before feeding to the LLM.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.