How to evaluate search quality
Quick answer
Evaluate search quality by measuring metrics like
precision, recall, and mean reciprocal rank (MRR) using labeled relevance data. Use Python to compute these metrics on search results to quantify relevance and effectiveness.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python SDK and set your API key as an environment variable for secure authentication.
pip install openai>=1.0 output
Collecting openai Downloading openai-1.x.x-py3-none-any.whl (xx kB) Installing collected packages: openai Successfully installed openai-1.x.x
Step by step
Use Python to calculate search quality metrics such as precision, recall, and mean reciprocal rank (MRR) based on search results and ground truth relevance labels.
import os
from typing import List
def precision_at_k(relevant: List[int], retrieved: List[int], k: int) -> float:
retrieved_k = retrieved[:k]
relevant_set = set(relevant)
true_positives = sum(1 for doc_id in retrieved_k if doc_id in relevant_set)
return true_positives / k
def recall_at_k(relevant: List[int], retrieved: List[int], k: int) -> float:
retrieved_k = retrieved[:k]
relevant_set = set(relevant)
true_positives = sum(1 for doc_id in retrieved_k if doc_id in relevant_set)
return true_positives / len(relevant) if relevant else 0.0
def mean_reciprocal_rank(relevant: List[int], retrieved: List[int]) -> float:
relevant_set = set(relevant)
for rank, doc_id in enumerate(retrieved, start=1):
if doc_id in relevant_set:
return 1.0 / rank
return 0.0
# Example usage
relevant_docs = [2, 5, 7] # Ground truth relevant document IDs
retrieved_docs = [1, 2, 3, 5, 8] # Search engine results
print(f"Precision@3: {precision_at_k(relevant_docs, retrieved_docs, 3):.2f}")
print(f"Recall@5: {recall_at_k(relevant_docs, retrieved_docs, 5):.2f}")
print(f"MRR: {mean_reciprocal_rank(relevant_docs, retrieved_docs):.2f}") output
Precision@3: 0.33 Recall@5: 0.67 MRR: 0.50
Common variations
You can extend evaluation by using AI APIs to generate relevance judgments or summaries for search results, or use streaming APIs for real-time evaluation. Different models like gpt-4.1-mini or claude-3-5-haiku-20241022 can assist in relevance scoring.
from openai import OpenAI
import os
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
query = "Benefits of renewable energy"
search_results = [
"Renewable energy reduces carbon emissions.",
"Fossil fuels are limited.",
"Solar power is sustainable.",
]
messages = [
{"role": "user", "content": f"Rate the relevance of these results to the query: '{query}'.\nResults:\n" + "\n".join(search_results)}
]
response = client.chat.completions.create(
model="gpt-4.1-mini",
messages=messages
)
print("AI relevance score:", response.choices[0].message.content) output
AI relevance score: 1.0 (Highly relevant)
Troubleshooting
- If precision or recall is zero, verify your relevance labels and retrieved document IDs match in format.
- For API errors, ensure your
OPENAI_API_KEYis set correctly and your network allows outbound HTTPS. - If AI relevance scoring is inconsistent, try adjusting the prompt or using a different model.
Key Takeaways
- Use precision, recall, and MRR metrics to quantitatively evaluate search relevance.
- Leverage AI APIs like
gpt-4.1-minito assist in automated relevance scoring. - Ensure consistent document ID formats between ground truth and retrieved results.
- Set environment variables securely for API keys to avoid authentication errors.