How to beginner · 3 min read

How to evaluate search quality

Quick answer
Evaluate search quality by measuring metrics like precision, recall, and mean reciprocal rank (MRR) using labeled relevance data. Use Python to compute these metrics on search results to quantify relevance and effectiveness.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python SDK and set your API key as an environment variable for secure authentication.

bash
pip install openai>=1.0
output
Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl (xx kB)
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

Use Python to calculate search quality metrics such as precision, recall, and mean reciprocal rank (MRR) based on search results and ground truth relevance labels.

python
import os
from typing import List

def precision_at_k(relevant: List[int], retrieved: List[int], k: int) -> float:
    retrieved_k = retrieved[:k]
    relevant_set = set(relevant)
    true_positives = sum(1 for doc_id in retrieved_k if doc_id in relevant_set)
    return true_positives / k

def recall_at_k(relevant: List[int], retrieved: List[int], k: int) -> float:
    retrieved_k = retrieved[:k]
    relevant_set = set(relevant)
    true_positives = sum(1 for doc_id in retrieved_k if doc_id in relevant_set)
    return true_positives / len(relevant) if relevant else 0.0

def mean_reciprocal_rank(relevant: List[int], retrieved: List[int]) -> float:
    relevant_set = set(relevant)
    for rank, doc_id in enumerate(retrieved, start=1):
        if doc_id in relevant_set:
            return 1.0 / rank
    return 0.0

# Example usage
relevant_docs = [2, 5, 7]  # Ground truth relevant document IDs
retrieved_docs = [1, 2, 3, 5, 8]  # Search engine results

print(f"Precision@3: {precision_at_k(relevant_docs, retrieved_docs, 3):.2f}")
print(f"Recall@5: {recall_at_k(relevant_docs, retrieved_docs, 5):.2f}")
print(f"MRR: {mean_reciprocal_rank(relevant_docs, retrieved_docs):.2f}")
output
Precision@3: 0.33
Recall@5: 0.67
MRR: 0.50

Common variations

You can extend evaluation by using AI APIs to generate relevance judgments or summaries for search results, or use streaming APIs for real-time evaluation. Different models like gpt-4.1-mini or claude-3-5-haiku-20241022 can assist in relevance scoring.

python
from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

query = "Benefits of renewable energy"
search_results = [
    "Renewable energy reduces carbon emissions.",
    "Fossil fuels are limited.",
    "Solar power is sustainable.",
]

messages = [
    {"role": "user", "content": f"Rate the relevance of these results to the query: '{query}'.\nResults:\n" + "\n".join(search_results)}
]

response = client.chat.completions.create(
    model="gpt-4.1-mini",
    messages=messages
)

print("AI relevance score:", response.choices[0].message.content)
output
AI relevance score: 1.0 (Highly relevant)

Troubleshooting

  • If precision or recall is zero, verify your relevance labels and retrieved document IDs match in format.
  • For API errors, ensure your OPENAI_API_KEY is set correctly and your network allows outbound HTTPS.
  • If AI relevance scoring is inconsistent, try adjusting the prompt or using a different model.

Key Takeaways

  • Use precision, recall, and MRR metrics to quantitatively evaluate search relevance.
  • Leverage AI APIs like gpt-4.1-mini to assist in automated relevance scoring.
  • Ensure consistent document ID formats between ground truth and retrieved results.
  • Set environment variables securely for API keys to avoid authentication errors.
Verified 2026-04 · gpt-4.1-mini, claude-3-5-haiku-20241022
Verify ↗