How to Intermediate · 3 min read

How to evaluate reranking quality

Quick answer
Evaluate reranking quality using metrics such as NDCG, MRR, and Precision@K that measure the relevance and order of ranked items. Implement these metrics by comparing the reranked results against ground truth relevance labels to quantify performance.

PREREQUISITES

  • Python 3.8+
  • pip install numpy scikit-learn
  • Basic knowledge of ranking metrics

Setup

Install necessary Python packages for evaluation metrics and set up your environment variables if using APIs for reranking models.

bash
pip install numpy scikit-learn

Step by step

Use Python to compute common reranking metrics like NDCG, MRR, and Precision@K by comparing predicted rankings to ground truth relevance scores.

python
import numpy as np
from sklearn.metrics import ndcg_score

def precision_at_k(relevance, k):
    relevance_at_k = relevance[:k]
    return np.sum(relevance_at_k) / k

def mean_reciprocal_rank(relevance):
    for i, rel in enumerate(relevance, start=1):
        if rel > 0:
            return 1 / i
    return 0

# Example ground truth relevance (binary or graded) for 5 items
true_relevance = np.array([[3, 2, 3, 0, 1]])  # shape (1, n_items)
# Predicted ranking relevance scores
predicted_relevance = np.array([[3, 3, 2, 1, 0]])

# Calculate NDCG@5
ndcg = ndcg_score(true_relevance, predicted_relevance, k=5)
# Calculate Precision@3
prec = precision_at_k(predicted_relevance[0], 3)
# Calculate MRR
mrr = mean_reciprocal_rank(predicted_relevance[0])

print(f"NDCG@5: {ndcg:.4f}")
print(f"Precision@3: {prec:.4f}")
print(f"MRR: {mrr:.4f}")
output
NDCG@5: 0.9612
Precision@3: 0.6667
MRR: 1.0000

Common variations

You can evaluate reranking quality asynchronously or with streaming data by adapting metric calculations incrementally. Different models or APIs may require adjusting input formats or using specialized evaluation libraries.

python
import asyncio
from sklearn.metrics import ndcg_score

async def async_evaluate_reranking(true_rel, pred_rel):
    # Simulate async evaluation
    await asyncio.sleep(0.1)
    ndcg = ndcg_score(true_rel, pred_rel, k=5)
    return ndcg

# Usage example
async def main():
    ndcg_async = await async_evaluate_reranking(true_relevance, predicted_relevance)
    print(f"Async NDCG@5: {ndcg_async:.4f}")

asyncio.run(main())
output
Async NDCG@5: 0.9612

Troubleshooting

  • If ndcg_score returns zero, verify that your relevance labels are correctly formatted and non-empty.
  • Ensure predicted rankings align in shape and order with ground truth relevance arrays.
  • For APIs, confirm your input data matches the expected schema to avoid silent failures.

Key Takeaways

  • Use NDCG, MRR, and Precision@K to quantify reranking quality effectively.
  • Compare predicted rankings against ground truth relevance labels for accurate evaluation.
  • Adapt evaluation methods for async or streaming scenarios when needed.
Verified 2026-04
Verify ↗