How to Intermediate · 3 min read

How to evaluate reranking quality

Q: How to evaluate reranking quality

Evaluate reranking quality using metrics such as NDCG, MRR, and Precision@K that measure the relevance and order of ranked items. Implement these metrics by comparing the reranked results against ground truth relevance labels to quantify performance.

Quick answer

Evaluate reranking quality using metrics such as NDCG, MRR, and Precision@K that measure the relevance and order of ranked items. Implement these metrics by comparing the reranked results against ground truth relevance labels to quantify performance.

PREREQUISITES

Python 3.8+
pip install numpy scikit-learn
Basic knowledge of ranking metrics

Setup

Install necessary Python packages for evaluation metrics and set up your environment variables if using APIs for reranking models.

bash

pip install numpy scikit-learn

Step by step

Use Python to compute common reranking metrics like NDCG, MRR, and Precision@K by comparing predicted rankings to ground truth relevance scores.

python

import numpy as np
from sklearn.metrics import ndcg_score

def precision_at_k(relevance, k):
    relevance_at_k = relevance[:k]
    return np.sum(relevance_at_k) / k

def mean_reciprocal_rank(relevance):
    for i, rel in enumerate(relevance, start=1):
        if rel > 0:
            return 1 / i
    return 0

# Example ground truth relevance (binary or graded) for 5 items
true_relevance = np.array([[3, 2, 3, 0, 1]])  # shape (1, n_items)
# Predicted ranking relevance scores
predicted_relevance = np.array([[3, 3, 2, 1, 0]])

# Calculate NDCG@5
ndcg = ndcg_score(true_relevance, predicted_relevance, k=5)
# Calculate Precision@3
prec = precision_at_k(predicted_relevance[0], 3)
# Calculate MRR
mrr = mean_reciprocal_rank(predicted_relevance[0])

print(f"NDCG@5: {ndcg:.4f}")
print(f"Precision@3: {prec:.4f}")
print(f"MRR: {mrr:.4f}")

output

NDCG@5: 0.9612
Precision@3: 0.6667
MRR: 1.0000

Common variations

You can evaluate reranking quality asynchronously or with streaming data by adapting metric calculations incrementally. Different models or APIs may require adjusting input formats or using specialized evaluation libraries.

python

import asyncio
from sklearn.metrics import ndcg_score

async def async_evaluate_reranking(true_rel, pred_rel):
    # Simulate async evaluation
    await asyncio.sleep(0.1)
    ndcg = ndcg_score(true_rel, pred_rel, k=5)
    return ndcg

# Usage example
async def main():
    ndcg_async = await async_evaluate_reranking(true_relevance, predicted_relevance)
    print(f"Async NDCG@5: {ndcg_async:.4f}")

asyncio.run(main())

output

Async NDCG@5: 0.9612

Troubleshooting

If ndcg_score returns zero, verify that your relevance labels are correctly formatted and non-empty.
Ensure predicted rankings align in shape and order with ground truth relevance arrays.
For APIs, confirm your input data matches the expected schema to avoid silent failures.

Key Takeaways

Use NDCG, MRR, and Precision@K to quantify reranking quality effectively.
Compare predicted rankings against ground truth relevance labels for accurate evaluation.
Adapt evaluation methods for async or streaming scenarios when needed.

Verified 2026-04

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.