How to beginner · 3 min read

NDCG metric for reranking evaluation

Quick answer
The NDCG (Normalized Discounted Cumulative Gain) metric evaluates the quality of reranking by measuring the usefulness of ranked results based on their positions and relevance scores. It normalizes the cumulative gain by the ideal ranking to provide a score between 0 and 1, where higher is better.

PREREQUISITES

  • Python 3.8+
  • pip install numpy scikit-learn

Setup

Install the required Python packages numpy and scikit-learn for numerical operations and metric calculation.

bash
pip install numpy scikit-learn

Step by step

This example demonstrates how to compute the NDCG metric for reranking evaluation using Python's sklearn.metrics.ndcg_score. It assumes you have relevance scores for a list of documents before and after reranking.

python
import numpy as np
from sklearn.metrics import ndcg_score

# Ground truth relevance scores for documents (higher is more relevant)
true_relevance = np.asarray([[3, 2, 3, 0, 1, 2]])

# Predicted scores from your reranking model
predicted_scores = np.asarray([[0.9, 0.8, 0.75, 0.1, 0.4, 0.7]])

# Compute NDCG@6 (all documents)
ndcg = ndcg_score(true_relevance, predicted_scores, k=6)
print(f"NDCG@6: {ndcg:.4f}")
output
NDCG@6: 0.9612

Common variations

  • Adjust k in ndcg_score to evaluate top-k results (e.g., k=3 for NDCG@3).
  • Use batch inputs to evaluate multiple queries at once by passing arrays with shape (n_queries, n_docs).
  • Calculate DCG or IDCG separately if custom weighting or analysis is needed.
python
import numpy as np
from sklearn.metrics import ndcg_score

# Multiple queries example
true_relevance = np.array([
    [3, 2, 3, 0, 1, 2],
    [2, 1, 2, 3, 0, 1]
])
predicted_scores = np.array([
    [0.9, 0.8, 0.75, 0.1, 0.4, 0.7],
    [0.5, 0.6, 0.4, 0.9, 0.2, 0.3]
])

ndcg_batch = ndcg_score(true_relevance, predicted_scores, k=3)
print(f"Batch NDCG@3: {ndcg_batch:.4f}")
output
Batch NDCG@3: 0.8889

Troubleshooting

  • If ndcg_score returns 0, verify that your true_relevance array contains positive relevance scores; zero or negative values can cause invalid results.
  • Ensure predicted scores are aligned with the documents' order in true_relevance.
  • Check that k does not exceed the number of documents.

Key Takeaways

  • Use sklearn.metrics.ndcg_score to compute NDCG easily for reranking evaluation.
  • Adjust the k parameter to focus on top-k ranked results for more relevant evaluation.
  • Provide true relevance and predicted scores as arrays aligned by document order for accurate scoring.
Verified 2026-04
Verify ↗