Measuring Search Quality: A Practical Guide
NDCG, MRR, Precision@K—what matters and how to measure it in production.
"Our search is bad" is not actionable. "Our NDCG@10 is 0.45, and we need it above 0.7 to match competitor benchmarks" is a goal you can work toward. Metrics transform search optimization from guesswork into engineering.
The Core Metrics
Precision@K
The simplest metric: what fraction of the top K results are relevant?
When to use: Quick sanity checks, comparing retrieval systems.
Limitation: Treats all positions equally. A relevant result at position 1 is the same as position 10.
Recall@K
What fraction of all relevant documents appear in the top K?
When to use: When missing relevant documents has high cost (legal discovery, medical research).
Limitation: Doesn't consider ranking order. Finding all relevant docs at positions 50-100 scores the same as positions 1-50.
Mean Reciprocal Rank (MRR)
How quickly do users find the first relevant result?
When to use: When users typically need just one good result (navigation, fact lookup).
Example: If the first relevant result is at position 3, the reciprocal rank is 1/3 ≈ 0.33.
Normalized Discounted Cumulative Gain (NDCG)
The gold standard for ranked retrieval. Rewards relevant documents at higher positions with logarithmic discount.
When to use: Almost always. It's the most comprehensive ranking metric.
Why it works: Position 1 matters more than position 2, which matters more than position 3. Highly relevant documents contribute more than marginally relevant ones.
Building an Evaluation Dataset
Metrics are only as good as your labels. Here's how to build a quality evaluation set:
Option 1: Click Data (Implicit Feedback)
# Extract query-document pairs from clicks
evaluation_pairs = []
for session in click_logs:
query = session.query
clicked_docs = session.clicked_documents
# Assume clicked docs are relevant (with noise)
for doc in clicked_docs:
evaluation_pairs.append((query, doc, 1)) # relevance = 1Pros: Abundant, reflects real user behavior Cons: Noisy (users click irrelevant results), biased toward what's shown
Option 2: Human Labels (Explicit Feedback)
# Label relevance on a graded scale
RELEVANCE_SCALE = {
0: "Not relevant",
1: "Marginally relevant",
2: "Relevant",
3: "Highly relevant"
}Pros: High quality, supports graded relevance Cons: Expensive, time-consuming, requires domain expertise
Option 3: LLM-as-Judge
prompt = f"""
Rate the relevance of this document to the query.
Query: {query}
Document: {document}
Relevance (0-3):
"""
relevance = llm.generate(prompt)Pros: Scalable, consistent, cheap Cons: May miss domain nuances, hallucination risk
Production Monitoring
Offline evaluation is necessary but not sufficient. You need online metrics too:
Click-Through Rate (CTR)
ctr = clicks / impressionsTrack CTR at different positions. A healthy search has high CTR at position 1, declining gradually.
Zero-Result Rate
zero_result_rate = queries_with_no_results / total_queriesTarget: < 5%. High ZRR means your index is incomplete or query understanding is failing.
Time to First Click
How long until users click something? Longer times suggest poor ranking.
Reformulation Rate
Do users rephrase their query after seeing results? High reformulation = poor initial results.
A/B Testing Search Changes
Never ship search changes without A/B testing. Search is too important and too easy to break.
# Minimum detectable effect calculation
def sample_size_for_ab_test(baseline_conversion, mde, power=0.8, alpha=0.05):
"""
baseline_conversion: current metric value
mde: minimum detectable effect (e.g., 0.05 for 5%)
"""
from scipy import stats
z_alpha = stats.norm.ppf(1 - alpha/2)
z_beta = stats.norm.ppf(power)
p1 = baseline_conversion
p2 = baseline_conversion * (1 + mde)
n = (2 * ((z_alpha + z_beta) ** 2) * p1 * (1 - p1)) / ((p2 - p1) ** 2)
return int(n)Putting It Together
A complete evaluation strategy:
- Baseline: Measure current NDCG, MRR, Precision, Recall
- Offline evaluation: Test changes on labeled dataset
- Online A/B test: Validate with real users
- Monitor: Track metrics in production continuously
Without measurement, you can't improve. With measurement, every change is an experiment with a clear outcome.
Need help measuring search quality? Start with a baseline audit.
Book a Discovery Call