Measuring Search Quality: A Practical Guide

"Our search is bad" is not actionable. "Our NDCG@10 is 0.45, and we need it above 0.7 to match competitor benchmarks" is a goal you can work toward. Metrics transform search optimization from guesswork into engineering.

The Core Metrics

Precision@K

The simplest metric: what fraction of the top K results are relevant?

\text{Precision@K} = \frac{\text{Number of relevant docs in top K}}{K}

When to use: Quick sanity checks, comparing retrieval systems.

Limitation: Treats all positions equally. A relevant result at position 1 is the same as position 10.

Recall@K

What fraction of all relevant documents appear in the top K?

\text{Recall@K} = \frac{\text{Number of relevant docs in top K}}{\text{Total relevant docs}}

When to use: When missing relevant documents has high cost (legal discovery, medical research).

Limitation: Doesn't consider ranking order. Finding all relevant docs at positions 50-100 scores the same as positions 1-50.

Mean Reciprocal Rank (MRR)

How quickly do users find the first relevant result?

\text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i}

When to use: When users typically need just one good result (navigation, fact lookup).

Example: If the first relevant result is at position 3, the reciprocal rank is 1/3 ≈ 0.33.

Normalized Discounted Cumulative Gain (NDCG)

The gold standard for ranked retrieval. Rewards relevant documents at higher positions with logarithmic discount.

\text{DCG@K} = \sum_{i=1}^{K} \frac{2^{\text{rel}_i} - 1}{\log_2(i + 1)}

\text{NDCG@K} = \frac{\text{DCG@K}}{\text{IDCG@K}}

When to use: Almost always. It's the most comprehensive ranking metric.

Why it works: Position 1 matters more than position 2, which matters more than position 3. Highly relevant documents contribute more than marginally relevant ones.

Building an Evaluation Dataset

Metrics are only as good as your labels. Here's how to build a quality evaluation set:

Option 1: Click Data (Implicit Feedback)

# Extract query-document pairs from clicks
evaluation_pairs = []
for session in click_logs:
    query = session.query
    clicked_docs = session.clicked_documents
    # Assume clicked docs are relevant (with noise)
    for doc in clicked_docs:
        evaluation_pairs.append((query, doc, 1))  # relevance = 1

Pros: Abundant, reflects real user behavior Cons: Noisy (users click irrelevant results), biased toward what's shown

Option 2: Human Labels (Explicit Feedback)

# Label relevance on a graded scale
RELEVANCE_SCALE = {
    0: "Not relevant",
    1: "Marginally relevant",
    2: "Relevant",
    3: "Highly relevant"
}

Pros: High quality, supports graded relevance Cons: Expensive, time-consuming, requires domain expertise

Option 3: LLM-as-Judge

prompt = f"""
Rate the relevance of this document to the query.
Query: {query}
Document: {document}
Relevance (0-3):
"""
relevance = llm.generate(prompt)

Pros: Scalable, consistent, cheap Cons: May miss domain nuances, hallucination risk

Production Monitoring

Offline evaluation is necessary but not sufficient. You need online metrics too:

Click-Through Rate (CTR)

ctr = clicks / impressions

Track CTR at different positions. A healthy search has high CTR at position 1, declining gradually.

Zero-Result Rate

zero_result_rate = queries_with_no_results / total_queries

Target: < 5%. High ZRR means your index is incomplete or query understanding is failing.

Time to First Click

How long until users click something? Longer times suggest poor ranking.

Reformulation Rate

Do users rephrase their query after seeing results? High reformulation = poor initial results.

A/B Testing Search Changes

Never ship search changes without A/B testing. Search is too important and too easy to break.

# Minimum detectable effect calculation
def sample_size_for_ab_test(baseline_conversion, mde, power=0.8, alpha=0.05):
    """
    baseline_conversion: current metric value
    mde: minimum detectable effect (e.g., 0.05 for 5%)
    """
    from scipy import stats
    z_alpha = stats.norm.ppf(1 - alpha/2)
    z_beta = stats.norm.ppf(power)
 
    p1 = baseline_conversion
    p2 = baseline_conversion * (1 + mde)
 
    n = (2 * ((z_alpha + z_beta) ** 2) * p1 * (1 - p1)) / ((p2 - p1) ** 2)
    return int(n)

Putting It Together

A complete evaluation strategy:

Baseline: Measure current NDCG, MRR, Precision, Recall
Offline evaluation: Test changes on labeled dataset
Online A/B test: Validate with real users
Monitor: Track metrics in production continuously

Without measurement, you can't improve. With measurement, every change is an experiment with a clear outcome.