Back to Blog
Evaluationrag-evaluationconfidence-intervalssignificance-testingndcgeval-tooling

The Significance-Testing Gap: LLM-Era Eval Tools Inherited IR's Metrics, Not Its Statistics

A reranker moves NDCG@10 from 0.41 to 0.43 and the dashboard turns green. None of the major LLM-era eval platforms reports whether that 0.02 is signal or noise. The IR libraries settled this two decades ago.

June 25, 20268 min read

A team swaps in a new reranker. The evaluation harness reports NDCG@10 rising from 0.41 to 0.43. The dashboard turns green, the pull request merges, and the release notes claim a retrieval improvement. Nobody in the room can say whether that 0.02 is a real gain or the kind of wobble that reverses on the next sample of queries. The number moved, so the number is treated as truth.

This is the default state of retrieval evaluation across most of the tooling built in the LLM era, and the gap is not an accident of any one product. It is a category-wide inheritance failure. The platforms picked up the information retrieval community's metrics and left behind the machinery that establishes whether a difference in those metrics is real.

The metrics were inherited. The apparatus was not.

NDCG, MAP, MRR, Recall@k, and Precision@k did not originate with RAG. They are decades-old IR measures with precise definitions and known behavior; normalized discounted cumulative gain, for instance, has a fixed formulation that the field has used since the early 2000s (Järvelin & Kekäläinen, 2002). When an LLM-era evaluation platform reports NDCG@10, it is borrowing a number whose meaning was established in a research tradition that also established something the platform usually omits: a measured difference between two retrieval systems on a fixed query set is a sample statistic, and sample statistics carry sampling error.

The reliability of a retrieval result depends on how many queries were evaluated and how much the per-query scores vary, not on the headline mean alone. A difference that looks decisive on fifty queries can be indistinguishable from noise once that variance is accounted for (Voorhees & Buckley, 2000). The standard response in IR has been routine for the better part of two decades: compute per-query scores for both systems, then run a paired significance test (paired t-test, Wilcoxon signed-rank, sign test, or a randomization test) to ask whether the observed difference would survive resampling (Smucker, Allan & Carterette, 2007). A point estimate with no interval and no test is, in that tradition, an incomplete result.

What the LLM-era platforms actually ship

An audit of the public documentation and source code of the major RAG and LLM evaluation platforms, conducted in June 2026, found a consistent pattern. The tools surveyed (Ragas, Arize Phoenix and Arize AX, Galileo, Patronus AI, DeepEval and Confident AI, Braintrust, LangSmith, Comet Opik, promptfoo, Vectara's Open-RAG-Eval, and TruLens) default to point estimates on retrieval metrics. None of them was found to natively compute and display a bootstrap confidence interval on a retrieval metric, and none was found to natively run a paired significance test comparing two configurations on a retrieval metric. The dashboards report a mean and a delta. They do not report whether the delta is distinguishable from zero.

There is a subtler problem underneath that one. Many of these platforms do not ship deterministic retrieval metrics in the first place. What they label "context precision" or "contextual recall" is frequently an LLM-judge reframing, a graded score produced by a language model rather than a deterministic computation against human relevance judgments. Ragas and DeepEval both fall into this pattern, as do Galileo (whose retrieval-quality metrics are chunk and context relevance computed as model-based judge scores, alongside a documented Precision@K) and Patronus (whose retrieval evaluators are LLM-judge context relevance and sufficiency checks). The exception is Arize, which documents NDCG, MRR, and MAP as rank-aware metrics (Arize AX documentation, audited June 2026), though it too surfaces no interval or test around them.

The distinction matters because marketing language frequently outruns documentation. At least one platform's comparison pages advertise statistical significance testing as a feature, describing CI/CD pipelines that analyze statistical significance and block merges when quality degrades (Braintrust, audited June 2026). The technical documentation does not describe a test, a statistic, or a threshold, and the associated open-source metric library ships RAG-style judge metrics without NDCG, MRR, or Recall@k at all. The word "significance" appears in the sales copy and not in the spec.

The counterexamples prove the apparatus exists

If the universal claim were that the apparatus is hard or unavailable, it would be wrong. Tools that descend directly from the IR research tradition carry the significance-testing machinery as table stakes. The clearest case is ranx, a ranking-evaluation library whose comparison function takes a statistical test as a parameter and natively runs a paired Student's t-test, Fisher's randomization test, and Tukey's HSD across runs, annotating which differences are significant at a user-set threshold (Bassani, 2022; ranx statistical tests documentation, audited June 2026). Its metrics are validated against TREC Eval for correctness. A fork of pytrec_eval similarly ships a ttest function that returns a p-value comparing the NDCG of two runs (eXascaleInfolab/pytrec_eval, audited June 2026).

These are not edge cases to be explained away. They are the reason the gap is worth writing about. The apparatus is open source, well documented, and decades into refinement. A pre-LLM Python library does paired significance testing on NDCG by default. The platforms marketed specifically for evaluating retrieval in the LLM era, the products a startup is most likely to reach for when building a RAG pipeline, mostly do not. The capability was not lost to difficulty. It was simply not carried across the boundary between the IR research world and the LLM application world.

The landscape at a glance

The pattern is easier to see laid out in one place. The split runs along provenance, not licensing: the libraries born in the IR research tradition carry significance testing natively, and the platforms born in the LLM application era do not, whether they are open source or commercial. All rows reflect public documentation and source code audited in June 2026.

| Tool | Category | License | Retrieval metrics supported | CI on metrics | Paired sig. test | p-value / SE / effect size | Per-query scores | Audit basis | |---|---|---|---|---|---|---|---|---| | ranx | OSS library | MIT | NDCG, MAP, MRR, P@k, R@k, RBP, Bpref (validated vs TREC Eval) | No (not shown by default) | Yes (t-test, Fisher randomization, Tukey HSD) | Yes (p-values) | Yes | docs | | pytrec_eval (eXascaleInfolab fork) | OSS library | MIT | NDCG, MAP, and other TREC measures | No | Yes (ttest) | Yes (p-values) | Yes | GitHub | | pytrec_eval (cvangysel, SIGIR'18) | OSS library | MIT | NDCG, MAP, and other TREC measures | No | No | No | Yes | GitHub | | ir_measures | OSS library | MIT-style | nDCG, AP, RR, P@k, R@k | No | Partial (significance via PyTerrier) | No | Yes | docs | | Ragas | OSS + SaaS | Apache-2.0 | Partial (LLM-judge "context precision/recall"; no deterministic NDCG/MAP) | No | No | No | Yes | docs | | Arize Phoenix / AX | OSS + SaaS | Elastic-2.0 (Phoenix) / proprietary (AX) | NDCG, MRR, MAP, Precision@k | No | No | No | Yes | docs | | DeepEval / Confident AI | OSS + SaaS | Apache-2.0 | Partial (LLM-judge "contextual precision/recall"; no classical NDCG/MAP) | No | No | No | Yes | docs | | Galileo | SaaS | Proprietary | Partial (Precision@K documented; chunk/context relevance are model-based judge scores; no NDCG/MRR/MAP) | No | No | No | Yes | docs | | Patronus AI | SaaS | Proprietary | Partial (LLM-judge context relevance/sufficiency, answer relevance; BLEU/ROUGE; no NDCG/MRR/MAP) | No | No | No | Yes | docs | | Braintrust | SaaS (autoevals MIT) | Proprietary | No classical retrieval metrics (RAGAS-style judge metrics only) | No | No (docs silent; marketing claims it) | No | Yes | autoevals | | LangSmith | SaaS | Proprietary | No native classical retrieval metrics | No | No (pairwise = LLM-judge preference, not a test) | No | Yes | docs | | Comet Opik | OSS + SaaS | Apache-2.0 | Partial (Ragas integration; heuristic + LLM-judge) | No | No | No | Yes | docs | | promptfoo | OSS + SaaS | MIT | Partial (RAG context metrics; no classical NDCG) | No | No | No | Yes | docs | | Vectara Open-RAG-Eval | OSS | Apache-2.0 | Partial (UMBRELA LLM-judge relevance) | No | No | No | Yes | GitHub | | TruLens | OSS | MIT | Partial (groundedness / context-relevance feedback) | No | No | No | Yes | docs |

A further set of tools with overlapping positioning (FutureAGI, Langfuse, Humanloop, Weights & Biases Weave, Helicone, Athina, LangWatch, Maxim AI, BEIR, MTEB, Hugging Face evaluate, DSPy, LlamaIndex evaluation, LangChain evaluation, inspect_ai, OpenAI Evals, and Continuous-Eval) was not audited to primary source as of June 2026 and is omitted from the table rather than marked, since silence in this audit is not evidence of a missing feature.

Why a missing interval changes the decision

The cost of point-estimate evaluation is not abstract. It is a stream of shipping decisions made on differences that may be noise. Return to the reranker that moved NDCG@10 from 0.41 to 0.43. The mean improved. Whether to ship depends entirely on a number the dashboard does not show: the spread of the per-query differences. An illustrative 95 percent confidence interval on that per-query difference of, say, [-0.002, +0.042] tells the state of the evidence at a glance. The interval includes zero, which means the data cannot rule out that the new reranker is no better, or marginally worse, than the old one. The point estimate alone tells none of that. It reports a win.

Run that decision loop across a quarter of retrieval experiments and the consequences compound. Configurations get locked in on the strength of differences that would not survive a resample. Genuine regressions hide inside the noise band of a metric that only ever reports its mean. Teams accumulate a changelog of improvements, some real and some illusory, with no principled way to tell which is which. The metric was supposed to be the guardrail against shipping on vibes. Stripped of its interval, it becomes a more authoritative-looking version of the same thing. A companion piece carries the same logic onto a number line, on a reranker change whose interval sits mostly above zero yet still crosses it (Paired Significance Tests for Retrieval Changes).

The independence dimension sharpens this further. When the platform reporting the metric is also the vendor selling the retrieval system, or is the team whose work the metric evaluates, a point estimate with no uncertainty band is the most flattering possible way to present a result. A measured improvement, however small, reads as progress. There is a reason financial results get audited by a party with no stake in the outcome rather than self-reported by the company. Retrieval evaluation has not yet built the equivalent discipline, and the tooling's default behavior actively works against it.

The workaround exists, but defaults are what get used

None of this is a claim that the platforms make significance testing impossible. Nearly every tool surveyed exports per-query or per-example scores through a CSV download, a JSON dump, or an SDK call. A practitioner with the right training can pull those scores into scipy.stats.bootstrap or a paired test and recover the interval the dashboard omitted. The workaround path is broadly available.

The trouble is that defaults are what teams actually use, especially under release pressure. A capability that requires leaving the tool, writing custom code, and knowing which test to run is a capability most teams will skip when the green number is sitting right there in the UI. The argument is not that the tooling cannot surface uncertainty. It is that the tooling does not surface uncertainty by default, and a discipline that lives only in the workaround path is a discipline most organizations will not practice.

Where this leaves a retrieval evaluation

A retrieval result without an interval and without a test is a partial result, no matter how precise the headline number looks. The LLM-era platforms inherited the scoreboard and left the rulebook behind.

If your team is making ship-or-hold decisions on retrieval metrics, the question worth asking is not whether the number moved. It is whether the number moved by more than the noise, and whether anyone has checked. TensorOpt's sample retrieval audit walks through exactly that analysis on a public benchmark: per-query scores, bootstrap confidence intervals on NDCG@10 and MRR, and paired significance tests between configurations, with the reasoning shown rather than asserted. It is the difference between a dashboard that says green and a report that says how confident you should be in the green.

You can request the sample report by email below. It is the artifact to hand the engineer who is about to merge a reranker on the strength of 0.02. The full measurement and statistical methodology behind it is documented in Designing Hybrid Search Systems.

L

Laszlo Csontos

Author of Designing Hybrid Search Systems (Leanpub, 2026). Practitioner background in production hybrid search, embeddings, cross-encoder reranking, and retrieval evaluation.

Need help measuring search quality? Start with a baseline audit.

Book a Discovery Call