Hybrid Search: What Makes It Work?
BM25 isn't dead. Embeddings aren't magic. Running both and fusing intelligently is what production systems actually do.
Abstract
The debate between keyword search and semantic search is a false dichotomy. BM25 excels at exact-match precision. Embedding models excel at semantic understanding. Neither alone covers the full range of queries a production system must handle. Hybrid search runs both in parallel and fuses the results, combining the strengths of each while compensating for their respective failure modes. The architecture is well-established and the engineering cost is marginal compared to running either system alone. The real differentiator is not which models you run, but how you combine their outputs. Most teams default to Reciprocal Rank Fusion with a parameter they never tune, leaving measurable relevance gains on the table.
BM25 Is Not Dead
A persistent take in search engineering circles: "BM25 is obsolete, embeddings are the future." This is wrong.
Run a simple test. Take a set of exact-match queries (product SKUs, specific error codes, technical terms, named entities) and compare BM25 against the best available embedding model. BM25 will win. Convincingly. Dense retrievers drastically underperform BM25 on entity-centric questions, retrieving passages about similar-sounding but entirely different entities (Sciavolino et al., 2021). A user searching for "SKU-12345" does not benefit from a model that understands meaning. They need a system that matches the exact string.
BM25 remains a strong baseline that many dense models fail to beat on out-of-domain tasks (Thakur et al., 2021). On benchmarks like Robust04 and TREC-COVID, BM25 outperforms neural passage retrieval (Chen et al., 2022). These are not obscure edge cases. Out-of-domain generalization is the default condition for most production systems, because the training data for dense models rarely matches the actual query distribution.
Embeddings solve a different problem. "Comfortable running shoes for flat feet" should surface relevant products even if those exact words don't appear in any product description. That requires understanding meaning, which is exactly what dense retrievers are designed for.
The answer is not choosing one or the other. It is running both.
How Hybrid Search Works
The mechanics are straightforward. Two retrieval systems run in parallel against the same query:
Keyword retrieval (BM25) finds documents containing the words the user typed. It is fast, precise, and literal. It does not know that "couch" and "sofa" are the same thing.
Semantic retrieval (dense embeddings) converts the query into a numerical vector and finds documents with similar vector representations. It knows "couch" and "sofa" are semantically equivalent. But it might also conclude that "couch" is similar to "bed," which may or may not be appropriate depending on the query context.
Each system returns a ranked list of results. The question becomes: how do you combine two ranked lists into one?
Naive Fusion
Merge and deduplicate. This throws away all ranking information and produces results that are effectively random with respect to quality. It is not a viable approach.
Reciprocal Rank Fusion (RRF)
RRF assigns each result a score based on its position in each list. Results that rank high in both lists score highest. The formula is simple:
Where is a document, is the set of ranked lists being fused, is the position of document in list , and is a constant that controls how much weight the top positions carry relative to lower positions.
RRF outperforms Condorcet voting and individual rank learning methods across diverse retrieval tasks (Cormack, Clarke & Büttcher, 2009). Hybrid retrieval using RRF outperforms both pure-lexical and pure-semantic retrieval on most benchmark datasets (Bruch, Gai & Ingber, 2023).
The appeal of RRF is that it requires almost no tuning. It operates on ranks, not scores, so it sidesteps the problem of normalizing scores from different retrieval systems. It is the correct default starting point.
Learned Fusion
A convex combination approach computes the final score as a weighted blend of the normalized scores from each retrieval system:
Where controls the relative weight of keyword versus semantic retrieval, and scores are normalized to a common range. This approach outperforms RRF in both in-domain and out-of-domain settings (Bruch, Gai & Ingber, 2023). The advantage over RRF is that it uses actual relevance scores rather than rank positions, preserving information about the confidence of each retrieval system.
The most effective variant is not a global but a per-query one. Different query types favor keyword retrieval versus semantic retrieval differently, and a learned model can determine the optimal blend for each query. Dynamic per-query weight tuning can yield 2 to 7.5 percentage point gains over static fusion weights (Hsu & Tzeng, 2025). This means the system adjusts its reliance on keyword versus semantic signals depending on the characteristics of each query, rather than applying one ratio to everything.
In practice, starting with RRF is correct. Moving to convex combination with a tuned global is the first upgrade. Per-query dynamic weighting is the second upgrade, and requires query classification infrastructure to route queries to different fusion strategies.
The k Parameter Nobody Tunes
Most teams running hybrid search with RRF use the default , the constant from the original 2009 paper, and never revisit it. This is a missed opportunity.
The parameter controls the score distribution. A low (say, 10) means the top-ranked result in each list dominates the fused score. A high (say, 100) means positions matter less; results from both lists contribute more equally regardless of their original rank.
This is not a theoretical distinction. Contrary to existing assumptions that RRF is parameter-insensitive, varying from 1 to 100 causes several-point swings in key retrieval metrics (Bruch, Gai & Ingber, 2023).
The practical implication: if keyword retrieval is strong for a given query distribution, a lower lets it dominate where it is confident. If semantic retrieval is strong, a higher gives it more influence. Sweeping from 10 to 100 against an evaluation set and picking the optimal value takes an afternoon. No model changes. No reindexing. One number. It is the highest-ROI optimization most hybrid search teams are not doing.
When Semantic Search Makes Things Worse
Semantic search is not always an upgrade. Sometimes it actively degrades results.
Consider a common pattern: a user searches "python error handling best practices." Semantic search returns results about "Java exception management patterns." The embedding model considers these close. Conceptually, they are related. But the user wanted Python, specifically Python.
This is semantic drift. The embedding space maps related concepts close together, which is usually the point. But when users have specific intent, "related" is not "relevant." Dense retrievers retrieve passages about similar-sounding but completely different entities when faced with entity-centric questions (Sciavolino et al., 2021).
The pattern shows up across domains. In e-commerce: search "red Nike Air Max 90" and get "blue Adidas Ultraboost" because the embeddings consider them similar products. In enterprise search: query a specific policy document and get a thematically related but factually different one. In code search: look for a specific error code and get documentation about a different but conceptually related error.
The severity of semantic drift depends on the query distribution. For head queries (common, general queries), semantic search usually helps. For tail queries (specific, entity-heavy queries), it frequently hurts. Most search systems handle head queries adequately with BM25 alone. The long tail is where semantic search is supposed to add value, but it is also where semantic drift is most likely to cause problems. This creates a paradox: the queries that benefit most from semantic understanding are also the queries most vulnerable to semantic drift.
The fix is not removing semantic search. It is knowing when to trust it and when to suppress it.
Query Classification: The Missing Layer
Per-query retrieval strategy selection outperforms uniform application of any single strategy. There is no single retrieval approach that effectively answers all queries (Arabzadeh, Yan & Clarke, 2021).
Most hybrid search systems treat every query identically: same fusion weights, same parameter, same retrieval strategy. But a query like "SKU-12345" and a query like "comfortable shoes for standing all day" need fundamentally different handling.
Short, specific queries with named entities (product names, codes, exact terms) should lean heavily on keyword retrieval. Longer, descriptive queries expressing needs or concepts should lean on semantic retrieval (Chen et al., 2022).
A lightweight query classifier that adjusts fusion weights per query type is not complex to implement. A simple heuristic based on query length, entity detection, and vocabulary overlap already gets most of the way there. The classifier does not need to be perfect. It needs to be better than treating all queries the same, which is a low bar.
A practical starting point for query classification:
- Query length: Single-word and two-word queries are more likely to be entity lookups (lean keyword). Queries with five or more words are more likely to express needs or concepts (lean semantic).
- Entity detection: If the query contains a recognized product name, SKU, model number, or proper noun, increase keyword weight. These queries need exact matching.
- Vocabulary overlap: If the query terms appear frequently in the index vocabulary, BM25 will perform well. If they are rare or absent, semantic retrieval is more likely to bridge the gap.
- Query intent signals: Questions ("how to," "what is," "why does") typically benefit from semantic understanding. Navigational queries (specific page or product names) benefit from keyword precision.
None of these heuristics require a trained classifier. They can be implemented as rule-based logic that adjusts the parameter in the fusion formula. A more sophisticated approach trains a small classifier on labeled query-type data, but the heuristic baseline is already a significant improvement over static weights.
The Fusion Strategy Is the Product
The individual retrieval models are increasingly commoditized. BM25 implementations are mature. Embedding models are available off the shelf from multiple providers. The combination strategy, how you fuse results, when you lean on which signal, and how you adapt per query type, is where teams differentiate.
BM25 plus reranking matches or beats pure dense retrieval on out-of-domain tasks (Thakur et al., 2021). BERT-based dense retrievers require interpolation with BM25 to achieve strong results across diverse benchmarks (Zhuang, Wang & Zuccon, 2021). The evidence is consistent: hybrid approaches that combine sparse and dense signals outperform either in isolation.
The question for teams building search systems today is not "which model?" It is "how do you fuse?"
Conclusion
Hybrid search is the architecture that works in production because it acknowledges a fundamental reality: different queries need different retrieval strategies. BM25 handles exact matches and entity lookups. Embeddings handle semantic understanding and vocabulary gaps. Running both in parallel and fusing intelligently covers the full spectrum of user intent.
Start with RRF as the default fusion method. Tune the parameter on your evaluation set. Measure the gap. Then consider learned fusion and per-query strategy selection as the next optimization. The models are commodities. The fusion strategy is the product.
References
- Arabzadeh, N., Yan, X., & Clarke, C. L. A. (2021). "Predicting Efficiency/Effectiveness Trade-offs for Dense vs. Sparse Retrieval Strategy Selection." CIKM 2021, pp. 2862-2866. https://arxiv.org/abs/2109.10739
- Bruch, S., Gai, S., & Ingber, A. (2023). "An Analysis of Fusion Functions for Hybrid Retrieval." ACM TOIS, 2023. https://arxiv.org/abs/2210.11934
- Chen, X., Zhang, N., Lu, K., Bendersky, M., & Najork, M. (2022). "Out-of-Domain Semantics to the Rescue! Zero-Shot Hybrid Retrieval Models." ECIR 2022, pp. 95-110. https://arxiv.org/abs/2201.10582
- Cormack, G. V., Clarke, C. L. A., & Büttcher, S. (2009). "Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods." SIGIR 2009, pp. 758-759. https://dl.acm.org/doi/10.1145/1571941.1572114
- Hsu, C., & Tzeng, W. (2025). "DAT: Dynamic Alpha Tuning for Hybrid Retrieval in Retrieval-Augmented Generation." arXiv:2503.23013.
- Sciavolino, C., Zhong, Z., Lee, J., & Chen, D. (2021). "Simple Entity-Centric Questions Challenge Dense Retrievers." EMNLP 2021, pp. 6138-6148. https://aclanthology.org/2021.emnlp-main.496/
- Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., & Gurevych, I. (2021). "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models." NeurIPS 2021. https://arxiv.org/abs/2104.08663
- Zhuang, S., Wang, B., & Zuccon, G. (2021). "BERT-based Dense Retrievers Require Interpolation with BM25 for Effective Passage Retrieval." ICTIR 2021. https://dl.acm.org/doi/10.1145/3471158.3472233
Struggling with search relevance? Get an audit.
Book a Discovery Call