The Hybrid Search Architecture We Use
How to combine BM25 and semantic search for the best of both worlds.
Pure keyword search fails on synonyms. Pure semantic search fails on exact matches. The solution is neither—it's both. Here's the hybrid search architecture we deploy in production.
Why Hybrid?
Consider these queries:
| Query | BM25 | Semantic | Ideal | |-------|------|----------|-------| | "iPhone 15 Pro Max 256GB" | ✅ Exact match | ❌ Too broad | BM25 wins | | "phone with best camera" | ❌ No keyword match | ✅ Understands intent | Semantic wins | | "smartphone photography" | ⚠️ Partial match | ⚠️ Too abstract | Need both |
Neither approach dominates. You need both retrieval paths, intelligently combined.
The Architecture
┌─────────────────┐
│ User Query │
└────────┬────────┘
│
┌────────▼────────┐
│ Query Analysis │
│ - Intent │
│ - Entities │
│ - Spelling │
└────────┬────────┘
│
┌─────────────────┼─────────────────┐
│ │ │
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ BM25 │ │ Semantic │ │ Sparse │
│ Retrieval │ │ Retrieval │ │ Retrieval │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└─────────────────┼─────────────────┘
│
┌────────▼────────┐
│ Fusion Layer │
│ (RRF / Linear)│
└────────┬────────┘
│
┌────────▼────────┐
│ Cross-Encoder │
│ Reranking │
└────────┬────────┘
│
┌────────▼────────┐
│ Final Results │
└─────────────────┘
Component Deep-Dive
1. Query Understanding
Before retrieval, understand what the user wants:
class QueryAnalyzer:
def analyze(self, query: str) -> QueryFeatures:
return QueryFeatures(
intent=self.classify_intent(query), # navigational, informational, transactional
entities=self.extract_entities(query), # brands, sizes, colors
corrected_query=self.spell_correct(query),
expansion_terms=self.expand_query(query) # synonyms
)2. BM25 Retrieval
The workhorse for exact matching. We use Elasticsearch with tuned parameters:
{
"query": {
"multi_match": {
"query": "iPhone 15 Pro",
"fields": ["title^3", "description", "brand^2"],
"type": "best_fields",
"fuzziness": "AUTO"
}
}
}Key tuning parameters:
- Field boosting: Title matches worth more than description
- Fuzziness: Handle typos without over-matching
3. Semantic Retrieval
Vector similarity search for conceptual matching:
def semantic_retrieve(query: str, k: int = 100) -> List[Document]:
query_embedding = embedding_model.encode(query)
results = vector_store.search(
query_embedding,
k=k,
filter=metadata_filter
)
return resultsWe use HNSW indexes for sub-millisecond retrieval at scale.
4. Fusion: Reciprocal Rank Fusion (RRF)
Combine ranked lists without needing score calibration:
def reciprocal_rank_fusion(
ranked_lists: List[List[Document]],
k: int = 60
) -> List[Document]:
"""
RRF score = sum(1 / (k + rank_i)) for each list
"""
scores = defaultdict(float)
for ranked_list in ranked_lists:
for rank, doc in enumerate(ranked_list, start=1):
scores[doc.id] += 1 / (k + rank)
sorted_docs = sorted(scores.items(), key=lambda x: -x[1])
return [doc_id for doc_id, score in sorted_docs]RRF works because:
- No score normalization needed
- Naturally handles different list lengths
- Robust to outlier scores
5. Cross-Encoder Reranking
The final refinement. Cross-encoders see query and document together:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')
def rerank(query: str, documents: List[Document], top_k: int = 10) -> List[Document]:
pairs = [(query, doc.text) for doc in documents]
scores = reranker.predict(pairs)
ranked = sorted(zip(documents, scores), key=lambda x: -x[1])
return [doc for doc, score in ranked[:top_k]]Cross-encoders are slower (can't be precomputed) but more accurate. Use them on the top 50-100 candidates from fusion.
The Latency Budget
In production, speed matters:
| Stage | Target Latency | Notes | |-------|----------------|-------| | Query Analysis | < 20ms | Can cache common patterns | | BM25 Retrieval | < 30ms | Elasticsearch is fast | | Semantic Retrieval | < 30ms | HNSW with good recall | | Fusion | < 5ms | Pure computation | | Reranking | < 50ms | Top 50 candidates | | Total | < 135ms | P95 target |
Latency Optimization Tips
- Parallelize retrieval: BM25 and semantic run simultaneously
- Limit candidates: Don't rerank 1000 documents
- Quantize embeddings: int8 is often sufficient
- Cache embeddings: Popular queries hit cache
When Not to Use Hybrid
Hybrid search adds complexity. Skip it when:
- Pure catalog lookup: SKU search doesn't need semantics
- Extreme latency requirements: Every millisecond counts
- Limited engineering resources: Start with one approach, add the other when needed
Results
On a recent e-commerce deployment, hybrid search outperformed both individual approaches:
| Approach | NDCG@10 | P95 Latency | |----------|---------|-------------| | BM25 only | 0.58 | 45ms | | Semantic only | 0.62 | 55ms | | Hybrid + Rerank | 0.79 | 130ms |
The 36% improvement in NDCG translated to a 12% increase in click-through rate and 8% improvement in conversion.
Hybrid search isn't just technically superior—it directly impacts business metrics.
Struggling with search relevance? Get an audit.
Book a Discovery Call