Hybrid RAG
Dense vectors meet sparse keywords.
When to use
When queries need both semantic understanding AND exact keyword/code/proper-noun matches (SKU, error codes, function names, version numbers, acronyms) — the DEFAULT upgrade when vector-only RAG keeps missing results that contain the exact term. ❌ Skip it if the corpus is small and queries are all natural language: plain vector search is enough.
Real-world examples
- Internal docs / company wiki search: understand the question AND match exact file names, process codes.
- Tech support: look up by exact error code ("ERR_2043") plus a natural-language symptom description.
- E-commerce: find products by SKU/model code as well as natural description ("waterproof jacket").
- Code/API docs lookup: match exact function/variable names while still grasping the described purpose.
Diagram
Illustrative pipeline diagram; see the step-by-step description in the Pipeline flow section below.Pipeline flow
- 1Query
- 2Dense branch: Embedding Model → Vector DB → Dense Results
- 3Sparse branch: BM25 Index → Sparse Results
- 4Reciprocal Rank Fusion (RRF) merges both lists
- 5Top-K Chunks → LLM → Answer
In plain words
Like searching a library with TWO librarians at once: one understands what you MEAN (semantics), the other has memorized the exact shelf codes (precise keywords). Each hands you their list, then you merge — a book ranked high by BOTH is almost certainly the right one.
Concept A–Z
Vanilla RAG uses only dense vector embeddings — great at MEANING ("car" ≈ "automobile") but it misses when a user types an exact error code, function name, or SKU that embeddings blur away. Hybrid RAG runs TWO retrievers in parallel: (1) dense — semantic vector distance; (2) sparse — BM25, the classic keyword-ranking algorithm (Elasticsearch-style) that matches exact tokens. It then merges the two ranked lists with Reciprocal Rank Fusion. The result covers both "gist" and "exact text" → far higher recall than vector-only, at almost no extra complexity.
How it works
Two retrieval branches
The same query enters two independent branches, then they merge.
- Dense: query → embedding → ANN search (Approximate Nearest Neighbor — finds the closest vectors approximately but fast; via cosine/dot) in a Vector DB (pgvector, Qdrant, Pinecone…). Captures meaning, synonyms, paraphrases.
- Sparse: BM25 over an inverted index (a reverse lookup "term → docs containing it"). Captures exact rare/specific terms; the rarer the token the higher its score (IDF — inverse document frequency).
- Each branch returns its own top-N (e.g. N=50) — NOT yet cut to final K.
Merge with Reciprocal Rank Fusion (RRF)
The two branches score on DIFFERENT scales (cosine 0–1 vs unbounded BM25) → you cannot add them directly. RRF uses only RANK: score(d) = Σ 1/(k + rank_i(d)), with k≈60. Docs ranked high in both branches rise to the top; no score normalization needed, very robust.
Rerank + cut Top-K
After RRF, add a rerank step with a cross-encoder (cohere-rerank, bge-reranker) — a model that reads the (query, doc) PAIR in one pass, so it judges TRUE relevance more accurately than embeddings (at the cost of speed) — then take Top-K (3–8) into the prompt. Reranking lifts precision noticeably at small cost.
In-depth content of the 5 RAG architectures
Unlock the hands-on code, pro tips, security notes, real-project guidance, common pitfalls and glossary — for the Senior plan and above.
Requires sign-in + the Senior plan or above
Already have an eligible plan? Sign in to unlock right away.
Related architectures
Practice AI/RAG interviews
Thousands of IT interview questions + roadmaps — learn fast, get hired.
Start practicing