Spaces:
Running
Running
| # Phase 2 β Hybrid Retrieval Engine | |
| **Status:** β Complete | **Tests:** 33/33 passed | **Date:** March 2026 | |
| --- | |
| ## Overview | |
| Phase 2 builds the retrieval pipeline β VoiceVault's technical differentiator. Instead of simple vector search (which most RAG tutorials use), VoiceVault implements hybrid BM25 + dense vector retrieval with Reciprocal Rank Fusion, cross-encoder reranking, and diversity filtering. | |
| **Why hybrid retrieval matters:** The 2026 MDPI systematic review of 63 enterprise RAG deployments found that 80.5% still use single-mode retrieval, missing the 20β30% recall improvement that hybrid search provides. | |
| --- | |
| ## Files Created | |
| | File | Purpose | | |
| |------|---------| | |
| | `voicevault/retrieval/bm25_retriever.py` | rank_bm25 keyword search against persisted index | | |
| | `voicevault/retrieval/vector_retriever.py` | ChromaDB cosine similarity search | | |
| | `voicevault/retrieval/hybrid_retriever.py` | RRF merge + cross-encoder + diversity filter | | |
| | `voicevault/retrieval/context_builder.py` | Formats chunks into LLM prompt context string | | |
| | `tests/test_phase2.py` | 33 tests: retrieval correctness, RRF math, diversity, context | | |
| --- | |
| ## Module Deep-Dives | |
| ### 1. BM25Retriever | |
| Loads the `bm25.pkl` serialized index (built by IndexBuilder) and scores all chunks against the query using BM25Okapi. | |
| **Key behaviors:** | |
| - Zero-score results (no term overlap) are excluded β returns only meaningful matches | |
| - Results sorted descending by BM25 score | |
| - Returns empty list gracefully if index doesn't exist (no documents ingested) | |
| - `reload()` method forces re-read from disk after a new ingest (used by the KB manager) | |
| ### 2. VectorRetriever | |
| Encodes the query with `all-MiniLM-L6-v2` (same model as ingestion) and queries ChromaDB with cosine similarity. | |
| **Score conversion:** | |
| ChromaDB returns cosine *distance* (0=identical, 2=opposite). The retriever converts to similarity score: `vector_score = max(0.0, 1.0 - distance)`. This makes the score range [0, 1] where 1 = perfect match. | |
| ### 3. HybridRetriever (Core) | |
| **Full pipeline:** | |
| ``` | |
| query β _expand_query() β [q1, q2, q3] | |
| β BM25 search Γ 3 queries β merge best scores per chunk_id | |
| β Vector search Γ 3 queries β merge best scores per chunk_id | |
| β _rrf_merge() β {chunk_id: rrf_score} | |
| β sort by rrf_score, take top-20 | |
| β _rerank() with CrossEncoder β sort by rerank_score | |
| β _diversity_filter() β max 2 chunks per (source_file, page_number) | |
| β return top-5 as list[RetrievalResult] | |
| ``` | |
| **RRF Formula (verified in test):** | |
| ```python | |
| rrf_score(chunk) = Ξ£_method 1 / (60 + rank_in_method) | |
| ``` | |
| - k=60 is the standard value from the Cormack 2009 RRF paper | |
| - A chunk ranked #1 in both methods scores: 2/61 β 0.0328 | |
| - A chunk ranked #5 in both methods scores: 2/65 β 0.0308 | |
| - A chunk ranked #1 in BM25 and #5 in vector scores: 1/61 + 1/65 β 0.0317 | |
| Test `test_rrf_chunk_in_both_lists_gets_higher_score` and `test_rrf_score_formula` verify the mathematics exactly. | |
| **Cross-encoder reranking:** | |
| The `ms-marco-MiniLM-L12-v2` model (33MB) reads `(query, chunk_text)` pairs together β this joint attention dramatically improves relevance scoring over bi-encoder similarity. The cross-encoder is run only on the top-20 RRF candidates (not all indexed chunks) for speed. | |
| **Diversity filter:** | |
| Caps at `cfg.max_chunks_per_page = 2` chunks from the same `(source_file, page_number)` pair. This prevents the final context from being dominated by a single dense page. | |
| **Multi-KB support:** | |
| HybridRetriever accepts `kb_names: list[str]`. It runs BM25 + vector search against all selected KBs in a single `retrieve()` call and merges results before RRF. This enables cross-KB queries in Phase 5. | |
| ### 4. ContextBuilder | |
| Formats the top-k RetrievalResult objects into a structured context string: | |
| ``` | |
| [Source: report.pdf, p.3 | Section: Results] | |
| The model achieved 94.2% accuracy... | |
| [Source: methods.pdf, p.7 | Section: Setup] | |
| We used a 10,000 sample dataset... | |
| ``` | |
| Also builds the `citation_map: list[Citation]` β each Citation corresponds to one source block, ordered by citation index. The LLM is told to cite using `[Source: filename, p.N]` markers. The CitationInjector (Phase 4) will map these markers back to the Citation objects for the UI panel. | |
| Conversation history (last 5 turns) is prepended to the context string, enabling follow-up question handling. | |
| --- | |
| ## Test Highlights | |
| **RRF Mathematics (`TestRRFMerge`):** | |
| - `test_rrf_score_formula`: Verifies 1/61 + 1/61 = 2/61 to 9 decimal places | |
| - `test_rrf_chunk_in_both_lists_gets_higher_score`: Core correctness property | |
| - `test_rrf_higher_rank_gets_lower_score`: Monotonicity property | |
| **Security (no dedicated security test β retrieval is read-only):** | |
| - BM25 pickle loaded only from `cfg.kb_bm25_path(kb_name)` β never from user input | |
| - ChromaDB queried with pre-computed embeddings β no raw query text passed to the DB | |
| --- | |
| ## Progress Tracker Update | |
| | Phase | Status | Tests | Docs | | |
| |-------|--------|-------|------| | |
| | Phase 0 β Foundation | β Done | β 58/58 | β Done | | |
| | Phase 1 β Ingestion | β Done | β 46/46 | β Done | | |
| | **Phase 2 β Retrieval** | β Done | β 33/33 | β Done | | |
| | Phase 3 β ASR | β¬ Next | β¬ | β¬ | | |
| | Phase 4 β Generation | β¬ | β¬ | β¬ | | |
| | Phase 5 β UI & Access | β¬ | β¬ | β¬ | | |