VoiceVault / DOCS /phase2_retrieval.md
NinjainPJs's picture
Initial release: VoiceVault v1.0.0 β€” Voice-First RAG Knowledge Agent
85f900d

Phase 2 β€” Hybrid Retrieval Engine

Status: βœ… Complete | Tests: 33/33 passed | Date: March 2026


Overview

Phase 2 builds the retrieval pipeline β€” VoiceVault's technical differentiator. Instead of simple vector search (which most RAG tutorials use), VoiceVault implements hybrid BM25 + dense vector retrieval with Reciprocal Rank Fusion, cross-encoder reranking, and diversity filtering.

Why hybrid retrieval matters: The 2026 MDPI systematic review of 63 enterprise RAG deployments found that 80.5% still use single-mode retrieval, missing the 20–30% recall improvement that hybrid search provides.


Files Created

File Purpose
voicevault/retrieval/bm25_retriever.py rank_bm25 keyword search against persisted index
voicevault/retrieval/vector_retriever.py ChromaDB cosine similarity search
voicevault/retrieval/hybrid_retriever.py RRF merge + cross-encoder + diversity filter
voicevault/retrieval/context_builder.py Formats chunks into LLM prompt context string
tests/test_phase2.py 33 tests: retrieval correctness, RRF math, diversity, context

Module Deep-Dives

1. BM25Retriever

Loads the bm25.pkl serialized index (built by IndexBuilder) and scores all chunks against the query using BM25Okapi.

Key behaviors:

  • Zero-score results (no term overlap) are excluded β€” returns only meaningful matches
  • Results sorted descending by BM25 score
  • Returns empty list gracefully if index doesn't exist (no documents ingested)
  • reload() method forces re-read from disk after a new ingest (used by the KB manager)

2. VectorRetriever

Encodes the query with all-MiniLM-L6-v2 (same model as ingestion) and queries ChromaDB with cosine similarity.

Score conversion: ChromaDB returns cosine distance (0=identical, 2=opposite). The retriever converts to similarity score: vector_score = max(0.0, 1.0 - distance). This makes the score range [0, 1] where 1 = perfect match.

3. HybridRetriever (Core)

Full pipeline:

query β†’ _expand_query() β†’ [q1, q2, q3]
      β†’ BM25 search Γ— 3 queries β†’ merge best scores per chunk_id
      β†’ Vector search Γ— 3 queries β†’ merge best scores per chunk_id
      β†’ _rrf_merge() β†’ {chunk_id: rrf_score}
      β†’ sort by rrf_score, take top-20
      β†’ _rerank() with CrossEncoder β†’ sort by rerank_score
      β†’ _diversity_filter() β†’ max 2 chunks per (source_file, page_number)
      β†’ return top-5 as list[RetrievalResult]

RRF Formula (verified in test):

rrf_score(chunk) = Ξ£_method  1 / (60 + rank_in_method)
  • k=60 is the standard value from the Cormack 2009 RRF paper
  • A chunk ranked #1 in both methods scores: 2/61 β‰ˆ 0.0328
  • A chunk ranked #5 in both methods scores: 2/65 β‰ˆ 0.0308
  • A chunk ranked #1 in BM25 and #5 in vector scores: 1/61 + 1/65 β‰ˆ 0.0317

Test test_rrf_chunk_in_both_lists_gets_higher_score and test_rrf_score_formula verify the mathematics exactly.

Cross-encoder reranking: The ms-marco-MiniLM-L12-v2 model (33MB) reads (query, chunk_text) pairs together β€” this joint attention dramatically improves relevance scoring over bi-encoder similarity. The cross-encoder is run only on the top-20 RRF candidates (not all indexed chunks) for speed.

Diversity filter: Caps at cfg.max_chunks_per_page = 2 chunks from the same (source_file, page_number) pair. This prevents the final context from being dominated by a single dense page.

Multi-KB support: HybridRetriever accepts kb_names: list[str]. It runs BM25 + vector search against all selected KBs in a single retrieve() call and merges results before RRF. This enables cross-KB queries in Phase 5.

4. ContextBuilder

Formats the top-k RetrievalResult objects into a structured context string:

[Source: report.pdf, p.3 | Section: Results]
The model achieved 94.2% accuracy...

[Source: methods.pdf, p.7 | Section: Setup]
We used a 10,000 sample dataset...

Also builds the citation_map: list[Citation] β€” each Citation corresponds to one source block, ordered by citation index. The LLM is told to cite using [Source: filename, p.N] markers. The CitationInjector (Phase 4) will map these markers back to the Citation objects for the UI panel.

Conversation history (last 5 turns) is prepended to the context string, enabling follow-up question handling.


Test Highlights

RRF Mathematics (TestRRFMerge):

  • test_rrf_score_formula: Verifies 1/61 + 1/61 = 2/61 to 9 decimal places
  • test_rrf_chunk_in_both_lists_gets_higher_score: Core correctness property
  • test_rrf_higher_rank_gets_lower_score: Monotonicity property

Security (no dedicated security test β€” retrieval is read-only):

  • BM25 pickle loaded only from cfg.kb_bm25_path(kb_name) β€” never from user input
  • ChromaDB queried with pre-computed embeddings β€” no raw query text passed to the DB

Progress Tracker Update

Phase Status Tests Docs
Phase 0 β€” Foundation βœ… Done βœ… 58/58 βœ… Done
Phase 1 β€” Ingestion βœ… Done βœ… 46/46 βœ… Done
Phase 2 β€” Retrieval βœ… Done βœ… 33/33 βœ… Done
Phase 3 β€” ASR ⬜ Next ⬜ ⬜
Phase 4 β€” Generation ⬜ ⬜ ⬜
Phase 5 β€” UI & Access ⬜ ⬜ ⬜