Spaces:

NinjainPJs
/

VoiceVault

Running

App Files Files Community

VoiceVault / DOCS /phase2_retrieval.md

NinjainPJs

Initial release: VoiceVault v1.0.0 — Voice-First RAG Knowledge Agent

85f900d 3 months ago

preview code

raw

history blame contribute delete

5.36 kB

	# Phase 2 — Hybrid Retrieval Engine
	Status: ✅ Complete \| Tests: 33/33 passed \| Date: March 2026

	---

	## Overview
	Phase 2 builds the retrieval pipeline — VoiceVault's technical differentiator. Instead of simple vector search (which most RAG tutorials use), VoiceVault implements hybrid BM25 + dense vector retrieval with Reciprocal Rank Fusion, cross-encoder reranking, and diversity filtering.

	Why hybrid retrieval matters: The 2026 MDPI systematic review of 63 enterprise RAG deployments found that 80.5% still use single-mode retrieval, missing the 20–30% recall improvement that hybrid search provides.

	---

	## Files Created

	\| File \| Purpose \|
	\|------\|---------\|
	\| `voicevault/retrieval/bm25_retriever.py` \| rank_bm25 keyword search against persisted index \|
	\| `voicevault/retrieval/vector_retriever.py` \| ChromaDB cosine similarity search \|
	\| `voicevault/retrieval/hybrid_retriever.py` \| RRF merge + cross-encoder + diversity filter \|
	\| `voicevault/retrieval/context_builder.py` \| Formats chunks into LLM prompt context string \|
	\| `tests/test_phase2.py` \| 33 tests: retrieval correctness, RRF math, diversity, context \|

	---

	## Module Deep-Dives

	### 1. BM25Retriever

	Loads the `bm25.pkl` serialized index (built by IndexBuilder) and scores all chunks against the query using BM25Okapi.

	Key behaviors:
	- Zero-score results (no term overlap) are excluded — returns only meaningful matches
	- Results sorted descending by BM25 score
	- Returns empty list gracefully if index doesn't exist (no documents ingested)
	- `reload()` method forces re-read from disk after a new ingest (used by the KB manager)

	### 2. VectorRetriever

	Encodes the query with `all-MiniLM-L6-v2` (same model as ingestion) and queries ChromaDB with cosine similarity.

	Score conversion:
	ChromaDB returns cosine distance (0=identical, 2=opposite). The retriever converts to similarity score: `vector_score = max(0.0, 1.0 - distance)`. This makes the score range [0, 1] where 1 = perfect match.

	### 3. HybridRetriever (Core)

	Full pipeline:
	```
	query → _expand_query() → [q1, q2, q3]
	→ BM25 search × 3 queries → merge best scores per chunk_id
	→ Vector search × 3 queries → merge best scores per chunk_id
	→ _rrf_merge() → {chunk_id: rrf_score}
	→ sort by rrf_score, take top-20
	→ _rerank() with CrossEncoder → sort by rerank_score
	→ _diversity_filter() → max 2 chunks per (source_file, page_number)
	→ return top-5 as list[RetrievalResult]
	```

	RRF Formula (verified in test):
	```python
	rrf_score(chunk) = Σ_method 1 / (60 + rank_in_method)
	```
	- k=60 is the standard value from the Cormack 2009 RRF paper
	- A chunk ranked #1 in both methods scores: 2/61 ≈ 0.0328
	- A chunk ranked #5 in both methods scores: 2/65 ≈ 0.0308
	- A chunk ranked #1 in BM25 and #5 in vector scores: 1/61 + 1/65 ≈ 0.0317

	Test `test_rrf_chunk_in_both_lists_gets_higher_score` and `test_rrf_score_formula` verify the mathematics exactly.

	Cross-encoder reranking:
	The `ms-marco-MiniLM-L12-v2` model (33MB) reads `(query, chunk_text)` pairs together — this joint attention dramatically improves relevance scoring over bi-encoder similarity. The cross-encoder is run only on the top-20 RRF candidates (not all indexed chunks) for speed.

	Diversity filter:
	Caps at `cfg.max_chunks_per_page = 2` chunks from the same `(source_file, page_number)` pair. This prevents the final context from being dominated by a single dense page.

	Multi-KB support:
	HybridRetriever accepts `kb_names: list[str]`. It runs BM25 + vector search against all selected KBs in a single `retrieve()` call and merges results before RRF. This enables cross-KB queries in Phase 5.

	### 4. ContextBuilder

	Formats the top-k RetrievalResult objects into a structured context string:
	```
	[Source: report.pdf, p.3 \| Section: Results]
	The model achieved 94.2% accuracy...

	[Source: methods.pdf, p.7 \| Section: Setup]
	We used a 10,000 sample dataset...
	```

	Also builds the `citation_map: list[Citation]` — each Citation corresponds to one source block, ordered by citation index. The LLM is told to cite using `[Source: filename, p.N]` markers. The CitationInjector (Phase 4) will map these markers back to the Citation objects for the UI panel.

	Conversation history (last 5 turns) is prepended to the context string, enabling follow-up question handling.

	---

	## Test Highlights

	RRF Mathematics (`TestRRFMerge`):
	- `test_rrf_score_formula`: Verifies 1/61 + 1/61 = 2/61 to 9 decimal places
	- `test_rrf_chunk_in_both_lists_gets_higher_score`: Core correctness property
	- `test_rrf_higher_rank_gets_lower_score`: Monotonicity property

	Security (no dedicated security test — retrieval is read-only):
	- BM25 pickle loaded only from `cfg.kb_bm25_path(kb_name)` — never from user input
	- ChromaDB queried with pre-computed embeddings — no raw query text passed to the DB

	---

	## Progress Tracker Update

	\| Phase \| Status \| Tests \| Docs \|
	\|-------\|--------\|-------\|------\|
	\| Phase 0 — Foundation \| ✅ Done \| ✅ 58/58 \| ✅ Done \|
	\| Phase 1 — Ingestion \| ✅ Done \| ✅ 46/46 \| ✅ Done \|
	\| Phase 2 — Retrieval \| ✅ Done \| ✅ 33/33 \| ✅ Done \|
	\| Phase 3 — ASR \| ⬜ Next \| ⬜ \| ⬜ \|
	\| Phase 4 — Generation \| ⬜ \| ⬜ \| ⬜ \|
	\| Phase 5 — UI & Access \| ⬜ \| ⬜ \| ⬜ \|