File size: 5,355 Bytes
85f900d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
# Phase 2 β€” Hybrid Retrieval Engine
**Status:** βœ… Complete | **Tests:** 33/33 passed | **Date:** March 2026

---

## Overview
Phase 2 builds the retrieval pipeline β€” VoiceVault's technical differentiator. Instead of simple vector search (which most RAG tutorials use), VoiceVault implements hybrid BM25 + dense vector retrieval with Reciprocal Rank Fusion, cross-encoder reranking, and diversity filtering.

**Why hybrid retrieval matters:** The 2026 MDPI systematic review of 63 enterprise RAG deployments found that 80.5% still use single-mode retrieval, missing the 20–30% recall improvement that hybrid search provides.

---

## Files Created

| File | Purpose |
|------|---------|
| `voicevault/retrieval/bm25_retriever.py` | rank_bm25 keyword search against persisted index |
| `voicevault/retrieval/vector_retriever.py` | ChromaDB cosine similarity search |
| `voicevault/retrieval/hybrid_retriever.py` | RRF merge + cross-encoder + diversity filter |
| `voicevault/retrieval/context_builder.py` | Formats chunks into LLM prompt context string |
| `tests/test_phase2.py` | 33 tests: retrieval correctness, RRF math, diversity, context |

---

## Module Deep-Dives

### 1. BM25Retriever

Loads the `bm25.pkl` serialized index (built by IndexBuilder) and scores all chunks against the query using BM25Okapi.

**Key behaviors:**
- Zero-score results (no term overlap) are excluded β€” returns only meaningful matches
- Results sorted descending by BM25 score
- Returns empty list gracefully if index doesn't exist (no documents ingested)
- `reload()` method forces re-read from disk after a new ingest (used by the KB manager)

### 2. VectorRetriever

Encodes the query with `all-MiniLM-L6-v2` (same model as ingestion) and queries ChromaDB with cosine similarity.

**Score conversion:**
ChromaDB returns cosine *distance* (0=identical, 2=opposite). The retriever converts to similarity score: `vector_score = max(0.0, 1.0 - distance)`. This makes the score range [0, 1] where 1 = perfect match.

### 3. HybridRetriever (Core)

**Full pipeline:**
```
query β†’ _expand_query() β†’ [q1, q2, q3]
      β†’ BM25 search Γ— 3 queries β†’ merge best scores per chunk_id
      β†’ Vector search Γ— 3 queries β†’ merge best scores per chunk_id
      β†’ _rrf_merge() β†’ {chunk_id: rrf_score}
      β†’ sort by rrf_score, take top-20
      β†’ _rerank() with CrossEncoder β†’ sort by rerank_score
      β†’ _diversity_filter() β†’ max 2 chunks per (source_file, page_number)
      β†’ return top-5 as list[RetrievalResult]
```

**RRF Formula (verified in test):**
```python
rrf_score(chunk) = Ξ£_method  1 / (60 + rank_in_method)
```
- k=60 is the standard value from the Cormack 2009 RRF paper
- A chunk ranked #1 in both methods scores: 2/61 β‰ˆ 0.0328
- A chunk ranked #5 in both methods scores: 2/65 β‰ˆ 0.0308
- A chunk ranked #1 in BM25 and #5 in vector scores: 1/61 + 1/65 β‰ˆ 0.0317

Test `test_rrf_chunk_in_both_lists_gets_higher_score` and `test_rrf_score_formula` verify the mathematics exactly.

**Cross-encoder reranking:**
The `ms-marco-MiniLM-L12-v2` model (33MB) reads `(query, chunk_text)` pairs together β€” this joint attention dramatically improves relevance scoring over bi-encoder similarity. The cross-encoder is run only on the top-20 RRF candidates (not all indexed chunks) for speed.

**Diversity filter:**
Caps at `cfg.max_chunks_per_page = 2` chunks from the same `(source_file, page_number)` pair. This prevents the final context from being dominated by a single dense page.

**Multi-KB support:**
HybridRetriever accepts `kb_names: list[str]`. It runs BM25 + vector search against all selected KBs in a single `retrieve()` call and merges results before RRF. This enables cross-KB queries in Phase 5.

### 4. ContextBuilder

Formats the top-k RetrievalResult objects into a structured context string:
```
[Source: report.pdf, p.3 | Section: Results]
The model achieved 94.2% accuracy...

[Source: methods.pdf, p.7 | Section: Setup]
We used a 10,000 sample dataset...
```

Also builds the `citation_map: list[Citation]` β€” each Citation corresponds to one source block, ordered by citation index. The LLM is told to cite using `[Source: filename, p.N]` markers. The CitationInjector (Phase 4) will map these markers back to the Citation objects for the UI panel.

Conversation history (last 5 turns) is prepended to the context string, enabling follow-up question handling.

---

## Test Highlights

**RRF Mathematics (`TestRRFMerge`):**
- `test_rrf_score_formula`: Verifies 1/61 + 1/61 = 2/61 to 9 decimal places
- `test_rrf_chunk_in_both_lists_gets_higher_score`: Core correctness property
- `test_rrf_higher_rank_gets_lower_score`: Monotonicity property

**Security (no dedicated security test β€” retrieval is read-only):**
- BM25 pickle loaded only from `cfg.kb_bm25_path(kb_name)` β€” never from user input
- ChromaDB queried with pre-computed embeddings β€” no raw query text passed to the DB

---

## Progress Tracker Update

| Phase | Status | Tests | Docs |
|-------|--------|-------|------|
| Phase 0 β€” Foundation | βœ… Done | βœ… 58/58 | βœ… Done |
| Phase 1 β€” Ingestion | βœ… Done | βœ… 46/46 | βœ… Done |
| **Phase 2 β€” Retrieval** | βœ… Done | βœ… 33/33 | βœ… Done |
| Phase 3 β€” ASR | ⬜ Next | ⬜ | ⬜ |
| Phase 4 β€” Generation | ⬜ | ⬜ | ⬜ |
| Phase 5 β€” UI & Access | ⬜ | ⬜ | ⬜ |