File size: 20,253 Bytes
85f900d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
# VoiceVault β€” End-to-End Implementation Plan
**Author:** Navnit Amrutharaj
**Model:** VoiceVault v1.0 β€” Voice-First RAG Knowledge Agent
**Stack:** Whisper Β· LangChain Β· ChromaDB Β· Groq Β· Gradio
**Target:** $0/month Β· HuggingFace Spaces Β· 10 Weeks
**Plan Date:** March 2026

---

## Table of Contents
1. [Project Overview](#1-project-overview)
2. [Architecture Summary](#2-architecture-summary)
3. [Phase Map](#3-phase-map)
4. [Phase 0 β€” Project Foundation](#phase-0--project-foundation)
5. [Phase 1 β€” Document Ingestion Pipeline](#phase-1--document-ingestion-pipeline)
6. [Phase 2 β€” Hybrid Retrieval Engine](#phase-2--hybrid-retrieval-engine)
7. [Phase 3 β€” ASR & Voice Input](#phase-3--asr--voice-input)
8. [Phase 4 β€” Generation Chain & Citations](#phase-4--generation-chain--citations)
9. [Phase 5 β€” Full UI, TTS & Access Control](#phase-5--full-ui-tts--access-control)
10. [Quality Gates](#10-quality-gates)
11. [Security Audit Checklist](#11-security-audit-checklist)
12. [Progress Tracker](#12-progress-tracker)

---

## 1. Project Overview

VoiceVault is a **voice-first retrieval-augmented generation (RAG) knowledge agent** that enables users to:
- Speak questions into a browser microphone
- Get transcribed (Whisper), retrieved, generated, and spoken back answers
- Reference private document collections (PDFs, Notion exports, Confluence, DOCX, MD)
- Receive fully cited answers anchored to source document + page + paragraph

**Core differentiator:** Hybrid BM25 + vector search with Reciprocal Rank Fusion (RRF) + cross-encoder reranking β€” demonstrating enterprise-grade retrieval depth that most RAG tutorials skip.

---

## 2. Architecture Summary

```
INGESTION PATH (one-time per document set)
  User uploads PDFs / HTML / DOCX / MD
      ↓
  DocumentParser β†’ text extraction (PyMuPDF, BS4, python-docx)
      ↓
  SemanticChunker β†’ sentence-aware chunks (spaCy + cosine boundary)
      ↓
  IndexBuilder β†’ ChromaDB (vectors) + BM25 (keywords) + SQLite (metadata)

QUERY PATH (real-time, per user question)
  Browser mic β†’ Gradio Audio β†’ Whisper Large-v3 (HuggingFace GPU)
      ↓
  QueryPreprocessor β†’ cleanup + intent class + language detect
      ↓
  HybridRetriever β†’ BM25 top-20 + Vector top-20 β†’ RRF merge β†’ CrossEncoder top-5
      ↓
  LangChain LCEL β†’ Groq Llama-3.1-70B (stream) / Gemini Flash (fallback)
      ↓
  CitationInjector β†’ [Source: filename, p.N] inline citations
      ↓
  Gradio UI (text + highlight citations) + Web Speech API (spoken answer)
```

---

## 3. Phase Map

| Phase | Name | Weeks | Core Deliverables |
|-------|------|-------|-------------------|
| **0** | Project Foundation | 0 | Scaffold, config, models, SQLite schema, Gradio skeleton |
| **1** | Document Ingestion | 1–2 | Parser, semantic chunker, ChromaDB + BM25 + SQLite indexer |
| **2** | Hybrid Retrieval | 3 | BM25 + vector + RRF + cross-encoder + diversity filter |
| **3** | ASR & Voice Input | 4 | Whisper Large-v3, Distil fallback, query preprocessor |
| **4** | Generation & Citations | 5 | LangChain LCEL, Groq, Gemini fallback, faithfulness guard |
| **5** | Full UI & Access Control | 6–8 | 4-tab Gradio UI, Web Speech TTS, multi-KB, bcrypt, audit log |

---

## Phase 0 β€” Project Foundation

### Goal
Establish the complete project skeleton β€” directory structure, dependencies, centralized config, Pydantic data models, SQLite schema, and a working 4-tab Gradio scaffold β€” before any business logic is written.

### Files Created
```
voicevault/
β”œβ”€β”€ app.py                          # Gradio Blocks entry point
β”œβ”€β”€ config.py                       # Pydantic-settings centralized config
β”œβ”€β”€ requirements.txt                # All project dependencies (pinned)
β”œβ”€β”€ .env.example                    # Environment variable template
β”œβ”€β”€ voicevault/
β”‚   β”œβ”€β”€ __init__.py                 # Package init + version
β”‚   β”œβ”€β”€ models.py                   # Pydantic data models (all schemas)
β”‚   β”œβ”€β”€ asr/__init__.py
β”‚   β”œβ”€β”€ ingestion/__init__.py
β”‚   β”œβ”€β”€ retrieval/__init__.py
β”‚   β”œβ”€β”€ generation/__init__.py
β”‚   β”œβ”€β”€ kb/__init__.py
β”‚   β”œβ”€β”€ tts/__init__.py
β”‚   └── storage/
β”‚       β”œβ”€β”€ __init__.py
β”‚       └── sqlite_store.py         # Schema creation + DB init
β”œβ”€β”€ ui/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ tabs/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ ask_tab.py              # Placeholder β€” voice query tab
β”‚   β”‚   β”œβ”€β”€ kb_tab.py               # Placeholder β€” KB manager tab
β”‚   β”‚   β”œβ”€β”€ analytics_tab.py        # Placeholder β€” analytics tab
β”‚   β”‚   └── settings_tab.py         # Placeholder β€” settings tab
β”‚   └── components/
β”‚       β”œβ”€β”€ __init__.py
β”‚       β”œβ”€β”€ citation_panel.py       # Placeholder β€” citation display
β”‚       └── audio_controls.py       # Placeholder β€” TTS controls
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ conftest.py                 # Pytest fixtures
β”‚   └── test_phase0.py              # Foundation smoke tests
β”œβ”€β”€ data/                           # Runtime data (gitignored)
└── DOCS/
    └── phase0_foundation.md        # Phase 0 documentation
```

### Key Decisions
- **pydantic-settings** for type-safe env var loading (no raw `os.environ` calls)
- **pathlib.Path** throughout β€” cross-platform, no `os.path`
- **SQLite stdlib** for metadata β€” zero-dependency, portable, no server
- **Gradio 4.x Blocks** for UI β€” native HuggingFace Spaces support
- **`__version__` sentinel** in `voicevault/__init__.py` for release tracking
- **Data models locked early** β€” prevents schema drift across phases

### Tests
| Test | Description | Pass Criteria |
|------|-------------|---------------|
| `test_config_loads` | Config instantiates without exceptions | No exception |
| `test_env_defaults` | Default values are correct types | All fields pass type check |
| `test_db_init` | SQLite schema creates 3 tables | Tables `knowledge_bases`, `documents`, `query_log` exist |
| `test_data_dirs` | Data directory structure is created | Dirs exist after init |
| `test_models_instantiate` | All Pydantic models can be instantiated | No validation errors |
| `test_gradio_builds` | Gradio demo object builds without error | `gr.Blocks` object created |

### Documentation
β†’ See `DOCS/phase0_foundation.md`

---

## Phase 1 β€” Document Ingestion Pipeline

### Goal
Build the complete document ingestion pipeline: parse any supported document format, semantically chunk the text, generate embeddings, build the BM25 index, store everything in ChromaDB + SQLite, and implement SHA-256-based deduplication.

### Files Created
```
voicevault/ingestion/
β”œβ”€β”€ document_parser.py      # PDF, HTML, DOCX, MD, TXT, URL parsers
β”œβ”€β”€ semantic_chunker.py     # spaCy + cosine-similarity boundary chunker
└── index_builder.py        # ChromaDB + BM25 + SQLite indexer + dedup

voicevault/storage/
β”œβ”€β”€ sqlite_store.py         # Full CRUD: KB, document, chunk metadata
└── chroma_store.py         # ChromaDB collection management

tests/
└── test_phase1.py          # Ingestion unit + integration tests

DOCS/
└── phase1_ingestion.md
```

### Key Components

**DocumentParser** β€” Multi-format dispatcher:
- PDF: `PyMuPDF` (fitz) β€” preserves page numbers, extracts tables as text
- HTML: `BeautifulSoup4` β€” Notion/Confluence exports, preserves heading hierarchy
- DOCX: `python-docx` β€” heading-aware extraction
- Markdown: `markdown-it-py` β€” heading hierarchy β†’ section metadata
- Plain text: paragraph-level splitting
- URL: `trafilatura` β€” clean article extraction from any public URL
- Scanned PDF fallback: `pytesseract` OCR when no text layer found

**SemanticChunker** β€” Boundary detection:
- `spaCy en_core_web_sm` sentence tokenization
- Cosine similarity between adjacent sentence embeddings
- New chunk when similarity < 0.5 (configurable threshold)
- Target: 400–600 tokens per chunk, 50-token overlap
- Special handling: tables as atomic units, code blocks atomic, lists kept together
- Metadata per chunk: source_file, page_number, section_heading, chunk_index, timestamp

**IndexBuilder** β€” Dual-index construction:
- SHA-256 hash of chunk text β†’ deduplication (skip re-indexed unchanged content)
- `sentence-transformers all-MiniLM-L6-v2` β†’ 384-dim embeddings β†’ ChromaDB
- `rank_bm25` BM25Okapi index β†’ serialized to `bm25.pkl`
- SQLite metadata: `chunks` table linking every chunk to its source doc
- Incremental update: only new/changed chunks re-embedded

### Tests
| Test | Pass Criteria |
|------|---------------|
| `test_pdf_parse` | Extracts text with correct page numbers |
| `test_html_parse` | Extracts headings and paragraphs from Notion HTML |
| `test_docx_parse` | Extracts text from DOCX with heading metadata |
| `test_semantic_chunker` | Chunks respect sentence boundaries, 100–600 tokens |
| `test_deduplication` | Same doc uploaded twice β†’ chunks not duplicated |
| `test_bm25_build` | BM25 index serializes and reloads correctly |
| `test_chroma_store` | Vectors stored and queryable in ChromaDB |
| `test_sqlite_metadata` | All chunk metadata persisted to SQLite |
| `test_incremental_update` | Only new chunks indexed on re-upload |

---

## Phase 2 β€” Hybrid Retrieval Engine

### Goal
Implement the hybrid BM25 + dense vector retrieval pipeline with Reciprocal Rank Fusion merging, cross-encoder reranking, diversity filtering, query expansion, and context window assembly.

### Files Created
```
voicevault/retrieval/
β”œβ”€β”€ bm25_retriever.py       # rank_bm25 keyword search
β”œβ”€β”€ vector_retriever.py     # ChromaDB semantic search
β”œβ”€β”€ hybrid_retriever.py     # RRF merge + cross-encoder + diversity filter
└── context_builder.py      # Formats top-k chunks for LLM prompt

tests/
└── test_phase2.py          # Retrieval unit + benchmark tests

DOCS/
└── phase2_retrieval.md
```

### Key Components

**BM25Retriever:**
- Loads pre-built BM25 index from disk
- Tokenizes query, scores all chunks, returns top-20

**VectorRetriever:**
- Encodes query with `all-MiniLM-L6-v2`
- ChromaDB cosine similarity query β†’ top-20

**HybridRetriever (RRF core):**
```
query β†’ [QueryExpander: 2 paraphrases]
     β†’ BM25 top-20 + Vector top-20 (parallel)
     β†’ RRF merge (k=60): score = Ξ£ 1/(k + rank)
     β†’ CrossEncoder ms-marco-MiniLM-L12-v2 rescores top-20
     β†’ DiversityFilter: max 2 chunks from same page
     β†’ Final top-5 chunks
```

**ContextBuilder:**
- Formats chunks as: `[Source: filename, p.N | Section: heading]\n{text}`
- Appends conversation history (last 5 turns)
- Returns context string ready for LLM prompt

### Tests
| Test | Pass Criteria |
|------|---------------|
| `test_bm25_retriever` | Returns ranked results for keyword query |
| `test_vector_retriever` | Returns semantically relevant results |
| `test_rrf_merge` | RRF scores computed correctly for known ranks |
| `test_cross_encoder_rerank` | Re-ranked order differs from RRF order (improvement) |
| `test_diversity_filter` | Max 2 chunks per page in final results |
| `test_hybrid_recall` | Recall@5 β‰₯ 0.80 on 50-Q benchmark dataset |
| `test_context_builder` | Output is valid string with source citations |
| `test_query_expansion` | Returns 2 paraphrase variants |

---

## Phase 3 β€” ASR & Voice Input

### Goal
Integrate Whisper Large-v3 for high-quality speech-to-text transcription, with Distil-Whisper CPU fallback, browser microphone capture via Gradio Audio, and a query preprocessor that cleans transcripts and classifies query intent.

### Files Created
```
voicevault/asr/
β”œβ”€β”€ whisper_transcriber.py  # Whisper Large-v3 + Distil-Whisper fallback
└── query_preprocessor.py   # Cleanup, intent classification, language detect

tests/
└── test_phase3.py          # ASR unit tests + WER evaluation

DOCS/
└── phase3_asr.md
```

### Key Components

**WhisperTranscriber:**
- Primary: `openai/whisper-large-v3` (HuggingFace GPU pipeline)
- Fallback: `distil-whisper/distil-large-v3` (CPU, 6Γ— faster, <1% WER diff)
- VAD pre-check: reject audio < 1s or silent audio
- Returns: `transcript`, `language`, `confidence`, `model_used`, `latency_ms`

**QueryPreprocessor:**
- Lowercase normalization, punctuation repair
- Filler word removal: um, uh, like, you know
- Language detection: `langdetect` library
- Query type classification:
  - `factual` β€” "What is...", "Who...", "When..."
  - `summary` β€” "Summarise...", "Give me an overview..."
  - `compare` β€” "Compare...", "What's the difference..."
- Routes to different retrieval strategies per type

### Tests
| Test | Pass Criteria |
|------|---------------|
| `test_preprocessor_cleanup` | Filler words removed, normalized |
| `test_intent_factual` | "What is X?" β†’ type=factual |
| `test_intent_summary` | "Summarise the report" β†’ type=summary |
| `test_intent_compare` | "Compare A and B" β†’ type=compare |
| `test_language_detection` | English text β†’ "en" |
| `test_vad_short_audio` | < 1s audio raises ValueError |
| `test_whisper_mock` | Transcriber returns correct schema with mocked model |

---

## Phase 4 β€” Generation Chain & Citations

### Goal
Build the full LangChain LCEL generation chain: Groq Llama-3.1-70B as primary LLM with streaming, Gemini 1.5 Flash as automatic fallback, citation injection with [Source: file, p.N] protocol, faithfulness guard for out-of-context detection, and conversation memory.

### Files Created
```
voicevault/generation/
β”œβ”€β”€ answer_chain.py         # LangChain LCEL + Groq + Gemini fallback
β”œβ”€β”€ citation_injector.py    # Maps [Doc:Page] citations to source chunks
└── faithfulness_guard.py   # Out-of-context detection

tests/
└── test_phase4.py          # Generation unit tests

DOCS/
└── phase4_generation.md
```

### Key Components

**AnswerChain (LCEL):**
```
context_string + query + history
    β†’ PromptTemplate (system: citation protocol + faithfulness instructions)
    β†’ ChatGroq (llama-3.1-70b-versatile, streaming, temp=0.1)
         on quota error β†’ ChatGoogleGenerativeAI (gemini-1.5-flash)
    β†’ StrOutputParser
    β†’ CitationInjector (post-processing)
```

**CitationInjector:**
- Parses `[Doc:Page]` markers from LLM output
- Resolves each to the actual chunk's source_file + page_number + excerpt
- Builds `List[Citation]` object for UI display

**FaithfulnessGuard:**
- System prompt: "If the answer cannot be found in the provided context, respond with exactly: 'I could not find this in your documents.'"
- Post-generation check: if answer references facts not in any retrieved chunk β†’ flag
- Confidence scoring based on retrieval score distribution

### Tests
| Test | Pass Criteria |
|------|---------------|
| `test_citation_injector_parses` | `[Doc:5]` β†’ correct Citation object |
| `test_faithfulness_guard_refusal` | Out-of-context Q β†’ refusal message |
| `test_answer_chain_mock` | Chain runs end-to-end with mocked LLM |
| `test_groq_fallback` | Groq quota error β†’ Gemini client used |
| `test_streaming_output` | Chain yields token-by-token |
| `test_conversation_memory` | Last 5 turns preserved across queries |

---

## Phase 5 β€” Full UI, TTS & Access Control

### Goal
Build the complete 4-tab Gradio UI, integrate Web Speech API for browser-native TTS, implement the multi-knowledge-base manager, add bcrypt password protection + HMAC share links, and build the analytics + audit log system.

### Files Created
```
voicevault/kb/
β”œβ”€β”€ kb_manager.py           # Create/list/delete knowledge bases
β”œβ”€β”€ access_control.py       # bcrypt password, HMAC share links
└── audit_log.py            # Query logging to SQLite

voicevault/tts/
└── web_speech.py           # Web Speech API JS bridge

voicevault/storage/
└── sqlite_store.py         # Complete CRUD (extended from Phase 0)

ui/tabs/
β”œβ”€β”€ ask_tab.py              # Full voice query tab
β”œβ”€β”€ kb_tab.py               # Full KB manager tab
β”œβ”€β”€ analytics_tab.py        # Charts + metrics tab
└── settings_tab.py         # All configurable parameters

ui/components/
β”œβ”€β”€ citation_panel.py       # Citation highlighting component
└── audio_controls.py       # TTS playback controls

tests/
β”œβ”€β”€ test_phase5.py          # UI component + access control tests
└── test_e2e.py             # Full end-to-end pipeline test

DOCS/
└── phase5_ui_access.md
```

### Key Components

**KBManager:**
- Creates per-KB directory: `data/{kb_name}/chroma/`, `bm25.pkl`, `voicevault.db`
- Lists all KBs with metadata (doc count, chunk count, last updated)
- Delete KB: removes directory + SQLite row

**AccessControl:**
- Password hash: `bcrypt` with work factor 12
- Share link: `HMAC-SHA256` signed token with KB name + expiry
- Token validation on every query to password-protected KB

**AuditLog:**
- Every query logs: session_id, kb_names, voice_query (anonymized), latency, timestamp
- Viewable in Analytics tab

**Web Speech API Bridge:**
- JavaScript injected via `gr.HTML` component
- `window.speechSynthesis.speak()` triggered from Python via Gradio's JS bridge
- Voice selector, rate slider, pitch slider
- Pause/Resume/Restart controls

**UI Tabs:**
- **Ask tab:** Mic button β†’ live transcript β†’ KB selector β†’ streaming answer β†’ citation panel β†’ speak button
- **KB tab:** Create KB form + document uploader (PDF/MD/HTML/DOCX) + progress bar + doc list
- **Analytics tab:** Query volume chart + latency breakdown + top documents + Groq quota gauge
- **Settings tab:** ASR model, voice settings, retrieval params, LLM params, chunking params

### Tests
| Test | Pass Criteria |
|------|---------------|
| `test_kb_create_delete` | KB directory created/removed correctly |
| `test_bcrypt_password` | Hash + verify round-trip |
| `test_hmac_share_link` | Token validates within expiry, fails after |
| `test_audit_log_write` | Query logged to SQLite correctly |
| `test_access_control_wrong_pw` | Wrong password β†’ access denied |
| `test_e2e_pipeline` | PDF upload β†’ query β†’ cited answer (mocked LLM) |

---

## 10. Quality Gates

Every phase must pass ALL gates before moving to the next phase:

| Gate | Requirement |
|------|-------------|
| **Zero import errors** | `python -m pytest tests/ --co -q` exits 0 |
| **All tests pass** | `pytest tests/test_phaseN.py` β€” 100% green |
| **No bare except** | No `except:` or `except Exception:` without logging |
| **Type annotations** | Every public function has full type hints |
| **No unused imports** | `pylint --disable=all --enable=W0611` passes |
| **No secrets in code** | No API keys, passwords, or tokens hardcoded |
| **Pathlib throughout** | No `os.path` usage in any module |

---

## 11. Security Audit Checklist

- [ ] No API keys committed to git (enforced by .gitignore + .env.example)
- [ ] All file uploads validated: extension whitelist + MIME check + size limit
- [ ] SQLite queries use parameterized statements (no f-string SQL)
- [ ] bcrypt work factor β‰₯ 12 for password hashing
- [ ] HMAC share tokens have expiry (default: 7 days)
- [ ] `trafilatura` URL fetching: no SSRF β€” block private IP ranges
- [ ] ChromaDB stored in non-public path (never served as static file)
- [ ] BM25 pickle files: only loaded from trusted internal paths
- [ ] Gradio app: file upload restricted to `data/uploads/` sandbox directory
- [ ] Audit log: voice queries anonymized before storage (hash, not raw text)

---

## 12. Progress Tracker

| Phase | Status | Tests | Docs |
|-------|--------|-------|------|
| Phase 0 β€” Foundation | βœ… Done | βœ… 58/58 | βœ… phase0_foundation.md |
| Phase 1 β€” Ingestion | βœ… Done | βœ… 46/46 | βœ… phase1_ingestion.md |
| Phase 2 β€” Retrieval | βœ… Done | βœ… 33/33 | βœ… phase2_retrieval.md |
| Phase 3 β€” ASR | βœ… Done | βœ… 45/47 (2 skipped) | βœ… phase3_asr.md |
| Phase 4 β€” Generation | βœ… Done | βœ… 72/72 | βœ… phase4_generation.md |
| Phase 5 β€” UI & Access | βœ… Done | βœ… 55/55 | βœ… phase5_ui_access.md |

---

*VoiceVault Β· Navnit Amrutharaj Β· navnita004@gmail.com Β· github.com/ninjacode911*