# VoiceVault — End-to-End Implementation Plan **Author:** Navnit Amrutharaj **Model:** VoiceVault v1.0 — Voice-First RAG Knowledge Agent **Stack:** Whisper · LangChain · ChromaDB · Groq · Gradio **Target:** $0/month · HuggingFace Spaces · 10 Weeks **Plan Date:** March 2026 --- ## Table of Contents 1. [Project Overview](#1-project-overview) 2. [Architecture Summary](#2-architecture-summary) 3. [Phase Map](#3-phase-map) 4. [Phase 0 — Project Foundation](#phase-0--project-foundation) 5. [Phase 1 — Document Ingestion Pipeline](#phase-1--document-ingestion-pipeline) 6. [Phase 2 — Hybrid Retrieval Engine](#phase-2--hybrid-retrieval-engine) 7. [Phase 3 — ASR & Voice Input](#phase-3--asr--voice-input) 8. [Phase 4 — Generation Chain & Citations](#phase-4--generation-chain--citations) 9. [Phase 5 — Full UI, TTS & Access Control](#phase-5--full-ui-tts--access-control) 10. [Quality Gates](#10-quality-gates) 11. [Security Audit Checklist](#11-security-audit-checklist) 12. [Progress Tracker](#12-progress-tracker) --- ## 1. Project Overview VoiceVault is a **voice-first retrieval-augmented generation (RAG) knowledge agent** that enables users to: - Speak questions into a browser microphone - Get transcribed (Whisper), retrieved, generated, and spoken back answers - Reference private document collections (PDFs, Notion exports, Confluence, DOCX, MD) - Receive fully cited answers anchored to source document + page + paragraph **Core differentiator:** Hybrid BM25 + vector search with Reciprocal Rank Fusion (RRF) + cross-encoder reranking — demonstrating enterprise-grade retrieval depth that most RAG tutorials skip. --- ## 2. Architecture Summary ``` INGESTION PATH (one-time per document set) User uploads PDFs / HTML / DOCX / MD ↓ DocumentParser → text extraction (PyMuPDF, BS4, python-docx) ↓ SemanticChunker → sentence-aware chunks (spaCy + cosine boundary) ↓ IndexBuilder → ChromaDB (vectors) + BM25 (keywords) + SQLite (metadata) QUERY PATH (real-time, per user question) Browser mic → Gradio Audio → Whisper Large-v3 (HuggingFace GPU) ↓ QueryPreprocessor → cleanup + intent class + language detect ↓ HybridRetriever → BM25 top-20 + Vector top-20 → RRF merge → CrossEncoder top-5 ↓ LangChain LCEL → Groq Llama-3.1-70B (stream) / Gemini Flash (fallback) ↓ CitationInjector → [Source: filename, p.N] inline citations ↓ Gradio UI (text + highlight citations) + Web Speech API (spoken answer) ``` --- ## 3. Phase Map | Phase | Name | Weeks | Core Deliverables | |-------|------|-------|-------------------| | **0** | Project Foundation | 0 | Scaffold, config, models, SQLite schema, Gradio skeleton | | **1** | Document Ingestion | 1–2 | Parser, semantic chunker, ChromaDB + BM25 + SQLite indexer | | **2** | Hybrid Retrieval | 3 | BM25 + vector + RRF + cross-encoder + diversity filter | | **3** | ASR & Voice Input | 4 | Whisper Large-v3, Distil fallback, query preprocessor | | **4** | Generation & Citations | 5 | LangChain LCEL, Groq, Gemini fallback, faithfulness guard | | **5** | Full UI & Access Control | 6–8 | 4-tab Gradio UI, Web Speech TTS, multi-KB, bcrypt, audit log | --- ## Phase 0 — Project Foundation ### Goal Establish the complete project skeleton — directory structure, dependencies, centralized config, Pydantic data models, SQLite schema, and a working 4-tab Gradio scaffold — before any business logic is written. ### Files Created ``` voicevault/ ├── app.py # Gradio Blocks entry point ├── config.py # Pydantic-settings centralized config ├── requirements.txt # All project dependencies (pinned) ├── .env.example # Environment variable template ├── voicevault/ │ ├── __init__.py # Package init + version │ ├── models.py # Pydantic data models (all schemas) │ ├── asr/__init__.py │ ├── ingestion/__init__.py │ ├── retrieval/__init__.py │ ├── generation/__init__.py │ ├── kb/__init__.py │ ├── tts/__init__.py │ └── storage/ │ ├── __init__.py │ └── sqlite_store.py # Schema creation + DB init ├── ui/ │ ├── __init__.py │ ├── tabs/ │ │ ├── __init__.py │ │ ├── ask_tab.py # Placeholder — voice query tab │ │ ├── kb_tab.py # Placeholder — KB manager tab │ │ ├── analytics_tab.py # Placeholder — analytics tab │ │ └── settings_tab.py # Placeholder — settings tab │ └── components/ │ ├── __init__.py │ ├── citation_panel.py # Placeholder — citation display │ └── audio_controls.py # Placeholder — TTS controls ├── tests/ │ ├── __init__.py │ ├── conftest.py # Pytest fixtures │ └── test_phase0.py # Foundation smoke tests ├── data/ # Runtime data (gitignored) └── DOCS/ └── phase0_foundation.md # Phase 0 documentation ``` ### Key Decisions - **pydantic-settings** for type-safe env var loading (no raw `os.environ` calls) - **pathlib.Path** throughout — cross-platform, no `os.path` - **SQLite stdlib** for metadata — zero-dependency, portable, no server - **Gradio 4.x Blocks** for UI — native HuggingFace Spaces support - **`__version__` sentinel** in `voicevault/__init__.py` for release tracking - **Data models locked early** — prevents schema drift across phases ### Tests | Test | Description | Pass Criteria | |------|-------------|---------------| | `test_config_loads` | Config instantiates without exceptions | No exception | | `test_env_defaults` | Default values are correct types | All fields pass type check | | `test_db_init` | SQLite schema creates 3 tables | Tables `knowledge_bases`, `documents`, `query_log` exist | | `test_data_dirs` | Data directory structure is created | Dirs exist after init | | `test_models_instantiate` | All Pydantic models can be instantiated | No validation errors | | `test_gradio_builds` | Gradio demo object builds without error | `gr.Blocks` object created | ### Documentation → See `DOCS/phase0_foundation.md` --- ## Phase 1 — Document Ingestion Pipeline ### Goal Build the complete document ingestion pipeline: parse any supported document format, semantically chunk the text, generate embeddings, build the BM25 index, store everything in ChromaDB + SQLite, and implement SHA-256-based deduplication. ### Files Created ``` voicevault/ingestion/ ├── document_parser.py # PDF, HTML, DOCX, MD, TXT, URL parsers ├── semantic_chunker.py # spaCy + cosine-similarity boundary chunker └── index_builder.py # ChromaDB + BM25 + SQLite indexer + dedup voicevault/storage/ ├── sqlite_store.py # Full CRUD: KB, document, chunk metadata └── chroma_store.py # ChromaDB collection management tests/ └── test_phase1.py # Ingestion unit + integration tests DOCS/ └── phase1_ingestion.md ``` ### Key Components **DocumentParser** — Multi-format dispatcher: - PDF: `PyMuPDF` (fitz) — preserves page numbers, extracts tables as text - HTML: `BeautifulSoup4` — Notion/Confluence exports, preserves heading hierarchy - DOCX: `python-docx` — heading-aware extraction - Markdown: `markdown-it-py` — heading hierarchy → section metadata - Plain text: paragraph-level splitting - URL: `trafilatura` — clean article extraction from any public URL - Scanned PDF fallback: `pytesseract` OCR when no text layer found **SemanticChunker** — Boundary detection: - `spaCy en_core_web_sm` sentence tokenization - Cosine similarity between adjacent sentence embeddings - New chunk when similarity < 0.5 (configurable threshold) - Target: 400–600 tokens per chunk, 50-token overlap - Special handling: tables as atomic units, code blocks atomic, lists kept together - Metadata per chunk: source_file, page_number, section_heading, chunk_index, timestamp **IndexBuilder** — Dual-index construction: - SHA-256 hash of chunk text → deduplication (skip re-indexed unchanged content) - `sentence-transformers all-MiniLM-L6-v2` → 384-dim embeddings → ChromaDB - `rank_bm25` BM25Okapi index → serialized to `bm25.pkl` - SQLite metadata: `chunks` table linking every chunk to its source doc - Incremental update: only new/changed chunks re-embedded ### Tests | Test | Pass Criteria | |------|---------------| | `test_pdf_parse` | Extracts text with correct page numbers | | `test_html_parse` | Extracts headings and paragraphs from Notion HTML | | `test_docx_parse` | Extracts text from DOCX with heading metadata | | `test_semantic_chunker` | Chunks respect sentence boundaries, 100–600 tokens | | `test_deduplication` | Same doc uploaded twice → chunks not duplicated | | `test_bm25_build` | BM25 index serializes and reloads correctly | | `test_chroma_store` | Vectors stored and queryable in ChromaDB | | `test_sqlite_metadata` | All chunk metadata persisted to SQLite | | `test_incremental_update` | Only new chunks indexed on re-upload | --- ## Phase 2 — Hybrid Retrieval Engine ### Goal Implement the hybrid BM25 + dense vector retrieval pipeline with Reciprocal Rank Fusion merging, cross-encoder reranking, diversity filtering, query expansion, and context window assembly. ### Files Created ``` voicevault/retrieval/ ├── bm25_retriever.py # rank_bm25 keyword search ├── vector_retriever.py # ChromaDB semantic search ├── hybrid_retriever.py # RRF merge + cross-encoder + diversity filter └── context_builder.py # Formats top-k chunks for LLM prompt tests/ └── test_phase2.py # Retrieval unit + benchmark tests DOCS/ └── phase2_retrieval.md ``` ### Key Components **BM25Retriever:** - Loads pre-built BM25 index from disk - Tokenizes query, scores all chunks, returns top-20 **VectorRetriever:** - Encodes query with `all-MiniLM-L6-v2` - ChromaDB cosine similarity query → top-20 **HybridRetriever (RRF core):** ``` query → [QueryExpander: 2 paraphrases] → BM25 top-20 + Vector top-20 (parallel) → RRF merge (k=60): score = Σ 1/(k + rank) → CrossEncoder ms-marco-MiniLM-L12-v2 rescores top-20 → DiversityFilter: max 2 chunks from same page → Final top-5 chunks ``` **ContextBuilder:** - Formats chunks as: `[Source: filename, p.N | Section: heading]\n{text}` - Appends conversation history (last 5 turns) - Returns context string ready for LLM prompt ### Tests | Test | Pass Criteria | |------|---------------| | `test_bm25_retriever` | Returns ranked results for keyword query | | `test_vector_retriever` | Returns semantically relevant results | | `test_rrf_merge` | RRF scores computed correctly for known ranks | | `test_cross_encoder_rerank` | Re-ranked order differs from RRF order (improvement) | | `test_diversity_filter` | Max 2 chunks per page in final results | | `test_hybrid_recall` | Recall@5 ≥ 0.80 on 50-Q benchmark dataset | | `test_context_builder` | Output is valid string with source citations | | `test_query_expansion` | Returns 2 paraphrase variants | --- ## Phase 3 — ASR & Voice Input ### Goal Integrate Whisper Large-v3 for high-quality speech-to-text transcription, with Distil-Whisper CPU fallback, browser microphone capture via Gradio Audio, and a query preprocessor that cleans transcripts and classifies query intent. ### Files Created ``` voicevault/asr/ ├── whisper_transcriber.py # Whisper Large-v3 + Distil-Whisper fallback └── query_preprocessor.py # Cleanup, intent classification, language detect tests/ └── test_phase3.py # ASR unit tests + WER evaluation DOCS/ └── phase3_asr.md ``` ### Key Components **WhisperTranscriber:** - Primary: `openai/whisper-large-v3` (HuggingFace GPU pipeline) - Fallback: `distil-whisper/distil-large-v3` (CPU, 6× faster, <1% WER diff) - VAD pre-check: reject audio < 1s or silent audio - Returns: `transcript`, `language`, `confidence`, `model_used`, `latency_ms` **QueryPreprocessor:** - Lowercase normalization, punctuation repair - Filler word removal: um, uh, like, you know - Language detection: `langdetect` library - Query type classification: - `factual` — "What is...", "Who...", "When..." - `summary` — "Summarise...", "Give me an overview..." - `compare` — "Compare...", "What's the difference..." - Routes to different retrieval strategies per type ### Tests | Test | Pass Criteria | |------|---------------| | `test_preprocessor_cleanup` | Filler words removed, normalized | | `test_intent_factual` | "What is X?" → type=factual | | `test_intent_summary` | "Summarise the report" → type=summary | | `test_intent_compare` | "Compare A and B" → type=compare | | `test_language_detection` | English text → "en" | | `test_vad_short_audio` | < 1s audio raises ValueError | | `test_whisper_mock` | Transcriber returns correct schema with mocked model | --- ## Phase 4 — Generation Chain & Citations ### Goal Build the full LangChain LCEL generation chain: Groq Llama-3.1-70B as primary LLM with streaming, Gemini 1.5 Flash as automatic fallback, citation injection with [Source: file, p.N] protocol, faithfulness guard for out-of-context detection, and conversation memory. ### Files Created ``` voicevault/generation/ ├── answer_chain.py # LangChain LCEL + Groq + Gemini fallback ├── citation_injector.py # Maps [Doc:Page] citations to source chunks └── faithfulness_guard.py # Out-of-context detection tests/ └── test_phase4.py # Generation unit tests DOCS/ └── phase4_generation.md ``` ### Key Components **AnswerChain (LCEL):** ``` context_string + query + history → PromptTemplate (system: citation protocol + faithfulness instructions) → ChatGroq (llama-3.1-70b-versatile, streaming, temp=0.1) on quota error → ChatGoogleGenerativeAI (gemini-1.5-flash) → StrOutputParser → CitationInjector (post-processing) ``` **CitationInjector:** - Parses `[Doc:Page]` markers from LLM output - Resolves each to the actual chunk's source_file + page_number + excerpt - Builds `List[Citation]` object for UI display **FaithfulnessGuard:** - System prompt: "If the answer cannot be found in the provided context, respond with exactly: 'I could not find this in your documents.'" - Post-generation check: if answer references facts not in any retrieved chunk → flag - Confidence scoring based on retrieval score distribution ### Tests | Test | Pass Criteria | |------|---------------| | `test_citation_injector_parses` | `[Doc:5]` → correct Citation object | | `test_faithfulness_guard_refusal` | Out-of-context Q → refusal message | | `test_answer_chain_mock` | Chain runs end-to-end with mocked LLM | | `test_groq_fallback` | Groq quota error → Gemini client used | | `test_streaming_output` | Chain yields token-by-token | | `test_conversation_memory` | Last 5 turns preserved across queries | --- ## Phase 5 — Full UI, TTS & Access Control ### Goal Build the complete 4-tab Gradio UI, integrate Web Speech API for browser-native TTS, implement the multi-knowledge-base manager, add bcrypt password protection + HMAC share links, and build the analytics + audit log system. ### Files Created ``` voicevault/kb/ ├── kb_manager.py # Create/list/delete knowledge bases ├── access_control.py # bcrypt password, HMAC share links └── audit_log.py # Query logging to SQLite voicevault/tts/ └── web_speech.py # Web Speech API JS bridge voicevault/storage/ └── sqlite_store.py # Complete CRUD (extended from Phase 0) ui/tabs/ ├── ask_tab.py # Full voice query tab ├── kb_tab.py # Full KB manager tab ├── analytics_tab.py # Charts + metrics tab └── settings_tab.py # All configurable parameters ui/components/ ├── citation_panel.py # Citation highlighting component └── audio_controls.py # TTS playback controls tests/ ├── test_phase5.py # UI component + access control tests └── test_e2e.py # Full end-to-end pipeline test DOCS/ └── phase5_ui_access.md ``` ### Key Components **KBManager:** - Creates per-KB directory: `data/{kb_name}/chroma/`, `bm25.pkl`, `voicevault.db` - Lists all KBs with metadata (doc count, chunk count, last updated) - Delete KB: removes directory + SQLite row **AccessControl:** - Password hash: `bcrypt` with work factor 12 - Share link: `HMAC-SHA256` signed token with KB name + expiry - Token validation on every query to password-protected KB **AuditLog:** - Every query logs: session_id, kb_names, voice_query (anonymized), latency, timestamp - Viewable in Analytics tab **Web Speech API Bridge:** - JavaScript injected via `gr.HTML` component - `window.speechSynthesis.speak()` triggered from Python via Gradio's JS bridge - Voice selector, rate slider, pitch slider - Pause/Resume/Restart controls **UI Tabs:** - **Ask tab:** Mic button → live transcript → KB selector → streaming answer → citation panel → speak button - **KB tab:** Create KB form + document uploader (PDF/MD/HTML/DOCX) + progress bar + doc list - **Analytics tab:** Query volume chart + latency breakdown + top documents + Groq quota gauge - **Settings tab:** ASR model, voice settings, retrieval params, LLM params, chunking params ### Tests | Test | Pass Criteria | |------|---------------| | `test_kb_create_delete` | KB directory created/removed correctly | | `test_bcrypt_password` | Hash + verify round-trip | | `test_hmac_share_link` | Token validates within expiry, fails after | | `test_audit_log_write` | Query logged to SQLite correctly | | `test_access_control_wrong_pw` | Wrong password → access denied | | `test_e2e_pipeline` | PDF upload → query → cited answer (mocked LLM) | --- ## 10. Quality Gates Every phase must pass ALL gates before moving to the next phase: | Gate | Requirement | |------|-------------| | **Zero import errors** | `python -m pytest tests/ --co -q` exits 0 | | **All tests pass** | `pytest tests/test_phaseN.py` — 100% green | | **No bare except** | No `except:` or `except Exception:` without logging | | **Type annotations** | Every public function has full type hints | | **No unused imports** | `pylint --disable=all --enable=W0611` passes | | **No secrets in code** | No API keys, passwords, or tokens hardcoded | | **Pathlib throughout** | No `os.path` usage in any module | --- ## 11. Security Audit Checklist - [ ] No API keys committed to git (enforced by .gitignore + .env.example) - [ ] All file uploads validated: extension whitelist + MIME check + size limit - [ ] SQLite queries use parameterized statements (no f-string SQL) - [ ] bcrypt work factor ≥ 12 for password hashing - [ ] HMAC share tokens have expiry (default: 7 days) - [ ] `trafilatura` URL fetching: no SSRF — block private IP ranges - [ ] ChromaDB stored in non-public path (never served as static file) - [ ] BM25 pickle files: only loaded from trusted internal paths - [ ] Gradio app: file upload restricted to `data/uploads/` sandbox directory - [ ] Audit log: voice queries anonymized before storage (hash, not raw text) --- ## 12. Progress Tracker | Phase | Status | Tests | Docs | |-------|--------|-------|------| | Phase 0 — Foundation | ✅ Done | ✅ 58/58 | ✅ phase0_foundation.md | | Phase 1 — Ingestion | ✅ Done | ✅ 46/46 | ✅ phase1_ingestion.md | | Phase 2 — Retrieval | ✅ Done | ✅ 33/33 | ✅ phase2_retrieval.md | | Phase 3 — ASR | ✅ Done | ✅ 45/47 (2 skipped) | ✅ phase3_asr.md | | Phase 4 — Generation | ✅ Done | ✅ 72/72 | ✅ phase4_generation.md | | Phase 5 — UI & Access | ✅ Done | ✅ 55/55 | ✅ phase5_ui_access.md | --- *VoiceVault · Navnit Amrutharaj · navnita004@gmail.com · github.com/ninjacode911*