Spaces:
Running
Running
| # VoiceVault β End-to-End Implementation Plan | |
| **Author:** Navnit Amrutharaj | |
| **Model:** VoiceVault v1.0 β Voice-First RAG Knowledge Agent | |
| **Stack:** Whisper Β· LangChain Β· ChromaDB Β· Groq Β· Gradio | |
| **Target:** $0/month Β· HuggingFace Spaces Β· 10 Weeks | |
| **Plan Date:** March 2026 | |
| --- | |
| ## Table of Contents | |
| 1. [Project Overview](#1-project-overview) | |
| 2. [Architecture Summary](#2-architecture-summary) | |
| 3. [Phase Map](#3-phase-map) | |
| 4. [Phase 0 β Project Foundation](#phase-0--project-foundation) | |
| 5. [Phase 1 β Document Ingestion Pipeline](#phase-1--document-ingestion-pipeline) | |
| 6. [Phase 2 β Hybrid Retrieval Engine](#phase-2--hybrid-retrieval-engine) | |
| 7. [Phase 3 β ASR & Voice Input](#phase-3--asr--voice-input) | |
| 8. [Phase 4 β Generation Chain & Citations](#phase-4--generation-chain--citations) | |
| 9. [Phase 5 β Full UI, TTS & Access Control](#phase-5--full-ui-tts--access-control) | |
| 10. [Quality Gates](#10-quality-gates) | |
| 11. [Security Audit Checklist](#11-security-audit-checklist) | |
| 12. [Progress Tracker](#12-progress-tracker) | |
| --- | |
| ## 1. Project Overview | |
| VoiceVault is a **voice-first retrieval-augmented generation (RAG) knowledge agent** that enables users to: | |
| - Speak questions into a browser microphone | |
| - Get transcribed (Whisper), retrieved, generated, and spoken back answers | |
| - Reference private document collections (PDFs, Notion exports, Confluence, DOCX, MD) | |
| - Receive fully cited answers anchored to source document + page + paragraph | |
| **Core differentiator:** Hybrid BM25 + vector search with Reciprocal Rank Fusion (RRF) + cross-encoder reranking β demonstrating enterprise-grade retrieval depth that most RAG tutorials skip. | |
| --- | |
| ## 2. Architecture Summary | |
| ``` | |
| INGESTION PATH (one-time per document set) | |
| User uploads PDFs / HTML / DOCX / MD | |
| β | |
| DocumentParser β text extraction (PyMuPDF, BS4, python-docx) | |
| β | |
| SemanticChunker β sentence-aware chunks (spaCy + cosine boundary) | |
| β | |
| IndexBuilder β ChromaDB (vectors) + BM25 (keywords) + SQLite (metadata) | |
| QUERY PATH (real-time, per user question) | |
| Browser mic β Gradio Audio β Whisper Large-v3 (HuggingFace GPU) | |
| β | |
| QueryPreprocessor β cleanup + intent class + language detect | |
| β | |
| HybridRetriever β BM25 top-20 + Vector top-20 β RRF merge β CrossEncoder top-5 | |
| β | |
| LangChain LCEL β Groq Llama-3.1-70B (stream) / Gemini Flash (fallback) | |
| β | |
| CitationInjector β [Source: filename, p.N] inline citations | |
| β | |
| Gradio UI (text + highlight citations) + Web Speech API (spoken answer) | |
| ``` | |
| --- | |
| ## 3. Phase Map | |
| | Phase | Name | Weeks | Core Deliverables | | |
| |-------|------|-------|-------------------| | |
| | **0** | Project Foundation | 0 | Scaffold, config, models, SQLite schema, Gradio skeleton | | |
| | **1** | Document Ingestion | 1β2 | Parser, semantic chunker, ChromaDB + BM25 + SQLite indexer | | |
| | **2** | Hybrid Retrieval | 3 | BM25 + vector + RRF + cross-encoder + diversity filter | | |
| | **3** | ASR & Voice Input | 4 | Whisper Large-v3, Distil fallback, query preprocessor | | |
| | **4** | Generation & Citations | 5 | LangChain LCEL, Groq, Gemini fallback, faithfulness guard | | |
| | **5** | Full UI & Access Control | 6β8 | 4-tab Gradio UI, Web Speech TTS, multi-KB, bcrypt, audit log | | |
| --- | |
| ## Phase 0 β Project Foundation | |
| ### Goal | |
| Establish the complete project skeleton β directory structure, dependencies, centralized config, Pydantic data models, SQLite schema, and a working 4-tab Gradio scaffold β before any business logic is written. | |
| ### Files Created | |
| ``` | |
| voicevault/ | |
| βββ app.py # Gradio Blocks entry point | |
| βββ config.py # Pydantic-settings centralized config | |
| βββ requirements.txt # All project dependencies (pinned) | |
| βββ .env.example # Environment variable template | |
| βββ voicevault/ | |
| β βββ __init__.py # Package init + version | |
| β βββ models.py # Pydantic data models (all schemas) | |
| β βββ asr/__init__.py | |
| β βββ ingestion/__init__.py | |
| β βββ retrieval/__init__.py | |
| β βββ generation/__init__.py | |
| β βββ kb/__init__.py | |
| β βββ tts/__init__.py | |
| β βββ storage/ | |
| β βββ __init__.py | |
| β βββ sqlite_store.py # Schema creation + DB init | |
| βββ ui/ | |
| β βββ __init__.py | |
| β βββ tabs/ | |
| β β βββ __init__.py | |
| β β βββ ask_tab.py # Placeholder β voice query tab | |
| β β βββ kb_tab.py # Placeholder β KB manager tab | |
| β β βββ analytics_tab.py # Placeholder β analytics tab | |
| β β βββ settings_tab.py # Placeholder β settings tab | |
| β βββ components/ | |
| β βββ __init__.py | |
| β βββ citation_panel.py # Placeholder β citation display | |
| β βββ audio_controls.py # Placeholder β TTS controls | |
| βββ tests/ | |
| β βββ __init__.py | |
| β βββ conftest.py # Pytest fixtures | |
| β βββ test_phase0.py # Foundation smoke tests | |
| βββ data/ # Runtime data (gitignored) | |
| βββ DOCS/ | |
| βββ phase0_foundation.md # Phase 0 documentation | |
| ``` | |
| ### Key Decisions | |
| - **pydantic-settings** for type-safe env var loading (no raw `os.environ` calls) | |
| - **pathlib.Path** throughout β cross-platform, no `os.path` | |
| - **SQLite stdlib** for metadata β zero-dependency, portable, no server | |
| - **Gradio 4.x Blocks** for UI β native HuggingFace Spaces support | |
| - **`__version__` sentinel** in `voicevault/__init__.py` for release tracking | |
| - **Data models locked early** β prevents schema drift across phases | |
| ### Tests | |
| | Test | Description | Pass Criteria | | |
| |------|-------------|---------------| | |
| | `test_config_loads` | Config instantiates without exceptions | No exception | | |
| | `test_env_defaults` | Default values are correct types | All fields pass type check | | |
| | `test_db_init` | SQLite schema creates 3 tables | Tables `knowledge_bases`, `documents`, `query_log` exist | | |
| | `test_data_dirs` | Data directory structure is created | Dirs exist after init | | |
| | `test_models_instantiate` | All Pydantic models can be instantiated | No validation errors | | |
| | `test_gradio_builds` | Gradio demo object builds without error | `gr.Blocks` object created | | |
| ### Documentation | |
| β See `DOCS/phase0_foundation.md` | |
| --- | |
| ## Phase 1 β Document Ingestion Pipeline | |
| ### Goal | |
| Build the complete document ingestion pipeline: parse any supported document format, semantically chunk the text, generate embeddings, build the BM25 index, store everything in ChromaDB + SQLite, and implement SHA-256-based deduplication. | |
| ### Files Created | |
| ``` | |
| voicevault/ingestion/ | |
| βββ document_parser.py # PDF, HTML, DOCX, MD, TXT, URL parsers | |
| βββ semantic_chunker.py # spaCy + cosine-similarity boundary chunker | |
| βββ index_builder.py # ChromaDB + BM25 + SQLite indexer + dedup | |
| voicevault/storage/ | |
| βββ sqlite_store.py # Full CRUD: KB, document, chunk metadata | |
| βββ chroma_store.py # ChromaDB collection management | |
| tests/ | |
| βββ test_phase1.py # Ingestion unit + integration tests | |
| DOCS/ | |
| βββ phase1_ingestion.md | |
| ``` | |
| ### Key Components | |
| **DocumentParser** β Multi-format dispatcher: | |
| - PDF: `PyMuPDF` (fitz) β preserves page numbers, extracts tables as text | |
| - HTML: `BeautifulSoup4` β Notion/Confluence exports, preserves heading hierarchy | |
| - DOCX: `python-docx` β heading-aware extraction | |
| - Markdown: `markdown-it-py` β heading hierarchy β section metadata | |
| - Plain text: paragraph-level splitting | |
| - URL: `trafilatura` β clean article extraction from any public URL | |
| - Scanned PDF fallback: `pytesseract` OCR when no text layer found | |
| **SemanticChunker** β Boundary detection: | |
| - `spaCy en_core_web_sm` sentence tokenization | |
| - Cosine similarity between adjacent sentence embeddings | |
| - New chunk when similarity < 0.5 (configurable threshold) | |
| - Target: 400β600 tokens per chunk, 50-token overlap | |
| - Special handling: tables as atomic units, code blocks atomic, lists kept together | |
| - Metadata per chunk: source_file, page_number, section_heading, chunk_index, timestamp | |
| **IndexBuilder** β Dual-index construction: | |
| - SHA-256 hash of chunk text β deduplication (skip re-indexed unchanged content) | |
| - `sentence-transformers all-MiniLM-L6-v2` β 384-dim embeddings β ChromaDB | |
| - `rank_bm25` BM25Okapi index β serialized to `bm25.pkl` | |
| - SQLite metadata: `chunks` table linking every chunk to its source doc | |
| - Incremental update: only new/changed chunks re-embedded | |
| ### Tests | |
| | Test | Pass Criteria | | |
| |------|---------------| | |
| | `test_pdf_parse` | Extracts text with correct page numbers | | |
| | `test_html_parse` | Extracts headings and paragraphs from Notion HTML | | |
| | `test_docx_parse` | Extracts text from DOCX with heading metadata | | |
| | `test_semantic_chunker` | Chunks respect sentence boundaries, 100β600 tokens | | |
| | `test_deduplication` | Same doc uploaded twice β chunks not duplicated | | |
| | `test_bm25_build` | BM25 index serializes and reloads correctly | | |
| | `test_chroma_store` | Vectors stored and queryable in ChromaDB | | |
| | `test_sqlite_metadata` | All chunk metadata persisted to SQLite | | |
| | `test_incremental_update` | Only new chunks indexed on re-upload | | |
| --- | |
| ## Phase 2 β Hybrid Retrieval Engine | |
| ### Goal | |
| Implement the hybrid BM25 + dense vector retrieval pipeline with Reciprocal Rank Fusion merging, cross-encoder reranking, diversity filtering, query expansion, and context window assembly. | |
| ### Files Created | |
| ``` | |
| voicevault/retrieval/ | |
| βββ bm25_retriever.py # rank_bm25 keyword search | |
| βββ vector_retriever.py # ChromaDB semantic search | |
| βββ hybrid_retriever.py # RRF merge + cross-encoder + diversity filter | |
| βββ context_builder.py # Formats top-k chunks for LLM prompt | |
| tests/ | |
| βββ test_phase2.py # Retrieval unit + benchmark tests | |
| DOCS/ | |
| βββ phase2_retrieval.md | |
| ``` | |
| ### Key Components | |
| **BM25Retriever:** | |
| - Loads pre-built BM25 index from disk | |
| - Tokenizes query, scores all chunks, returns top-20 | |
| **VectorRetriever:** | |
| - Encodes query with `all-MiniLM-L6-v2` | |
| - ChromaDB cosine similarity query β top-20 | |
| **HybridRetriever (RRF core):** | |
| ``` | |
| query β [QueryExpander: 2 paraphrases] | |
| β BM25 top-20 + Vector top-20 (parallel) | |
| β RRF merge (k=60): score = Ξ£ 1/(k + rank) | |
| β CrossEncoder ms-marco-MiniLM-L12-v2 rescores top-20 | |
| β DiversityFilter: max 2 chunks from same page | |
| β Final top-5 chunks | |
| ``` | |
| **ContextBuilder:** | |
| - Formats chunks as: `[Source: filename, p.N | Section: heading]\n{text}` | |
| - Appends conversation history (last 5 turns) | |
| - Returns context string ready for LLM prompt | |
| ### Tests | |
| | Test | Pass Criteria | | |
| |------|---------------| | |
| | `test_bm25_retriever` | Returns ranked results for keyword query | | |
| | `test_vector_retriever` | Returns semantically relevant results | | |
| | `test_rrf_merge` | RRF scores computed correctly for known ranks | | |
| | `test_cross_encoder_rerank` | Re-ranked order differs from RRF order (improvement) | | |
| | `test_diversity_filter` | Max 2 chunks per page in final results | | |
| | `test_hybrid_recall` | Recall@5 β₯ 0.80 on 50-Q benchmark dataset | | |
| | `test_context_builder` | Output is valid string with source citations | | |
| | `test_query_expansion` | Returns 2 paraphrase variants | | |
| --- | |
| ## Phase 3 β ASR & Voice Input | |
| ### Goal | |
| Integrate Whisper Large-v3 for high-quality speech-to-text transcription, with Distil-Whisper CPU fallback, browser microphone capture via Gradio Audio, and a query preprocessor that cleans transcripts and classifies query intent. | |
| ### Files Created | |
| ``` | |
| voicevault/asr/ | |
| βββ whisper_transcriber.py # Whisper Large-v3 + Distil-Whisper fallback | |
| βββ query_preprocessor.py # Cleanup, intent classification, language detect | |
| tests/ | |
| βββ test_phase3.py # ASR unit tests + WER evaluation | |
| DOCS/ | |
| βββ phase3_asr.md | |
| ``` | |
| ### Key Components | |
| **WhisperTranscriber:** | |
| - Primary: `openai/whisper-large-v3` (HuggingFace GPU pipeline) | |
| - Fallback: `distil-whisper/distil-large-v3` (CPU, 6Γ faster, <1% WER diff) | |
| - VAD pre-check: reject audio < 1s or silent audio | |
| - Returns: `transcript`, `language`, `confidence`, `model_used`, `latency_ms` | |
| **QueryPreprocessor:** | |
| - Lowercase normalization, punctuation repair | |
| - Filler word removal: um, uh, like, you know | |
| - Language detection: `langdetect` library | |
| - Query type classification: | |
| - `factual` β "What is...", "Who...", "When..." | |
| - `summary` β "Summarise...", "Give me an overview..." | |
| - `compare` β "Compare...", "What's the difference..." | |
| - Routes to different retrieval strategies per type | |
| ### Tests | |
| | Test | Pass Criteria | | |
| |------|---------------| | |
| | `test_preprocessor_cleanup` | Filler words removed, normalized | | |
| | `test_intent_factual` | "What is X?" β type=factual | | |
| | `test_intent_summary` | "Summarise the report" β type=summary | | |
| | `test_intent_compare` | "Compare A and B" β type=compare | | |
| | `test_language_detection` | English text β "en" | | |
| | `test_vad_short_audio` | < 1s audio raises ValueError | | |
| | `test_whisper_mock` | Transcriber returns correct schema with mocked model | | |
| --- | |
| ## Phase 4 β Generation Chain & Citations | |
| ### Goal | |
| Build the full LangChain LCEL generation chain: Groq Llama-3.1-70B as primary LLM with streaming, Gemini 1.5 Flash as automatic fallback, citation injection with [Source: file, p.N] protocol, faithfulness guard for out-of-context detection, and conversation memory. | |
| ### Files Created | |
| ``` | |
| voicevault/generation/ | |
| βββ answer_chain.py # LangChain LCEL + Groq + Gemini fallback | |
| βββ citation_injector.py # Maps [Doc:Page] citations to source chunks | |
| βββ faithfulness_guard.py # Out-of-context detection | |
| tests/ | |
| βββ test_phase4.py # Generation unit tests | |
| DOCS/ | |
| βββ phase4_generation.md | |
| ``` | |
| ### Key Components | |
| **AnswerChain (LCEL):** | |
| ``` | |
| context_string + query + history | |
| β PromptTemplate (system: citation protocol + faithfulness instructions) | |
| β ChatGroq (llama-3.1-70b-versatile, streaming, temp=0.1) | |
| on quota error β ChatGoogleGenerativeAI (gemini-1.5-flash) | |
| β StrOutputParser | |
| β CitationInjector (post-processing) | |
| ``` | |
| **CitationInjector:** | |
| - Parses `[Doc:Page]` markers from LLM output | |
| - Resolves each to the actual chunk's source_file + page_number + excerpt | |
| - Builds `List[Citation]` object for UI display | |
| **FaithfulnessGuard:** | |
| - System prompt: "If the answer cannot be found in the provided context, respond with exactly: 'I could not find this in your documents.'" | |
| - Post-generation check: if answer references facts not in any retrieved chunk β flag | |
| - Confidence scoring based on retrieval score distribution | |
| ### Tests | |
| | Test | Pass Criteria | | |
| |------|---------------| | |
| | `test_citation_injector_parses` | `[Doc:5]` β correct Citation object | | |
| | `test_faithfulness_guard_refusal` | Out-of-context Q β refusal message | | |
| | `test_answer_chain_mock` | Chain runs end-to-end with mocked LLM | | |
| | `test_groq_fallback` | Groq quota error β Gemini client used | | |
| | `test_streaming_output` | Chain yields token-by-token | | |
| | `test_conversation_memory` | Last 5 turns preserved across queries | | |
| --- | |
| ## Phase 5 β Full UI, TTS & Access Control | |
| ### Goal | |
| Build the complete 4-tab Gradio UI, integrate Web Speech API for browser-native TTS, implement the multi-knowledge-base manager, add bcrypt password protection + HMAC share links, and build the analytics + audit log system. | |
| ### Files Created | |
| ``` | |
| voicevault/kb/ | |
| βββ kb_manager.py # Create/list/delete knowledge bases | |
| βββ access_control.py # bcrypt password, HMAC share links | |
| βββ audit_log.py # Query logging to SQLite | |
| voicevault/tts/ | |
| βββ web_speech.py # Web Speech API JS bridge | |
| voicevault/storage/ | |
| βββ sqlite_store.py # Complete CRUD (extended from Phase 0) | |
| ui/tabs/ | |
| βββ ask_tab.py # Full voice query tab | |
| βββ kb_tab.py # Full KB manager tab | |
| βββ analytics_tab.py # Charts + metrics tab | |
| βββ settings_tab.py # All configurable parameters | |
| ui/components/ | |
| βββ citation_panel.py # Citation highlighting component | |
| βββ audio_controls.py # TTS playback controls | |
| tests/ | |
| βββ test_phase5.py # UI component + access control tests | |
| βββ test_e2e.py # Full end-to-end pipeline test | |
| DOCS/ | |
| βββ phase5_ui_access.md | |
| ``` | |
| ### Key Components | |
| **KBManager:** | |
| - Creates per-KB directory: `data/{kb_name}/chroma/`, `bm25.pkl`, `voicevault.db` | |
| - Lists all KBs with metadata (doc count, chunk count, last updated) | |
| - Delete KB: removes directory + SQLite row | |
| **AccessControl:** | |
| - Password hash: `bcrypt` with work factor 12 | |
| - Share link: `HMAC-SHA256` signed token with KB name + expiry | |
| - Token validation on every query to password-protected KB | |
| **AuditLog:** | |
| - Every query logs: session_id, kb_names, voice_query (anonymized), latency, timestamp | |
| - Viewable in Analytics tab | |
| **Web Speech API Bridge:** | |
| - JavaScript injected via `gr.HTML` component | |
| - `window.speechSynthesis.speak()` triggered from Python via Gradio's JS bridge | |
| - Voice selector, rate slider, pitch slider | |
| - Pause/Resume/Restart controls | |
| **UI Tabs:** | |
| - **Ask tab:** Mic button β live transcript β KB selector β streaming answer β citation panel β speak button | |
| - **KB tab:** Create KB form + document uploader (PDF/MD/HTML/DOCX) + progress bar + doc list | |
| - **Analytics tab:** Query volume chart + latency breakdown + top documents + Groq quota gauge | |
| - **Settings tab:** ASR model, voice settings, retrieval params, LLM params, chunking params | |
| ### Tests | |
| | Test | Pass Criteria | | |
| |------|---------------| | |
| | `test_kb_create_delete` | KB directory created/removed correctly | | |
| | `test_bcrypt_password` | Hash + verify round-trip | | |
| | `test_hmac_share_link` | Token validates within expiry, fails after | | |
| | `test_audit_log_write` | Query logged to SQLite correctly | | |
| | `test_access_control_wrong_pw` | Wrong password β access denied | | |
| | `test_e2e_pipeline` | PDF upload β query β cited answer (mocked LLM) | | |
| --- | |
| ## 10. Quality Gates | |
| Every phase must pass ALL gates before moving to the next phase: | |
| | Gate | Requirement | | |
| |------|-------------| | |
| | **Zero import errors** | `python -m pytest tests/ --co -q` exits 0 | | |
| | **All tests pass** | `pytest tests/test_phaseN.py` β 100% green | | |
| | **No bare except** | No `except:` or `except Exception:` without logging | | |
| | **Type annotations** | Every public function has full type hints | | |
| | **No unused imports** | `pylint --disable=all --enable=W0611` passes | | |
| | **No secrets in code** | No API keys, passwords, or tokens hardcoded | | |
| | **Pathlib throughout** | No `os.path` usage in any module | | |
| --- | |
| ## 11. Security Audit Checklist | |
| - [ ] No API keys committed to git (enforced by .gitignore + .env.example) | |
| - [ ] All file uploads validated: extension whitelist + MIME check + size limit | |
| - [ ] SQLite queries use parameterized statements (no f-string SQL) | |
| - [ ] bcrypt work factor β₯ 12 for password hashing | |
| - [ ] HMAC share tokens have expiry (default: 7 days) | |
| - [ ] `trafilatura` URL fetching: no SSRF β block private IP ranges | |
| - [ ] ChromaDB stored in non-public path (never served as static file) | |
| - [ ] BM25 pickle files: only loaded from trusted internal paths | |
| - [ ] Gradio app: file upload restricted to `data/uploads/` sandbox directory | |
| - [ ] Audit log: voice queries anonymized before storage (hash, not raw text) | |
| --- | |
| ## 12. Progress Tracker | |
| | Phase | Status | Tests | Docs | | |
| |-------|--------|-------|------| | |
| | Phase 0 β Foundation | β Done | β 58/58 | β phase0_foundation.md | | |
| | Phase 1 β Ingestion | β Done | β 46/46 | β phase1_ingestion.md | | |
| | Phase 2 β Retrieval | β Done | β 33/33 | β phase2_retrieval.md | | |
| | Phase 3 β ASR | β Done | β 45/47 (2 skipped) | β phase3_asr.md | | |
| | Phase 4 β Generation | β Done | β 72/72 | β phase4_generation.md | | |
| | Phase 5 β UI & Access | β Done | β 55/55 | β phase5_ui_access.md | | |
| --- | |
| *VoiceVault Β· Navnit Amrutharaj Β· navnita004@gmail.com Β· github.com/ninjacode911* | |