Spaces:
Running
VoiceVault β End-to-End Implementation Plan
Author: Navnit Amrutharaj Model: VoiceVault v1.0 β Voice-First RAG Knowledge Agent Stack: Whisper Β· LangChain Β· ChromaDB Β· Groq Β· Gradio Target: $0/month Β· HuggingFace Spaces Β· 10 Weeks Plan Date: March 2026
Table of Contents
- Project Overview
- Architecture Summary
- Phase Map
- Phase 0 β Project Foundation
- Phase 1 β Document Ingestion Pipeline
- Phase 2 β Hybrid Retrieval Engine
- Phase 3 β ASR & Voice Input
- Phase 4 β Generation Chain & Citations
- Phase 5 β Full UI, TTS & Access Control
- Quality Gates
- Security Audit Checklist
- Progress Tracker
1. Project Overview
VoiceVault is a voice-first retrieval-augmented generation (RAG) knowledge agent that enables users to:
- Speak questions into a browser microphone
- Get transcribed (Whisper), retrieved, generated, and spoken back answers
- Reference private document collections (PDFs, Notion exports, Confluence, DOCX, MD)
- Receive fully cited answers anchored to source document + page + paragraph
Core differentiator: Hybrid BM25 + vector search with Reciprocal Rank Fusion (RRF) + cross-encoder reranking β demonstrating enterprise-grade retrieval depth that most RAG tutorials skip.
2. Architecture Summary
INGESTION PATH (one-time per document set)
User uploads PDFs / HTML / DOCX / MD
β
DocumentParser β text extraction (PyMuPDF, BS4, python-docx)
β
SemanticChunker β sentence-aware chunks (spaCy + cosine boundary)
β
IndexBuilder β ChromaDB (vectors) + BM25 (keywords) + SQLite (metadata)
QUERY PATH (real-time, per user question)
Browser mic β Gradio Audio β Whisper Large-v3 (HuggingFace GPU)
β
QueryPreprocessor β cleanup + intent class + language detect
β
HybridRetriever β BM25 top-20 + Vector top-20 β RRF merge β CrossEncoder top-5
β
LangChain LCEL β Groq Llama-3.1-70B (stream) / Gemini Flash (fallback)
β
CitationInjector β [Source: filename, p.N] inline citations
β
Gradio UI (text + highlight citations) + Web Speech API (spoken answer)
3. Phase Map
| Phase | Name | Weeks | Core Deliverables |
|---|---|---|---|
| 0 | Project Foundation | 0 | Scaffold, config, models, SQLite schema, Gradio skeleton |
| 1 | Document Ingestion | 1β2 | Parser, semantic chunker, ChromaDB + BM25 + SQLite indexer |
| 2 | Hybrid Retrieval | 3 | BM25 + vector + RRF + cross-encoder + diversity filter |
| 3 | ASR & Voice Input | 4 | Whisper Large-v3, Distil fallback, query preprocessor |
| 4 | Generation & Citations | 5 | LangChain LCEL, Groq, Gemini fallback, faithfulness guard |
| 5 | Full UI & Access Control | 6β8 | 4-tab Gradio UI, Web Speech TTS, multi-KB, bcrypt, audit log |
Phase 0 β Project Foundation
Goal
Establish the complete project skeleton β directory structure, dependencies, centralized config, Pydantic data models, SQLite schema, and a working 4-tab Gradio scaffold β before any business logic is written.
Files Created
voicevault/
βββ app.py # Gradio Blocks entry point
βββ config.py # Pydantic-settings centralized config
βββ requirements.txt # All project dependencies (pinned)
βββ .env.example # Environment variable template
βββ voicevault/
β βββ __init__.py # Package init + version
β βββ models.py # Pydantic data models (all schemas)
β βββ asr/__init__.py
β βββ ingestion/__init__.py
β βββ retrieval/__init__.py
β βββ generation/__init__.py
β βββ kb/__init__.py
β βββ tts/__init__.py
β βββ storage/
β βββ __init__.py
β βββ sqlite_store.py # Schema creation + DB init
βββ ui/
β βββ __init__.py
β βββ tabs/
β β βββ __init__.py
β β βββ ask_tab.py # Placeholder β voice query tab
β β βββ kb_tab.py # Placeholder β KB manager tab
β β βββ analytics_tab.py # Placeholder β analytics tab
β β βββ settings_tab.py # Placeholder β settings tab
β βββ components/
β βββ __init__.py
β βββ citation_panel.py # Placeholder β citation display
β βββ audio_controls.py # Placeholder β TTS controls
βββ tests/
β βββ __init__.py
β βββ conftest.py # Pytest fixtures
β βββ test_phase0.py # Foundation smoke tests
βββ data/ # Runtime data (gitignored)
βββ DOCS/
βββ phase0_foundation.md # Phase 0 documentation
Key Decisions
- pydantic-settings for type-safe env var loading (no raw
os.environcalls) - pathlib.Path throughout β cross-platform, no
os.path - SQLite stdlib for metadata β zero-dependency, portable, no server
- Gradio 4.x Blocks for UI β native HuggingFace Spaces support
__version__sentinel invoicevault/__init__.pyfor release tracking- Data models locked early β prevents schema drift across phases
Tests
| Test | Description | Pass Criteria |
|---|---|---|
test_config_loads |
Config instantiates without exceptions | No exception |
test_env_defaults |
Default values are correct types | All fields pass type check |
test_db_init |
SQLite schema creates 3 tables | Tables knowledge_bases, documents, query_log exist |
test_data_dirs |
Data directory structure is created | Dirs exist after init |
test_models_instantiate |
All Pydantic models can be instantiated | No validation errors |
test_gradio_builds |
Gradio demo object builds without error | gr.Blocks object created |
Documentation
β See DOCS/phase0_foundation.md
Phase 1 β Document Ingestion Pipeline
Goal
Build the complete document ingestion pipeline: parse any supported document format, semantically chunk the text, generate embeddings, build the BM25 index, store everything in ChromaDB + SQLite, and implement SHA-256-based deduplication.
Files Created
voicevault/ingestion/
βββ document_parser.py # PDF, HTML, DOCX, MD, TXT, URL parsers
βββ semantic_chunker.py # spaCy + cosine-similarity boundary chunker
βββ index_builder.py # ChromaDB + BM25 + SQLite indexer + dedup
voicevault/storage/
βββ sqlite_store.py # Full CRUD: KB, document, chunk metadata
βββ chroma_store.py # ChromaDB collection management
tests/
βββ test_phase1.py # Ingestion unit + integration tests
DOCS/
βββ phase1_ingestion.md
Key Components
DocumentParser β Multi-format dispatcher:
- PDF:
PyMuPDF(fitz) β preserves page numbers, extracts tables as text - HTML:
BeautifulSoup4β Notion/Confluence exports, preserves heading hierarchy - DOCX:
python-docxβ heading-aware extraction - Markdown:
markdown-it-pyβ heading hierarchy β section metadata - Plain text: paragraph-level splitting
- URL:
trafilaturaβ clean article extraction from any public URL - Scanned PDF fallback:
pytesseractOCR when no text layer found
SemanticChunker β Boundary detection:
spaCy en_core_web_smsentence tokenization- Cosine similarity between adjacent sentence embeddings
- New chunk when similarity < 0.5 (configurable threshold)
- Target: 400β600 tokens per chunk, 50-token overlap
- Special handling: tables as atomic units, code blocks atomic, lists kept together
- Metadata per chunk: source_file, page_number, section_heading, chunk_index, timestamp
IndexBuilder β Dual-index construction:
- SHA-256 hash of chunk text β deduplication (skip re-indexed unchanged content)
sentence-transformers all-MiniLM-L6-v2β 384-dim embeddings β ChromaDBrank_bm25BM25Okapi index β serialized tobm25.pkl- SQLite metadata:
chunkstable linking every chunk to its source doc - Incremental update: only new/changed chunks re-embedded
Tests
| Test | Pass Criteria |
|---|---|
test_pdf_parse |
Extracts text with correct page numbers |
test_html_parse |
Extracts headings and paragraphs from Notion HTML |
test_docx_parse |
Extracts text from DOCX with heading metadata |
test_semantic_chunker |
Chunks respect sentence boundaries, 100β600 tokens |
test_deduplication |
Same doc uploaded twice β chunks not duplicated |
test_bm25_build |
BM25 index serializes and reloads correctly |
test_chroma_store |
Vectors stored and queryable in ChromaDB |
test_sqlite_metadata |
All chunk metadata persisted to SQLite |
test_incremental_update |
Only new chunks indexed on re-upload |
Phase 2 β Hybrid Retrieval Engine
Goal
Implement the hybrid BM25 + dense vector retrieval pipeline with Reciprocal Rank Fusion merging, cross-encoder reranking, diversity filtering, query expansion, and context window assembly.
Files Created
voicevault/retrieval/
βββ bm25_retriever.py # rank_bm25 keyword search
βββ vector_retriever.py # ChromaDB semantic search
βββ hybrid_retriever.py # RRF merge + cross-encoder + diversity filter
βββ context_builder.py # Formats top-k chunks for LLM prompt
tests/
βββ test_phase2.py # Retrieval unit + benchmark tests
DOCS/
βββ phase2_retrieval.md
Key Components
BM25Retriever:
- Loads pre-built BM25 index from disk
- Tokenizes query, scores all chunks, returns top-20
VectorRetriever:
- Encodes query with
all-MiniLM-L6-v2 - ChromaDB cosine similarity query β top-20
HybridRetriever (RRF core):
query β [QueryExpander: 2 paraphrases]
β BM25 top-20 + Vector top-20 (parallel)
β RRF merge (k=60): score = Ξ£ 1/(k + rank)
β CrossEncoder ms-marco-MiniLM-L12-v2 rescores top-20
β DiversityFilter: max 2 chunks from same page
β Final top-5 chunks
ContextBuilder:
- Formats chunks as:
[Source: filename, p.N | Section: heading]\n{text} - Appends conversation history (last 5 turns)
- Returns context string ready for LLM prompt
Tests
| Test | Pass Criteria |
|---|---|
test_bm25_retriever |
Returns ranked results for keyword query |
test_vector_retriever |
Returns semantically relevant results |
test_rrf_merge |
RRF scores computed correctly for known ranks |
test_cross_encoder_rerank |
Re-ranked order differs from RRF order (improvement) |
test_diversity_filter |
Max 2 chunks per page in final results |
test_hybrid_recall |
Recall@5 β₯ 0.80 on 50-Q benchmark dataset |
test_context_builder |
Output is valid string with source citations |
test_query_expansion |
Returns 2 paraphrase variants |
Phase 3 β ASR & Voice Input
Goal
Integrate Whisper Large-v3 for high-quality speech-to-text transcription, with Distil-Whisper CPU fallback, browser microphone capture via Gradio Audio, and a query preprocessor that cleans transcripts and classifies query intent.
Files Created
voicevault/asr/
βββ whisper_transcriber.py # Whisper Large-v3 + Distil-Whisper fallback
βββ query_preprocessor.py # Cleanup, intent classification, language detect
tests/
βββ test_phase3.py # ASR unit tests + WER evaluation
DOCS/
βββ phase3_asr.md
Key Components
WhisperTranscriber:
- Primary:
openai/whisper-large-v3(HuggingFace GPU pipeline) - Fallback:
distil-whisper/distil-large-v3(CPU, 6Γ faster, <1% WER diff) - VAD pre-check: reject audio < 1s or silent audio
- Returns:
transcript,language,confidence,model_used,latency_ms
QueryPreprocessor:
- Lowercase normalization, punctuation repair
- Filler word removal: um, uh, like, you know
- Language detection:
langdetectlibrary - Query type classification:
factualβ "What is...", "Who...", "When..."summaryβ "Summarise...", "Give me an overview..."compareβ "Compare...", "What's the difference..."
- Routes to different retrieval strategies per type
Tests
| Test | Pass Criteria |
|---|---|
test_preprocessor_cleanup |
Filler words removed, normalized |
test_intent_factual |
"What is X?" β type=factual |
test_intent_summary |
"Summarise the report" β type=summary |
test_intent_compare |
"Compare A and B" β type=compare |
test_language_detection |
English text β "en" |
test_vad_short_audio |
< 1s audio raises ValueError |
test_whisper_mock |
Transcriber returns correct schema with mocked model |
Phase 4 β Generation Chain & Citations
Goal
Build the full LangChain LCEL generation chain: Groq Llama-3.1-70B as primary LLM with streaming, Gemini 1.5 Flash as automatic fallback, citation injection with [Source: file, p.N] protocol, faithfulness guard for out-of-context detection, and conversation memory.
Files Created
voicevault/generation/
βββ answer_chain.py # LangChain LCEL + Groq + Gemini fallback
βββ citation_injector.py # Maps [Doc:Page] citations to source chunks
βββ faithfulness_guard.py # Out-of-context detection
tests/
βββ test_phase4.py # Generation unit tests
DOCS/
βββ phase4_generation.md
Key Components
AnswerChain (LCEL):
context_string + query + history
β PromptTemplate (system: citation protocol + faithfulness instructions)
β ChatGroq (llama-3.1-70b-versatile, streaming, temp=0.1)
on quota error β ChatGoogleGenerativeAI (gemini-1.5-flash)
β StrOutputParser
β CitationInjector (post-processing)
CitationInjector:
- Parses
[Doc:Page]markers from LLM output - Resolves each to the actual chunk's source_file + page_number + excerpt
- Builds
List[Citation]object for UI display
FaithfulnessGuard:
- System prompt: "If the answer cannot be found in the provided context, respond with exactly: 'I could not find this in your documents.'"
- Post-generation check: if answer references facts not in any retrieved chunk β flag
- Confidence scoring based on retrieval score distribution
Tests
| Test | Pass Criteria |
|---|---|
test_citation_injector_parses |
[Doc:5] β correct Citation object |
test_faithfulness_guard_refusal |
Out-of-context Q β refusal message |
test_answer_chain_mock |
Chain runs end-to-end with mocked LLM |
test_groq_fallback |
Groq quota error β Gemini client used |
test_streaming_output |
Chain yields token-by-token |
test_conversation_memory |
Last 5 turns preserved across queries |
Phase 5 β Full UI, TTS & Access Control
Goal
Build the complete 4-tab Gradio UI, integrate Web Speech API for browser-native TTS, implement the multi-knowledge-base manager, add bcrypt password protection + HMAC share links, and build the analytics + audit log system.
Files Created
voicevault/kb/
βββ kb_manager.py # Create/list/delete knowledge bases
βββ access_control.py # bcrypt password, HMAC share links
βββ audit_log.py # Query logging to SQLite
voicevault/tts/
βββ web_speech.py # Web Speech API JS bridge
voicevault/storage/
βββ sqlite_store.py # Complete CRUD (extended from Phase 0)
ui/tabs/
βββ ask_tab.py # Full voice query tab
βββ kb_tab.py # Full KB manager tab
βββ analytics_tab.py # Charts + metrics tab
βββ settings_tab.py # All configurable parameters
ui/components/
βββ citation_panel.py # Citation highlighting component
βββ audio_controls.py # TTS playback controls
tests/
βββ test_phase5.py # UI component + access control tests
βββ test_e2e.py # Full end-to-end pipeline test
DOCS/
βββ phase5_ui_access.md
Key Components
KBManager:
- Creates per-KB directory:
data/{kb_name}/chroma/,bm25.pkl,voicevault.db - Lists all KBs with metadata (doc count, chunk count, last updated)
- Delete KB: removes directory + SQLite row
AccessControl:
- Password hash:
bcryptwith work factor 12 - Share link:
HMAC-SHA256signed token with KB name + expiry - Token validation on every query to password-protected KB
AuditLog:
- Every query logs: session_id, kb_names, voice_query (anonymized), latency, timestamp
- Viewable in Analytics tab
Web Speech API Bridge:
- JavaScript injected via
gr.HTMLcomponent window.speechSynthesis.speak()triggered from Python via Gradio's JS bridge- Voice selector, rate slider, pitch slider
- Pause/Resume/Restart controls
UI Tabs:
- Ask tab: Mic button β live transcript β KB selector β streaming answer β citation panel β speak button
- KB tab: Create KB form + document uploader (PDF/MD/HTML/DOCX) + progress bar + doc list
- Analytics tab: Query volume chart + latency breakdown + top documents + Groq quota gauge
- Settings tab: ASR model, voice settings, retrieval params, LLM params, chunking params
Tests
| Test | Pass Criteria |
|---|---|
test_kb_create_delete |
KB directory created/removed correctly |
test_bcrypt_password |
Hash + verify round-trip |
test_hmac_share_link |
Token validates within expiry, fails after |
test_audit_log_write |
Query logged to SQLite correctly |
test_access_control_wrong_pw |
Wrong password β access denied |
test_e2e_pipeline |
PDF upload β query β cited answer (mocked LLM) |
10. Quality Gates
Every phase must pass ALL gates before moving to the next phase:
| Gate | Requirement |
|---|---|
| Zero import errors | python -m pytest tests/ --co -q exits 0 |
| All tests pass | pytest tests/test_phaseN.py β 100% green |
| No bare except | No except: or except Exception: without logging |
| Type annotations | Every public function has full type hints |
| No unused imports | pylint --disable=all --enable=W0611 passes |
| No secrets in code | No API keys, passwords, or tokens hardcoded |
| Pathlib throughout | No os.path usage in any module |
11. Security Audit Checklist
- No API keys committed to git (enforced by .gitignore + .env.example)
- All file uploads validated: extension whitelist + MIME check + size limit
- SQLite queries use parameterized statements (no f-string SQL)
- bcrypt work factor β₯ 12 for password hashing
- HMAC share tokens have expiry (default: 7 days)
-
trafilaturaURL fetching: no SSRF β block private IP ranges - ChromaDB stored in non-public path (never served as static file)
- BM25 pickle files: only loaded from trusted internal paths
- Gradio app: file upload restricted to
data/uploads/sandbox directory - Audit log: voice queries anonymized before storage (hash, not raw text)
12. Progress Tracker
| Phase | Status | Tests | Docs |
|---|---|---|---|
| Phase 0 β Foundation | β Done | β 58/58 | β phase0_foundation.md |
| Phase 1 β Ingestion | β Done | β 46/46 | β phase1_ingestion.md |
| Phase 2 β Retrieval | β Done | β 33/33 | β phase2_retrieval.md |
| Phase 3 β ASR | β Done | β 45/47 (2 skipped) | β phase3_asr.md |
| Phase 4 β Generation | β Done | β 72/72 | β phase4_generation.md |
| Phase 5 β UI & Access | β Done | β 55/55 | β phase5_ui_access.md |
VoiceVault Β· Navnit Amrutharaj Β· navnita004@gmail.com Β· github.com/ninjacode911