VoiceVault / PLAN.md
NinjainPJs's picture
Initial release: VoiceVault v1.0.0 β€” Voice-First RAG Knowledge Agent
85f900d
# VoiceVault β€” End-to-End Implementation Plan
**Author:** Navnit Amrutharaj
**Model:** VoiceVault v1.0 β€” Voice-First RAG Knowledge Agent
**Stack:** Whisper Β· LangChain Β· ChromaDB Β· Groq Β· Gradio
**Target:** $0/month Β· HuggingFace Spaces Β· 10 Weeks
**Plan Date:** March 2026
---
## Table of Contents
1. [Project Overview](#1-project-overview)
2. [Architecture Summary](#2-architecture-summary)
3. [Phase Map](#3-phase-map)
4. [Phase 0 β€” Project Foundation](#phase-0--project-foundation)
5. [Phase 1 β€” Document Ingestion Pipeline](#phase-1--document-ingestion-pipeline)
6. [Phase 2 β€” Hybrid Retrieval Engine](#phase-2--hybrid-retrieval-engine)
7. [Phase 3 β€” ASR & Voice Input](#phase-3--asr--voice-input)
8. [Phase 4 β€” Generation Chain & Citations](#phase-4--generation-chain--citations)
9. [Phase 5 β€” Full UI, TTS & Access Control](#phase-5--full-ui-tts--access-control)
10. [Quality Gates](#10-quality-gates)
11. [Security Audit Checklist](#11-security-audit-checklist)
12. [Progress Tracker](#12-progress-tracker)
---
## 1. Project Overview
VoiceVault is a **voice-first retrieval-augmented generation (RAG) knowledge agent** that enables users to:
- Speak questions into a browser microphone
- Get transcribed (Whisper), retrieved, generated, and spoken back answers
- Reference private document collections (PDFs, Notion exports, Confluence, DOCX, MD)
- Receive fully cited answers anchored to source document + page + paragraph
**Core differentiator:** Hybrid BM25 + vector search with Reciprocal Rank Fusion (RRF) + cross-encoder reranking β€” demonstrating enterprise-grade retrieval depth that most RAG tutorials skip.
---
## 2. Architecture Summary
```
INGESTION PATH (one-time per document set)
User uploads PDFs / HTML / DOCX / MD
↓
DocumentParser β†’ text extraction (PyMuPDF, BS4, python-docx)
↓
SemanticChunker β†’ sentence-aware chunks (spaCy + cosine boundary)
↓
IndexBuilder β†’ ChromaDB (vectors) + BM25 (keywords) + SQLite (metadata)
QUERY PATH (real-time, per user question)
Browser mic β†’ Gradio Audio β†’ Whisper Large-v3 (HuggingFace GPU)
↓
QueryPreprocessor β†’ cleanup + intent class + language detect
↓
HybridRetriever β†’ BM25 top-20 + Vector top-20 β†’ RRF merge β†’ CrossEncoder top-5
↓
LangChain LCEL β†’ Groq Llama-3.1-70B (stream) / Gemini Flash (fallback)
↓
CitationInjector β†’ [Source: filename, p.N] inline citations
↓
Gradio UI (text + highlight citations) + Web Speech API (spoken answer)
```
---
## 3. Phase Map
| Phase | Name | Weeks | Core Deliverables |
|-------|------|-------|-------------------|
| **0** | Project Foundation | 0 | Scaffold, config, models, SQLite schema, Gradio skeleton |
| **1** | Document Ingestion | 1–2 | Parser, semantic chunker, ChromaDB + BM25 + SQLite indexer |
| **2** | Hybrid Retrieval | 3 | BM25 + vector + RRF + cross-encoder + diversity filter |
| **3** | ASR & Voice Input | 4 | Whisper Large-v3, Distil fallback, query preprocessor |
| **4** | Generation & Citations | 5 | LangChain LCEL, Groq, Gemini fallback, faithfulness guard |
| **5** | Full UI & Access Control | 6–8 | 4-tab Gradio UI, Web Speech TTS, multi-KB, bcrypt, audit log |
---
## Phase 0 β€” Project Foundation
### Goal
Establish the complete project skeleton β€” directory structure, dependencies, centralized config, Pydantic data models, SQLite schema, and a working 4-tab Gradio scaffold β€” before any business logic is written.
### Files Created
```
voicevault/
β”œβ”€β”€ app.py # Gradio Blocks entry point
β”œβ”€β”€ config.py # Pydantic-settings centralized config
β”œβ”€β”€ requirements.txt # All project dependencies (pinned)
β”œβ”€β”€ .env.example # Environment variable template
β”œβ”€β”€ voicevault/
β”‚ β”œβ”€β”€ __init__.py # Package init + version
β”‚ β”œβ”€β”€ models.py # Pydantic data models (all schemas)
β”‚ β”œβ”€β”€ asr/__init__.py
β”‚ β”œβ”€β”€ ingestion/__init__.py
β”‚ β”œβ”€β”€ retrieval/__init__.py
β”‚ β”œβ”€β”€ generation/__init__.py
β”‚ β”œβ”€β”€ kb/__init__.py
β”‚ β”œβ”€β”€ tts/__init__.py
β”‚ └── storage/
β”‚ β”œβ”€β”€ __init__.py
β”‚ └── sqlite_store.py # Schema creation + DB init
β”œβ”€β”€ ui/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ tabs/
β”‚ β”‚ β”œβ”€β”€ __init__.py
β”‚ β”‚ β”œβ”€β”€ ask_tab.py # Placeholder β€” voice query tab
β”‚ β”‚ β”œβ”€β”€ kb_tab.py # Placeholder β€” KB manager tab
β”‚ β”‚ β”œβ”€β”€ analytics_tab.py # Placeholder β€” analytics tab
β”‚ β”‚ └── settings_tab.py # Placeholder β€” settings tab
β”‚ └── components/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ citation_panel.py # Placeholder β€” citation display
β”‚ └── audio_controls.py # Placeholder β€” TTS controls
β”œβ”€β”€ tests/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ conftest.py # Pytest fixtures
β”‚ └── test_phase0.py # Foundation smoke tests
β”œβ”€β”€ data/ # Runtime data (gitignored)
└── DOCS/
└── phase0_foundation.md # Phase 0 documentation
```
### Key Decisions
- **pydantic-settings** for type-safe env var loading (no raw `os.environ` calls)
- **pathlib.Path** throughout β€” cross-platform, no `os.path`
- **SQLite stdlib** for metadata β€” zero-dependency, portable, no server
- **Gradio 4.x Blocks** for UI β€” native HuggingFace Spaces support
- **`__version__` sentinel** in `voicevault/__init__.py` for release tracking
- **Data models locked early** β€” prevents schema drift across phases
### Tests
| Test | Description | Pass Criteria |
|------|-------------|---------------|
| `test_config_loads` | Config instantiates without exceptions | No exception |
| `test_env_defaults` | Default values are correct types | All fields pass type check |
| `test_db_init` | SQLite schema creates 3 tables | Tables `knowledge_bases`, `documents`, `query_log` exist |
| `test_data_dirs` | Data directory structure is created | Dirs exist after init |
| `test_models_instantiate` | All Pydantic models can be instantiated | No validation errors |
| `test_gradio_builds` | Gradio demo object builds without error | `gr.Blocks` object created |
### Documentation
β†’ See `DOCS/phase0_foundation.md`
---
## Phase 1 β€” Document Ingestion Pipeline
### Goal
Build the complete document ingestion pipeline: parse any supported document format, semantically chunk the text, generate embeddings, build the BM25 index, store everything in ChromaDB + SQLite, and implement SHA-256-based deduplication.
### Files Created
```
voicevault/ingestion/
β”œβ”€β”€ document_parser.py # PDF, HTML, DOCX, MD, TXT, URL parsers
β”œβ”€β”€ semantic_chunker.py # spaCy + cosine-similarity boundary chunker
└── index_builder.py # ChromaDB + BM25 + SQLite indexer + dedup
voicevault/storage/
β”œβ”€β”€ sqlite_store.py # Full CRUD: KB, document, chunk metadata
└── chroma_store.py # ChromaDB collection management
tests/
└── test_phase1.py # Ingestion unit + integration tests
DOCS/
└── phase1_ingestion.md
```
### Key Components
**DocumentParser** β€” Multi-format dispatcher:
- PDF: `PyMuPDF` (fitz) β€” preserves page numbers, extracts tables as text
- HTML: `BeautifulSoup4` β€” Notion/Confluence exports, preserves heading hierarchy
- DOCX: `python-docx` β€” heading-aware extraction
- Markdown: `markdown-it-py` β€” heading hierarchy β†’ section metadata
- Plain text: paragraph-level splitting
- URL: `trafilatura` β€” clean article extraction from any public URL
- Scanned PDF fallback: `pytesseract` OCR when no text layer found
**SemanticChunker** β€” Boundary detection:
- `spaCy en_core_web_sm` sentence tokenization
- Cosine similarity between adjacent sentence embeddings
- New chunk when similarity < 0.5 (configurable threshold)
- Target: 400–600 tokens per chunk, 50-token overlap
- Special handling: tables as atomic units, code blocks atomic, lists kept together
- Metadata per chunk: source_file, page_number, section_heading, chunk_index, timestamp
**IndexBuilder** β€” Dual-index construction:
- SHA-256 hash of chunk text β†’ deduplication (skip re-indexed unchanged content)
- `sentence-transformers all-MiniLM-L6-v2` β†’ 384-dim embeddings β†’ ChromaDB
- `rank_bm25` BM25Okapi index β†’ serialized to `bm25.pkl`
- SQLite metadata: `chunks` table linking every chunk to its source doc
- Incremental update: only new/changed chunks re-embedded
### Tests
| Test | Pass Criteria |
|------|---------------|
| `test_pdf_parse` | Extracts text with correct page numbers |
| `test_html_parse` | Extracts headings and paragraphs from Notion HTML |
| `test_docx_parse` | Extracts text from DOCX with heading metadata |
| `test_semantic_chunker` | Chunks respect sentence boundaries, 100–600 tokens |
| `test_deduplication` | Same doc uploaded twice β†’ chunks not duplicated |
| `test_bm25_build` | BM25 index serializes and reloads correctly |
| `test_chroma_store` | Vectors stored and queryable in ChromaDB |
| `test_sqlite_metadata` | All chunk metadata persisted to SQLite |
| `test_incremental_update` | Only new chunks indexed on re-upload |
---
## Phase 2 β€” Hybrid Retrieval Engine
### Goal
Implement the hybrid BM25 + dense vector retrieval pipeline with Reciprocal Rank Fusion merging, cross-encoder reranking, diversity filtering, query expansion, and context window assembly.
### Files Created
```
voicevault/retrieval/
β”œβ”€β”€ bm25_retriever.py # rank_bm25 keyword search
β”œβ”€β”€ vector_retriever.py # ChromaDB semantic search
β”œβ”€β”€ hybrid_retriever.py # RRF merge + cross-encoder + diversity filter
└── context_builder.py # Formats top-k chunks for LLM prompt
tests/
└── test_phase2.py # Retrieval unit + benchmark tests
DOCS/
└── phase2_retrieval.md
```
### Key Components
**BM25Retriever:**
- Loads pre-built BM25 index from disk
- Tokenizes query, scores all chunks, returns top-20
**VectorRetriever:**
- Encodes query with `all-MiniLM-L6-v2`
- ChromaDB cosine similarity query β†’ top-20
**HybridRetriever (RRF core):**
```
query β†’ [QueryExpander: 2 paraphrases]
β†’ BM25 top-20 + Vector top-20 (parallel)
β†’ RRF merge (k=60): score = Ξ£ 1/(k + rank)
β†’ CrossEncoder ms-marco-MiniLM-L12-v2 rescores top-20
β†’ DiversityFilter: max 2 chunks from same page
β†’ Final top-5 chunks
```
**ContextBuilder:**
- Formats chunks as: `[Source: filename, p.N | Section: heading]\n{text}`
- Appends conversation history (last 5 turns)
- Returns context string ready for LLM prompt
### Tests
| Test | Pass Criteria |
|------|---------------|
| `test_bm25_retriever` | Returns ranked results for keyword query |
| `test_vector_retriever` | Returns semantically relevant results |
| `test_rrf_merge` | RRF scores computed correctly for known ranks |
| `test_cross_encoder_rerank` | Re-ranked order differs from RRF order (improvement) |
| `test_diversity_filter` | Max 2 chunks per page in final results |
| `test_hybrid_recall` | Recall@5 β‰₯ 0.80 on 50-Q benchmark dataset |
| `test_context_builder` | Output is valid string with source citations |
| `test_query_expansion` | Returns 2 paraphrase variants |
---
## Phase 3 β€” ASR & Voice Input
### Goal
Integrate Whisper Large-v3 for high-quality speech-to-text transcription, with Distil-Whisper CPU fallback, browser microphone capture via Gradio Audio, and a query preprocessor that cleans transcripts and classifies query intent.
### Files Created
```
voicevault/asr/
β”œβ”€β”€ whisper_transcriber.py # Whisper Large-v3 + Distil-Whisper fallback
└── query_preprocessor.py # Cleanup, intent classification, language detect
tests/
└── test_phase3.py # ASR unit tests + WER evaluation
DOCS/
└── phase3_asr.md
```
### Key Components
**WhisperTranscriber:**
- Primary: `openai/whisper-large-v3` (HuggingFace GPU pipeline)
- Fallback: `distil-whisper/distil-large-v3` (CPU, 6Γ— faster, <1% WER diff)
- VAD pre-check: reject audio < 1s or silent audio
- Returns: `transcript`, `language`, `confidence`, `model_used`, `latency_ms`
**QueryPreprocessor:**
- Lowercase normalization, punctuation repair
- Filler word removal: um, uh, like, you know
- Language detection: `langdetect` library
- Query type classification:
- `factual` β€” "What is...", "Who...", "When..."
- `summary` β€” "Summarise...", "Give me an overview..."
- `compare` β€” "Compare...", "What's the difference..."
- Routes to different retrieval strategies per type
### Tests
| Test | Pass Criteria |
|------|---------------|
| `test_preprocessor_cleanup` | Filler words removed, normalized |
| `test_intent_factual` | "What is X?" β†’ type=factual |
| `test_intent_summary` | "Summarise the report" β†’ type=summary |
| `test_intent_compare` | "Compare A and B" β†’ type=compare |
| `test_language_detection` | English text β†’ "en" |
| `test_vad_short_audio` | < 1s audio raises ValueError |
| `test_whisper_mock` | Transcriber returns correct schema with mocked model |
---
## Phase 4 β€” Generation Chain & Citations
### Goal
Build the full LangChain LCEL generation chain: Groq Llama-3.1-70B as primary LLM with streaming, Gemini 1.5 Flash as automatic fallback, citation injection with [Source: file, p.N] protocol, faithfulness guard for out-of-context detection, and conversation memory.
### Files Created
```
voicevault/generation/
β”œβ”€β”€ answer_chain.py # LangChain LCEL + Groq + Gemini fallback
β”œβ”€β”€ citation_injector.py # Maps [Doc:Page] citations to source chunks
└── faithfulness_guard.py # Out-of-context detection
tests/
└── test_phase4.py # Generation unit tests
DOCS/
└── phase4_generation.md
```
### Key Components
**AnswerChain (LCEL):**
```
context_string + query + history
β†’ PromptTemplate (system: citation protocol + faithfulness instructions)
β†’ ChatGroq (llama-3.1-70b-versatile, streaming, temp=0.1)
on quota error β†’ ChatGoogleGenerativeAI (gemini-1.5-flash)
β†’ StrOutputParser
β†’ CitationInjector (post-processing)
```
**CitationInjector:**
- Parses `[Doc:Page]` markers from LLM output
- Resolves each to the actual chunk's source_file + page_number + excerpt
- Builds `List[Citation]` object for UI display
**FaithfulnessGuard:**
- System prompt: "If the answer cannot be found in the provided context, respond with exactly: 'I could not find this in your documents.'"
- Post-generation check: if answer references facts not in any retrieved chunk β†’ flag
- Confidence scoring based on retrieval score distribution
### Tests
| Test | Pass Criteria |
|------|---------------|
| `test_citation_injector_parses` | `[Doc:5]` β†’ correct Citation object |
| `test_faithfulness_guard_refusal` | Out-of-context Q β†’ refusal message |
| `test_answer_chain_mock` | Chain runs end-to-end with mocked LLM |
| `test_groq_fallback` | Groq quota error β†’ Gemini client used |
| `test_streaming_output` | Chain yields token-by-token |
| `test_conversation_memory` | Last 5 turns preserved across queries |
---
## Phase 5 β€” Full UI, TTS & Access Control
### Goal
Build the complete 4-tab Gradio UI, integrate Web Speech API for browser-native TTS, implement the multi-knowledge-base manager, add bcrypt password protection + HMAC share links, and build the analytics + audit log system.
### Files Created
```
voicevault/kb/
β”œβ”€β”€ kb_manager.py # Create/list/delete knowledge bases
β”œβ”€β”€ access_control.py # bcrypt password, HMAC share links
└── audit_log.py # Query logging to SQLite
voicevault/tts/
└── web_speech.py # Web Speech API JS bridge
voicevault/storage/
└── sqlite_store.py # Complete CRUD (extended from Phase 0)
ui/tabs/
β”œβ”€β”€ ask_tab.py # Full voice query tab
β”œβ”€β”€ kb_tab.py # Full KB manager tab
β”œβ”€β”€ analytics_tab.py # Charts + metrics tab
└── settings_tab.py # All configurable parameters
ui/components/
β”œβ”€β”€ citation_panel.py # Citation highlighting component
└── audio_controls.py # TTS playback controls
tests/
β”œβ”€β”€ test_phase5.py # UI component + access control tests
└── test_e2e.py # Full end-to-end pipeline test
DOCS/
└── phase5_ui_access.md
```
### Key Components
**KBManager:**
- Creates per-KB directory: `data/{kb_name}/chroma/`, `bm25.pkl`, `voicevault.db`
- Lists all KBs with metadata (doc count, chunk count, last updated)
- Delete KB: removes directory + SQLite row
**AccessControl:**
- Password hash: `bcrypt` with work factor 12
- Share link: `HMAC-SHA256` signed token with KB name + expiry
- Token validation on every query to password-protected KB
**AuditLog:**
- Every query logs: session_id, kb_names, voice_query (anonymized), latency, timestamp
- Viewable in Analytics tab
**Web Speech API Bridge:**
- JavaScript injected via `gr.HTML` component
- `window.speechSynthesis.speak()` triggered from Python via Gradio's JS bridge
- Voice selector, rate slider, pitch slider
- Pause/Resume/Restart controls
**UI Tabs:**
- **Ask tab:** Mic button β†’ live transcript β†’ KB selector β†’ streaming answer β†’ citation panel β†’ speak button
- **KB tab:** Create KB form + document uploader (PDF/MD/HTML/DOCX) + progress bar + doc list
- **Analytics tab:** Query volume chart + latency breakdown + top documents + Groq quota gauge
- **Settings tab:** ASR model, voice settings, retrieval params, LLM params, chunking params
### Tests
| Test | Pass Criteria |
|------|---------------|
| `test_kb_create_delete` | KB directory created/removed correctly |
| `test_bcrypt_password` | Hash + verify round-trip |
| `test_hmac_share_link` | Token validates within expiry, fails after |
| `test_audit_log_write` | Query logged to SQLite correctly |
| `test_access_control_wrong_pw` | Wrong password β†’ access denied |
| `test_e2e_pipeline` | PDF upload β†’ query β†’ cited answer (mocked LLM) |
---
## 10. Quality Gates
Every phase must pass ALL gates before moving to the next phase:
| Gate | Requirement |
|------|-------------|
| **Zero import errors** | `python -m pytest tests/ --co -q` exits 0 |
| **All tests pass** | `pytest tests/test_phaseN.py` β€” 100% green |
| **No bare except** | No `except:` or `except Exception:` without logging |
| **Type annotations** | Every public function has full type hints |
| **No unused imports** | `pylint --disable=all --enable=W0611` passes |
| **No secrets in code** | No API keys, passwords, or tokens hardcoded |
| **Pathlib throughout** | No `os.path` usage in any module |
---
## 11. Security Audit Checklist
- [ ] No API keys committed to git (enforced by .gitignore + .env.example)
- [ ] All file uploads validated: extension whitelist + MIME check + size limit
- [ ] SQLite queries use parameterized statements (no f-string SQL)
- [ ] bcrypt work factor β‰₯ 12 for password hashing
- [ ] HMAC share tokens have expiry (default: 7 days)
- [ ] `trafilatura` URL fetching: no SSRF β€” block private IP ranges
- [ ] ChromaDB stored in non-public path (never served as static file)
- [ ] BM25 pickle files: only loaded from trusted internal paths
- [ ] Gradio app: file upload restricted to `data/uploads/` sandbox directory
- [ ] Audit log: voice queries anonymized before storage (hash, not raw text)
---
## 12. Progress Tracker
| Phase | Status | Tests | Docs |
|-------|--------|-------|------|
| Phase 0 β€” Foundation | βœ… Done | βœ… 58/58 | βœ… phase0_foundation.md |
| Phase 1 β€” Ingestion | βœ… Done | βœ… 46/46 | βœ… phase1_ingestion.md |
| Phase 2 β€” Retrieval | βœ… Done | βœ… 33/33 | βœ… phase2_retrieval.md |
| Phase 3 β€” ASR | βœ… Done | βœ… 45/47 (2 skipped) | βœ… phase3_asr.md |
| Phase 4 β€” Generation | βœ… Done | βœ… 72/72 | βœ… phase4_generation.md |
| Phase 5 β€” UI & Access | βœ… Done | βœ… 55/55 | βœ… phase5_ui_access.md |
---
*VoiceVault Β· Navnit Amrutharaj Β· navnita004@gmail.com Β· github.com/ninjacode911*