Spaces:

NinjainPJs
/

VoiceVault

Running

App Files Files Community

VoiceVault / PLAN.md

NinjainPJs

Initial release: VoiceVault v1.0.0 — Voice-First RAG Knowledge Agent

85f900d 3 months ago

preview code

raw

history blame contribute delete

20.3 kB

	# VoiceVault — End-to-End Implementation Plan
	Author: Navnit Amrutharaj
	Model: VoiceVault v1.0 — Voice-First RAG Knowledge Agent
	Stack: Whisper · LangChain · ChromaDB · Groq · Gradio
	Target: $0/month · HuggingFace Spaces · 10 Weeks
	Plan Date: March 2026

	---

	## Table of Contents
	1. [Project Overview](#1-project-overview)
	2. [Architecture Summary](#2-architecture-summary)
	3. [Phase Map](#3-phase-map)
	4. [Phase 0 — Project Foundation](#phase-0--project-foundation)
	5. [Phase 1 — Document Ingestion Pipeline](#phase-1--document-ingestion-pipeline)
	6. [Phase 2 — Hybrid Retrieval Engine](#phase-2--hybrid-retrieval-engine)
	7. [Phase 3 — ASR & Voice Input](#phase-3--asr--voice-input)
	8. [Phase 4 — Generation Chain & Citations](#phase-4--generation-chain--citations)
	9. [Phase 5 — Full UI, TTS & Access Control](#phase-5--full-ui-tts--access-control)
	10. [Quality Gates](#10-quality-gates)
	11. [Security Audit Checklist](#11-security-audit-checklist)
	12. [Progress Tracker](#12-progress-tracker)

	---

	## 1. Project Overview

	VoiceVault is a voice-first retrieval-augmented generation (RAG) knowledge agent that enables users to:
	- Speak questions into a browser microphone
	- Get transcribed (Whisper), retrieved, generated, and spoken back answers
	- Reference private document collections (PDFs, Notion exports, Confluence, DOCX, MD)
	- Receive fully cited answers anchored to source document + page + paragraph

	Core differentiator: Hybrid BM25 + vector search with Reciprocal Rank Fusion (RRF) + cross-encoder reranking — demonstrating enterprise-grade retrieval depth that most RAG tutorials skip.

	---

	## 2. Architecture Summary

	```
	INGESTION PATH (one-time per document set)
	User uploads PDFs / HTML / DOCX / MD
	↓
	DocumentParser → text extraction (PyMuPDF, BS4, python-docx)
	↓
	SemanticChunker → sentence-aware chunks (spaCy + cosine boundary)
	↓
	IndexBuilder → ChromaDB (vectors) + BM25 (keywords) + SQLite (metadata)

	QUERY PATH (real-time, per user question)
	Browser mic → Gradio Audio → Whisper Large-v3 (HuggingFace GPU)
	↓
	QueryPreprocessor → cleanup + intent class + language detect
	↓
	HybridRetriever → BM25 top-20 + Vector top-20 → RRF merge → CrossEncoder top-5
	↓
	LangChain LCEL → Groq Llama-3.1-70B (stream) / Gemini Flash (fallback)
	↓
	CitationInjector → [Source: filename, p.N] inline citations
	↓
	Gradio UI (text + highlight citations) + Web Speech API (spoken answer)
	```

	---

	## 3. Phase Map

	\| Phase \| Name \| Weeks \| Core Deliverables \|
	\|-------\|------\|-------\|-------------------\|
	\| 0 \| Project Foundation \| 0 \| Scaffold, config, models, SQLite schema, Gradio skeleton \|
	\| 1 \| Document Ingestion \| 1–2 \| Parser, semantic chunker, ChromaDB + BM25 + SQLite indexer \|
	\| 2 \| Hybrid Retrieval \| 3 \| BM25 + vector + RRF + cross-encoder + diversity filter \|
	\| 3 \| ASR & Voice Input \| 4 \| Whisper Large-v3, Distil fallback, query preprocessor \|
	\| 4 \| Generation & Citations \| 5 \| LangChain LCEL, Groq, Gemini fallback, faithfulness guard \|
	\| 5 \| Full UI & Access Control \| 6–8 \| 4-tab Gradio UI, Web Speech TTS, multi-KB, bcrypt, audit log \|

	---

	## Phase 0 — Project Foundation

	### Goal
	Establish the complete project skeleton — directory structure, dependencies, centralized config, Pydantic data models, SQLite schema, and a working 4-tab Gradio scaffold — before any business logic is written.

	### Files Created
	```
	voicevault/
	├── app.py # Gradio Blocks entry point
	├── config.py # Pydantic-settings centralized config
	├── requirements.txt # All project dependencies (pinned)
	├── .env.example # Environment variable template
	├── voicevault/
	│ ├── __init__.py # Package init + version
	│ ├── models.py # Pydantic data models (all schemas)
	│ ├── asr/__init__.py
	│ ├── ingestion/__init__.py
	│ ├── retrieval/__init__.py
	│ ├── generation/__init__.py
	│ ├── kb/__init__.py
	│ ├── tts/__init__.py
	│ └── storage/
	│ ├── __init__.py
	│ └── sqlite_store.py # Schema creation + DB init
	├── ui/
	│ ├── __init__.py
	│ ├── tabs/
	│ │ ├── __init__.py
	│ │ ├── ask_tab.py # Placeholder — voice query tab
	│ │ ├── kb_tab.py # Placeholder — KB manager tab
	│ │ ├── analytics_tab.py # Placeholder — analytics tab
	│ │ └── settings_tab.py # Placeholder — settings tab
	│ └── components/
	│ ├── __init__.py
	│ ├── citation_panel.py # Placeholder — citation display
	│ └── audio_controls.py # Placeholder — TTS controls
	├── tests/
	│ ├── __init__.py
	│ ├── conftest.py # Pytest fixtures
	│ └── test_phase0.py # Foundation smoke tests
	├── data/ # Runtime data (gitignored)
	└── DOCS/
	└── phase0_foundation.md # Phase 0 documentation
	```

	### Key Decisions
	- pydantic-settings for type-safe env var loading (no raw `os.environ` calls)
	- pathlib.Path throughout — cross-platform, no `os.path`
	- SQLite stdlib for metadata — zero-dependency, portable, no server
	- Gradio 4.x Blocks for UI — native HuggingFace Spaces support
	- `__version__` sentinel in `voicevault/__init__.py` for release tracking
	- Data models locked early — prevents schema drift across phases

	### Tests
	\| Test \| Description \| Pass Criteria \|
	\|------\|-------------\|---------------\|
	\| `test_config_loads` \| Config instantiates without exceptions \| No exception \|
	\| `test_env_defaults` \| Default values are correct types \| All fields pass type check \|
	\| `test_db_init` \| SQLite schema creates 3 tables \| Tables `knowledge_bases`, `documents`, `query_log` exist \|
	\| `test_data_dirs` \| Data directory structure is created \| Dirs exist after init \|
	\| `test_models_instantiate` \| All Pydantic models can be instantiated \| No validation errors \|
	\| `test_gradio_builds` \| Gradio demo object builds without error \| `gr.Blocks` object created \|

	### Documentation
	→ See `DOCS/phase0_foundation.md`

	---

	## Phase 1 — Document Ingestion Pipeline

	### Goal
	Build the complete document ingestion pipeline: parse any supported document format, semantically chunk the text, generate embeddings, build the BM25 index, store everything in ChromaDB + SQLite, and implement SHA-256-based deduplication.

	### Files Created
	```
	voicevault/ingestion/
	├── document_parser.py # PDF, HTML, DOCX, MD, TXT, URL parsers
	├── semantic_chunker.py # spaCy + cosine-similarity boundary chunker
	└── index_builder.py # ChromaDB + BM25 + SQLite indexer + dedup

	voicevault/storage/
	├── sqlite_store.py # Full CRUD: KB, document, chunk metadata
	└── chroma_store.py # ChromaDB collection management

	tests/
	└── test_phase1.py # Ingestion unit + integration tests

	DOCS/
	└── phase1_ingestion.md
	```

	### Key Components

	DocumentParser — Multi-format dispatcher:
	- PDF: `PyMuPDF` (fitz) — preserves page numbers, extracts tables as text
	- HTML: `BeautifulSoup4` — Notion/Confluence exports, preserves heading hierarchy
	- DOCX: `python-docx` — heading-aware extraction
	- Markdown: `markdown-it-py` — heading hierarchy → section metadata
	- Plain text: paragraph-level splitting
	- URL: `trafilatura` — clean article extraction from any public URL
	- Scanned PDF fallback: `pytesseract` OCR when no text layer found

	SemanticChunker — Boundary detection:
	- `spaCy en_core_web_sm` sentence tokenization
	- Cosine similarity between adjacent sentence embeddings
	- New chunk when similarity < 0.5 (configurable threshold)
	- Target: 400–600 tokens per chunk, 50-token overlap
	- Special handling: tables as atomic units, code blocks atomic, lists kept together
	- Metadata per chunk: source_file, page_number, section_heading, chunk_index, timestamp

	IndexBuilder — Dual-index construction:
	- SHA-256 hash of chunk text → deduplication (skip re-indexed unchanged content)
	- `sentence-transformers all-MiniLM-L6-v2` → 384-dim embeddings → ChromaDB
	- `rank_bm25` BM25Okapi index → serialized to `bm25.pkl`
	- SQLite metadata: `chunks` table linking every chunk to its source doc
	- Incremental update: only new/changed chunks re-embedded

	### Tests
	\| Test \| Pass Criteria \|
	\|------\|---------------\|
	\| `test_pdf_parse` \| Extracts text with correct page numbers \|
	\| `test_html_parse` \| Extracts headings and paragraphs from Notion HTML \|
	\| `test_docx_parse` \| Extracts text from DOCX with heading metadata \|
	\| `test_semantic_chunker` \| Chunks respect sentence boundaries, 100–600 tokens \|
	\| `test_deduplication` \| Same doc uploaded twice → chunks not duplicated \|
	\| `test_bm25_build` \| BM25 index serializes and reloads correctly \|
	\| `test_chroma_store` \| Vectors stored and queryable in ChromaDB \|
	\| `test_sqlite_metadata` \| All chunk metadata persisted to SQLite \|
	\| `test_incremental_update` \| Only new chunks indexed on re-upload \|

	---

	## Phase 2 — Hybrid Retrieval Engine

	### Goal
	Implement the hybrid BM25 + dense vector retrieval pipeline with Reciprocal Rank Fusion merging, cross-encoder reranking, diversity filtering, query expansion, and context window assembly.

	### Files Created
	```
	voicevault/retrieval/
	├── bm25_retriever.py # rank_bm25 keyword search
	├── vector_retriever.py # ChromaDB semantic search
	├── hybrid_retriever.py # RRF merge + cross-encoder + diversity filter
	└── context_builder.py # Formats top-k chunks for LLM prompt

	tests/
	└── test_phase2.py # Retrieval unit + benchmark tests

	DOCS/
	└── phase2_retrieval.md
	```

	### Key Components

	BM25Retriever:
	- Loads pre-built BM25 index from disk
	- Tokenizes query, scores all chunks, returns top-20

	VectorRetriever:
	- Encodes query with `all-MiniLM-L6-v2`
	- ChromaDB cosine similarity query → top-20

	HybridRetriever (RRF core):
	```
	query → [QueryExpander: 2 paraphrases]
	→ BM25 top-20 + Vector top-20 (parallel)
	→ RRF merge (k=60): score = Σ 1/(k + rank)
	→ CrossEncoder ms-marco-MiniLM-L12-v2 rescores top-20
	→ DiversityFilter: max 2 chunks from same page
	→ Final top-5 chunks
	```

	ContextBuilder:
	- Formats chunks as: `[Source: filename, p.N \| Section: heading]\n{text}`
	- Appends conversation history (last 5 turns)
	- Returns context string ready for LLM prompt

	### Tests
	\| Test \| Pass Criteria \|
	\|------\|---------------\|
	\| `test_bm25_retriever` \| Returns ranked results for keyword query \|
	\| `test_vector_retriever` \| Returns semantically relevant results \|
	\| `test_rrf_merge` \| RRF scores computed correctly for known ranks \|
	\| `test_cross_encoder_rerank` \| Re-ranked order differs from RRF order (improvement) \|
	\| `test_diversity_filter` \| Max 2 chunks per page in final results \|
	\| `test_hybrid_recall` \| Recall@5 ≥ 0.80 on 50-Q benchmark dataset \|
	\| `test_context_builder` \| Output is valid string with source citations \|
	\| `test_query_expansion` \| Returns 2 paraphrase variants \|

	---

	## Phase 3 — ASR & Voice Input

	### Goal
	Integrate Whisper Large-v3 for high-quality speech-to-text transcription, with Distil-Whisper CPU fallback, browser microphone capture via Gradio Audio, and a query preprocessor that cleans transcripts and classifies query intent.

	### Files Created
	```
	voicevault/asr/
	├── whisper_transcriber.py # Whisper Large-v3 + Distil-Whisper fallback
	└── query_preprocessor.py # Cleanup, intent classification, language detect

	tests/
	└── test_phase3.py # ASR unit tests + WER evaluation

	DOCS/
	└── phase3_asr.md
	```

	### Key Components

	WhisperTranscriber:
	- Primary: `openai/whisper-large-v3` (HuggingFace GPU pipeline)
	- Fallback: `distil-whisper/distil-large-v3` (CPU, 6× faster, <1% WER diff)
	- VAD pre-check: reject audio < 1s or silent audio
	- Returns: `transcript`, `language`, `confidence`, `model_used`, `latency_ms`

	QueryPreprocessor:
	- Lowercase normalization, punctuation repair
	- Filler word removal: um, uh, like, you know
	- Language detection: `langdetect` library
	- Query type classification:
	- `factual` — "What is...", "Who...", "When..."
	- `summary` — "Summarise...", "Give me an overview..."
	- `compare` — "Compare...", "What's the difference..."
	- Routes to different retrieval strategies per type

	### Tests
	\| Test \| Pass Criteria \|
	\|------\|---------------\|
	\| `test_preprocessor_cleanup` \| Filler words removed, normalized \|
	\| `test_intent_factual` \| "What is X?" → type=factual \|
	\| `test_intent_summary` \| "Summarise the report" → type=summary \|
	\| `test_intent_compare` \| "Compare A and B" → type=compare \|
	\| `test_language_detection` \| English text → "en" \|
	\| `test_vad_short_audio` \| < 1s audio raises ValueError \|
	\| `test_whisper_mock` \| Transcriber returns correct schema with mocked model \|

	---

	## Phase 4 — Generation Chain & Citations

	### Goal
	Build the full LangChain LCEL generation chain: Groq Llama-3.1-70B as primary LLM with streaming, Gemini 1.5 Flash as automatic fallback, citation injection with [Source: file, p.N] protocol, faithfulness guard for out-of-context detection, and conversation memory.

	### Files Created
	```
	voicevault/generation/
	├── answer_chain.py # LangChain LCEL + Groq + Gemini fallback
	├── citation_injector.py # Maps [Doc:Page] citations to source chunks
	└── faithfulness_guard.py # Out-of-context detection

	tests/
	└── test_phase4.py # Generation unit tests

	DOCS/
	└── phase4_generation.md
	```

	### Key Components

	AnswerChain (LCEL):
	```
	context_string + query + history
	→ PromptTemplate (system: citation protocol + faithfulness instructions)
	→ ChatGroq (llama-3.1-70b-versatile, streaming, temp=0.1)
	on quota error → ChatGoogleGenerativeAI (gemini-1.5-flash)
	→ StrOutputParser
	→ CitationInjector (post-processing)
	```

	CitationInjector:
	- Parses `[Doc:Page]` markers from LLM output
	- Resolves each to the actual chunk's source_file + page_number + excerpt
	- Builds `List[Citation]` object for UI display

	FaithfulnessGuard:
	- System prompt: "If the answer cannot be found in the provided context, respond with exactly: 'I could not find this in your documents.'"
	- Post-generation check: if answer references facts not in any retrieved chunk → flag
	- Confidence scoring based on retrieval score distribution

	### Tests
	\| Test \| Pass Criteria \|
	\|------\|---------------\|
	\| `test_citation_injector_parses` \| `[Doc:5]` → correct Citation object \|
	\| `test_faithfulness_guard_refusal` \| Out-of-context Q → refusal message \|
	\| `test_answer_chain_mock` \| Chain runs end-to-end with mocked LLM \|
	\| `test_groq_fallback` \| Groq quota error → Gemini client used \|
	\| `test_streaming_output` \| Chain yields token-by-token \|
	\| `test_conversation_memory` \| Last 5 turns preserved across queries \|

	---

	## Phase 5 — Full UI, TTS & Access Control

	### Goal
	Build the complete 4-tab Gradio UI, integrate Web Speech API for browser-native TTS, implement the multi-knowledge-base manager, add bcrypt password protection + HMAC share links, and build the analytics + audit log system.

	### Files Created
	```
	voicevault/kb/
	├── kb_manager.py # Create/list/delete knowledge bases
	├── access_control.py # bcrypt password, HMAC share links
	└── audit_log.py # Query logging to SQLite

	voicevault/tts/
	└── web_speech.py # Web Speech API JS bridge

	voicevault/storage/
	└── sqlite_store.py # Complete CRUD (extended from Phase 0)

	ui/tabs/
	├── ask_tab.py # Full voice query tab
	├── kb_tab.py # Full KB manager tab
	├── analytics_tab.py # Charts + metrics tab
	└── settings_tab.py # All configurable parameters

	ui/components/
	├── citation_panel.py # Citation highlighting component
	└── audio_controls.py # TTS playback controls

	tests/
	├── test_phase5.py # UI component + access control tests
	└── test_e2e.py # Full end-to-end pipeline test

	DOCS/
	└── phase5_ui_access.md
	```

	### Key Components

	KBManager:
	- Creates per-KB directory: `data/{kb_name}/chroma/`, `bm25.pkl`, `voicevault.db`
	- Lists all KBs with metadata (doc count, chunk count, last updated)
	- Delete KB: removes directory + SQLite row

	AccessControl:
	- Password hash: `bcrypt` with work factor 12
	- Share link: `HMAC-SHA256` signed token with KB name + expiry
	- Token validation on every query to password-protected KB

	AuditLog:
	- Every query logs: session_id, kb_names, voice_query (anonymized), latency, timestamp
	- Viewable in Analytics tab

	Web Speech API Bridge:
	- JavaScript injected via `gr.HTML` component
	- `window.speechSynthesis.speak()` triggered from Python via Gradio's JS bridge
	- Voice selector, rate slider, pitch slider
	- Pause/Resume/Restart controls

	UI Tabs:
	- Ask tab: Mic button → live transcript → KB selector → streaming answer → citation panel → speak button
	- KB tab: Create KB form + document uploader (PDF/MD/HTML/DOCX) + progress bar + doc list
	- Analytics tab: Query volume chart + latency breakdown + top documents + Groq quota gauge
	- Settings tab: ASR model, voice settings, retrieval params, LLM params, chunking params

	### Tests
	\| Test \| Pass Criteria \|
	\|------\|---------------\|
	\| `test_kb_create_delete` \| KB directory created/removed correctly \|
	\| `test_bcrypt_password` \| Hash + verify round-trip \|
	\| `test_hmac_share_link` \| Token validates within expiry, fails after \|
	\| `test_audit_log_write` \| Query logged to SQLite correctly \|
	\| `test_access_control_wrong_pw` \| Wrong password → access denied \|
	\| `test_e2e_pipeline` \| PDF upload → query → cited answer (mocked LLM) \|

	---

	## 10. Quality Gates

	Every phase must pass ALL gates before moving to the next phase:

	\| Gate \| Requirement \|
	\|------\|-------------\|
	\| Zero import errors \| `python -m pytest tests/ --co -q` exits 0 \|
	\| All tests pass \| `pytest tests/test_phaseN.py` — 100% green \|
	\| No bare except \| No `except:` or `except Exception:` without logging \|
	\| Type annotations \| Every public function has full type hints \|
	\| No unused imports \| `pylint --disable=all --enable=W0611` passes \|
	\| No secrets in code \| No API keys, passwords, or tokens hardcoded \|
	\| Pathlib throughout \| No `os.path` usage in any module \|

	---

	## 11. Security Audit Checklist

	- [ ] No API keys committed to git (enforced by .gitignore + .env.example)
	- [ ] All file uploads validated: extension whitelist + MIME check + size limit
	- [ ] SQLite queries use parameterized statements (no f-string SQL)
	- [ ] bcrypt work factor ≥ 12 for password hashing
	- [ ] HMAC share tokens have expiry (default: 7 days)
	- [ ] `trafilatura` URL fetching: no SSRF — block private IP ranges
	- [ ] ChromaDB stored in non-public path (never served as static file)
	- [ ] BM25 pickle files: only loaded from trusted internal paths
	- [ ] Gradio app: file upload restricted to `data/uploads/` sandbox directory
	- [ ] Audit log: voice queries anonymized before storage (hash, not raw text)

	---

	## 12. Progress Tracker

	\| Phase \| Status \| Tests \| Docs \|
	\|-------\|--------\|-------\|------\|
	\| Phase 0 — Foundation \| ✅ Done \| ✅ 58/58 \| ✅ phase0_foundation.md \|
	\| Phase 1 — Ingestion \| ✅ Done \| ✅ 46/46 \| ✅ phase1_ingestion.md \|
	\| Phase 2 — Retrieval \| ✅ Done \| ✅ 33/33 \| ✅ phase2_retrieval.md \|
	\| Phase 3 — ASR \| ✅ Done \| ✅ 45/47 (2 skipped) \| ✅ phase3_asr.md \|
	\| Phase 4 — Generation \| ✅ Done \| ✅ 72/72 \| ✅ phase4_generation.md \|
	\| Phase 5 — UI & Access \| ✅ Done \| ✅ 55/55 \| ✅ phase5_ui_access.md \|

	---

	VoiceVault · Navnit Amrutharaj · navnita004@gmail.com · github.com/ninjacode911