VoiceVault / PLAN.md
NinjainPJs's picture
Initial release: VoiceVault v1.0.0 β€” Voice-First RAG Knowledge Agent
85f900d

VoiceVault β€” End-to-End Implementation Plan

Author: Navnit Amrutharaj Model: VoiceVault v1.0 β€” Voice-First RAG Knowledge Agent Stack: Whisper Β· LangChain Β· ChromaDB Β· Groq Β· Gradio Target: $0/month Β· HuggingFace Spaces Β· 10 Weeks Plan Date: March 2026


Table of Contents

  1. Project Overview
  2. Architecture Summary
  3. Phase Map
  4. Phase 0 β€” Project Foundation
  5. Phase 1 β€” Document Ingestion Pipeline
  6. Phase 2 β€” Hybrid Retrieval Engine
  7. Phase 3 β€” ASR & Voice Input
  8. Phase 4 β€” Generation Chain & Citations
  9. Phase 5 β€” Full UI, TTS & Access Control
  10. Quality Gates
  11. Security Audit Checklist
  12. Progress Tracker

1. Project Overview

VoiceVault is a voice-first retrieval-augmented generation (RAG) knowledge agent that enables users to:

  • Speak questions into a browser microphone
  • Get transcribed (Whisper), retrieved, generated, and spoken back answers
  • Reference private document collections (PDFs, Notion exports, Confluence, DOCX, MD)
  • Receive fully cited answers anchored to source document + page + paragraph

Core differentiator: Hybrid BM25 + vector search with Reciprocal Rank Fusion (RRF) + cross-encoder reranking β€” demonstrating enterprise-grade retrieval depth that most RAG tutorials skip.


2. Architecture Summary

INGESTION PATH (one-time per document set)
  User uploads PDFs / HTML / DOCX / MD
      ↓
  DocumentParser β†’ text extraction (PyMuPDF, BS4, python-docx)
      ↓
  SemanticChunker β†’ sentence-aware chunks (spaCy + cosine boundary)
      ↓
  IndexBuilder β†’ ChromaDB (vectors) + BM25 (keywords) + SQLite (metadata)

QUERY PATH (real-time, per user question)
  Browser mic β†’ Gradio Audio β†’ Whisper Large-v3 (HuggingFace GPU)
      ↓
  QueryPreprocessor β†’ cleanup + intent class + language detect
      ↓
  HybridRetriever β†’ BM25 top-20 + Vector top-20 β†’ RRF merge β†’ CrossEncoder top-5
      ↓
  LangChain LCEL β†’ Groq Llama-3.1-70B (stream) / Gemini Flash (fallback)
      ↓
  CitationInjector β†’ [Source: filename, p.N] inline citations
      ↓
  Gradio UI (text + highlight citations) + Web Speech API (spoken answer)

3. Phase Map

Phase Name Weeks Core Deliverables
0 Project Foundation 0 Scaffold, config, models, SQLite schema, Gradio skeleton
1 Document Ingestion 1–2 Parser, semantic chunker, ChromaDB + BM25 + SQLite indexer
2 Hybrid Retrieval 3 BM25 + vector + RRF + cross-encoder + diversity filter
3 ASR & Voice Input 4 Whisper Large-v3, Distil fallback, query preprocessor
4 Generation & Citations 5 LangChain LCEL, Groq, Gemini fallback, faithfulness guard
5 Full UI & Access Control 6–8 4-tab Gradio UI, Web Speech TTS, multi-KB, bcrypt, audit log

Phase 0 β€” Project Foundation

Goal

Establish the complete project skeleton β€” directory structure, dependencies, centralized config, Pydantic data models, SQLite schema, and a working 4-tab Gradio scaffold β€” before any business logic is written.

Files Created

voicevault/
β”œβ”€β”€ app.py                          # Gradio Blocks entry point
β”œβ”€β”€ config.py                       # Pydantic-settings centralized config
β”œβ”€β”€ requirements.txt                # All project dependencies (pinned)
β”œβ”€β”€ .env.example                    # Environment variable template
β”œβ”€β”€ voicevault/
β”‚   β”œβ”€β”€ __init__.py                 # Package init + version
β”‚   β”œβ”€β”€ models.py                   # Pydantic data models (all schemas)
β”‚   β”œβ”€β”€ asr/__init__.py
β”‚   β”œβ”€β”€ ingestion/__init__.py
β”‚   β”œβ”€β”€ retrieval/__init__.py
β”‚   β”œβ”€β”€ generation/__init__.py
β”‚   β”œβ”€β”€ kb/__init__.py
β”‚   β”œβ”€β”€ tts/__init__.py
β”‚   └── storage/
β”‚       β”œβ”€β”€ __init__.py
β”‚       └── sqlite_store.py         # Schema creation + DB init
β”œβ”€β”€ ui/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ tabs/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ ask_tab.py              # Placeholder β€” voice query tab
β”‚   β”‚   β”œβ”€β”€ kb_tab.py               # Placeholder β€” KB manager tab
β”‚   β”‚   β”œβ”€β”€ analytics_tab.py        # Placeholder β€” analytics tab
β”‚   β”‚   └── settings_tab.py         # Placeholder β€” settings tab
β”‚   └── components/
β”‚       β”œβ”€β”€ __init__.py
β”‚       β”œβ”€β”€ citation_panel.py       # Placeholder β€” citation display
β”‚       └── audio_controls.py       # Placeholder β€” TTS controls
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ conftest.py                 # Pytest fixtures
β”‚   └── test_phase0.py              # Foundation smoke tests
β”œβ”€β”€ data/                           # Runtime data (gitignored)
└── DOCS/
    └── phase0_foundation.md        # Phase 0 documentation

Key Decisions

  • pydantic-settings for type-safe env var loading (no raw os.environ calls)
  • pathlib.Path throughout β€” cross-platform, no os.path
  • SQLite stdlib for metadata β€” zero-dependency, portable, no server
  • Gradio 4.x Blocks for UI β€” native HuggingFace Spaces support
  • __version__ sentinel in voicevault/__init__.py for release tracking
  • Data models locked early β€” prevents schema drift across phases

Tests

Test Description Pass Criteria
test_config_loads Config instantiates without exceptions No exception
test_env_defaults Default values are correct types All fields pass type check
test_db_init SQLite schema creates 3 tables Tables knowledge_bases, documents, query_log exist
test_data_dirs Data directory structure is created Dirs exist after init
test_models_instantiate All Pydantic models can be instantiated No validation errors
test_gradio_builds Gradio demo object builds without error gr.Blocks object created

Documentation

β†’ See DOCS/phase0_foundation.md


Phase 1 β€” Document Ingestion Pipeline

Goal

Build the complete document ingestion pipeline: parse any supported document format, semantically chunk the text, generate embeddings, build the BM25 index, store everything in ChromaDB + SQLite, and implement SHA-256-based deduplication.

Files Created

voicevault/ingestion/
β”œβ”€β”€ document_parser.py      # PDF, HTML, DOCX, MD, TXT, URL parsers
β”œβ”€β”€ semantic_chunker.py     # spaCy + cosine-similarity boundary chunker
└── index_builder.py        # ChromaDB + BM25 + SQLite indexer + dedup

voicevault/storage/
β”œβ”€β”€ sqlite_store.py         # Full CRUD: KB, document, chunk metadata
└── chroma_store.py         # ChromaDB collection management

tests/
└── test_phase1.py          # Ingestion unit + integration tests

DOCS/
└── phase1_ingestion.md

Key Components

DocumentParser β€” Multi-format dispatcher:

  • PDF: PyMuPDF (fitz) β€” preserves page numbers, extracts tables as text
  • HTML: BeautifulSoup4 β€” Notion/Confluence exports, preserves heading hierarchy
  • DOCX: python-docx β€” heading-aware extraction
  • Markdown: markdown-it-py β€” heading hierarchy β†’ section metadata
  • Plain text: paragraph-level splitting
  • URL: trafilatura β€” clean article extraction from any public URL
  • Scanned PDF fallback: pytesseract OCR when no text layer found

SemanticChunker β€” Boundary detection:

  • spaCy en_core_web_sm sentence tokenization
  • Cosine similarity between adjacent sentence embeddings
  • New chunk when similarity < 0.5 (configurable threshold)
  • Target: 400–600 tokens per chunk, 50-token overlap
  • Special handling: tables as atomic units, code blocks atomic, lists kept together
  • Metadata per chunk: source_file, page_number, section_heading, chunk_index, timestamp

IndexBuilder β€” Dual-index construction:

  • SHA-256 hash of chunk text β†’ deduplication (skip re-indexed unchanged content)
  • sentence-transformers all-MiniLM-L6-v2 β†’ 384-dim embeddings β†’ ChromaDB
  • rank_bm25 BM25Okapi index β†’ serialized to bm25.pkl
  • SQLite metadata: chunks table linking every chunk to its source doc
  • Incremental update: only new/changed chunks re-embedded

Tests

Test Pass Criteria
test_pdf_parse Extracts text with correct page numbers
test_html_parse Extracts headings and paragraphs from Notion HTML
test_docx_parse Extracts text from DOCX with heading metadata
test_semantic_chunker Chunks respect sentence boundaries, 100–600 tokens
test_deduplication Same doc uploaded twice β†’ chunks not duplicated
test_bm25_build BM25 index serializes and reloads correctly
test_chroma_store Vectors stored and queryable in ChromaDB
test_sqlite_metadata All chunk metadata persisted to SQLite
test_incremental_update Only new chunks indexed on re-upload

Phase 2 β€” Hybrid Retrieval Engine

Goal

Implement the hybrid BM25 + dense vector retrieval pipeline with Reciprocal Rank Fusion merging, cross-encoder reranking, diversity filtering, query expansion, and context window assembly.

Files Created

voicevault/retrieval/
β”œβ”€β”€ bm25_retriever.py       # rank_bm25 keyword search
β”œβ”€β”€ vector_retriever.py     # ChromaDB semantic search
β”œβ”€β”€ hybrid_retriever.py     # RRF merge + cross-encoder + diversity filter
└── context_builder.py      # Formats top-k chunks for LLM prompt

tests/
└── test_phase2.py          # Retrieval unit + benchmark tests

DOCS/
└── phase2_retrieval.md

Key Components

BM25Retriever:

  • Loads pre-built BM25 index from disk
  • Tokenizes query, scores all chunks, returns top-20

VectorRetriever:

  • Encodes query with all-MiniLM-L6-v2
  • ChromaDB cosine similarity query β†’ top-20

HybridRetriever (RRF core):

query β†’ [QueryExpander: 2 paraphrases]
     β†’ BM25 top-20 + Vector top-20 (parallel)
     β†’ RRF merge (k=60): score = Ξ£ 1/(k + rank)
     β†’ CrossEncoder ms-marco-MiniLM-L12-v2 rescores top-20
     β†’ DiversityFilter: max 2 chunks from same page
     β†’ Final top-5 chunks

ContextBuilder:

  • Formats chunks as: [Source: filename, p.N | Section: heading]\n{text}
  • Appends conversation history (last 5 turns)
  • Returns context string ready for LLM prompt

Tests

Test Pass Criteria
test_bm25_retriever Returns ranked results for keyword query
test_vector_retriever Returns semantically relevant results
test_rrf_merge RRF scores computed correctly for known ranks
test_cross_encoder_rerank Re-ranked order differs from RRF order (improvement)
test_diversity_filter Max 2 chunks per page in final results
test_hybrid_recall Recall@5 β‰₯ 0.80 on 50-Q benchmark dataset
test_context_builder Output is valid string with source citations
test_query_expansion Returns 2 paraphrase variants

Phase 3 β€” ASR & Voice Input

Goal

Integrate Whisper Large-v3 for high-quality speech-to-text transcription, with Distil-Whisper CPU fallback, browser microphone capture via Gradio Audio, and a query preprocessor that cleans transcripts and classifies query intent.

Files Created

voicevault/asr/
β”œβ”€β”€ whisper_transcriber.py  # Whisper Large-v3 + Distil-Whisper fallback
└── query_preprocessor.py   # Cleanup, intent classification, language detect

tests/
└── test_phase3.py          # ASR unit tests + WER evaluation

DOCS/
└── phase3_asr.md

Key Components

WhisperTranscriber:

  • Primary: openai/whisper-large-v3 (HuggingFace GPU pipeline)
  • Fallback: distil-whisper/distil-large-v3 (CPU, 6Γ— faster, <1% WER diff)
  • VAD pre-check: reject audio < 1s or silent audio
  • Returns: transcript, language, confidence, model_used, latency_ms

QueryPreprocessor:

  • Lowercase normalization, punctuation repair
  • Filler word removal: um, uh, like, you know
  • Language detection: langdetect library
  • Query type classification:
    • factual β€” "What is...", "Who...", "When..."
    • summary β€” "Summarise...", "Give me an overview..."
    • compare β€” "Compare...", "What's the difference..."
  • Routes to different retrieval strategies per type

Tests

Test Pass Criteria
test_preprocessor_cleanup Filler words removed, normalized
test_intent_factual "What is X?" β†’ type=factual
test_intent_summary "Summarise the report" β†’ type=summary
test_intent_compare "Compare A and B" β†’ type=compare
test_language_detection English text β†’ "en"
test_vad_short_audio < 1s audio raises ValueError
test_whisper_mock Transcriber returns correct schema with mocked model

Phase 4 β€” Generation Chain & Citations

Goal

Build the full LangChain LCEL generation chain: Groq Llama-3.1-70B as primary LLM with streaming, Gemini 1.5 Flash as automatic fallback, citation injection with [Source: file, p.N] protocol, faithfulness guard for out-of-context detection, and conversation memory.

Files Created

voicevault/generation/
β”œβ”€β”€ answer_chain.py         # LangChain LCEL + Groq + Gemini fallback
β”œβ”€β”€ citation_injector.py    # Maps [Doc:Page] citations to source chunks
└── faithfulness_guard.py   # Out-of-context detection

tests/
└── test_phase4.py          # Generation unit tests

DOCS/
└── phase4_generation.md

Key Components

AnswerChain (LCEL):

context_string + query + history
    β†’ PromptTemplate (system: citation protocol + faithfulness instructions)
    β†’ ChatGroq (llama-3.1-70b-versatile, streaming, temp=0.1)
         on quota error β†’ ChatGoogleGenerativeAI (gemini-1.5-flash)
    β†’ StrOutputParser
    β†’ CitationInjector (post-processing)

CitationInjector:

  • Parses [Doc:Page] markers from LLM output
  • Resolves each to the actual chunk's source_file + page_number + excerpt
  • Builds List[Citation] object for UI display

FaithfulnessGuard:

  • System prompt: "If the answer cannot be found in the provided context, respond with exactly: 'I could not find this in your documents.'"
  • Post-generation check: if answer references facts not in any retrieved chunk β†’ flag
  • Confidence scoring based on retrieval score distribution

Tests

Test Pass Criteria
test_citation_injector_parses [Doc:5] β†’ correct Citation object
test_faithfulness_guard_refusal Out-of-context Q β†’ refusal message
test_answer_chain_mock Chain runs end-to-end with mocked LLM
test_groq_fallback Groq quota error β†’ Gemini client used
test_streaming_output Chain yields token-by-token
test_conversation_memory Last 5 turns preserved across queries

Phase 5 β€” Full UI, TTS & Access Control

Goal

Build the complete 4-tab Gradio UI, integrate Web Speech API for browser-native TTS, implement the multi-knowledge-base manager, add bcrypt password protection + HMAC share links, and build the analytics + audit log system.

Files Created

voicevault/kb/
β”œβ”€β”€ kb_manager.py           # Create/list/delete knowledge bases
β”œβ”€β”€ access_control.py       # bcrypt password, HMAC share links
└── audit_log.py            # Query logging to SQLite

voicevault/tts/
└── web_speech.py           # Web Speech API JS bridge

voicevault/storage/
└── sqlite_store.py         # Complete CRUD (extended from Phase 0)

ui/tabs/
β”œβ”€β”€ ask_tab.py              # Full voice query tab
β”œβ”€β”€ kb_tab.py               # Full KB manager tab
β”œβ”€β”€ analytics_tab.py        # Charts + metrics tab
└── settings_tab.py         # All configurable parameters

ui/components/
β”œβ”€β”€ citation_panel.py       # Citation highlighting component
└── audio_controls.py       # TTS playback controls

tests/
β”œβ”€β”€ test_phase5.py          # UI component + access control tests
└── test_e2e.py             # Full end-to-end pipeline test

DOCS/
└── phase5_ui_access.md

Key Components

KBManager:

  • Creates per-KB directory: data/{kb_name}/chroma/, bm25.pkl, voicevault.db
  • Lists all KBs with metadata (doc count, chunk count, last updated)
  • Delete KB: removes directory + SQLite row

AccessControl:

  • Password hash: bcrypt with work factor 12
  • Share link: HMAC-SHA256 signed token with KB name + expiry
  • Token validation on every query to password-protected KB

AuditLog:

  • Every query logs: session_id, kb_names, voice_query (anonymized), latency, timestamp
  • Viewable in Analytics tab

Web Speech API Bridge:

  • JavaScript injected via gr.HTML component
  • window.speechSynthesis.speak() triggered from Python via Gradio's JS bridge
  • Voice selector, rate slider, pitch slider
  • Pause/Resume/Restart controls

UI Tabs:

  • Ask tab: Mic button β†’ live transcript β†’ KB selector β†’ streaming answer β†’ citation panel β†’ speak button
  • KB tab: Create KB form + document uploader (PDF/MD/HTML/DOCX) + progress bar + doc list
  • Analytics tab: Query volume chart + latency breakdown + top documents + Groq quota gauge
  • Settings tab: ASR model, voice settings, retrieval params, LLM params, chunking params

Tests

Test Pass Criteria
test_kb_create_delete KB directory created/removed correctly
test_bcrypt_password Hash + verify round-trip
test_hmac_share_link Token validates within expiry, fails after
test_audit_log_write Query logged to SQLite correctly
test_access_control_wrong_pw Wrong password β†’ access denied
test_e2e_pipeline PDF upload β†’ query β†’ cited answer (mocked LLM)

10. Quality Gates

Every phase must pass ALL gates before moving to the next phase:

Gate Requirement
Zero import errors python -m pytest tests/ --co -q exits 0
All tests pass pytest tests/test_phaseN.py β€” 100% green
No bare except No except: or except Exception: without logging
Type annotations Every public function has full type hints
No unused imports pylint --disable=all --enable=W0611 passes
No secrets in code No API keys, passwords, or tokens hardcoded
Pathlib throughout No os.path usage in any module

11. Security Audit Checklist

  • No API keys committed to git (enforced by .gitignore + .env.example)
  • All file uploads validated: extension whitelist + MIME check + size limit
  • SQLite queries use parameterized statements (no f-string SQL)
  • bcrypt work factor β‰₯ 12 for password hashing
  • HMAC share tokens have expiry (default: 7 days)
  • trafilatura URL fetching: no SSRF β€” block private IP ranges
  • ChromaDB stored in non-public path (never served as static file)
  • BM25 pickle files: only loaded from trusted internal paths
  • Gradio app: file upload restricted to data/uploads/ sandbox directory
  • Audit log: voice queries anonymized before storage (hash, not raw text)

12. Progress Tracker

Phase Status Tests Docs
Phase 0 β€” Foundation βœ… Done βœ… 58/58 βœ… phase0_foundation.md
Phase 1 β€” Ingestion βœ… Done βœ… 46/46 βœ… phase1_ingestion.md
Phase 2 β€” Retrieval βœ… Done βœ… 33/33 βœ… phase2_retrieval.md
Phase 3 β€” ASR βœ… Done βœ… 45/47 (2 skipped) βœ… phase3_asr.md
Phase 4 β€” Generation βœ… Done βœ… 72/72 βœ… phase4_generation.md
Phase 5 β€” UI & Access βœ… Done βœ… 55/55 βœ… phase5_ui_access.md

VoiceVault Β· Navnit Amrutharaj Β· navnita004@gmail.com Β· github.com/ninjacode911