Spaces:

NinjainPJs
/

VoiceVault

Running

App Files Files Community

VoiceVault / PLAN.md

NinjainPJs

Initial release: VoiceVault v1.0.0 — Voice-First RAG Knowledge Agent

85f900d 3 months ago

preview code

raw

history blame contribute delete

20.3 kB

VoiceVault — End-to-End Implementation Plan

Author: Navnit Amrutharaj Model: VoiceVault v1.0 — Voice-First RAG Knowledge Agent Stack: Whisper · LangChain · ChromaDB · Groq · Gradio Target: $0/month · HuggingFace Spaces · 10 Weeks Plan Date: March 2026

Project Overview
Architecture Summary
Phase Map
Phase 0 — Project Foundation
Phase 1 — Document Ingestion Pipeline
Phase 2 — Hybrid Retrieval Engine
Phase 3 — ASR & Voice Input
Phase 4 — Generation Chain & Citations
Phase 5 — Full UI, TTS & Access Control
Quality Gates
Security Audit Checklist
Progress Tracker

1. Project Overview

VoiceVault is a voice-first retrieval-augmented generation (RAG) knowledge agent that enables users to:

Speak questions into a browser microphone
Get transcribed (Whisper), retrieved, generated, and spoken back answers
Reference private document collections (PDFs, Notion exports, Confluence, DOCX, MD)
Receive fully cited answers anchored to source document + page + paragraph

Core differentiator: Hybrid BM25 + vector search with Reciprocal Rank Fusion (RRF) + cross-encoder reranking — demonstrating enterprise-grade retrieval depth that most RAG tutorials skip.

2. Architecture Summary

INGESTION PATH (one-time per document set)
  User uploads PDFs / HTML / DOCX / MD
      ↓
  DocumentParser → text extraction (PyMuPDF, BS4, python-docx)
      ↓
  SemanticChunker → sentence-aware chunks (spaCy + cosine boundary)
      ↓
  IndexBuilder → ChromaDB (vectors) + BM25 (keywords) + SQLite (metadata)

QUERY PATH (real-time, per user question)
  Browser mic → Gradio Audio → Whisper Large-v3 (HuggingFace GPU)
      ↓
  QueryPreprocessor → cleanup + intent class + language detect
      ↓
  HybridRetriever → BM25 top-20 + Vector top-20 → RRF merge → CrossEncoder top-5
      ↓
  LangChain LCEL → Groq Llama-3.1-70B (stream) / Gemini Flash (fallback)
      ↓
  CitationInjector → [Source: filename, p.N] inline citations
      ↓
  Gradio UI (text + highlight citations) + Web Speech API (spoken answer)

3. Phase Map

Phase	Name	Weeks	Core Deliverables
0	Project Foundation	0	Scaffold, config, models, SQLite schema, Gradio skeleton
1	Document Ingestion	1–2	Parser, semantic chunker, ChromaDB + BM25 + SQLite indexer
2	Hybrid Retrieval	3	BM25 + vector + RRF + cross-encoder + diversity filter
3	ASR & Voice Input	4	Whisper Large-v3, Distil fallback, query preprocessor
4	Generation & Citations	5	LangChain LCEL, Groq, Gemini fallback, faithfulness guard
5	Full UI & Access Control	6–8	4-tab Gradio UI, Web Speech TTS, multi-KB, bcrypt, audit log

Phase 0 — Project Foundation

Goal

Establish the complete project skeleton — directory structure, dependencies, centralized config, Pydantic data models, SQLite schema, and a working 4-tab Gradio scaffold — before any business logic is written.

Files Created

voicevault/
├── app.py                          # Gradio Blocks entry point
├── config.py                       # Pydantic-settings centralized config
├── requirements.txt                # All project dependencies (pinned)
├── .env.example                    # Environment variable template
├── voicevault/
│   ├── __init__.py                 # Package init + version
│   ├── models.py                   # Pydantic data models (all schemas)
│   ├── asr/__init__.py
│   ├── ingestion/__init__.py
│   ├── retrieval/__init__.py
│   ├── generation/__init__.py
│   ├── kb/__init__.py
│   ├── tts/__init__.py
│   └── storage/
│       ├── __init__.py
│       └── sqlite_store.py         # Schema creation + DB init
├── ui/
│   ├── __init__.py
│   ├── tabs/
│   │   ├── __init__.py
│   │   ├── ask_tab.py              # Placeholder — voice query tab
│   │   ├── kb_tab.py               # Placeholder — KB manager tab
│   │   ├── analytics_tab.py        # Placeholder — analytics tab
│   │   └── settings_tab.py         # Placeholder — settings tab
│   └── components/
│       ├── __init__.py
│       ├── citation_panel.py       # Placeholder — citation display
│       └── audio_controls.py       # Placeholder — TTS controls
├── tests/
│   ├── __init__.py
│   ├── conftest.py                 # Pytest fixtures
│   └── test_phase0.py              # Foundation smoke tests
├── data/                           # Runtime data (gitignored)
└── DOCS/
    └── phase0_foundation.md        # Phase 0 documentation

Key Decisions

pydantic-settings for type-safe env var loading (no raw os.environ calls)
pathlib.Path throughout — cross-platform, no os.path
SQLite stdlib for metadata — zero-dependency, portable, no server
Gradio 4.x Blocks for UI — native HuggingFace Spaces support
__version__ sentinel in voicevault/__init__.py for release tracking
Data models locked early — prevents schema drift across phases

Tests

Test	Description	Pass Criteria
`test_config_loads`	Config instantiates without exceptions	No exception
`test_env_defaults`	Default values are correct types	All fields pass type check
`test_db_init`	SQLite schema creates 3 tables	Tables `knowledge_bases`, `documents`, `query_log` exist
`test_data_dirs`	Data directory structure is created	Dirs exist after init
`test_models_instantiate`	All Pydantic models can be instantiated	No validation errors
`test_gradio_builds`	Gradio demo object builds without error	`gr.Blocks` object created

Documentation

→ See DOCS/phase0_foundation.md

Phase 1 — Document Ingestion Pipeline

Goal

Build the complete document ingestion pipeline: parse any supported document format, semantically chunk the text, generate embeddings, build the BM25 index, store everything in ChromaDB + SQLite, and implement SHA-256-based deduplication.

Files Created

voicevault/ingestion/
├── document_parser.py      # PDF, HTML, DOCX, MD, TXT, URL parsers
├── semantic_chunker.py     # spaCy + cosine-similarity boundary chunker
└── index_builder.py        # ChromaDB + BM25 + SQLite indexer + dedup

voicevault/storage/
├── sqlite_store.py         # Full CRUD: KB, document, chunk metadata
└── chroma_store.py         # ChromaDB collection management

tests/
└── test_phase1.py          # Ingestion unit + integration tests

DOCS/
└── phase1_ingestion.md

Key Components

DocumentParser — Multi-format dispatcher:

PDF: PyMuPDF (fitz) — preserves page numbers, extracts tables as text
HTML: BeautifulSoup4 — Notion/Confluence exports, preserves heading hierarchy
DOCX: python-docx — heading-aware extraction
Markdown: markdown-it-py — heading hierarchy → section metadata
Plain text: paragraph-level splitting
URL: trafilatura — clean article extraction from any public URL
Scanned PDF fallback: pytesseract OCR when no text layer found

SemanticChunker — Boundary detection:

spaCy en_core_web_sm sentence tokenization
Cosine similarity between adjacent sentence embeddings
New chunk when similarity < 0.5 (configurable threshold)
Target: 400–600 tokens per chunk, 50-token overlap
Special handling: tables as atomic units, code blocks atomic, lists kept together
Metadata per chunk: source_file, page_number, section_heading, chunk_index, timestamp

IndexBuilder — Dual-index construction:

SHA-256 hash of chunk text → deduplication (skip re-indexed unchanged content)
sentence-transformers all-MiniLM-L6-v2 → 384-dim embeddings → ChromaDB
rank_bm25 BM25Okapi index → serialized to bm25.pkl
SQLite metadata: chunks table linking every chunk to its source doc
Incremental update: only new/changed chunks re-embedded

Tests

Test	Pass Criteria
`test_pdf_parse`	Extracts text with correct page numbers
`test_html_parse`	Extracts headings and paragraphs from Notion HTML
`test_docx_parse`	Extracts text from DOCX with heading metadata
`test_semantic_chunker`	Chunks respect sentence boundaries, 100–600 tokens
`test_deduplication`	Same doc uploaded twice → chunks not duplicated
`test_bm25_build`	BM25 index serializes and reloads correctly
`test_chroma_store`	Vectors stored and queryable in ChromaDB
`test_sqlite_metadata`	All chunk metadata persisted to SQLite
`test_incremental_update`	Only new chunks indexed on re-upload

Phase 2 — Hybrid Retrieval Engine

Goal

Implement the hybrid BM25 + dense vector retrieval pipeline with Reciprocal Rank Fusion merging, cross-encoder reranking, diversity filtering, query expansion, and context window assembly.

Files Created

voicevault/retrieval/
├── bm25_retriever.py       # rank_bm25 keyword search
├── vector_retriever.py     # ChromaDB semantic search
├── hybrid_retriever.py     # RRF merge + cross-encoder + diversity filter
└── context_builder.py      # Formats top-k chunks for LLM prompt

tests/
└── test_phase2.py          # Retrieval unit + benchmark tests

DOCS/
└── phase2_retrieval.md

Key Components

BM25Retriever:

Loads pre-built BM25 index from disk
Tokenizes query, scores all chunks, returns top-20

VectorRetriever:

Encodes query with all-MiniLM-L6-v2
ChromaDB cosine similarity query → top-20

HybridRetriever (RRF core):

query → [QueryExpander: 2 paraphrases]
     → BM25 top-20 + Vector top-20 (parallel)
     → RRF merge (k=60): score = Σ 1/(k + rank)
     → CrossEncoder ms-marco-MiniLM-L12-v2 rescores top-20
     → DiversityFilter: max 2 chunks from same page
     → Final top-5 chunks

ContextBuilder:

Formats chunks as: [Source: filename, p.N | Section: heading]\n{text}
Appends conversation history (last 5 turns)
Returns context string ready for LLM prompt

Tests

Test	Pass Criteria
`test_bm25_retriever`	Returns ranked results for keyword query
`test_vector_retriever`	Returns semantically relevant results
`test_rrf_merge`	RRF scores computed correctly for known ranks
`test_cross_encoder_rerank`	Re-ranked order differs from RRF order (improvement)
`test_diversity_filter`	Max 2 chunks per page in final results
`test_hybrid_recall`	Recall@5 ≥ 0.80 on 50-Q benchmark dataset
`test_context_builder`	Output is valid string with source citations
`test_query_expansion`	Returns 2 paraphrase variants

Phase 3 — ASR & Voice Input

Goal

Integrate Whisper Large-v3 for high-quality speech-to-text transcription, with Distil-Whisper CPU fallback, browser microphone capture via Gradio Audio, and a query preprocessor that cleans transcripts and classifies query intent.

Files Created

voicevault/asr/
├── whisper_transcriber.py  # Whisper Large-v3 + Distil-Whisper fallback
└── query_preprocessor.py   # Cleanup, intent classification, language detect

tests/
└── test_phase3.py          # ASR unit tests + WER evaluation

DOCS/
└── phase3_asr.md

Key Components

WhisperTranscriber:

Primary: openai/whisper-large-v3 (HuggingFace GPU pipeline)
Fallback: distil-whisper/distil-large-v3 (CPU, 6× faster, <1% WER diff)
VAD pre-check: reject audio < 1s or silent audio
Returns: transcript, language, confidence, model_used, latency_ms

QueryPreprocessor:

Lowercase normalization, punctuation repair
Filler word removal: um, uh, like, you know
Language detection: langdetect library
Query type classification:
- factual — "What is...", "Who...", "When..."
- summary — "Summarise...", "Give me an overview..."
- compare — "Compare...", "What's the difference..."
Routes to different retrieval strategies per type

Tests

Test	Pass Criteria
`test_preprocessor_cleanup`	Filler words removed, normalized
`test_intent_factual`	"What is X?" → type=factual
`test_intent_summary`	"Summarise the report" → type=summary
`test_intent_compare`	"Compare A and B" → type=compare
`test_language_detection`	English text → "en"
`test_vad_short_audio`	< 1s audio raises ValueError
`test_whisper_mock`	Transcriber returns correct schema with mocked model

Phase 4 — Generation Chain & Citations

Goal

Build the full LangChain LCEL generation chain: Groq Llama-3.1-70B as primary LLM with streaming, Gemini 1.5 Flash as automatic fallback, citation injection with [Source: file, p.N] protocol, faithfulness guard for out-of-context detection, and conversation memory.

Files Created

voicevault/generation/
├── answer_chain.py         # LangChain LCEL + Groq + Gemini fallback
├── citation_injector.py    # Maps [Doc:Page] citations to source chunks
└── faithfulness_guard.py   # Out-of-context detection

tests/
└── test_phase4.py          # Generation unit tests

DOCS/
└── phase4_generation.md

Key Components

AnswerChain (LCEL):

context_string + query + history
    → PromptTemplate (system: citation protocol + faithfulness instructions)
    → ChatGroq (llama-3.1-70b-versatile, streaming, temp=0.1)
         on quota error → ChatGoogleGenerativeAI (gemini-1.5-flash)
    → StrOutputParser
    → CitationInjector (post-processing)

CitationInjector:

Parses [Doc:Page] markers from LLM output
Resolves each to the actual chunk's source_file + page_number + excerpt
Builds List[Citation] object for UI display

FaithfulnessGuard:

System prompt: "If the answer cannot be found in the provided context, respond with exactly: 'I could not find this in your documents.'"
Post-generation check: if answer references facts not in any retrieved chunk → flag
Confidence scoring based on retrieval score distribution

Tests

Test	Pass Criteria
`test_citation_injector_parses`	`[Doc:5]` → correct Citation object
`test_faithfulness_guard_refusal`	Out-of-context Q → refusal message
`test_answer_chain_mock`	Chain runs end-to-end with mocked LLM
`test_groq_fallback`	Groq quota error → Gemini client used
`test_streaming_output`	Chain yields token-by-token
`test_conversation_memory`	Last 5 turns preserved across queries

Phase 5 — Full UI, TTS & Access Control

Goal

Build the complete 4-tab Gradio UI, integrate Web Speech API for browser-native TTS, implement the multi-knowledge-base manager, add bcrypt password protection + HMAC share links, and build the analytics + audit log system.

Files Created

voicevault/kb/
├── kb_manager.py           # Create/list/delete knowledge bases
├── access_control.py       # bcrypt password, HMAC share links
└── audit_log.py            # Query logging to SQLite

voicevault/tts/
└── web_speech.py           # Web Speech API JS bridge

voicevault/storage/
└── sqlite_store.py         # Complete CRUD (extended from Phase 0)

ui/tabs/
├── ask_tab.py              # Full voice query tab
├── kb_tab.py               # Full KB manager tab
├── analytics_tab.py        # Charts + metrics tab
└── settings_tab.py         # All configurable parameters

ui/components/
├── citation_panel.py       # Citation highlighting component
└── audio_controls.py       # TTS playback controls

tests/
├── test_phase5.py          # UI component + access control tests
└── test_e2e.py             # Full end-to-end pipeline test

DOCS/
└── phase5_ui_access.md

Key Components

KBManager:

Creates per-KB directory: data/{kb_name}/chroma/, bm25.pkl, voicevault.db
Lists all KBs with metadata (doc count, chunk count, last updated)
Delete KB: removes directory + SQLite row

AccessControl:

Password hash: bcrypt with work factor 12
Share link: HMAC-SHA256 signed token with KB name + expiry
Token validation on every query to password-protected KB

AuditLog:

Every query logs: session_id, kb_names, voice_query (anonymized), latency, timestamp
Viewable in Analytics tab

Web Speech API Bridge:

JavaScript injected via gr.HTML component
window.speechSynthesis.speak() triggered from Python via Gradio's JS bridge
Voice selector, rate slider, pitch slider
Pause/Resume/Restart controls

UI Tabs:

Ask tab: Mic button → live transcript → KB selector → streaming answer → citation panel → speak button
KB tab: Create KB form + document uploader (PDF/MD/HTML/DOCX) + progress bar + doc list
Analytics tab: Query volume chart + latency breakdown + top documents + Groq quota gauge
Settings tab: ASR model, voice settings, retrieval params, LLM params, chunking params

Tests

Test	Pass Criteria
`test_kb_create_delete`	KB directory created/removed correctly
`test_bcrypt_password`	Hash + verify round-trip
`test_hmac_share_link`	Token validates within expiry, fails after
`test_audit_log_write`	Query logged to SQLite correctly
`test_access_control_wrong_pw`	Wrong password → access denied
`test_e2e_pipeline`	PDF upload → query → cited answer (mocked LLM)

10. Quality Gates

Every phase must pass ALL gates before moving to the next phase:

Gate	Requirement
Zero import errors	`python -m pytest tests/ --co -q` exits 0
All tests pass	`pytest tests/test_phaseN.py` — 100% green
No bare except	No `except:` or `except Exception:` without logging
Type annotations	Every public function has full type hints
No unused imports	`pylint --disable=all --enable=W0611` passes
No secrets in code	No API keys, passwords, or tokens hardcoded
Pathlib throughout	No `os.path` usage in any module

11. Security Audit Checklist

No API keys committed to git (enforced by .gitignore + .env.example)
All file uploads validated: extension whitelist + MIME check + size limit
SQLite queries use parameterized statements (no f-string SQL)
bcrypt work factor ≥ 12 for password hashing
HMAC share tokens have expiry (default: 7 days)
trafilatura URL fetching: no SSRF — block private IP ranges
ChromaDB stored in non-public path (never served as static file)
BM25 pickle files: only loaded from trusted internal paths
Gradio app: file upload restricted to data/uploads/ sandbox directory
Audit log: voice queries anonymized before storage (hash, not raw text)

12. Progress Tracker

Phase	Status	Tests	Docs
Phase 0 — Foundation	✅ Done	✅ 58/58	✅ phase0_foundation.md
Phase 1 — Ingestion	✅ Done	✅ 46/46	✅ phase1_ingestion.md
Phase 2 — Retrieval	✅ Done	✅ 33/33	✅ phase2_retrieval.md
Phase 3 — ASR	✅ Done	✅ 45/47 (2 skipped)	✅ phase3_asr.md
Phase 4 — Generation	✅ Done	✅ 72/72	✅ phase4_generation.md
Phase 5 — UI & Access	✅ Done	✅ 55/55	✅ phase5_ui_access.md

VoiceVault · Navnit Amrutharaj · navnita004@gmail.com · github.com/ninjacode911

VoiceVault — End-to-End Implementation Plan

Table of Contents

1. Project Overview

2. Architecture Summary

3. Phase Map

Phase 0 — Project Foundation

Goal

Files Created

Key Decisions

Tests

Documentation

Phase 1 — Document Ingestion Pipeline

Goal

Files Created

Key Components

Tests

Phase 2 — Hybrid Retrieval Engine

Goal

Files Created

Key Components

Tests

Phase 3 — ASR & Voice Input

Goal

Files Created

Key Components

Tests

Phase 4 — Generation Chain & Citations

Goal

Files Created

Key Components

Tests

Phase 5 — Full UI, TTS & Access Control

Goal

Files Created

Key Components

Tests

10. Quality Gates

11. Security Audit Checklist

12. Progress Tracker