Spaces:

NinjainPJs
/

VoiceVault

Running

File size: 20,253 Bytes

85f900d

# VoiceVault — End-to-End Implementation Plan
**Author:** Navnit Amrutharaj
**Model:** VoiceVault v1.0 — Voice-First RAG Knowledge Agent
**Stack:** Whisper · LangChain · ChromaDB · Groq · Gradio
**Target:** $0/month · HuggingFace Spaces · 10 Weeks
**Plan Date:** March 2026

---

## Table of Contents
1. [Project Overview](#1-project-overview)
2. [Architecture Summary](#2-architecture-summary)
3. [Phase Map](#3-phase-map)
4. [Phase 0 — Project Foundation](#phase-0--project-foundation)
5. [Phase 1 — Document Ingestion Pipeline](#phase-1--document-ingestion-pipeline)
6. [Phase 2 — Hybrid Retrieval Engine](#phase-2--hybrid-retrieval-engine)
7. [Phase 3 — ASR & Voice Input](#phase-3--asr--voice-input)
8. [Phase 4 — Generation Chain & Citations](#phase-4--generation-chain--citations)
9. [Phase 5 — Full UI, TTS & Access Control](#phase-5--full-ui-tts--access-control)
10. [Quality Gates](#10-quality-gates)
11. [Security Audit Checklist](#11-security-audit-checklist)
12. [Progress Tracker](#12-progress-tracker)

---

## 1. Project Overview

VoiceVault is a **voice-first retrieval-augmented generation (RAG) knowledge agent** that enables users to:
- Speak questions into a browser microphone
- Get transcribed (Whisper), retrieved, generated, and spoken back answers
- Reference private document collections (PDFs, Notion exports, Confluence, DOCX, MD)
- Receive fully cited answers anchored to source document + page + paragraph

**Core differentiator:** Hybrid BM25 + vector search with Reciprocal Rank Fusion (RRF) + cross-encoder reranking — demonstrating enterprise-grade retrieval depth that most RAG tutorials skip.

---

## 2. Architecture Summary

```
INGESTION PATH (one-time per document set)
  User uploads PDFs / HTML / DOCX / MD
      ↓
  DocumentParser → text extraction (PyMuPDF, BS4, python-docx)
      ↓
  SemanticChunker → sentence-aware chunks (spaCy + cosine boundary)
      ↓
  IndexBuilder → ChromaDB (vectors) + BM25 (keywords) + SQLite (metadata)

QUERY PATH (real-time, per user question)
  Browser mic → Gradio Audio → Whisper Large-v3 (HuggingFace GPU)
      ↓
  QueryPreprocessor → cleanup + intent class + language detect
      ↓
  HybridRetriever → BM25 top-20 + Vector top-20 → RRF merge → CrossEncoder top-5
      ↓
  LangChain LCEL → Groq Llama-3.1-70B (stream) / Gemini Flash (fallback)
      ↓
  CitationInjector → [Source: filename, p.N] inline citations
      ↓
  Gradio UI (text + highlight citations) + Web Speech API (spoken answer)
```

---

## 3. Phase Map

| Phase | Name | Weeks | Core Deliverables |
|-------|------|-------|-------------------|
| **0** | Project Foundation | 0 | Scaffold, config, models, SQLite schema, Gradio skeleton |
| **1** | Document Ingestion | 1–2 | Parser, semantic chunker, ChromaDB + BM25 + SQLite indexer |
| **2** | Hybrid Retrieval | 3 | BM25 + vector + RRF + cross-encoder + diversity filter |
| **3** | ASR & Voice Input | 4 | Whisper Large-v3, Distil fallback, query preprocessor |
| **4** | Generation & Citations | 5 | LangChain LCEL, Groq, Gemini fallback, faithfulness guard |
| **5** | Full UI & Access Control | 6–8 | 4-tab Gradio UI, Web Speech TTS, multi-KB, bcrypt, audit log |

---

## Phase 0 — Project Foundation

### Goal
Establish the complete project skeleton — directory structure, dependencies, centralized config, Pydantic data models, SQLite schema, and a working 4-tab Gradio scaffold — before any business logic is written.

### Files Created
```
voicevault/
├── app.py                          # Gradio Blocks entry point
├── config.py                       # Pydantic-settings centralized config
├── requirements.txt                # All project dependencies (pinned)
├── .env.example                    # Environment variable template
├── voicevault/
│   ├── __init__.py                 # Package init + version
│   ├── models.py                   # Pydantic data models (all schemas)
│   ├── asr/__init__.py
│   ├── ingestion/__init__.py
│   ├── retrieval/__init__.py
│   ├── generation/__init__.py
│   ├── kb/__init__.py
│   ├── tts/__init__.py
│   └── storage/
│       ├── __init__.py
│       └── sqlite_store.py         # Schema creation + DB init
├── ui/
│   ├── __init__.py
│   ├── tabs/
│   │   ├── __init__.py
│   │   ├── ask_tab.py              # Placeholder — voice query tab
│   │   ├── kb_tab.py               # Placeholder — KB manager tab
│   │   ├── analytics_tab.py        # Placeholder — analytics tab
│   │   └── settings_tab.py         # Placeholder — settings tab
│   └── components/
│       ├── __init__.py
│       ├── citation_panel.py       # Placeholder — citation display
│       └── audio_controls.py       # Placeholder — TTS controls
├── tests/
│   ├── __init__.py
│   ├── conftest.py                 # Pytest fixtures
│   └── test_phase0.py              # Foundation smoke tests
├── data/                           # Runtime data (gitignored)
└── DOCS/
    └── phase0_foundation.md        # Phase 0 documentation
```

### Key Decisions
- **pydantic-settings** for type-safe env var loading (no raw `os.environ` calls)
- **pathlib.Path** throughout — cross-platform, no `os.path`
- **SQLite stdlib** for metadata — zero-dependency, portable, no server
- **Gradio 4.x Blocks** for UI — native HuggingFace Spaces support
- **`__version__` sentinel** in `voicevault/__init__.py` for release tracking
- **Data models locked early** — prevents schema drift across phases

### Tests
| Test | Description | Pass Criteria |
|------|-------------|---------------|
| `test_config_loads` | Config instantiates without exceptions | No exception |
| `test_env_defaults` | Default values are correct types | All fields pass type check |
| `test_db_init` | SQLite schema creates 3 tables | Tables `knowledge_bases`, `documents`, `query_log` exist |
| `test_data_dirs` | Data directory structure is created | Dirs exist after init |
| `test_models_instantiate` | All Pydantic models can be instantiated | No validation errors |
| `test_gradio_builds` | Gradio demo object builds without error | `gr.Blocks` object created |

### Documentation
→ See `DOCS/phase0_foundation.md`

---

## Phase 1 — Document Ingestion Pipeline

### Goal
Build the complete document ingestion pipeline: parse any supported document format, semantically chunk the text, generate embeddings, build the BM25 index, store everything in ChromaDB + SQLite, and implement SHA-256-based deduplication.

### Files Created
```
voicevault/ingestion/
├── document_parser.py      # PDF, HTML, DOCX, MD, TXT, URL parsers
├── semantic_chunker.py     # spaCy + cosine-similarity boundary chunker
└── index_builder.py        # ChromaDB + BM25 + SQLite indexer + dedup

voicevault/storage/
├── sqlite_store.py         # Full CRUD: KB, document, chunk metadata
└── chroma_store.py         # ChromaDB collection management

tests/
└── test_phase1.py          # Ingestion unit + integration tests

DOCS/
└── phase1_ingestion.md
```

### Key Components

**DocumentParser** — Multi-format dispatcher:
- PDF: `PyMuPDF` (fitz) — preserves page numbers, extracts tables as text
- HTML: `BeautifulSoup4` — Notion/Confluence exports, preserves heading hierarchy
- DOCX: `python-docx` — heading-aware extraction
- Markdown: `markdown-it-py` — heading hierarchy → section metadata
- Plain text: paragraph-level splitting
- URL: `trafilatura` — clean article extraction from any public URL
- Scanned PDF fallback: `pytesseract` OCR when no text layer found

**SemanticChunker** — Boundary detection:
- `spaCy en_core_web_sm` sentence tokenization
- Cosine similarity between adjacent sentence embeddings
- New chunk when similarity < 0.5 (configurable threshold)
- Target: 400–600 tokens per chunk, 50-token overlap
- Special handling: tables as atomic units, code blocks atomic, lists kept together
- Metadata per chunk: source_file, page_number, section_heading, chunk_index, timestamp

**IndexBuilder** — Dual-index construction:
- SHA-256 hash of chunk text → deduplication (skip re-indexed unchanged content)
- `sentence-transformers all-MiniLM-L6-v2` → 384-dim embeddings → ChromaDB
- `rank_bm25` BM25Okapi index → serialized to `bm25.pkl`
- SQLite metadata: `chunks` table linking every chunk to its source doc
- Incremental update: only new/changed chunks re-embedded

### Tests
| Test | Pass Criteria |
|------|---------------|
| `test_pdf_parse` | Extracts text with correct page numbers |
| `test_html_parse` | Extracts headings and paragraphs from Notion HTML |
| `test_docx_parse` | Extracts text from DOCX with heading metadata |
| `test_semantic_chunker` | Chunks respect sentence boundaries, 100–600 tokens |
| `test_deduplication` | Same doc uploaded twice → chunks not duplicated |
| `test_bm25_build` | BM25 index serializes and reloads correctly |
| `test_chroma_store` | Vectors stored and queryable in ChromaDB |
| `test_sqlite_metadata` | All chunk metadata persisted to SQLite |
| `test_incremental_update` | Only new chunks indexed on re-upload |

---

## Phase 2 — Hybrid Retrieval Engine

### Goal
Implement the hybrid BM25 + dense vector retrieval pipeline with Reciprocal Rank Fusion merging, cross-encoder reranking, diversity filtering, query expansion, and context window assembly.

### Files Created
```
voicevault/retrieval/
├── bm25_retriever.py       # rank_bm25 keyword search
├── vector_retriever.py     # ChromaDB semantic search
├── hybrid_retriever.py     # RRF merge + cross-encoder + diversity filter
└── context_builder.py      # Formats top-k chunks for LLM prompt

tests/
└── test_phase2.py          # Retrieval unit + benchmark tests

DOCS/
└── phase2_retrieval.md
```

### Key Components

**BM25Retriever:**
- Loads pre-built BM25 index from disk
- Tokenizes query, scores all chunks, returns top-20

**VectorRetriever:**
- Encodes query with `all-MiniLM-L6-v2`
- ChromaDB cosine similarity query → top-20

**HybridRetriever (RRF core):**
```
query → [QueryExpander: 2 paraphrases]
     → BM25 top-20 + Vector top-20 (parallel)
     → RRF merge (k=60): score = Σ 1/(k + rank)
     → CrossEncoder ms-marco-MiniLM-L12-v2 rescores top-20
     → DiversityFilter: max 2 chunks from same page
     → Final top-5 chunks
```

**ContextBuilder:**
- Formats chunks as: `[Source: filename, p.N | Section: heading]\n{text}`
- Appends conversation history (last 5 turns)
- Returns context string ready for LLM prompt

### Tests
| Test | Pass Criteria |
|------|---------------|
| `test_bm25_retriever` | Returns ranked results for keyword query |
| `test_vector_retriever` | Returns semantically relevant results |
| `test_rrf_merge` | RRF scores computed correctly for known ranks |
| `test_cross_encoder_rerank` | Re-ranked order differs from RRF order (improvement) |
| `test_diversity_filter` | Max 2 chunks per page in final results |
| `test_hybrid_recall` | Recall@5 ≥ 0.80 on 50-Q benchmark dataset |
| `test_context_builder` | Output is valid string with source citations |
| `test_query_expansion` | Returns 2 paraphrase variants |

---

## Phase 3 — ASR & Voice Input

### Goal
Integrate Whisper Large-v3 for high-quality speech-to-text transcription, with Distil-Whisper CPU fallback, browser microphone capture via Gradio Audio, and a query preprocessor that cleans transcripts and classifies query intent.

### Files Created
```
voicevault/asr/
├── whisper_transcriber.py  # Whisper Large-v3 + Distil-Whisper fallback
└── query_preprocessor.py   # Cleanup, intent classification, language detect

tests/
└── test_phase3.py          # ASR unit tests + WER evaluation

DOCS/
└── phase3_asr.md
```

### Key Components

**WhisperTranscriber:**
- Primary: `openai/whisper-large-v3` (HuggingFace GPU pipeline)
- Fallback: `distil-whisper/distil-large-v3` (CPU, 6× faster, <1% WER diff)
- VAD pre-check: reject audio < 1s or silent audio
- Returns: `transcript`, `language`, `confidence`, `model_used`, `latency_ms`

**QueryPreprocessor:**
- Lowercase normalization, punctuation repair
- Filler word removal: um, uh, like, you know
- Language detection: `langdetect` library
- Query type classification:
  - `factual` — "What is...", "Who...", "When..."
  - `summary` — "Summarise...", "Give me an overview..."
  - `compare` — "Compare...", "What's the difference..."
- Routes to different retrieval strategies per type

### Tests
| Test | Pass Criteria |
|------|---------------|
| `test_preprocessor_cleanup` | Filler words removed, normalized |
| `test_intent_factual` | "What is X?" → type=factual |
| `test_intent_summary` | "Summarise the report" → type=summary |
| `test_intent_compare` | "Compare A and B" → type=compare |
| `test_language_detection` | English text → "en" |
| `test_vad_short_audio` | < 1s audio raises ValueError |
| `test_whisper_mock` | Transcriber returns correct schema with mocked model |

---

## Phase 4 — Generation Chain & Citations

### Goal
Build the full LangChain LCEL generation chain: Groq Llama-3.1-70B as primary LLM with streaming, Gemini 1.5 Flash as automatic fallback, citation injection with [Source: file, p.N] protocol, faithfulness guard for out-of-context detection, and conversation memory.

### Files Created
```
voicevault/generation/
├── answer_chain.py         # LangChain LCEL + Groq + Gemini fallback
├── citation_injector.py    # Maps [Doc:Page] citations to source chunks
└── faithfulness_guard.py   # Out-of-context detection

tests/
└── test_phase4.py          # Generation unit tests

DOCS/
└── phase4_generation.md
```

### Key Components

**AnswerChain (LCEL):**
```
context_string + query + history
    → PromptTemplate (system: citation protocol + faithfulness instructions)
    → ChatGroq (llama-3.1-70b-versatile, streaming, temp=0.1)
         on quota error → ChatGoogleGenerativeAI (gemini-1.5-flash)
    → StrOutputParser
    → CitationInjector (post-processing)
```

**CitationInjector:**
- Parses `[Doc:Page]` markers from LLM output
- Resolves each to the actual chunk's source_file + page_number + excerpt
- Builds `List[Citation]` object for UI display

**FaithfulnessGuard:**
- System prompt: "If the answer cannot be found in the provided context, respond with exactly: 'I could not find this in your documents.'"
- Post-generation check: if answer references facts not in any retrieved chunk → flag
- Confidence scoring based on retrieval score distribution

### Tests
| Test | Pass Criteria |
|------|---------------|
| `test_citation_injector_parses` | `[Doc:5]` → correct Citation object |
| `test_faithfulness_guard_refusal` | Out-of-context Q → refusal message |
| `test_answer_chain_mock` | Chain runs end-to-end with mocked LLM |
| `test_groq_fallback` | Groq quota error → Gemini client used |
| `test_streaming_output` | Chain yields token-by-token |
| `test_conversation_memory` | Last 5 turns preserved across queries |

---

## Phase 5 — Full UI, TTS & Access Control

### Goal
Build the complete 4-tab Gradio UI, integrate Web Speech API for browser-native TTS, implement the multi-knowledge-base manager, add bcrypt password protection + HMAC share links, and build the analytics + audit log system.

### Files Created
```
voicevault/kb/
├── kb_manager.py           # Create/list/delete knowledge bases
├── access_control.py       # bcrypt password, HMAC share links
└── audit_log.py            # Query logging to SQLite

voicevault/tts/
└── web_speech.py           # Web Speech API JS bridge

voicevault/storage/
└── sqlite_store.py         # Complete CRUD (extended from Phase 0)

ui/tabs/
├── ask_tab.py              # Full voice query tab
├── kb_tab.py               # Full KB manager tab
├── analytics_tab.py        # Charts + metrics tab
└── settings_tab.py         # All configurable parameters

ui/components/
├── citation_panel.py       # Citation highlighting component
└── audio_controls.py       # TTS playback controls

tests/
├── test_phase5.py          # UI component + access control tests
└── test_e2e.py             # Full end-to-end pipeline test

DOCS/
└── phase5_ui_access.md
```

### Key Components

**KBManager:**
- Creates per-KB directory: `data/{kb_name}/chroma/`, `bm25.pkl`, `voicevault.db`
- Lists all KBs with metadata (doc count, chunk count, last updated)
- Delete KB: removes directory + SQLite row

**AccessControl:**
- Password hash: `bcrypt` with work factor 12
- Share link: `HMAC-SHA256` signed token with KB name + expiry
- Token validation on every query to password-protected KB

**AuditLog:**
- Every query logs: session_id, kb_names, voice_query (anonymized), latency, timestamp
- Viewable in Analytics tab

**Web Speech API Bridge:**
- JavaScript injected via `gr.HTML` component
- `window.speechSynthesis.speak()` triggered from Python via Gradio's JS bridge
- Voice selector, rate slider, pitch slider
- Pause/Resume/Restart controls

**UI Tabs:**
- **Ask tab:** Mic button → live transcript → KB selector → streaming answer → citation panel → speak button
- **KB tab:** Create KB form + document uploader (PDF/MD/HTML/DOCX) + progress bar + doc list
- **Analytics tab:** Query volume chart + latency breakdown + top documents + Groq quota gauge
- **Settings tab:** ASR model, voice settings, retrieval params, LLM params, chunking params

### Tests
| Test | Pass Criteria |
|------|---------------|
| `test_kb_create_delete` | KB directory created/removed correctly |
| `test_bcrypt_password` | Hash + verify round-trip |
| `test_hmac_share_link` | Token validates within expiry, fails after |
| `test_audit_log_write` | Query logged to SQLite correctly |
| `test_access_control_wrong_pw` | Wrong password → access denied |
| `test_e2e_pipeline` | PDF upload → query → cited answer (mocked LLM) |

---

## 10. Quality Gates

Every phase must pass ALL gates before moving to the next phase:

| Gate | Requirement |
|------|-------------|
| **Zero import errors** | `python -m pytest tests/ --co -q` exits 0 |
| **All tests pass** | `pytest tests/test_phaseN.py` — 100% green |
| **No bare except** | No `except:` or `except Exception:` without logging |
| **Type annotations** | Every public function has full type hints |
| **No unused imports** | `pylint --disable=all --enable=W0611` passes |
| **No secrets in code** | No API keys, passwords, or tokens hardcoded |
| **Pathlib throughout** | No `os.path` usage in any module |

---

## 11. Security Audit Checklist

- [ ] No API keys committed to git (enforced by .gitignore + .env.example)
- [ ] All file uploads validated: extension whitelist + MIME check + size limit
- [ ] SQLite queries use parameterized statements (no f-string SQL)
- [ ] bcrypt work factor ≥ 12 for password hashing
- [ ] HMAC share tokens have expiry (default: 7 days)
- [ ] `trafilatura` URL fetching: no SSRF — block private IP ranges
- [ ] ChromaDB stored in non-public path (never served as static file)
- [ ] BM25 pickle files: only loaded from trusted internal paths
- [ ] Gradio app: file upload restricted to `data/uploads/` sandbox directory
- [ ] Audit log: voice queries anonymized before storage (hash, not raw text)

---

## 12. Progress Tracker

| Phase | Status | Tests | Docs |
|-------|--------|-------|------|
| Phase 0 — Foundation | ✅ Done | ✅ 58/58 | ✅ phase0_foundation.md |
| Phase 1 — Ingestion | ✅ Done | ✅ 46/46 | ✅ phase1_ingestion.md |
| Phase 2 — Retrieval | ✅ Done | ✅ 33/33 | ✅ phase2_retrieval.md |
| Phase 3 — ASR | ✅ Done | ✅ 45/47 (2 skipped) | ✅ phase3_asr.md |
| Phase 4 — Generation | ✅ Done | ✅ 72/72 | ✅ phase4_generation.md |
| Phase 5 — UI & Access | ✅ Done | ✅ 55/55 | ✅ phase5_ui_access.md |

---

*VoiceVault · Navnit Amrutharaj · navnita004@gmail.com · github.com/ninjacode911*