Spaces:

NinjainPJs
/

VoiceVault

Running

App Files Files Community

VoiceVault / README.md

NinjainPJs

Update README.md

05853ae verified 3 months ago

preview code

raw

history blame contribute delete

16.8 kB

metadata

title: VoiceVault
emoji: 🎙️
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
license: other

VoiceVault

Voice-First RAG Knowledge Agent

Speak to your documents. Get cited answers back.

Live Demo → | Documentation → | Project Plan →

Overview

VoiceVault is a production-grade, voice-first Retrieval-Augmented Generation (RAG) system built entirely from scratch. It enables users to record or type questions and receive answers grounded in their own private document collections — with inline citations pointing back to the exact source, page, and paragraph.

The project was built in 6 phases over several weeks, with a full test suite (328 tests), enterprise-grade security practices (bcrypt, parameterized SQL, SHA-256 audit logs, SSRF prevention), and deployment to Hugging Face Spaces via Docker.

What makes this different from typical RAG demos:

Hybrid retrieval — BM25 keyword search + semantic vector search, fused with Reciprocal Rank Fusion (RRF) + cross-encoder reranking. Most tutorials use only one retrieval method.
Voice-native pipeline — Groq Whisper API for ~300ms cloud transcription with local Whisper fallback; Web Speech API for TTS output.
Faithfulness guard — Detects when the LLM cannot answer from retrieved context and returns a grounded refusal instead of hallucinating.
Multi-KB support — Multiple independent knowledge bases, each optionally password-protected.

Screenshots

Ask VoiceVault — Voice Query Interface

Record your question via microphone or type it. The mic button pulses when recording.

Ask VoiceVault — main voice query interface with dark glassmorphism UI

Knowledge Base Management

Create named knowledge bases, upload documents (PDF, DOCX, HTML, MD, TXT), and manage them.

Knowledge Bases panel — empty state with New Knowledge Base button

Analytics Dashboard

Real-time query statistics: total queries, average latency, citation counts, and daily breakdowns.

Full App in Action

A populated knowledge base (358 chunks from 1 document) and a live conversation with the RAG pipeline.

Full VoiceVault app with a knowledge base and active conversation

Architecture

INGESTION PATH (one-time per document set)
──────────────────────────────────────────────────────
  User uploads PDF / HTML / DOCX / MD / TXT
      │
      ▼
  DocumentParser         →  text + metadata per page
      │                     (PyMuPDF, BS4, python-docx)
      ▼
  SemanticChunker        →  sentence-aware chunks
      │                     (spaCy sentences + cosine boundary)
      ▼
  IndexBuilder           →  ChromaDB (vector) + BM25 (keyword)
                             + SQLite (metadata)

QUERY PATH (real-time, per question)
──────────────────────────────────────────────────────
  Browser mic → WAV → POST /api/transcribe
      │
      ▼
  GroqTranscriber        →  Groq Whisper API (~300ms)
      │                     [fallback: local Whisper CPU]
      ▼
  QueryPreprocessor      →  filler removal, intent classification
      │                     (factual / summary / compare)
      ▼
  HybridRetriever        →  BM25 top-20 + Vector top-20
      │                     → RRF merge (k=60)
      │                     → CrossEncoder rerank (ms-marco-MiniLM-L12-v2)
      │                     → diversity filter (max 2 chunks/page)
      ▼
  ContextBuilder         →  formatted context with [Source:N] markers
      ▼
  LangChain LCEL         →  Groq Llama-3.1-70B (primary)
      │                     [fallback: Gemini 1.5 Flash]
      ▼
  FaithfulnessGuard      →  refusal detection, confidence scoring
      │
  CitationInjector       →  resolve [Source:N] → filename + page
      ▼
  JSON response          →  answer + citations + confidence + tts_text
      │
      ▼
  SPA Frontend           →  chat display + Web Speech API TTS

Features

Feature	Detail
Voice Input	Browser microphone → WAV conversion → Groq Whisper API (~300ms)
Hybrid Retrieval	BM25 + semantic vector search, RRF fusion, cross-encoder reranking
Multi-KB	Create multiple independent knowledge bases per session
KB Access Control	Optional bcrypt password protection (work factor 12) per KB
Document Formats	PDF, DOCX, HTML, Markdown, TXT (OCR fallback for scanned PDFs)
Source Citations	Every answer traceable to source file + page number
Faithfulness Guard	Detects hallucinations; returns grounded refusal when context is insufficient
Conversation Memory	Rolling 5-turn conversation window passed to the LLM
LLM Fallback	Groq Llama-3.1-70B → Gemini 1.5 Flash automatic fallback
TTS Output	Web Speech API reads answer aloud with citation markers stripped
Analytics	SQLite audit log: query counts, latency, citation rates (7-day window)
Privacy	Raw queries never stored — SHA-256 hash only in audit log
328 Tests	Integration + unit tests across all 6 phases

Tech Stack

Layer	Technology	Purpose
API	FastAPI + uvicorn	REST backend with async endpoints
Frontend	HTML5 / CSS3 / Vanilla JS	Premium dark SPA (no framework)
ASR	Groq Whisper API	Cloud transcription (~300ms)
ASR Fallback	OpenAI Whisper Large-v3	Local CPU transcription
Embeddings	sentence-transformers `all-MiniLM-L6-v2`	Dense vector representations
Reranking	`cross-encoder/ms-marco-MiniLM-L12-v2`	Semantic relevance scoring
Vector Store	ChromaDB	In-process vector database
Keyword Search	rank-bm25 (BM25Okapi)	Lexical keyword matching
Chunking	spaCy `en_core_web_sm`	Sentence boundary detection
LLM (primary)	Groq Llama-3.1-70B	Fast inference via Groq cloud
LLM (fallback)	Gemini 1.5 Flash	Google generative AI fallback
Orchestration	LangChain LCEL	LLM pipeline composition
Metadata	SQLite	KB registry, doc index, audit log
Security	bcrypt (work factor 12)	KB password hashing
Config	Pydantic-settings	Centralized, type-safe config
Deployment	Docker on Hugging Face Spaces	Container-based cloud hosting

Project Structure

Project-VoiceVault/
├── server.py                      # FastAPI entry point (run this)
├── app.py                         # Gradio entry point (legacy / tests)
├── config.py                      # Centralized Pydantic-settings config
├── requirements.txt               # All dependencies
├── Dockerfile                     # HF Spaces Docker deployment
├── .env.example                   # Environment variable template
│
├── api/                           # FastAPI REST API
│   ├── __init__.py
│   └── routes.py                  # All /api/* endpoints
│
├── static/                        # SPA frontend assets
│   ├── index.html                 # Single-page application shell
│   ├── style.css                  # Dark glassmorphism design system
│   └── app.js                     # Full SPA logic (recording, chat, KB CRUD)
│
├── voicevault/                    # Core package
│   ├── models.py                  # Pydantic data models
│   ├── asr/
│   │   ├── groq_transcriber.py    # Groq Whisper cloud ASR (~300ms)
│   │   ├── whisper_transcriber.py # Local Whisper CPU/GPU fallback
│   │   └── query_preprocessor.py  # Filler removal, intent classification
│   ├── ingestion/
│   │   ├── document_parser.py     # PDF/HTML/DOCX/MD/TXT → structured text
│   │   ├── semantic_chunker.py    # Sentence-aware chunking with topic boundaries
│   │   └── index_builder.py      # ChromaDB + BM25 + SQLite orchestration
│   ├── retrieval/
│   │   ├── hybrid_retriever.py    # BM25 + vector + RRF + cross-encoder
│   │   ├── bm25_retriever.py      # BM25Okapi keyword search
│   │   ├── vector_retriever.py    # ChromaDB semantic search
│   │   └── context_builder.py     # Context formatting + citation markers
│   ├── generation/
│   │   ├── answer_chain.py        # LangChain LCEL + Groq + Gemini fallback
│   │   ├── faithfulness_guard.py  # Hallucination detection + refusal
│   │   └── citation_injector.py   # [Source:N] → filename + page resolution
│   ├── kb/
│   │   └── kb_manager.py          # KB lifecycle, bcrypt auth, validation
│   ├── storage/
│   │   ├── sqlite_store.py        # Schema, CRUD, audit log queries
│   │   └── chroma_store.py        # ChromaDB wrapper
│   └── tts/
│       └── web_speech.py          # TTS text preparation
│
├── ui/                            # Gradio UI components (legacy / app.py)
│   ├── tabs/
│   │   ├── ask_tab.py
│   │   ├── kb_tab.py
│   │   ├── analytics_tab.py
│   │   └── settings_tab.py
│   └── components/
│       ├── citation_panel.py
│       └── audio_controls.py
│
├── tests/                         # Full test suite — 328 tests
│   ├── conftest.py
│   ├── test_api_routes.py         # Integration tests (FastAPI + real methods)
│   ├── test_phase0.py             # Foundation tests
│   ├── test_phase1.py             # Ingestion tests
│   ├── test_phase2.py             # Retrieval tests
│   ├── test_phase3.py             # ASR tests
│   ├── test_phase4.py             # Generation tests
│   └── test_phase5.py             # UI / access control tests
│
├── DOCS/                          # Detailed phase documentation
│   ├── phase0_foundation.md
│   ├── phase1_ingestion.md
│   ├── phase2_retrieval.md
│   ├── phase3_asr.md
│   ├── phase4_generation.md
│   ├── phase5_ui_access.md
│   └── phase6_deployment.md
│
└── Screenshots/
    ├── 1.png                      # Ask tab — voice query interface
    ├── 2.png                      # Knowledge Bases panel
    ├── 3.png                      # Analytics dashboard
    └── 4.png                      # Full app with KB and live conversation

Quick Start

Prerequisites

Python 3.11+
A Groq API key (free at console.groq.com)
Optionally a Gemini API key (free at aistudio.google.com)

1. Clone and install

git clone https://github.com/ninjacode911/Project-VoiceVault.git
cd Project-VoiceVault
python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install torch --index-url https://download.pytorch.org/whl/cpu   # CPU-only (saves ~1.8GB)
pip install -r requirements.txt
python -m spacy download en_core_web_sm

2. Configure secrets

cp .env.example .env
# Edit .env and add:
# GROQ_API_KEY=gsk_...
# GEMINI_API_KEY=...   (optional)

3. Run

python server.py
# Open http://localhost:7860

4. Use it

Navigate to Knowledge Bases → click + New Knowledge Base
Name it (lowercase, hyphens only, e.g. my-docs) and upload your PDFs/documents
Go back to Ask VoiceVault → select your KB → record or type a question → click Ask

Running Tests

pytest tests/ -v
# Expected: 328 passed

The integration tests in tests/test_api_routes.py use a real KBManager backed by a temp SQLite DB and exercise the actual FastAPI routes and method signatures — not mocked pipelines. This is intentional: it catches runtime AttributeError bugs that pure-mock unit tests miss.

Deployment to Hugging Face Spaces

The project ships with a Dockerfile configured for HF Spaces. The Docker image:

Uses Python 3.11-slim base
Installs CPU-only PyTorch (~650MB vs 2.5GB GPU wheels)
Pre-downloads all-MiniLM-L6-v2 and cross-encoder/ms-marco-MiniLM-L12-v2 at build time (no cold-start model downloads)
Downloads en_core_web_sm spaCy model at build time
Binds to 0.0.0.0:7860 (HF Spaces default port)

To deploy your own copy:

Create a Hugging Face Space with Docker SDK
Push this repository to the Space's git remote
Add GROQ_API_KEY (and optionally GEMINI_API_KEY) as Space secrets

See DOCS/phase6_deployment.md for the full deployment walkthrough.

Configuration

All configuration is environment-driven via .env. See .env.example for the full reference.

Key variables:

Variable	Default	Description
`GROQ_API_KEY`	—	Required. Groq API key for Whisper + Llama
`GEMINI_API_KEY`	—	Optional Gemini fallback key
`HOST`	`0.0.0.0`	Server bind address
`PORT`	`7860`	Server port
`FINAL_TOP_K`	`5`	Number of chunks passed to LLM
`MAX_ANSWER_TOKENS`	`500`	LLM max output tokens
`CHUNK_SIZE_MAX`	`600`	Max tokens per document chunk
`BCRYPT_ROUNDS`	`12`	bcrypt work factor for KB passwords

Security

Control	Implementation
No raw queries stored	Audit log stores SHA-256 hash only
KB access control	bcrypt-hashed passwords (work factor 12)
SQL injection prevention	100% parameterized queries — no f-string SQL
Path traversal prevention	KB names validated as slugs (`^[a-z0-9][a-z0-9\-]*[a-z0-9]$`)
SSRF prevention	URL ingestion via trafilatura with no internal-network access
Upload whitelist	Only `.pdf`, `.html`, `.docx`, `.md`, `.txt` accepted
File size limit	50MB max per upload
GPU isolation	`CUDA_VISIBLE_DEVICES=-1` prevents CUDA crashes on incompatible hardware
No secrets in git	`.env` gitignored; HF secrets via Space settings API

Phase Documentation

Each phase has a detailed write-up covering design decisions, key code sections, and test results:

Phase	Topic	Tests
Phase 0	Project Foundation (config, models, schema, scaffold)	58 ✅
Phase 1	Document Ingestion (parser, chunker, indexer)	46 ✅
Phase 2	Hybrid Retrieval (BM25 + vector + RRF + reranker)	33 ✅
Phase 3	ASR & Voice Input (Whisper, query preprocessor)	47 ✅
Phase 4	Generation & Citations (LangChain, faithfulness guard)	72 ✅
Phase 5	Full UI, TTS & Access Control	55 ✅
Phase 6	FastAPI Server, SPA Frontend & HF Deployment	17 ✅

Total: 328 tests — all passing.

License

The source code is publicly visible for viewing and educational purposes. Any use in personal, commercial, or academic projects requires explicit written permission from the author.

To request permission: navnitamrutharaj1234@gmail.com

Author: Navnit Amrutharaj