Spaces:

NinjainPJs
/

VoiceVault

Running

File size: 16,778 Bytes

---
title: VoiceVault
emoji: 🎙️
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
license: other
---

<div align="center">

# VoiceVault

**Voice-First RAG Knowledge Agent**

*Speak to your documents. Get cited answers back.*

[![Python](https://img.shields.io/badge/Python-3.11+-3776AB?style=flat&logo=python&logoColor=white)](https://www.python.org/)
[![FastAPI](https://img.shields.io/badge/FastAPI-0.115+-009688?style=flat&logo=fastapi&logoColor=white)](https://fastapi.tiangolo.com/)
[![License](https://img.shields.io/badge/License-Source%20Available-blue.svg)](LICENSE)
[![Tests](https://img.shields.io/badge/Tests-328%20passing-brightgreen)](tests/)
[![HF Spaces](https://img.shields.io/badge/🤗%20Hugging%20Face-Live%20Demo-FFD21E)](https://huggingface.co/spaces/NinjainPJs/VoiceVault)

[**Live Demo →**](https://huggingface.co/spaces/NinjainPJs/VoiceVault)&nbsp;&nbsp;|&nbsp;&nbsp;[**Documentation →**](DOCS/)&nbsp;&nbsp;|&nbsp;&nbsp;[**Project Plan →**](PLAN.md)

</div>

---

## Overview

VoiceVault is a production-grade, voice-first Retrieval-Augmented Generation (RAG) system built entirely from scratch. It enables users to record or type questions and receive answers grounded in their own private document collections — with inline citations pointing back to the exact source, page, and paragraph.

The project was built in 6 phases over several weeks, with a full test suite (328 tests), enterprise-grade security practices (bcrypt, parameterized SQL, SHA-256 audit logs, SSRF prevention), and deployment to Hugging Face Spaces via Docker.

**What makes this different from typical RAG demos:**
- **Hybrid retrieval** — BM25 keyword search + semantic vector search, fused with Reciprocal Rank Fusion (RRF) + cross-encoder reranking. Most tutorials use only one retrieval method.
- **Voice-native pipeline** — Groq Whisper API for ~300ms cloud transcription with local Whisper fallback; Web Speech API for TTS output.
- **Faithfulness guard** — Detects when the LLM cannot answer from retrieved context and returns a grounded refusal instead of hallucinating.
- **Multi-KB support** — Multiple independent knowledge bases, each optionally password-protected.

---

## Screenshots

<div align="center">

### Ask VoiceVault — Voice Query Interface
*Record your question via microphone or type it. The mic button pulses when recording.*

<img src="Screenshots/1.png" alt="Ask VoiceVault — main voice query interface with dark glassmorphism UI" width="800"/>

---

### Knowledge Base Management
*Create named knowledge bases, upload documents (PDF, DOCX, HTML, MD, TXT), and manage them.*

<img src="Screenshots/2.png" alt="Knowledge Bases panel — empty state with New Knowledge Base button" width="800"/>

---

### Analytics Dashboard
*Real-time query statistics: total queries, average latency, citation counts, and daily breakdowns.*

<img src="Screenshots/3.png" alt="Analytics dashboard showing query statistics" width="800"/>

---

### Full App in Action
*A populated knowledge base (358 chunks from 1 document) and a live conversation with the RAG pipeline.*

<img src="Screenshots/4.png" alt="Full VoiceVault app with a knowledge base and active conversation" width="800"/>

</div>

---

## Architecture

```
INGESTION PATH (one-time per document set)
──────────────────────────────────────────────────────
  User uploads PDF / HTML / DOCX / MD / TXT
      │
      ▼
  DocumentParser         →  text + metadata per page
      │                     (PyMuPDF, BS4, python-docx)
      ▼
  SemanticChunker        →  sentence-aware chunks
      │                     (spaCy sentences + cosine boundary)
      ▼
  IndexBuilder           →  ChromaDB (vector) + BM25 (keyword)
                             + SQLite (metadata)

QUERY PATH (real-time, per question)
──────────────────────────────────────────────────────
  Browser mic → WAV → POST /api/transcribe
      │
      ▼
  GroqTranscriber        →  Groq Whisper API (~300ms)
      │                     [fallback: local Whisper CPU]
      ▼
  QueryPreprocessor      →  filler removal, intent classification
      │                     (factual / summary / compare)
      ▼
  HybridRetriever        →  BM25 top-20 + Vector top-20
      │                     → RRF merge (k=60)
      │                     → CrossEncoder rerank (ms-marco-MiniLM-L12-v2)
      │                     → diversity filter (max 2 chunks/page)
      ▼
  ContextBuilder         →  formatted context with [Source:N] markers
      ▼
  LangChain LCEL         →  Groq Llama-3.1-70B (primary)
      │                     [fallback: Gemini 1.5 Flash]
      ▼
  FaithfulnessGuard      →  refusal detection, confidence scoring
      │
  CitationInjector       →  resolve [Source:N] → filename + page
      ▼
  JSON response          →  answer + citations + confidence + tts_text
      │
      ▼
  SPA Frontend           →  chat display + Web Speech API TTS
```

---

## Features

| Feature | Detail |
|---------|--------|
| **Voice Input** | Browser microphone → WAV conversion → Groq Whisper API (~300ms) |
| **Hybrid Retrieval** | BM25 + semantic vector search, RRF fusion, cross-encoder reranking |
| **Multi-KB** | Create multiple independent knowledge bases per session |
| **KB Access Control** | Optional bcrypt password protection (work factor 12) per KB |
| **Document Formats** | PDF, DOCX, HTML, Markdown, TXT (OCR fallback for scanned PDFs) |
| **Source Citations** | Every answer traceable to source file + page number |
| **Faithfulness Guard** | Detects hallucinations; returns grounded refusal when context is insufficient |
| **Conversation Memory** | Rolling 5-turn conversation window passed to the LLM |
| **LLM Fallback** | Groq Llama-3.1-70B → Gemini 1.5 Flash automatic fallback |
| **TTS Output** | Web Speech API reads answer aloud with citation markers stripped |
| **Analytics** | SQLite audit log: query counts, latency, citation rates (7-day window) |
| **Privacy** | Raw queries never stored — SHA-256 hash only in audit log |
| **328 Tests** | Integration + unit tests across all 6 phases |

---

## Tech Stack

| Layer | Technology | Purpose |
|-------|-----------|---------|
| **API** | FastAPI + uvicorn | REST backend with async endpoints |
| **Frontend** | HTML5 / CSS3 / Vanilla JS | Premium dark SPA (no framework) |
| **ASR** | Groq Whisper API | Cloud transcription (~300ms) |
| **ASR Fallback** | OpenAI Whisper Large-v3 | Local CPU transcription |
| **Embeddings** | sentence-transformers `all-MiniLM-L6-v2` | Dense vector representations |
| **Reranking** | `cross-encoder/ms-marco-MiniLM-L12-v2` | Semantic relevance scoring |
| **Vector Store** | ChromaDB | In-process vector database |
| **Keyword Search** | rank-bm25 (BM25Okapi) | Lexical keyword matching |
| **Chunking** | spaCy `en_core_web_sm` | Sentence boundary detection |
| **LLM (primary)** | Groq Llama-3.1-70B | Fast inference via Groq cloud |
| **LLM (fallback)** | Gemini 1.5 Flash | Google generative AI fallback |
| **Orchestration** | LangChain LCEL | LLM pipeline composition |
| **Metadata** | SQLite | KB registry, doc index, audit log |
| **Security** | bcrypt (work factor 12) | KB password hashing |
| **Config** | Pydantic-settings | Centralized, type-safe config |
| **Deployment** | Docker on Hugging Face Spaces | Container-based cloud hosting |

---

## Project Structure

```
Project-VoiceVault/
├── server.py                      # FastAPI entry point (run this)
├── app.py                         # Gradio entry point (legacy / tests)
├── config.py                      # Centralized Pydantic-settings config
├── requirements.txt               # All dependencies
├── Dockerfile                     # HF Spaces Docker deployment
├── .env.example                   # Environment variable template
│
├── api/                           # FastAPI REST API
│   ├── __init__.py
│   └── routes.py                  # All /api/* endpoints
│
├── static/                        # SPA frontend assets
│   ├── index.html                 # Single-page application shell
│   ├── style.css                  # Dark glassmorphism design system
│   └── app.js                     # Full SPA logic (recording, chat, KB CRUD)
│
├── voicevault/                    # Core package
│   ├── models.py                  # Pydantic data models
│   ├── asr/
│   │   ├── groq_transcriber.py    # Groq Whisper cloud ASR (~300ms)
│   │   ├── whisper_transcriber.py # Local Whisper CPU/GPU fallback
│   │   └── query_preprocessor.py  # Filler removal, intent classification
│   ├── ingestion/
│   │   ├── document_parser.py     # PDF/HTML/DOCX/MD/TXT → structured text
│   │   ├── semantic_chunker.py    # Sentence-aware chunking with topic boundaries
│   │   └── index_builder.py      # ChromaDB + BM25 + SQLite orchestration
│   ├── retrieval/
│   │   ├── hybrid_retriever.py    # BM25 + vector + RRF + cross-encoder
│   │   ├── bm25_retriever.py      # BM25Okapi keyword search
│   │   ├── vector_retriever.py    # ChromaDB semantic search
│   │   └── context_builder.py     # Context formatting + citation markers
│   ├── generation/
│   │   ├── answer_chain.py        # LangChain LCEL + Groq + Gemini fallback
│   │   ├── faithfulness_guard.py  # Hallucination detection + refusal
│   │   └── citation_injector.py   # [Source:N] → filename + page resolution
│   ├── kb/
│   │   └── kb_manager.py          # KB lifecycle, bcrypt auth, validation
│   ├── storage/
│   │   ├── sqlite_store.py        # Schema, CRUD, audit log queries
│   │   └── chroma_store.py        # ChromaDB wrapper
│   └── tts/
│       └── web_speech.py          # TTS text preparation
│
├── ui/                            # Gradio UI components (legacy / app.py)
│   ├── tabs/
│   │   ├── ask_tab.py
│   │   ├── kb_tab.py
│   │   ├── analytics_tab.py
│   │   └── settings_tab.py
│   └── components/
│       ├── citation_panel.py
│       └── audio_controls.py
│
├── tests/                         # Full test suite — 328 tests
│   ├── conftest.py
│   ├── test_api_routes.py         # Integration tests (FastAPI + real methods)
│   ├── test_phase0.py             # Foundation tests
│   ├── test_phase1.py             # Ingestion tests
│   ├── test_phase2.py             # Retrieval tests
│   ├── test_phase3.py             # ASR tests
│   ├── test_phase4.py             # Generation tests
│   └── test_phase5.py             # UI / access control tests
│
├── DOCS/                          # Detailed phase documentation
│   ├── phase0_foundation.md
│   ├── phase1_ingestion.md
│   ├── phase2_retrieval.md
│   ├── phase3_asr.md
│   ├── phase4_generation.md
│   ├── phase5_ui_access.md
│   └── phase6_deployment.md
│
└── Screenshots/
    ├── 1.png                      # Ask tab — voice query interface
    ├── 2.png                      # Knowledge Bases panel
    ├── 3.png                      # Analytics dashboard
    └── 4.png                      # Full app with KB and live conversation
```

---

## Quick Start

### Prerequisites

- Python 3.11+
- A Groq API key ([free at console.groq.com](https://console.groq.com))
- Optionally a Gemini API key ([free at aistudio.google.com](https://aistudio.google.com))

### 1. Clone and install

```bash
git clone https://github.com/ninjacode911/Project-VoiceVault.git
cd Project-VoiceVault
python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install torch --index-url https://download.pytorch.org/whl/cpu   # CPU-only (saves ~1.8GB)
pip install -r requirements.txt
python -m spacy download en_core_web_sm
```

### 2. Configure secrets

```bash
cp .env.example .env
# Edit .env and add:
# GROQ_API_KEY=gsk_...
# GEMINI_API_KEY=...   (optional)
```

### 3. Run

```bash
python server.py
# Open http://localhost:7860
```

### 4. Use it

1. Navigate to **Knowledge Bases** → click **+ New Knowledge Base**
2. Name it (lowercase, hyphens only, e.g. `my-docs`) and upload your PDFs/documents
3. Go back to **Ask VoiceVault** → select your KB → record or type a question → click **Ask**

---

## Running Tests

```bash
pytest tests/ -v
# Expected: 328 passed
```

The integration tests in `tests/test_api_routes.py` use a real `KBManager` backed by a temp SQLite DB and exercise the actual FastAPI routes and method signatures — not mocked pipelines. This is intentional: it catches runtime `AttributeError` bugs that pure-mock unit tests miss.

---

## Deployment to Hugging Face Spaces

The project ships with a `Dockerfile` configured for HF Spaces. The Docker image:
- Uses Python 3.11-slim base
- Installs CPU-only PyTorch (~650MB vs 2.5GB GPU wheels)
- Pre-downloads `all-MiniLM-L6-v2` and `cross-encoder/ms-marco-MiniLM-L12-v2` at build time (no cold-start model downloads)
- Downloads `en_core_web_sm` spaCy model at build time
- Binds to `0.0.0.0:7860` (HF Spaces default port)

To deploy your own copy:

1. Create a [Hugging Face Space](https://huggingface.co/new-space) with **Docker** SDK
2. Push this repository to the Space's git remote
3. Add `GROQ_API_KEY` (and optionally `GEMINI_API_KEY`) as Space secrets

See [DOCS/phase6_deployment.md](DOCS/phase6_deployment.md) for the full deployment walkthrough.

---

## Configuration

All configuration is environment-driven via `.env`. See [`.env.example`](.env.example) for the full reference.

Key variables:

| Variable | Default | Description |
|----------|---------|-------------|
| `GROQ_API_KEY` | — | **Required.** Groq API key for Whisper + Llama |
| `GEMINI_API_KEY` | — | Optional Gemini fallback key |
| `HOST` | `0.0.0.0` | Server bind address |
| `PORT` | `7860` | Server port |
| `FINAL_TOP_K` | `5` | Number of chunks passed to LLM |
| `MAX_ANSWER_TOKENS` | `500` | LLM max output tokens |
| `CHUNK_SIZE_MAX` | `600` | Max tokens per document chunk |
| `BCRYPT_ROUNDS` | `12` | bcrypt work factor for KB passwords |

---

## Security

| Control | Implementation |
|---------|----------------|
| **No raw queries stored** | Audit log stores SHA-256 hash only |
| **KB access control** | bcrypt-hashed passwords (work factor 12) |
| **SQL injection prevention** | 100% parameterized queries — no f-string SQL |
| **Path traversal prevention** | KB names validated as slugs (`^[a-z0-9][a-z0-9\-]*[a-z0-9]$`) |
| **SSRF prevention** | URL ingestion via trafilatura with no internal-network access |
| **Upload whitelist** | Only `.pdf`, `.html`, `.docx`, `.md`, `.txt` accepted |
| **File size limit** | 50MB max per upload |
| **GPU isolation** | `CUDA_VISIBLE_DEVICES=-1` prevents CUDA crashes on incompatible hardware |
| **No secrets in git** | `.env` gitignored; HF secrets via Space settings API |

---

## Phase Documentation

Each phase has a detailed write-up covering design decisions, key code sections, and test results:

| Phase | Topic | Tests |
|-------|-------|-------|
| [Phase 0](DOCS/phase0_foundation.md) | Project Foundation (config, models, schema, scaffold) | 58 ✅ |
| [Phase 1](DOCS/phase1_ingestion.md) | Document Ingestion (parser, chunker, indexer) | 46 ✅ |
| [Phase 2](DOCS/phase2_retrieval.md) | Hybrid Retrieval (BM25 + vector + RRF + reranker) | 33 ✅ |
| [Phase 3](DOCS/phase3_asr.md) | ASR & Voice Input (Whisper, query preprocessor) | 47 ✅ |
| [Phase 4](DOCS/phase4_generation.md) | Generation & Citations (LangChain, faithfulness guard) | 72 ✅ |
| [Phase 5](DOCS/phase5_ui_access.md) | Full UI, TTS & Access Control | 55 ✅ |
| [Phase 6](DOCS/phase6_deployment.md) | FastAPI Server, SPA Frontend & HF Deployment | 17 ✅ |

**Total: 328 tests — all passing.**

---

## License

**Source Available — All Rights Reserved.** See [LICENSE](LICENSE) for full terms.

The source code is publicly visible for viewing and educational purposes. Any use in personal, commercial, or academic projects requires explicit written permission from the author.

To request permission: navnitamrutharaj1234@gmail.com

**Author:** Navnit Amrutharaj