Spaces:

NinjainPJs
/

VoiceVault

Running

App Files Files Community

NinjainPJs commited on Mar 21

Commit

05853ae

verified ·

1 Parent(s): 85f900d

Update README.md

Browse files

Files changed (1) hide show

README.md +370 -33

README.md CHANGED Viewed

@@ -5,51 +5,388 @@ colorFrom: purple
 colorTo: blue
 sdk: docker
 pinned: false
-license: mit
 ---
-# VoiceVault — Voice-First RAG Knowledge Agent
-A production-grade, voice-first Retrieval-Augmented Generation (RAG) system that lets you speak questions and get answers grounded in your own documents.
-## Features
-- **Voice-to-Answer Pipeline** — Record or type a question, get a cited answer from your knowledge bases
-- **Multi-KB Support** — Create and query multiple independent knowledge bases
-- **Hybrid Retrieval** — BM25 + semantic vector search fused with RRF + cross-encoder reranking
-- **Fast Transcription** — Groq Whisper API (~300ms turnaround)
-- **Smart LLM Fallback** — Groq Llama-3.1-70B → Gemini 1.5 Flash
-- **Source Citations** — Every answer is traceable to source documents and page numbers
-- **Document Support** — PDF, DOCX, HTML, Markdown, TXT
-## Setup
-Add the following secrets in your Space settings:
-| Secret | Description |
-|--------|-------------|
-| `GROQ_API_KEY` | Required — powers Whisper transcription and Llama LLM |
-| `GEMINI_API_KEY` | Optional — Gemini 1.5 Flash fallback LLM |
-## Tech Stack
-- **FastAPI** — REST API backend
-- **ChromaDB** — In-process vector store
-- **sentence-transformers** — `all-MiniLM-L6-v2` embeddings
-- **LangChain** — LLM orchestration
-- **Groq** — Whisper ASR + Llama 3.1 70B
-- **spaCy** — Semantic chunking
 ## Architecture
 ```
-Voice Input → Groq Whisper ASR → Query Preprocessor
-                                        ↓
-                            HybridRetriever (BM25 + Vector + RRF)
-                                        ↓
-                            CrossEncoder Reranking → ContextBuilder
-                                        ↓
-                            LangChain → Groq LLM → Cited Answer
-                                        ↓
-                                 Web Speech API TTS
 ```

 colorTo: blue
 sdk: docker
 pinned: false
+license: other
 ---
+<div align="center">
+# VoiceVault
+**Voice-First RAG Knowledge Agent**
+*Speak to your documents. Get cited answers back.*
+[![Python](https://img.shields.io/badge/Python-3.11+-3776AB?style=flat&logo=python&logoColor=white)](https://www.python.org/)
+[![FastAPI](https://img.shields.io/badge/FastAPI-0.115+-009688?style=flat&logo=fastapi&logoColor=white)](https://fastapi.tiangolo.com/)
+[![License](https://img.shields.io/badge/License-Source%20Available-blue.svg)](LICENSE)
+[![Tests](https://img.shields.io/badge/Tests-328%20passing-brightgreen)](tests/)
+[![HF Spaces](https://img.shields.io/badge/🤗%20Hugging%20Face-Live%20Demo-FFD21E)](https://huggingface.co/spaces/NinjainPJs/VoiceVault)
+[**Live Demo →**](https://huggingface.co/spaces/NinjainPJs/VoiceVault)&nbsp;&nbsp;|&nbsp;&nbsp;[**Documentation →**](DOCS/)&nbsp;&nbsp;|&nbsp;&nbsp;[**Project Plan →**](PLAN.md)
+</div>
+---
+## Overview
+VoiceVault is a production-grade, voice-first Retrieval-Augmented Generation (RAG) system built entirely from scratch. It enables users to record or type questions and receive answers grounded in their own private document collections — with inline citations pointing back to the exact source, page, and paragraph.
+The project was built in 6 phases over several weeks, with a full test suite (328 tests), enterprise-grade security practices (bcrypt, parameterized SQL, SHA-256 audit logs, SSRF prevention), and deployment to Hugging Face Spaces via Docker.
+**What makes this different from typical RAG demos:**
+- **Hybrid retrieval** — BM25 keyword search + semantic vector search, fused with Reciprocal Rank Fusion (RRF) + cross-encoder reranking. Most tutorials use only one retrieval method.
+- **Voice-native pipeline** — Groq Whisper API for ~300ms cloud transcription with local Whisper fallback; Web Speech API for TTS output.
+- **Faithfulness guard** — Detects when the LLM cannot answer from retrieved context and returns a grounded refusal instead of hallucinating.
+- **Multi-KB support** — Multiple independent knowledge bases, each optionally password-protected.
+---
+## Screenshots
+<div align="center">
+### Ask VoiceVault — Voice Query Interface
+*Record your question via microphone or type it. The mic button pulses when recording.*
+<img src="Screenshots/1.png" alt="Ask VoiceVault — main voice query interface with dark glassmorphism UI" width="800"/>
+---
+### Knowledge Base Management
+*Create named knowledge bases, upload documents (PDF, DOCX, HTML, MD, TXT), and manage them.*
+<img src="Screenshots/2.png" alt="Knowledge Bases panel — empty state with New Knowledge Base button" width="800"/>
+---
+### Analytics Dashboard
+*Real-time query statistics: total queries, average latency, citation counts, and daily breakdowns.*
+<img src="Screenshots/3.png" alt="Analytics dashboard showing query statistics" width="800"/>
+---
+### Full App in Action
+*A populated knowledge base (358 chunks from 1 document) and a live conversation with the RAG pipeline.*
+<img src="Screenshots/4.png" alt="Full VoiceVault app with a knowledge base and active conversation" width="800"/>
+</div>
+---
 ## Architecture
 ```
+INGESTION PATH (one-time per document set)
+──────────────────────────────────────────────────────
+  User uploads PDF / HTML / DOCX / MD / TXT
+      │
+      ▼
+  DocumentParser         →  text + metadata per page
+      │                     (PyMuPDF, BS4, python-docx)
+      ▼
+  SemanticChunker        →  sentence-aware chunks
+      │                     (spaCy sentences + cosine boundary)
+      ▼
+  IndexBuilder           →  ChromaDB (vector) + BM25 (keyword)
+                             + SQLite (metadata)
+QUERY PATH (real-time, per question)
+──────────────────────────────────────────────────────
+  Browser mic → WAV → POST /api/transcribe
+      │
+      ▼
+  GroqTranscriber        →  Groq Whisper API (~300ms)
+      │                     [fallback: local Whisper CPU]
+      ▼
+  QueryPreprocessor      →  filler removal, intent classification
+      │                     (factual / summary / compare)
+      ▼
+  HybridRetriever        →  BM25 top-20 + Vector top-20
+      │                     → RRF merge (k=60)
+      │                     → CrossEncoder rerank (ms-marco-MiniLM-L12-v2)
+      │                     → diversity filter (max 2 chunks/page)
+      ▼
+  ContextBuilder         →  formatted context with [Source:N] markers
+      ▼
+  LangChain LCEL         →  Groq Llama-3.1-70B (primary)
+      │                     [fallback: Gemini 1.5 Flash]
+      ▼
+  FaithfulnessGuard      →  refusal detection, confidence scoring
+      │
+  CitationInjector       →  resolve [Source:N] → filename + page
+      ▼
+  JSON response          →  answer + citations + confidence + tts_text
+      │
+      ▼
+  SPA Frontend           →  chat display + Web Speech API TTS
+```
+---
+## Features
+| Feature | Detail |
+|---------|--------|
+| **Voice Input** | Browser microphone → WAV conversion → Groq Whisper API (~300ms) |
+| **Hybrid Retrieval** | BM25 + semantic vector search, RRF fusion, cross-encoder reranking |
+| **Multi-KB** | Create multiple independent knowledge bases per session |
+| **KB Access Control** | Optional bcrypt password protection (work factor 12) per KB |
+| **Document Formats** | PDF, DOCX, HTML, Markdown, TXT (OCR fallback for scanned PDFs) |
+| **Source Citations** | Every answer traceable to source file + page number |
+| **Faithfulness Guard** | Detects hallucinations; returns grounded refusal when context is insufficient |
+| **Conversation Memory** | Rolling 5-turn conversation window passed to the LLM |
+| **LLM Fallback** | Groq Llama-3.1-70B → Gemini 1.5 Flash automatic fallback |
+| **TTS Output** | Web Speech API reads answer aloud with citation markers stripped |
+| **Analytics** | SQLite audit log: query counts, latency, citation rates (7-day window) |
+| **Privacy** | Raw queries never stored — SHA-256 hash only in audit log |
+| **328 Tests** | Integration + unit tests across all 6 phases |
+---
+## Tech Stack
+| Layer | Technology | Purpose |
+|-------|-----------|---------|
+| **API** | FastAPI + uvicorn | REST backend with async endpoints |
+| **Frontend** | HTML5 / CSS3 / Vanilla JS | Premium dark SPA (no framework) |
+| **ASR** | Groq Whisper API | Cloud transcription (~300ms) |
+| **ASR Fallback** | OpenAI Whisper Large-v3 | Local CPU transcription |
+| **Embeddings** | sentence-transformers `all-MiniLM-L6-v2` | Dense vector representations |
+| **Reranking** | `cross-encoder/ms-marco-MiniLM-L12-v2` | Semantic relevance scoring |
+| **Vector Store** | ChromaDB | In-process vector database |
+| **Keyword Search** | rank-bm25 (BM25Okapi) | Lexical keyword matching |
+| **Chunking** | spaCy `en_core_web_sm` | Sentence boundary detection |
+| **LLM (primary)** | Groq Llama-3.1-70B | Fast inference via Groq cloud |
+| **LLM (fallback)** | Gemini 1.5 Flash | Google generative AI fallback |
+| **Orchestration** | LangChain LCEL | LLM pipeline composition |
+| **Metadata** | SQLite | KB registry, doc index, audit log |
+| **Security** | bcrypt (work factor 12) | KB password hashing |
+| **Config** | Pydantic-settings | Centralized, type-safe config |
+| **Deployment** | Docker on Hugging Face Spaces | Container-based cloud hosting |
+---
+## Project Structure
+```
+Project-VoiceVault/
+├── server.py                      # FastAPI entry point (run this)
+├── app.py                         # Gradio entry point (legacy / tests)
+├── config.py                      # Centralized Pydantic-settings config
+├── requirements.txt               # All dependencies
+├── Dockerfile                     # HF Spaces Docker deployment
+├── .env.example                   # Environment variable template
+│
+├── api/                           # FastAPI REST API
+│   ├── __init__.py
+│   ��── routes.py                  # All /api/* endpoints
+│
+├── static/                        # SPA frontend assets
+│   ├── index.html                 # Single-page application shell
+│   ├── style.css                  # Dark glassmorphism design system
+│   └── app.js                     # Full SPA logic (recording, chat, KB CRUD)
+│
+├── voicevault/                    # Core package
+│   ├── models.py                  # Pydantic data models
+│   ├── asr/
+│   │   ├── groq_transcriber.py    # Groq Whisper cloud ASR (~300ms)
+│   │   ├── whisper_transcriber.py # Local Whisper CPU/GPU fallback
+│   │   └── query_preprocessor.py  # Filler removal, intent classification
+│   ├── ingestion/
+│   │   ├── document_parser.py     # PDF/HTML/DOCX/MD/TXT → structured text
+│   │   ├── semantic_chunker.py    # Sentence-aware chunking with topic boundaries
+│   │   └── index_builder.py      # ChromaDB + BM25 + SQLite orchestration
+│   ├── retrieval/
+│   │   ├── hybrid_retriever.py    # BM25 + vector + RRF + cross-encoder
+│   │   ├── bm25_retriever.py      # BM25Okapi keyword search
+│   │   ├── vector_retriever.py    # ChromaDB semantic search
+│   │   └── context_builder.py     # Context formatting + citation markers
+│   ├── generation/
+│   │   ├── answer_chain.py        # LangChain LCEL + Groq + Gemini fallback
+│   │   ├── faithfulness_guard.py  # Hallucination detection + refusal
+│   │   └── citation_injector.py   # [Source:N] → filename + page resolution
+│   ├── kb/
+│   │   └── kb_manager.py          # KB lifecycle, bcrypt auth, validation
+│   ├── storage/
+│   │   ├── sqlite_store.py        # Schema, CRUD, audit log queries
+│   │   └── chroma_store.py        # ChromaDB wrapper
+│   └── tts/
+│       └── web_speech.py          # TTS text preparation
+│
+├── ui/                            # Gradio UI components (legacy / app.py)
+│   ├── tabs/
+│   │   ├── ask_tab.py
+│   │   ├── kb_tab.py
+│   │   ├── analytics_tab.py
+│   │   └── settings_tab.py
+│   └── components/
+│       ├── citation_panel.py
+│       └── audio_controls.py
+│
+├── tests/                         # Full test suite — 328 tests
+│   ├── conftest.py
+│   ├── test_api_routes.py         # Integration tests (FastAPI + real methods)
+│   ├── test_phase0.py             # Foundation tests
+│   ├── test_phase1.py             # Ingestion tests
+│   ├── test_phase2.py             # Retrieval tests
+│   ├── test_phase3.py             # ASR tests
+│   ├── test_phase4.py             # Generation tests
+│   └── test_phase5.py             # UI / access control tests
+│
+├── DOCS/                          # Detailed phase documentation
+│   ├── phase0_foundation.md
+│   ├── phase1_ingestion.md
+│   ├── phase2_retrieval.md
+│   ├── phase3_asr.md
+│   ├── phase4_generation.md
+│   ├── phase5_ui_access.md
+│   └── phase6_deployment.md
+│
+└── Screenshots/
+    ├── 1.png                      # Ask tab — voice query interface
+    ├── 2.png                      # Knowledge Bases panel
+    ├── 3.png                      # Analytics dashboard
+    └── 4.png                      # Full app with KB and live conversation
+```
+---
+## Quick Start
+### Prerequisites
+- Python 3.11+
+- A Groq API key ([free at console.groq.com](https://console.groq.com))
+- Optionally a Gemini API key ([free at aistudio.google.com](https://aistudio.google.com))
+### 1. Clone and install
+```bash
+git clone https://github.com/ninjacode911/Project-VoiceVault.git
+cd Project-VoiceVault
+python -m venv .venv
+source .venv/bin/activate   # Windows: .venv\Scripts\activate
+pip install torch --index-url https://download.pytorch.org/whl/cpu   # CPU-only (saves ~1.8GB)
+pip install -r requirements.txt
+python -m spacy download en_core_web_sm
+```
+### 2. Configure secrets
+```bash
+cp .env.example .env
+# Edit .env and add:
+# GROQ_API_KEY=gsk_...
+# GEMINI_API_KEY=...   (optional)
+```
+### 3. Run
+```bash
+python server.py
+# Open http://localhost:7860
+```
+### 4. Use it
+1. Navigate to **Knowledge Bases** → click **+ New Knowledge Base**
+2. Name it (lowercase, hyphens only, e.g. `my-docs`) and upload your PDFs/documents
+3. Go back to **Ask VoiceVault** → select your KB → record or type a question → click **Ask**
+---
+## Running Tests
+```bash
+pytest tests/ -v
+# Expected: 328 passed
 ```
+The integration tests in `tests/test_api_routes.py` use a real `KBManager` backed by a temp SQLite DB and exercise the actual FastAPI routes and method signatures — not mocked pipelines. This is intentional: it catches runtime `AttributeError` bugs that pure-mock unit tests miss.
+---
+## Deployment to Hugging Face Spaces
+The project ships with a `Dockerfile` configured for HF Spaces. The Docker image:
+- Uses Python 3.11-slim base
+- Installs CPU-only PyTorch (~650MB vs 2.5GB GPU wheels)
+- Pre-downloads `all-MiniLM-L6-v2` and `cross-encoder/ms-marco-MiniLM-L12-v2` at build time (no cold-start model downloads)
+- Downloads `en_core_web_sm` spaCy model at build time
+- Binds to `0.0.0.0:7860` (HF Spaces default port)
+To deploy your own copy:
+1. Create a [Hugging Face Space](https://huggingface.co/new-space) with **Docker** SDK
+2. Push this repository to the Space's git remote
+3. Add `GROQ_API_KEY` (and optionally `GEMINI_API_KEY`) as Space secrets
+See [DOCS/phase6_deployment.md](DOCS/phase6_deployment.md) for the full deployment walkthrough.
+---
+## Configuration
+All configuration is environment-driven via `.env`. See [`.env.example`](.env.example) for the full reference.
+Key variables:
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `GROQ_API_KEY` | — | **Required.** Groq API key for Whisper + Llama |
+| `GEMINI_API_KEY` | — | Optional Gemini fallback key |
+| `HOST` | `0.0.0.0` | Server bind address |
+| `PORT` | `7860` | Server port |
+| `FINAL_TOP_K` | `5` | Number of chunks passed to LLM |
+| `MAX_ANSWER_TOKENS` | `500` | LLM max output tokens |
+| `CHUNK_SIZE_MAX` | `600` | Max tokens per document chunk |
+| `BCRYPT_ROUNDS` | `12` | bcrypt work factor for KB passwords |
+---
+## Security
+| Control | Implementation |
+|---------|----------------|
+| **No raw queries stored** | Audit log stores SHA-256 hash only |
+| **KB access control** | bcrypt-hashed passwords (work factor 12) |
+| **SQL injection prevention** | 100% parameterized queries — no f-string SQL |
+| **Path traversal prevention** | KB names validated as slugs (`^[a-z0-9][a-z0-9\-]*[a-z0-9]$`) |
+| **SSRF prevention** | URL ingestion via trafilatura with no internal-network access |
+| **Upload whitelist** | Only `.pdf`, `.html`, `.docx`, `.md`, `.txt` accepted |
+| **File size limit** | 50MB max per upload |
+| **GPU isolation** | `CUDA_VISIBLE_DEVICES=-1` prevents CUDA crashes on incompatible hardware |
+| **No secrets in git** | `.env` gitignored; HF secrets via Space settings API |
+---
+## Phase Documentation
+Each phase has a detailed write-up covering design decisions, key code sections, and test results:
+| Phase | Topic | Tests |
+|-------|-------|-------|
+| [Phase 0](DOCS/phase0_foundation.md) | Project Foundation (config, models, schema, scaffold) | 58 ✅ |
+| [Phase 1](DOCS/phase1_ingestion.md) | Document Ingestion (parser, chunker, indexer) | 46 ✅ |
+| [Phase 2](DOCS/phase2_retrieval.md) | Hybrid Retrieval (BM25 + vector + RRF + reranker) | 33 ✅ |
+| [Phase 3](DOCS/phase3_asr.md) | ASR & Voice Input (Whisper, query preprocessor) | 47 ✅ |
+| [Phase 4](DOCS/phase4_generation.md) | Generation & Citations (LangChain, faithfulness guard) | 72 ✅ |
+| [Phase 5](DOCS/phase5_ui_access.md) | Full UI, TTS & Access Control | 55 ✅ |
+| [Phase 6](DOCS/phase6_deployment.md) | FastAPI Server, SPA Frontend & HF Deployment | 17 ✅ |
+**Total: 328 tests — all passing.**
+---
+## License
+**Source Available — All Rights Reserved.** See [LICENSE](LICENSE) for full terms.
+The source code is publicly visible for viewing and educational purposes. Any use in personal, commercial, or academic projects requires explicit written permission from the author.
+To request permission: navnitamrutharaj1234@gmail.com
+**Author:** Navnit Amrutharaj