File size: 16,778 Bytes
85f900d
 
 
 
 
 
 
05853ae
85f900d
 
05853ae
85f900d
05853ae
85f900d
05853ae
85f900d
05853ae
85f900d
05853ae
 
 
 
 
85f900d
05853ae
85f900d
05853ae
85f900d
05853ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85f900d
05853ae
 
 
 
 
85f900d
 
 
 
05853ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85f900d
05853ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
---
title: VoiceVault
emoji: πŸŽ™οΈ
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
license: other
---

<div align="center">

# VoiceVault

**Voice-First RAG Knowledge Agent**

*Speak to your documents. Get cited answers back.*

[![Python](https://img.shields.io/badge/Python-3.11+-3776AB?style=flat&logo=python&logoColor=white)](https://www.python.org/)
[![FastAPI](https://img.shields.io/badge/FastAPI-0.115+-009688?style=flat&logo=fastapi&logoColor=white)](https://fastapi.tiangolo.com/)
[![License](https://img.shields.io/badge/License-Source%20Available-blue.svg)](LICENSE)
[![Tests](https://img.shields.io/badge/Tests-328%20passing-brightgreen)](tests/)
[![HF Spaces](https://img.shields.io/badge/πŸ€—%20Hugging%20Face-Live%20Demo-FFD21E)](https://huggingface.co/spaces/NinjainPJs/VoiceVault)

[**Live Demo β†’**](https://huggingface.co/spaces/NinjainPJs/VoiceVault)&nbsp;&nbsp;|&nbsp;&nbsp;[**Documentation β†’**](DOCS/)&nbsp;&nbsp;|&nbsp;&nbsp;[**Project Plan β†’**](PLAN.md)

</div>

---

## Overview

VoiceVault is a production-grade, voice-first Retrieval-Augmented Generation (RAG) system built entirely from scratch. It enables users to record or type questions and receive answers grounded in their own private document collections β€” with inline citations pointing back to the exact source, page, and paragraph.

The project was built in 6 phases over several weeks, with a full test suite (328 tests), enterprise-grade security practices (bcrypt, parameterized SQL, SHA-256 audit logs, SSRF prevention), and deployment to Hugging Face Spaces via Docker.

**What makes this different from typical RAG demos:**
- **Hybrid retrieval** β€” BM25 keyword search + semantic vector search, fused with Reciprocal Rank Fusion (RRF) + cross-encoder reranking. Most tutorials use only one retrieval method.
- **Voice-native pipeline** β€” Groq Whisper API for ~300ms cloud transcription with local Whisper fallback; Web Speech API for TTS output.
- **Faithfulness guard** β€” Detects when the LLM cannot answer from retrieved context and returns a grounded refusal instead of hallucinating.
- **Multi-KB support** β€” Multiple independent knowledge bases, each optionally password-protected.

---

## Screenshots

<div align="center">

### Ask VoiceVault β€” Voice Query Interface
*Record your question via microphone or type it. The mic button pulses when recording.*

<img src="Screenshots/1.png" alt="Ask VoiceVault β€” main voice query interface with dark glassmorphism UI" width="800"/>

---

### Knowledge Base Management
*Create named knowledge bases, upload documents (PDF, DOCX, HTML, MD, TXT), and manage them.*

<img src="Screenshots/2.png" alt="Knowledge Bases panel β€” empty state with New Knowledge Base button" width="800"/>

---

### Analytics Dashboard
*Real-time query statistics: total queries, average latency, citation counts, and daily breakdowns.*

<img src="Screenshots/3.png" alt="Analytics dashboard showing query statistics" width="800"/>

---

### Full App in Action
*A populated knowledge base (358 chunks from 1 document) and a live conversation with the RAG pipeline.*

<img src="Screenshots/4.png" alt="Full VoiceVault app with a knowledge base and active conversation" width="800"/>

</div>

---

## Architecture

```
INGESTION PATH (one-time per document set)
──────────────────────────────────────────────────────
  User uploads PDF / HTML / DOCX / MD / TXT
      β”‚
      β–Ό
  DocumentParser         β†’  text + metadata per page
      β”‚                     (PyMuPDF, BS4, python-docx)
      β–Ό
  SemanticChunker        β†’  sentence-aware chunks
      β”‚                     (spaCy sentences + cosine boundary)
      β–Ό
  IndexBuilder           β†’  ChromaDB (vector) + BM25 (keyword)
                             + SQLite (metadata)

QUERY PATH (real-time, per question)
──────────────────────────────────────────────────────
  Browser mic β†’ WAV β†’ POST /api/transcribe
      β”‚
      β–Ό
  GroqTranscriber        β†’  Groq Whisper API (~300ms)
      β”‚                     [fallback: local Whisper CPU]
      β–Ό
  QueryPreprocessor      β†’  filler removal, intent classification
      β”‚                     (factual / summary / compare)
      β–Ό
  HybridRetriever        β†’  BM25 top-20 + Vector top-20
      β”‚                     β†’ RRF merge (k=60)
      β”‚                     β†’ CrossEncoder rerank (ms-marco-MiniLM-L12-v2)
      β”‚                     β†’ diversity filter (max 2 chunks/page)
      β–Ό
  ContextBuilder         β†’  formatted context with [Source:N] markers
      β–Ό
  LangChain LCEL         β†’  Groq Llama-3.1-70B (primary)
      β”‚                     [fallback: Gemini 1.5 Flash]
      β–Ό
  FaithfulnessGuard      β†’  refusal detection, confidence scoring
      β”‚
  CitationInjector       β†’  resolve [Source:N] β†’ filename + page
      β–Ό
  JSON response          β†’  answer + citations + confidence + tts_text
      β”‚
      β–Ό
  SPA Frontend           β†’  chat display + Web Speech API TTS
```

---

## Features

| Feature | Detail |
|---------|--------|
| **Voice Input** | Browser microphone β†’ WAV conversion β†’ Groq Whisper API (~300ms) |
| **Hybrid Retrieval** | BM25 + semantic vector search, RRF fusion, cross-encoder reranking |
| **Multi-KB** | Create multiple independent knowledge bases per session |
| **KB Access Control** | Optional bcrypt password protection (work factor 12) per KB |
| **Document Formats** | PDF, DOCX, HTML, Markdown, TXT (OCR fallback for scanned PDFs) |
| **Source Citations** | Every answer traceable to source file + page number |
| **Faithfulness Guard** | Detects hallucinations; returns grounded refusal when context is insufficient |
| **Conversation Memory** | Rolling 5-turn conversation window passed to the LLM |
| **LLM Fallback** | Groq Llama-3.1-70B β†’ Gemini 1.5 Flash automatic fallback |
| **TTS Output** | Web Speech API reads answer aloud with citation markers stripped |
| **Analytics** | SQLite audit log: query counts, latency, citation rates (7-day window) |
| **Privacy** | Raw queries never stored β€” SHA-256 hash only in audit log |
| **328 Tests** | Integration + unit tests across all 6 phases |

---

## Tech Stack

| Layer | Technology | Purpose |
|-------|-----------|---------|
| **API** | FastAPI + uvicorn | REST backend with async endpoints |
| **Frontend** | HTML5 / CSS3 / Vanilla JS | Premium dark SPA (no framework) |
| **ASR** | Groq Whisper API | Cloud transcription (~300ms) |
| **ASR Fallback** | OpenAI Whisper Large-v3 | Local CPU transcription |
| **Embeddings** | sentence-transformers `all-MiniLM-L6-v2` | Dense vector representations |
| **Reranking** | `cross-encoder/ms-marco-MiniLM-L12-v2` | Semantic relevance scoring |
| **Vector Store** | ChromaDB | In-process vector database |
| **Keyword Search** | rank-bm25 (BM25Okapi) | Lexical keyword matching |
| **Chunking** | spaCy `en_core_web_sm` | Sentence boundary detection |
| **LLM (primary)** | Groq Llama-3.1-70B | Fast inference via Groq cloud |
| **LLM (fallback)** | Gemini 1.5 Flash | Google generative AI fallback |
| **Orchestration** | LangChain LCEL | LLM pipeline composition |
| **Metadata** | SQLite | KB registry, doc index, audit log |
| **Security** | bcrypt (work factor 12) | KB password hashing |
| **Config** | Pydantic-settings | Centralized, type-safe config |
| **Deployment** | Docker on Hugging Face Spaces | Container-based cloud hosting |

---

## Project Structure

```
Project-VoiceVault/
β”œβ”€β”€ server.py                      # FastAPI entry point (run this)
β”œβ”€β”€ app.py                         # Gradio entry point (legacy / tests)
β”œβ”€β”€ config.py                      # Centralized Pydantic-settings config
β”œβ”€β”€ requirements.txt               # All dependencies
β”œβ”€β”€ Dockerfile                     # HF Spaces Docker deployment
β”œβ”€β”€ .env.example                   # Environment variable template
β”‚
β”œβ”€β”€ api/                           # FastAPI REST API
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── routes.py                  # All /api/* endpoints
β”‚
β”œβ”€β”€ static/                        # SPA frontend assets
β”‚   β”œβ”€β”€ index.html                 # Single-page application shell
β”‚   β”œβ”€β”€ style.css                  # Dark glassmorphism design system
β”‚   └── app.js                     # Full SPA logic (recording, chat, KB CRUD)
β”‚
β”œβ”€β”€ voicevault/                    # Core package
β”‚   β”œβ”€β”€ models.py                  # Pydantic data models
β”‚   β”œβ”€β”€ asr/
β”‚   β”‚   β”œβ”€β”€ groq_transcriber.py    # Groq Whisper cloud ASR (~300ms)
β”‚   β”‚   β”œβ”€β”€ whisper_transcriber.py # Local Whisper CPU/GPU fallback
β”‚   β”‚   └── query_preprocessor.py  # Filler removal, intent classification
β”‚   β”œβ”€β”€ ingestion/
β”‚   β”‚   β”œβ”€β”€ document_parser.py     # PDF/HTML/DOCX/MD/TXT β†’ structured text
β”‚   β”‚   β”œβ”€β”€ semantic_chunker.py    # Sentence-aware chunking with topic boundaries
β”‚   β”‚   └── index_builder.py      # ChromaDB + BM25 + SQLite orchestration
β”‚   β”œβ”€β”€ retrieval/
β”‚   β”‚   β”œβ”€β”€ hybrid_retriever.py    # BM25 + vector + RRF + cross-encoder
β”‚   β”‚   β”œβ”€β”€ bm25_retriever.py      # BM25Okapi keyword search
β”‚   β”‚   β”œβ”€β”€ vector_retriever.py    # ChromaDB semantic search
β”‚   β”‚   └── context_builder.py     # Context formatting + citation markers
β”‚   β”œβ”€β”€ generation/
β”‚   β”‚   β”œβ”€β”€ answer_chain.py        # LangChain LCEL + Groq + Gemini fallback
β”‚   β”‚   β”œβ”€β”€ faithfulness_guard.py  # Hallucination detection + refusal
β”‚   β”‚   └── citation_injector.py   # [Source:N] β†’ filename + page resolution
β”‚   β”œβ”€β”€ kb/
β”‚   β”‚   └── kb_manager.py          # KB lifecycle, bcrypt auth, validation
β”‚   β”œβ”€β”€ storage/
β”‚   β”‚   β”œβ”€β”€ sqlite_store.py        # Schema, CRUD, audit log queries
β”‚   β”‚   └── chroma_store.py        # ChromaDB wrapper
β”‚   └── tts/
β”‚       └── web_speech.py          # TTS text preparation
β”‚
β”œβ”€β”€ ui/                            # Gradio UI components (legacy / app.py)
β”‚   β”œβ”€β”€ tabs/
β”‚   β”‚   β”œβ”€β”€ ask_tab.py
β”‚   β”‚   β”œβ”€β”€ kb_tab.py
β”‚   β”‚   β”œβ”€β”€ analytics_tab.py
β”‚   β”‚   └── settings_tab.py
β”‚   └── components/
β”‚       β”œβ”€β”€ citation_panel.py
β”‚       └── audio_controls.py
β”‚
β”œβ”€β”€ tests/                         # Full test suite β€” 328 tests
β”‚   β”œβ”€β”€ conftest.py
β”‚   β”œβ”€β”€ test_api_routes.py         # Integration tests (FastAPI + real methods)
β”‚   β”œβ”€β”€ test_phase0.py             # Foundation tests
β”‚   β”œβ”€β”€ test_phase1.py             # Ingestion tests
β”‚   β”œβ”€β”€ test_phase2.py             # Retrieval tests
β”‚   β”œβ”€β”€ test_phase3.py             # ASR tests
β”‚   β”œβ”€β”€ test_phase4.py             # Generation tests
β”‚   └── test_phase5.py             # UI / access control tests
β”‚
β”œβ”€β”€ DOCS/                          # Detailed phase documentation
β”‚   β”œβ”€β”€ phase0_foundation.md
β”‚   β”œβ”€β”€ phase1_ingestion.md
β”‚   β”œβ”€β”€ phase2_retrieval.md
β”‚   β”œβ”€β”€ phase3_asr.md
β”‚   β”œβ”€β”€ phase4_generation.md
β”‚   β”œβ”€β”€ phase5_ui_access.md
β”‚   └── phase6_deployment.md
β”‚
└── Screenshots/
    β”œβ”€β”€ 1.png                      # Ask tab β€” voice query interface
    β”œβ”€β”€ 2.png                      # Knowledge Bases panel
    β”œβ”€β”€ 3.png                      # Analytics dashboard
    └── 4.png                      # Full app with KB and live conversation
```

---

## Quick Start

### Prerequisites

- Python 3.11+
- A Groq API key ([free at console.groq.com](https://console.groq.com))
- Optionally a Gemini API key ([free at aistudio.google.com](https://aistudio.google.com))

### 1. Clone and install

```bash
git clone https://github.com/ninjacode911/Project-VoiceVault.git
cd Project-VoiceVault
python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install torch --index-url https://download.pytorch.org/whl/cpu   # CPU-only (saves ~1.8GB)
pip install -r requirements.txt
python -m spacy download en_core_web_sm
```

### 2. Configure secrets

```bash
cp .env.example .env
# Edit .env and add:
# GROQ_API_KEY=gsk_...
# GEMINI_API_KEY=...   (optional)
```

### 3. Run

```bash
python server.py
# Open http://localhost:7860
```

### 4. Use it

1. Navigate to **Knowledge Bases** β†’ click **+ New Knowledge Base**
2. Name it (lowercase, hyphens only, e.g. `my-docs`) and upload your PDFs/documents
3. Go back to **Ask VoiceVault** β†’ select your KB β†’ record or type a question β†’ click **Ask**

---

## Running Tests

```bash
pytest tests/ -v
# Expected: 328 passed
```

The integration tests in `tests/test_api_routes.py` use a real `KBManager` backed by a temp SQLite DB and exercise the actual FastAPI routes and method signatures β€” not mocked pipelines. This is intentional: it catches runtime `AttributeError` bugs that pure-mock unit tests miss.

---

## Deployment to Hugging Face Spaces

The project ships with a `Dockerfile` configured for HF Spaces. The Docker image:
- Uses Python 3.11-slim base
- Installs CPU-only PyTorch (~650MB vs 2.5GB GPU wheels)
- Pre-downloads `all-MiniLM-L6-v2` and `cross-encoder/ms-marco-MiniLM-L12-v2` at build time (no cold-start model downloads)
- Downloads `en_core_web_sm` spaCy model at build time
- Binds to `0.0.0.0:7860` (HF Spaces default port)

To deploy your own copy:

1. Create a [Hugging Face Space](https://huggingface.co/new-space) with **Docker** SDK
2. Push this repository to the Space's git remote
3. Add `GROQ_API_KEY` (and optionally `GEMINI_API_KEY`) as Space secrets

See [DOCS/phase6_deployment.md](DOCS/phase6_deployment.md) for the full deployment walkthrough.

---

## Configuration

All configuration is environment-driven via `.env`. See [`.env.example`](.env.example) for the full reference.

Key variables:

| Variable | Default | Description |
|----------|---------|-------------|
| `GROQ_API_KEY` | β€” | **Required.** Groq API key for Whisper + Llama |
| `GEMINI_API_KEY` | β€” | Optional Gemini fallback key |
| `HOST` | `0.0.0.0` | Server bind address |
| `PORT` | `7860` | Server port |
| `FINAL_TOP_K` | `5` | Number of chunks passed to LLM |
| `MAX_ANSWER_TOKENS` | `500` | LLM max output tokens |
| `CHUNK_SIZE_MAX` | `600` | Max tokens per document chunk |
| `BCRYPT_ROUNDS` | `12` | bcrypt work factor for KB passwords |

---

## Security

| Control | Implementation |
|---------|----------------|
| **No raw queries stored** | Audit log stores SHA-256 hash only |
| **KB access control** | bcrypt-hashed passwords (work factor 12) |
| **SQL injection prevention** | 100% parameterized queries β€” no f-string SQL |
| **Path traversal prevention** | KB names validated as slugs (`^[a-z0-9][a-z0-9\-]*[a-z0-9]$`) |
| **SSRF prevention** | URL ingestion via trafilatura with no internal-network access |
| **Upload whitelist** | Only `.pdf`, `.html`, `.docx`, `.md`, `.txt` accepted |
| **File size limit** | 50MB max per upload |
| **GPU isolation** | `CUDA_VISIBLE_DEVICES=-1` prevents CUDA crashes on incompatible hardware |
| **No secrets in git** | `.env` gitignored; HF secrets via Space settings API |

---

## Phase Documentation

Each phase has a detailed write-up covering design decisions, key code sections, and test results:

| Phase | Topic | Tests |
|-------|-------|-------|
| [Phase 0](DOCS/phase0_foundation.md) | Project Foundation (config, models, schema, scaffold) | 58 βœ… |
| [Phase 1](DOCS/phase1_ingestion.md) | Document Ingestion (parser, chunker, indexer) | 46 βœ… |
| [Phase 2](DOCS/phase2_retrieval.md) | Hybrid Retrieval (BM25 + vector + RRF + reranker) | 33 βœ… |
| [Phase 3](DOCS/phase3_asr.md) | ASR & Voice Input (Whisper, query preprocessor) | 47 βœ… |
| [Phase 4](DOCS/phase4_generation.md) | Generation & Citations (LangChain, faithfulness guard) | 72 βœ… |
| [Phase 5](DOCS/phase5_ui_access.md) | Full UI, TTS & Access Control | 55 βœ… |
| [Phase 6](DOCS/phase6_deployment.md) | FastAPI Server, SPA Frontend & HF Deployment | 17 βœ… |

**Total: 328 tests β€” all passing.**

---

## License

**Source Available β€” All Rights Reserved.** See [LICENSE](LICENSE) for full terms.

The source code is publicly visible for viewing and educational purposes. Any use in personal, commercial, or academic projects requires explicit written permission from the author.

To request permission: navnitamrutharaj1234@gmail.com

**Author:** Navnit Amrutharaj