File size: 11,082 Bytes
1694352
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
# Phase 6 β€” FastAPI Server, SPA Frontend & Hugging Face Deployment

**Status:** βœ… Complete | **Tests:** 17/17 passed (integration suite) | **Files:** 7 new files

---

## What Was Built

Phase 6 supersedes the Gradio UI with a production-quality FastAPI REST API + hand-crafted single-page application, then deploys the complete system to Hugging Face Spaces via Docker.

| Module | Responsibility |
|--------|----------------|
| `server.py` | FastAPI entry point with lifespan startup, singleton injection |
| `api/routes.py` | All REST endpoints (`/api/kbs`, `/api/ask`, `/api/transcribe`, `/api/analytics`) |
| `static/index.html` | SPA shell: sidebar nav, Ask/KB/Analytics views, modals, toast container |
| `static/style.css` | Dark glassmorphism design system (~600 lines) |
| `static/app.js` | Full SPA logic: recording, WAV conversion, API calls, chat, TTS (~500 lines) |
| `voicevault/asr/groq_transcriber.py` | Groq Whisper cloud transcription (~300ms) |
| `Dockerfile` | CPU-optimized Docker image for HF Spaces |
| `tests/test_api_routes.py` | Integration tests for all REST endpoints |

---

## Why Replace Gradio

The Gradio UI worked well for prototyping but had two production problems:

1. **Blocking event loop** β€” Whisper model loading (30–60s on first call) runs synchronously inside Gradio's event handler, freezing the entire UI.
2. **No control over UX** β€” Gradio's preset component library limits design to its own aesthetic.

FastAPI + a custom SPA solves both:
- All slow operations (model loading, retrieval, LLM calls) run inside FastAPI's async handlers on a proper ASGI server (uvicorn).
- The frontend is plain HTML/CSS/JS β€” no framework overhead, full design control.

---

## FastAPI Server (`server.py`)

### Key Decisions

**GPU forced off at the very top:**
```python
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"   # Must be before any ML imports
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"
```
The RTX 5070 (sm_120 architecture) is incompatible with packaged PyTorch (max sm_90). Setting `CUDA_VISIBLE_DEVICES=-1` before any import prevents the crash. This must be the first two lines β€” any later and `sentence_transformers` or `torch` may already have probed CUDA.

**Modern lifespan pattern (no deprecated `on_event`):**
```python
@asynccontextmanager
async def _lifespan(app: FastAPI):
    kb_manager, transcriber, answer_chain = _startup()
    init_routes(kb_manager, transcriber, answer_chain, _CENTRAL_DB_PATH)
    yield

app = FastAPI(lifespan=_lifespan)
```
FastAPI deprecated `@app.on_event("startup")` in favour of this context manager pattern.

**Smart transcriber selection:**
```python
if cfg.has_groq_key():
    transcriber = GroqTranscriber()   # ~300ms, cloud
else:
    transcriber = WhisperTranscriber()  # ~5–60s, local CPU
```

---

## REST API (`api/routes.py`)

### Singleton Injection

Rather than using FastAPI `Depends()` (which requires each endpoint to declare its dependencies), singletons are injected once at startup via `init_routes()`:

```python
_kb_manager = None
_transcriber = None
_answer_chain = None
_db_path: Optional[Path] = None

def init_routes(kb_manager, transcriber, answer_chain, db_path) -> None:
    global _kb_manager, _transcriber, _answer_chain, _db_path
    _kb_manager = kb_manager
    ...
```

This makes the module stateful but keeps route definitions clean and eliminates per-request dependency resolution overhead for these heavy singletons.

### Critical Bug Fixed: `.search()` vs `.retrieve()`

During the first end-to-end runtime test, `/api/ask` raised:
```
AttributeError: 'HybridRetriever' object has no attribute 'search'
```
The existing unit tests never caught this because `HybridRetriever` was fully mocked β€” the mock accepted any attribute access. The actual public method is `retrieve()`.

**Root cause:** Unit tests with `MagicMock()` replacing the entire retriever cannot catch wrong method names.

**Fix:** `retriever.retrieve(search_query)` in both `api/routes.py` and `ui/tabs/ask_tab.py`.

**Prevention:** The new integration tests in `test_api_routes.py` patch only the model-loading step, leaving all method call routing real:
```python
with patch("voicevault.retrieval.hybrid_retriever.HybridRetriever.retrieve",
           return_value=[]) as mock_retrieve:
    r = client.post("/api/ask", json={...})
mock_retrieve.assert_called_once()   # would fail if code calls .search()
```

### Endpoints

| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/api/kbs` | List all knowledge bases with doc/chunk counts |
| `POST` | `/api/kbs` | Create a KB (validates name, hashes password) |
| `DELETE` | `/api/kbs/{kb_name}` | Delete KB and all its indexed data |
| `POST` | `/api/kbs/{kb_name}/documents` | Upload + index documents |
| `POST` | `/api/transcribe` | Transcribe uploaded audio file |
| `POST` | `/api/ask` | Full RAG pipeline: retrieve β†’ generate β†’ cite |
| `GET` | `/api/analytics` | Query stats from SQLite audit log |

---

## Groq Transcriber (`voicevault/asr/groq_transcriber.py`)

### Why Groq for ASR

Local Whisper on CPU downloads a 1.5GB model on first use and takes 5+ minutes to transcribe a 5-second clip. This is unusable in a live demo setting.

Groq's Whisper API:
- No model download
- ~300–500ms round-trip for short voice queries
- Same `whisper-large-v3-turbo` model quality
- Free tier: 7,200 requests/day

```python
response = client.audio.transcriptions.create(
    file=(audio_path.name, f.read()),
    model="whisper-large-v3-turbo",
    language="en",
    response_format="text",
)
```

The `GroqTranscriber` returns the same `TranscriptResult` dataclass as `WhisperTranscriber`, so the rest of the pipeline is unaware of which was used.

---

## SPA Frontend

### Browser Audio β†’ WAV Conversion

The browser's `MediaRecorder` API outputs `audio/webm` (Chrome) or `audio/ogg` (Firefox). The server's `soundfile` library only reads WAV/FLAC/OGG-Vorbis. Sending WebM directly to the server would require `ffmpeg`.

Solution: convert in-browser using `AudioContext` before sending:

```javascript
async function convertBlobToWav(blob) {
    const arrayBuffer = await blob.arrayBuffer();
    const audioCtx = new (window.AudioContext || window.webkitAudioContext)();
    const audioBuffer = await audioCtx.decodeAudioData(arrayBuffer);
    audioCtx.close();
    return audioBufferToWavBlob(audioBuffer);  // 16-bit PCM WAV
}
```

This eliminates the `ffmpeg` system dependency entirely. The server receives a standard PCM WAV file that `soundfile` reads natively.

### Design System

The SPA uses a dark glassmorphism design:
- Background: near-black `#09090b` with ambient purple gradient orbs (CSS `@keyframes drift`)
- Cards: `rgba(255,255,255,0.03)` with `backdrop-filter: blur(12px)` and subtle border
- Primary: `#8b5cf6` (violet-500) β€” consistent across buttons, badges, microphone glow
- Chat bubbles: user messages right-aligned (violet), assistant messages left-aligned (dark card)
- Animations: `msgIn` slide-in for chat, `micPulse` glow during recording, `waveAnim` bars during processing

---

## Integration Tests (`tests/test_api_routes.py`)

### Philosophy

The existing 311 tests mock `HybridRetriever`, `AnswerChain`, and `WhisperTranscriber` at the class level using `MagicMock()`. This correctly validates internal logic within each module, but cannot catch:
- Wrong method names called across module boundaries
- Incorrect field names in Pydantic response models
- Route registration issues (missing prefix, typos)

The integration tests solve this by using:
- Real `KBManager` backed by a temp SQLite DB (schema initialized via `initialize_database()`)
- Real FastAPI `TestClient` routing (not mocked)
- Only the LLM and Whisper calls mocked (network/slow)

```python
@pytest.fixture(scope="module")
def client(kb_manager, mock_transcriber, mock_answer_chain, tmp_path_factory):
    db = tmp_path_factory.mktemp("db2") / "server.db"
    db_mod.initialize_database(db)          # real schema
    routes_mod.init_routes(kb_manager, mock_transcriber, mock_answer_chain, db)
    app = FastAPI()
    app.include_router(router)
    return TestClient(app)
```

### Test Coverage

| Class | Tests | What Is Validated |
|-------|-------|-------------------|
| `TestKBEndpoints` | 8 | CRUD, validation, duplicate detection, response fields |
| `TestAskEndpoint` | 6 | Method names (`retrieve`, `build`), response schema, history, empty/missing inputs |
| `TestAnalyticsEndpoint` | 2 | Stats structure, KB list type |
| `TestTranscribeEndpoint` | 1 | Real WAV file upload, mock transcriber called |

**Total: 17 tests, all passing.**

---

## Docker Deployment

### Image Strategy

```dockerfile
# 1. CPU-only PyTorch first (saves 1.8GB vs GPU wheel)
RUN pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cpu

# 2. All other requirements
RUN pip install -r requirements.txt

# 3. spaCy model (needed for document chunking)
RUN python -m spacy download en_core_web_sm

# 4. Pre-download ML models at build time (no cold-start delays)
RUN python -c "
    from sentence_transformers import SentenceTransformer, CrossEncoder;
    SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2');
    CrossEncoder('cross-encoder/ms-marco-MiniLM-L12-v2')
"
```

Pre-baking the embedding and reranking models into the Docker layer means the first document upload doesn't trigger a download. Whisper is intentionally not pre-baked β€” Groq cloud API is used when `GROQ_API_KEY` is present.

### Environment Variables in HF Spaces

API keys are injected as Space secrets (not committed to git):

```python
from huggingface_hub import HfApi
api = HfApi()
api.add_space_secret('NinjainPJs/VoiceVault', 'GROQ_API_KEY', value)
api.add_space_secret('NinjainPJs/VoiceVault', 'GEMINI_API_KEY', value)
```

### Storage

HF Spaces free tier uses ephemeral storage β€” knowledge bases created at runtime are lost on container restart. This is acceptable for a demo deployment. For production persistence, HF Spaces offers a persistent storage add-on, or the data layer can be pointed at an external object store.

---

## Lessons Learned

1. **Mock depth matters** β€” mocking at the class level (`MagicMock()`) cannot catch method-name bugs across modules. Integration tests that mock only I/O boundaries (LLM API, Whisper model) while keeping real routing are essential.

2. **GPU environment variables must be first** β€” `CUDA_VISIBLE_DEVICES=-1` must precede all Python ML imports. Any utility module that imports `torch` at module level will trigger CUDA detection before the variable is set if import order is not controlled.

3. **Browser audio formats** β€” `MediaRecorder` output (WebM/OGG) and server-side audio libraries (`soundfile`) don't share a format. Converting to 16-bit PCM WAV in the browser with `AudioContext` is the cleanest zero-dependency solution.

4. **Groq vs local Whisper** β€” For a live demo, cloud transcription is non-negotiable. A 5-minute wait on first recording kills the experience. The 300ms Groq round-trip feels instant.