Spaces:
Running
Running
File size: 11,082 Bytes
1694352 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 | # Phase 6 β FastAPI Server, SPA Frontend & Hugging Face Deployment
**Status:** β
Complete | **Tests:** 17/17 passed (integration suite) | **Files:** 7 new files
---
## What Was Built
Phase 6 supersedes the Gradio UI with a production-quality FastAPI REST API + hand-crafted single-page application, then deploys the complete system to Hugging Face Spaces via Docker.
| Module | Responsibility |
|--------|----------------|
| `server.py` | FastAPI entry point with lifespan startup, singleton injection |
| `api/routes.py` | All REST endpoints (`/api/kbs`, `/api/ask`, `/api/transcribe`, `/api/analytics`) |
| `static/index.html` | SPA shell: sidebar nav, Ask/KB/Analytics views, modals, toast container |
| `static/style.css` | Dark glassmorphism design system (~600 lines) |
| `static/app.js` | Full SPA logic: recording, WAV conversion, API calls, chat, TTS (~500 lines) |
| `voicevault/asr/groq_transcriber.py` | Groq Whisper cloud transcription (~300ms) |
| `Dockerfile` | CPU-optimized Docker image for HF Spaces |
| `tests/test_api_routes.py` | Integration tests for all REST endpoints |
---
## Why Replace Gradio
The Gradio UI worked well for prototyping but had two production problems:
1. **Blocking event loop** β Whisper model loading (30β60s on first call) runs synchronously inside Gradio's event handler, freezing the entire UI.
2. **No control over UX** β Gradio's preset component library limits design to its own aesthetic.
FastAPI + a custom SPA solves both:
- All slow operations (model loading, retrieval, LLM calls) run inside FastAPI's async handlers on a proper ASGI server (uvicorn).
- The frontend is plain HTML/CSS/JS β no framework overhead, full design control.
---
## FastAPI Server (`server.py`)
### Key Decisions
**GPU forced off at the very top:**
```python
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1" # Must be before any ML imports
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"
```
The RTX 5070 (sm_120 architecture) is incompatible with packaged PyTorch (max sm_90). Setting `CUDA_VISIBLE_DEVICES=-1` before any import prevents the crash. This must be the first two lines β any later and `sentence_transformers` or `torch` may already have probed CUDA.
**Modern lifespan pattern (no deprecated `on_event`):**
```python
@asynccontextmanager
async def _lifespan(app: FastAPI):
kb_manager, transcriber, answer_chain = _startup()
init_routes(kb_manager, transcriber, answer_chain, _CENTRAL_DB_PATH)
yield
app = FastAPI(lifespan=_lifespan)
```
FastAPI deprecated `@app.on_event("startup")` in favour of this context manager pattern.
**Smart transcriber selection:**
```python
if cfg.has_groq_key():
transcriber = GroqTranscriber() # ~300ms, cloud
else:
transcriber = WhisperTranscriber() # ~5β60s, local CPU
```
---
## REST API (`api/routes.py`)
### Singleton Injection
Rather than using FastAPI `Depends()` (which requires each endpoint to declare its dependencies), singletons are injected once at startup via `init_routes()`:
```python
_kb_manager = None
_transcriber = None
_answer_chain = None
_db_path: Optional[Path] = None
def init_routes(kb_manager, transcriber, answer_chain, db_path) -> None:
global _kb_manager, _transcriber, _answer_chain, _db_path
_kb_manager = kb_manager
...
```
This makes the module stateful but keeps route definitions clean and eliminates per-request dependency resolution overhead for these heavy singletons.
### Critical Bug Fixed: `.search()` vs `.retrieve()`
During the first end-to-end runtime test, `/api/ask` raised:
```
AttributeError: 'HybridRetriever' object has no attribute 'search'
```
The existing unit tests never caught this because `HybridRetriever` was fully mocked β the mock accepted any attribute access. The actual public method is `retrieve()`.
**Root cause:** Unit tests with `MagicMock()` replacing the entire retriever cannot catch wrong method names.
**Fix:** `retriever.retrieve(search_query)` in both `api/routes.py` and `ui/tabs/ask_tab.py`.
**Prevention:** The new integration tests in `test_api_routes.py` patch only the model-loading step, leaving all method call routing real:
```python
with patch("voicevault.retrieval.hybrid_retriever.HybridRetriever.retrieve",
return_value=[]) as mock_retrieve:
r = client.post("/api/ask", json={...})
mock_retrieve.assert_called_once() # would fail if code calls .search()
```
### Endpoints
| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/api/kbs` | List all knowledge bases with doc/chunk counts |
| `POST` | `/api/kbs` | Create a KB (validates name, hashes password) |
| `DELETE` | `/api/kbs/{kb_name}` | Delete KB and all its indexed data |
| `POST` | `/api/kbs/{kb_name}/documents` | Upload + index documents |
| `POST` | `/api/transcribe` | Transcribe uploaded audio file |
| `POST` | `/api/ask` | Full RAG pipeline: retrieve β generate β cite |
| `GET` | `/api/analytics` | Query stats from SQLite audit log |
---
## Groq Transcriber (`voicevault/asr/groq_transcriber.py`)
### Why Groq for ASR
Local Whisper on CPU downloads a 1.5GB model on first use and takes 5+ minutes to transcribe a 5-second clip. This is unusable in a live demo setting.
Groq's Whisper API:
- No model download
- ~300β500ms round-trip for short voice queries
- Same `whisper-large-v3-turbo` model quality
- Free tier: 7,200 requests/day
```python
response = client.audio.transcriptions.create(
file=(audio_path.name, f.read()),
model="whisper-large-v3-turbo",
language="en",
response_format="text",
)
```
The `GroqTranscriber` returns the same `TranscriptResult` dataclass as `WhisperTranscriber`, so the rest of the pipeline is unaware of which was used.
---
## SPA Frontend
### Browser Audio β WAV Conversion
The browser's `MediaRecorder` API outputs `audio/webm` (Chrome) or `audio/ogg` (Firefox). The server's `soundfile` library only reads WAV/FLAC/OGG-Vorbis. Sending WebM directly to the server would require `ffmpeg`.
Solution: convert in-browser using `AudioContext` before sending:
```javascript
async function convertBlobToWav(blob) {
const arrayBuffer = await blob.arrayBuffer();
const audioCtx = new (window.AudioContext || window.webkitAudioContext)();
const audioBuffer = await audioCtx.decodeAudioData(arrayBuffer);
audioCtx.close();
return audioBufferToWavBlob(audioBuffer); // 16-bit PCM WAV
}
```
This eliminates the `ffmpeg` system dependency entirely. The server receives a standard PCM WAV file that `soundfile` reads natively.
### Design System
The SPA uses a dark glassmorphism design:
- Background: near-black `#09090b` with ambient purple gradient orbs (CSS `@keyframes drift`)
- Cards: `rgba(255,255,255,0.03)` with `backdrop-filter: blur(12px)` and subtle border
- Primary: `#8b5cf6` (violet-500) β consistent across buttons, badges, microphone glow
- Chat bubbles: user messages right-aligned (violet), assistant messages left-aligned (dark card)
- Animations: `msgIn` slide-in for chat, `micPulse` glow during recording, `waveAnim` bars during processing
---
## Integration Tests (`tests/test_api_routes.py`)
### Philosophy
The existing 311 tests mock `HybridRetriever`, `AnswerChain`, and `WhisperTranscriber` at the class level using `MagicMock()`. This correctly validates internal logic within each module, but cannot catch:
- Wrong method names called across module boundaries
- Incorrect field names in Pydantic response models
- Route registration issues (missing prefix, typos)
The integration tests solve this by using:
- Real `KBManager` backed by a temp SQLite DB (schema initialized via `initialize_database()`)
- Real FastAPI `TestClient` routing (not mocked)
- Only the LLM and Whisper calls mocked (network/slow)
```python
@pytest.fixture(scope="module")
def client(kb_manager, mock_transcriber, mock_answer_chain, tmp_path_factory):
db = tmp_path_factory.mktemp("db2") / "server.db"
db_mod.initialize_database(db) # real schema
routes_mod.init_routes(kb_manager, mock_transcriber, mock_answer_chain, db)
app = FastAPI()
app.include_router(router)
return TestClient(app)
```
### Test Coverage
| Class | Tests | What Is Validated |
|-------|-------|-------------------|
| `TestKBEndpoints` | 8 | CRUD, validation, duplicate detection, response fields |
| `TestAskEndpoint` | 6 | Method names (`retrieve`, `build`), response schema, history, empty/missing inputs |
| `TestAnalyticsEndpoint` | 2 | Stats structure, KB list type |
| `TestTranscribeEndpoint` | 1 | Real WAV file upload, mock transcriber called |
**Total: 17 tests, all passing.**
---
## Docker Deployment
### Image Strategy
```dockerfile
# 1. CPU-only PyTorch first (saves 1.8GB vs GPU wheel)
RUN pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cpu
# 2. All other requirements
RUN pip install -r requirements.txt
# 3. spaCy model (needed for document chunking)
RUN python -m spacy download en_core_web_sm
# 4. Pre-download ML models at build time (no cold-start delays)
RUN python -c "
from sentence_transformers import SentenceTransformer, CrossEncoder;
SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2');
CrossEncoder('cross-encoder/ms-marco-MiniLM-L12-v2')
"
```
Pre-baking the embedding and reranking models into the Docker layer means the first document upload doesn't trigger a download. Whisper is intentionally not pre-baked β Groq cloud API is used when `GROQ_API_KEY` is present.
### Environment Variables in HF Spaces
API keys are injected as Space secrets (not committed to git):
```python
from huggingface_hub import HfApi
api = HfApi()
api.add_space_secret('NinjainPJs/VoiceVault', 'GROQ_API_KEY', value)
api.add_space_secret('NinjainPJs/VoiceVault', 'GEMINI_API_KEY', value)
```
### Storage
HF Spaces free tier uses ephemeral storage β knowledge bases created at runtime are lost on container restart. This is acceptable for a demo deployment. For production persistence, HF Spaces offers a persistent storage add-on, or the data layer can be pointed at an external object store.
---
## Lessons Learned
1. **Mock depth matters** β mocking at the class level (`MagicMock()`) cannot catch method-name bugs across modules. Integration tests that mock only I/O boundaries (LLM API, Whisper model) while keeping real routing are essential.
2. **GPU environment variables must be first** β `CUDA_VISIBLE_DEVICES=-1` must precede all Python ML imports. Any utility module that imports `torch` at module level will trigger CUDA detection before the variable is set if import order is not controlled.
3. **Browser audio formats** β `MediaRecorder` output (WebM/OGG) and server-side audio libraries (`soundfile`) don't share a format. Converting to 16-bit PCM WAV in the browser with `AudioContext` is the cleanest zero-dependency solution.
4. **Groq vs local Whisper** β For a live demo, cloud transcription is non-negotiable. A 5-minute wait on first recording kills the experience. The 300ms Groq round-trip feels instant.
|