NinjainPJs commited on
Commit
1694352
Β·
verified Β·
1 Parent(s): 0e4047c

Update DOCS/phase6_deployment.md

Browse files
Files changed (1) hide show
  1. DOCS/phase6_deployment.md +268 -0
DOCS/phase6_deployment.md ADDED
@@ -0,0 +1,268 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Phase 6 β€” FastAPI Server, SPA Frontend & Hugging Face Deployment
2
+
3
+ **Status:** βœ… Complete | **Tests:** 17/17 passed (integration suite) | **Files:** 7 new files
4
+
5
+ ---
6
+
7
+ ## What Was Built
8
+
9
+ Phase 6 supersedes the Gradio UI with a production-quality FastAPI REST API + hand-crafted single-page application, then deploys the complete system to Hugging Face Spaces via Docker.
10
+
11
+ | Module | Responsibility |
12
+ |--------|----------------|
13
+ | `server.py` | FastAPI entry point with lifespan startup, singleton injection |
14
+ | `api/routes.py` | All REST endpoints (`/api/kbs`, `/api/ask`, `/api/transcribe`, `/api/analytics`) |
15
+ | `static/index.html` | SPA shell: sidebar nav, Ask/KB/Analytics views, modals, toast container |
16
+ | `static/style.css` | Dark glassmorphism design system (~600 lines) |
17
+ | `static/app.js` | Full SPA logic: recording, WAV conversion, API calls, chat, TTS (~500 lines) |
18
+ | `voicevault/asr/groq_transcriber.py` | Groq Whisper cloud transcription (~300ms) |
19
+ | `Dockerfile` | CPU-optimized Docker image for HF Spaces |
20
+ | `tests/test_api_routes.py` | Integration tests for all REST endpoints |
21
+
22
+ ---
23
+
24
+ ## Why Replace Gradio
25
+
26
+ The Gradio UI worked well for prototyping but had two production problems:
27
+
28
+ 1. **Blocking event loop** β€” Whisper model loading (30–60s on first call) runs synchronously inside Gradio's event handler, freezing the entire UI.
29
+ 2. **No control over UX** β€” Gradio's preset component library limits design to its own aesthetic.
30
+
31
+ FastAPI + a custom SPA solves both:
32
+ - All slow operations (model loading, retrieval, LLM calls) run inside FastAPI's async handlers on a proper ASGI server (uvicorn).
33
+ - The frontend is plain HTML/CSS/JS β€” no framework overhead, full design control.
34
+
35
+ ---
36
+
37
+ ## FastAPI Server (`server.py`)
38
+
39
+ ### Key Decisions
40
+
41
+ **GPU forced off at the very top:**
42
+ ```python
43
+ import os
44
+ os.environ["CUDA_VISIBLE_DEVICES"] = "-1" # Must be before any ML imports
45
+ os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"
46
+ ```
47
+ The RTX 5070 (sm_120 architecture) is incompatible with packaged PyTorch (max sm_90). Setting `CUDA_VISIBLE_DEVICES=-1` before any import prevents the crash. This must be the first two lines β€” any later and `sentence_transformers` or `torch` may already have probed CUDA.
48
+
49
+ **Modern lifespan pattern (no deprecated `on_event`):**
50
+ ```python
51
+ @asynccontextmanager
52
+ async def _lifespan(app: FastAPI):
53
+ kb_manager, transcriber, answer_chain = _startup()
54
+ init_routes(kb_manager, transcriber, answer_chain, _CENTRAL_DB_PATH)
55
+ yield
56
+
57
+ app = FastAPI(lifespan=_lifespan)
58
+ ```
59
+ FastAPI deprecated `@app.on_event("startup")` in favour of this context manager pattern.
60
+
61
+ **Smart transcriber selection:**
62
+ ```python
63
+ if cfg.has_groq_key():
64
+ transcriber = GroqTranscriber() # ~300ms, cloud
65
+ else:
66
+ transcriber = WhisperTranscriber() # ~5–60s, local CPU
67
+ ```
68
+
69
+ ---
70
+
71
+ ## REST API (`api/routes.py`)
72
+
73
+ ### Singleton Injection
74
+
75
+ Rather than using FastAPI `Depends()` (which requires each endpoint to declare its dependencies), singletons are injected once at startup via `init_routes()`:
76
+
77
+ ```python
78
+ _kb_manager = None
79
+ _transcriber = None
80
+ _answer_chain = None
81
+ _db_path: Optional[Path] = None
82
+
83
+ def init_routes(kb_manager, transcriber, answer_chain, db_path) -> None:
84
+ global _kb_manager, _transcriber, _answer_chain, _db_path
85
+ _kb_manager = kb_manager
86
+ ...
87
+ ```
88
+
89
+ This makes the module stateful but keeps route definitions clean and eliminates per-request dependency resolution overhead for these heavy singletons.
90
+
91
+ ### Critical Bug Fixed: `.search()` vs `.retrieve()`
92
+
93
+ During the first end-to-end runtime test, `/api/ask` raised:
94
+ ```
95
+ AttributeError: 'HybridRetriever' object has no attribute 'search'
96
+ ```
97
+ The existing unit tests never caught this because `HybridRetriever` was fully mocked β€” the mock accepted any attribute access. The actual public method is `retrieve()`.
98
+
99
+ **Root cause:** Unit tests with `MagicMock()` replacing the entire retriever cannot catch wrong method names.
100
+
101
+ **Fix:** `retriever.retrieve(search_query)` in both `api/routes.py` and `ui/tabs/ask_tab.py`.
102
+
103
+ **Prevention:** The new integration tests in `test_api_routes.py` patch only the model-loading step, leaving all method call routing real:
104
+ ```python
105
+ with patch("voicevault.retrieval.hybrid_retriever.HybridRetriever.retrieve",
106
+ return_value=[]) as mock_retrieve:
107
+ r = client.post("/api/ask", json={...})
108
+ mock_retrieve.assert_called_once() # would fail if code calls .search()
109
+ ```
110
+
111
+ ### Endpoints
112
+
113
+ | Method | Path | Description |
114
+ |--------|------|-------------|
115
+ | `GET` | `/api/kbs` | List all knowledge bases with doc/chunk counts |
116
+ | `POST` | `/api/kbs` | Create a KB (validates name, hashes password) |
117
+ | `DELETE` | `/api/kbs/{kb_name}` | Delete KB and all its indexed data |
118
+ | `POST` | `/api/kbs/{kb_name}/documents` | Upload + index documents |
119
+ | `POST` | `/api/transcribe` | Transcribe uploaded audio file |
120
+ | `POST` | `/api/ask` | Full RAG pipeline: retrieve β†’ generate β†’ cite |
121
+ | `GET` | `/api/analytics` | Query stats from SQLite audit log |
122
+
123
+ ---
124
+
125
+ ## Groq Transcriber (`voicevault/asr/groq_transcriber.py`)
126
+
127
+ ### Why Groq for ASR
128
+
129
+ Local Whisper on CPU downloads a 1.5GB model on first use and takes 5+ minutes to transcribe a 5-second clip. This is unusable in a live demo setting.
130
+
131
+ Groq's Whisper API:
132
+ - No model download
133
+ - ~300–500ms round-trip for short voice queries
134
+ - Same `whisper-large-v3-turbo` model quality
135
+ - Free tier: 7,200 requests/day
136
+
137
+ ```python
138
+ response = client.audio.transcriptions.create(
139
+ file=(audio_path.name, f.read()),
140
+ model="whisper-large-v3-turbo",
141
+ language="en",
142
+ response_format="text",
143
+ )
144
+ ```
145
+
146
+ The `GroqTranscriber` returns the same `TranscriptResult` dataclass as `WhisperTranscriber`, so the rest of the pipeline is unaware of which was used.
147
+
148
+ ---
149
+
150
+ ## SPA Frontend
151
+
152
+ ### Browser Audio β†’ WAV Conversion
153
+
154
+ The browser's `MediaRecorder` API outputs `audio/webm` (Chrome) or `audio/ogg` (Firefox). The server's `soundfile` library only reads WAV/FLAC/OGG-Vorbis. Sending WebM directly to the server would require `ffmpeg`.
155
+
156
+ Solution: convert in-browser using `AudioContext` before sending:
157
+
158
+ ```javascript
159
+ async function convertBlobToWav(blob) {
160
+ const arrayBuffer = await blob.arrayBuffer();
161
+ const audioCtx = new (window.AudioContext || window.webkitAudioContext)();
162
+ const audioBuffer = await audioCtx.decodeAudioData(arrayBuffer);
163
+ audioCtx.close();
164
+ return audioBufferToWavBlob(audioBuffer); // 16-bit PCM WAV
165
+ }
166
+ ```
167
+
168
+ This eliminates the `ffmpeg` system dependency entirely. The server receives a standard PCM WAV file that `soundfile` reads natively.
169
+
170
+ ### Design System
171
+
172
+ The SPA uses a dark glassmorphism design:
173
+ - Background: near-black `#09090b` with ambient purple gradient orbs (CSS `@keyframes drift`)
174
+ - Cards: `rgba(255,255,255,0.03)` with `backdrop-filter: blur(12px)` and subtle border
175
+ - Primary: `#8b5cf6` (violet-500) β€” consistent across buttons, badges, microphone glow
176
+ - Chat bubbles: user messages right-aligned (violet), assistant messages left-aligned (dark card)
177
+ - Animations: `msgIn` slide-in for chat, `micPulse` glow during recording, `waveAnim` bars during processing
178
+
179
+ ---
180
+
181
+ ## Integration Tests (`tests/test_api_routes.py`)
182
+
183
+ ### Philosophy
184
+
185
+ The existing 311 tests mock `HybridRetriever`, `AnswerChain`, and `WhisperTranscriber` at the class level using `MagicMock()`. This correctly validates internal logic within each module, but cannot catch:
186
+ - Wrong method names called across module boundaries
187
+ - Incorrect field names in Pydantic response models
188
+ - Route registration issues (missing prefix, typos)
189
+
190
+ The integration tests solve this by using:
191
+ - Real `KBManager` backed by a temp SQLite DB (schema initialized via `initialize_database()`)
192
+ - Real FastAPI `TestClient` routing (not mocked)
193
+ - Only the LLM and Whisper calls mocked (network/slow)
194
+
195
+ ```python
196
+ @pytest.fixture(scope="module")
197
+ def client(kb_manager, mock_transcriber, mock_answer_chain, tmp_path_factory):
198
+ db = tmp_path_factory.mktemp("db2") / "server.db"
199
+ db_mod.initialize_database(db) # real schema
200
+ routes_mod.init_routes(kb_manager, mock_transcriber, mock_answer_chain, db)
201
+ app = FastAPI()
202
+ app.include_router(router)
203
+ return TestClient(app)
204
+ ```
205
+
206
+ ### Test Coverage
207
+
208
+ | Class | Tests | What Is Validated |
209
+ |-------|-------|-------------------|
210
+ | `TestKBEndpoints` | 8 | CRUD, validation, duplicate detection, response fields |
211
+ | `TestAskEndpoint` | 6 | Method names (`retrieve`, `build`), response schema, history, empty/missing inputs |
212
+ | `TestAnalyticsEndpoint` | 2 | Stats structure, KB list type |
213
+ | `TestTranscribeEndpoint` | 1 | Real WAV file upload, mock transcriber called |
214
+
215
+ **Total: 17 tests, all passing.**
216
+
217
+ ---
218
+
219
+ ## Docker Deployment
220
+
221
+ ### Image Strategy
222
+
223
+ ```dockerfile
224
+ # 1. CPU-only PyTorch first (saves 1.8GB vs GPU wheel)
225
+ RUN pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cpu
226
+
227
+ # 2. All other requirements
228
+ RUN pip install -r requirements.txt
229
+
230
+ # 3. spaCy model (needed for document chunking)
231
+ RUN python -m spacy download en_core_web_sm
232
+
233
+ # 4. Pre-download ML models at build time (no cold-start delays)
234
+ RUN python -c "
235
+ from sentence_transformers import SentenceTransformer, CrossEncoder;
236
+ SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2');
237
+ CrossEncoder('cross-encoder/ms-marco-MiniLM-L12-v2')
238
+ "
239
+ ```
240
+
241
+ Pre-baking the embedding and reranking models into the Docker layer means the first document upload doesn't trigger a download. Whisper is intentionally not pre-baked β€” Groq cloud API is used when `GROQ_API_KEY` is present.
242
+
243
+ ### Environment Variables in HF Spaces
244
+
245
+ API keys are injected as Space secrets (not committed to git):
246
+
247
+ ```python
248
+ from huggingface_hub import HfApi
249
+ api = HfApi()
250
+ api.add_space_secret('NinjainPJs/VoiceVault', 'GROQ_API_KEY', value)
251
+ api.add_space_secret('NinjainPJs/VoiceVault', 'GEMINI_API_KEY', value)
252
+ ```
253
+
254
+ ### Storage
255
+
256
+ HF Spaces free tier uses ephemeral storage β€” knowledge bases created at runtime are lost on container restart. This is acceptable for a demo deployment. For production persistence, HF Spaces offers a persistent storage add-on, or the data layer can be pointed at an external object store.
257
+
258
+ ---
259
+
260
+ ## Lessons Learned
261
+
262
+ 1. **Mock depth matters** β€” mocking at the class level (`MagicMock()`) cannot catch method-name bugs across modules. Integration tests that mock only I/O boundaries (LLM API, Whisper model) while keeping real routing are essential.
263
+
264
+ 2. **GPU environment variables must be first** β€” `CUDA_VISIBLE_DEVICES=-1` must precede all Python ML imports. Any utility module that imports `torch` at module level will trigger CUDA detection before the variable is set if import order is not controlled.
265
+
266
+ 3. **Browser audio formats** β€” `MediaRecorder` output (WebM/OGG) and server-side audio libraries (`soundfile`) don't share a format. Converting to 16-bit PCM WAV in the browser with `AudioContext` is the cleanest zero-dependency solution.
267
+
268
+ 4. **Groq vs local Whisper** β€” For a live demo, cloud transcription is non-negotiable. A 5-minute wait on first recording kills the experience. The 300ms Groq round-trip feels instant.