Spaces:
Running
Running
File size: 9,961 Bytes
85f900d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 | # Phase 5 β Full UI, TTS & Access Control
**Status:** β
Complete | **Tests:** 55/55 passed | **Files:** 7 modules (3 UI tabs, 2 backend, 1 TTS, 1 updated app.py)
---
## What Was Built
Phase 5 wires all previous phases into a working end-to-end application.
| Module | Responsibility |
|--------|----------------|
| `voicevault/kb/kb_manager.py` | KB lifecycle: create, list, delete, ingest, password auth |
| `voicevault/tts/web_speech.py` | TTS text prep: strip citation markers before speech |
| `ui/tabs/ask_tab.py` | Full voice query pipeline in Gradio |
| `ui/tabs/kb_tab.py` | KB creation, document upload, management |
| `ui/tabs/analytics_tab.py` | Query stats from SQLite audit log |
| `ui/tabs/settings_tab.py` | Configuration panels (display-only) |
| `app.py` | Startup orchestration, pipeline wiring |
---
## KBManager
**File:** [voicevault/kb/kb_manager.py](../voicevault/kb/kb_manager.py)
### Central Database
All KBs share **one** SQLite database at `cfg.data_dir / "voicevault.db"`. This enables cross-KB queries, global analytics, and efficient listing without per-KB filesystem scanning.
### KB Name Validation
```python
_VALID_KB_NAME = re.compile(r"^[a-z0-9][a-z0-9\-]{0,62}[a-z0-9]$|^[a-z0-9]$")
```
- Lowercase alphanumeric + hyphens only
- 1β64 characters
- Cannot start or end with a hyphen
- Prevents path traversal attacks (no `..`, `/`, `\`, spaces)
### Password Protection (bcrypt)
```python
password_hash = bcrypt.hashpw(
password.encode(), bcrypt.gensalt(rounds=cfg.bcrypt_rounds) # default: 12
).decode()
```
- Passwords are hashed at creation time β plaintext never stored
- `verify_password()` uses `bcrypt.checkpw()` for constant-time comparison
- Public KBs (no password) return True for any password check
### verify_password Logic
```
KB has no hash (public) β True (always accessible)
KB has hash, no password β False (protected but no credentials)
KB has hash, with password β bcrypt.checkpw(password, hash)
```
### ingest_documents Flow
```python
ingest_documents(kb_name, file_paths, password=None):
1. Verify KB exists
2. Verify password
3. IndexBuilder(kb_name).ingest_file(path, db_path) per file
4. Return list[IngestionReport]
```
Delegates entirely to `IndexBuilder` (Phase 1) which handles parsing, chunking, embedding, ChromaDB upsert, BM25 rebuild, and deduplication.
### delete_kb Flow
```python
delete_kb(kb_name):
1. Verify KB exists (raises KBManagerError if not)
2. db.delete_kb() β SQLite CASCADE deletes documents, chunks, query_log
3. shutil.rmtree(cfg.kb_dir(kb_name)) β removes ChromaDB, BM25, files
```
Irreversible β the UI confirms before calling.
---
## TTS β Web Speech API
**File:** [voicevault/tts/web_speech.py](../voicevault/tts/web_speech.py)
The TTS engine runs entirely in the browser via the `SpeechSynthesis` API β zero API cost, zero server load. Python's role is text preparation only.
### prepare_for_tts()
```python
def prepare_for_tts(answer: str, is_refusal: bool = False) -> str:
if is_refusal or not answer:
return ""
text = _CITATION_MARKER_RE.sub("", answer) # strip [Source: ...]
text = re.sub(r"\s{2,}", " ", text).strip()
return text
```
Removes `[Source: filename, p.N]` markers before passing to the browser β reading "Source: paper dot pdf, p dot 3" aloud is poor UX. The JS bridge (`ui/components/audio_controls.py`) takes this cleaned text and calls `window._vv_tts.speak(text, rate, pitch)`.
---
## Ask Tab (Full Pipeline)
**File:** [ui/tabs/ask_tab.py](../ui/tabs/ask_tab.py)
### End-to-End Query Flow
```
1. User records audio β stop_recording event fires
β WhisperTranscriber.transcribe(audio_path) β transcript text
2. User selects KB(s) β clicks Ask
3. _query_fn():
a. QueryPreprocessor.process(query) β pq (cleaned, typed)
b. HybridRetriever(kb_names=selected).search(pq.processed_query) β results
c. ContextBuilder().build(results) β (context_str, citation_map)
d. AnswerChain.generate(query, context, citation_map, history, query_type) β generation
e. db.log_query(...) β SHA-256 only, no raw text stored
f. format_citations_markdown(generation.citations) β citation panel
g. prepare_for_tts(generation.answer, generation.is_refusal) β TTS text
h. Update chatbot + citations + history state + TTS state
```
### State Management
- `gr.State([])` β conversation history as `list[tuple[str, str]]`
- `gr.State("")` β last answer text (for TTS playback)
Conversation history is passed to `AnswerChain._build_messages()` as proper `HumanMessage`/`AIMessage` pairs β the correct LangChain pattern for multi-turn conversation.
### Error Handling
Every failure path (no query, no KB selected, pipeline error) produces a user-visible error message in the chatbot rather than crashing. The query logger failure is non-critical (caught and warned, never raises).
### Factory Functions
Event handlers are returned as closures from factory functions:
```python
def _make_transcribe_fn(transcriber):
def _transcribe(audio_path): ...
return _transcribe
def _make_query_fn(answer_chain, db_path):
def _query(query, kb_names, history, chatbot): ...
return _query
```
This enables dependency injection without globals β the `transcriber` and `answer_chain` objects are passed in from `app.py` and captured in the closure.
---
## KB Tab (Management UI)
**File:** [ui/tabs/kb_tab.py](../ui/tabs/kb_tab.py)
Three operations wired to Gradio event handlers:
| Button | Handler | Output |
|--------|---------|--------|
| β Create KB | `_create_kb()` | Status message, refreshed dropdowns |
| π€ Index Documents | `_upload_docs()` | Ingestion report per file |
| ποΈ Delete KB | `_delete_kb()` | Status message, refreshed table + dropdowns |
After each create/delete, all dropdowns and the KB dataframe are updated via `gr.update(choices=...)` β no page refresh needed.
---
## Analytics Tab
**File:** [ui/tabs/analytics_tab.py](../ui/tabs/analytics_tab.py)
Pulls data from `sqlite_store.get_query_stats()` on refresh button click:
| Metric | Source |
|--------|--------|
| Total queries (7d) | `COUNT(*)` from `query_log` |
| Avg end-to-end latency | `AVG(total_latency_ms)` |
| Avg citations per answer | `AVG(citation_count)` |
| Queries by day | `GROUP BY DATE(timestamp)` |
| KB inventory | `KBManager.list_kbs()` |
Stats are not loaded on page load β the user clicks π Refresh to pull fresh data. This avoids unnecessary DB queries at startup.
---
## app.py β Startup Orchestration
**File:** [app.py](../app.py)
```python
_startup() β (kb_manager, transcriber, answer_chain):
1. cfg.ensure_directories()
2. KBManager(db_path=data_dir/voicevault.db) β initializes SQLite schema
3. WhisperTranscriber() β lazy: no model loaded at startup
4. AnswerChain() β lazy: LLM clients created per call
```
All three singletons are created once and passed to the UI tab builders. This avoids the model-loading overhead being repeated on every query.
---
## Security Decisions
### Password Storage
bcrypt with work factor 12 β prevents offline brute-force attacks even if the SQLite file is exfiltrated. The same rounds as industry standard (bcrypt rounds β₯ 10 is OWASP recommended).
### KB Name as Path Component
The KB name regex (`^[a-z0-9][a-z0-9\-]{0,62}[a-z0-9]$`) prevents path traversal. All KB filesystem operations use `cfg.kb_dir(kb_name)` which returns `data_dir / kb_name` β impossible to escape with a validated slug.
### Query Audit Log β PII Protection
The raw query text is NEVER stored in SQLite. Only the SHA-256 hash of the query is stored (`voice_query_hash`). This satisfies GDPR "data minimization" β analytics work on aggregates, not raw user queries.
### No Globals in Event Handlers
All pipeline objects (transcriber, answer_chain, kb_manager) are passed via closures, not module-level globals. This makes the code testable (dependency injection) and prevents accidental shared state mutation.
---
## Test Coverage
**File:** [tests/test_phase5.py](../tests/test_phase5.py) | **55/55 passed**
| Class | Tests | What's verified |
|-------|-------|----------------|
| `TestKBManagerCreate` | 16 | Create, list, get, duplicate detection, 5 slug validation cases |
| `TestKBManagerDelete` | 3 | Delete removes from list, nonexistent raises, count decreases |
| `TestKBManagerPassword` | 7 | Public access, protected access, wrong pass, no pass, unknown KB, bcrypt format |
| `TestKBManagerStats` | 3 | Returns dict, has required keys, zeros on empty DB |
| `TestPreparForTTS` | 7 | Citation stripping, refusal β empty, normal text unchanged, no double spaces |
| `TestCitationPanel` | 8 | Filename, page, section, excerpt, multiple, numbered, empty, type |
| `TestUIHelpers` | 7 | KB choices, KB table format, protected lock icon, append_chat, no mutation |
| `TestAppStartup` | 4 | build_app returns Blocks, all three tab builders run without error |
### Fixture Design
The `manager` fixture creates a fresh KBManager backed by a temp SQLite path for each test β complete isolation with no shared state between tests.
---
## Full Project Test Summary
| Phase | Tests | Status |
|-------|-------|--------|
| Phase 0 β Foundation | 58 passed | β
|
| Phase 1 β Ingestion | 46 passed | β
|
| Phase 2 β Retrieval | 33 passed, 0 errors | β
|
| Phase 3 β ASR | 45 passed, 2 skipped (soundfile) | β
|
| Phase 4 β Generation | 72 passed | β
|
| Phase 5 β UI & Access | 55 passed | β
|
| **Total** | **309 passed, 2 skipped** | β
|
**Note on conftest.py CPU fix:** `CUDA_VISIBLE_DEVICES="-1"` is set in `tests/conftest.py` to force CPU for all tests. This prevents CUDA compatibility errors on RTX 5070 (sm_120 not supported by packaged PyTorch β€ 2.x). Production deployment on HuggingFace Spaces uses NVIDIA T4 (sm_75) which is fully compatible.
|