File size: 9,961 Bytes
85f900d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
# Phase 5 β€” Full UI, TTS & Access Control

**Status:** βœ… Complete | **Tests:** 55/55 passed | **Files:** 7 modules (3 UI tabs, 2 backend, 1 TTS, 1 updated app.py)

---

## What Was Built

Phase 5 wires all previous phases into a working end-to-end application.

| Module | Responsibility |
|--------|----------------|
| `voicevault/kb/kb_manager.py` | KB lifecycle: create, list, delete, ingest, password auth |
| `voicevault/tts/web_speech.py` | TTS text prep: strip citation markers before speech |
| `ui/tabs/ask_tab.py` | Full voice query pipeline in Gradio |
| `ui/tabs/kb_tab.py` | KB creation, document upload, management |
| `ui/tabs/analytics_tab.py` | Query stats from SQLite audit log |
| `ui/tabs/settings_tab.py` | Configuration panels (display-only) |
| `app.py` | Startup orchestration, pipeline wiring |

---

## KBManager

**File:** [voicevault/kb/kb_manager.py](../voicevault/kb/kb_manager.py)

### Central Database

All KBs share **one** SQLite database at `cfg.data_dir / "voicevault.db"`. This enables cross-KB queries, global analytics, and efficient listing without per-KB filesystem scanning.

### KB Name Validation

```python
_VALID_KB_NAME = re.compile(r"^[a-z0-9][a-z0-9\-]{0,62}[a-z0-9]$|^[a-z0-9]$")
```

- Lowercase alphanumeric + hyphens only
- 1–64 characters
- Cannot start or end with a hyphen
- Prevents path traversal attacks (no `..`, `/`, `\`, spaces)

### Password Protection (bcrypt)

```python
password_hash = bcrypt.hashpw(
    password.encode(), bcrypt.gensalt(rounds=cfg.bcrypt_rounds)  # default: 12
).decode()
```

- Passwords are hashed at creation time β€” plaintext never stored
- `verify_password()` uses `bcrypt.checkpw()` for constant-time comparison
- Public KBs (no password) return True for any password check

### verify_password Logic

```
KB has no hash (public)  β†’ True  (always accessible)
KB has hash, no password β†’ False (protected but no credentials)
KB has hash, with password β†’ bcrypt.checkpw(password, hash)
```

### ingest_documents Flow

```python
ingest_documents(kb_name, file_paths, password=None):
    1. Verify KB exists
    2. Verify password
    3. IndexBuilder(kb_name).ingest_file(path, db_path) per file
    4. Return list[IngestionReport]
```

Delegates entirely to `IndexBuilder` (Phase 1) which handles parsing, chunking, embedding, ChromaDB upsert, BM25 rebuild, and deduplication.

### delete_kb Flow

```python
delete_kb(kb_name):
    1. Verify KB exists (raises KBManagerError if not)
    2. db.delete_kb() β†’ SQLite CASCADE deletes documents, chunks, query_log
    3. shutil.rmtree(cfg.kb_dir(kb_name)) β†’ removes ChromaDB, BM25, files
```

Irreversible β€” the UI confirms before calling.

---

## TTS β€” Web Speech API

**File:** [voicevault/tts/web_speech.py](../voicevault/tts/web_speech.py)

The TTS engine runs entirely in the browser via the `SpeechSynthesis` API β€” zero API cost, zero server load. Python's role is text preparation only.

### prepare_for_tts()

```python
def prepare_for_tts(answer: str, is_refusal: bool = False) -> str:
    if is_refusal or not answer:
        return ""
    text = _CITATION_MARKER_RE.sub("", answer)  # strip [Source: ...]
    text = re.sub(r"\s{2,}", " ", text).strip()
    return text
```

Removes `[Source: filename, p.N]` markers before passing to the browser β€” reading "Source: paper dot pdf, p dot 3" aloud is poor UX. The JS bridge (`ui/components/audio_controls.py`) takes this cleaned text and calls `window._vv_tts.speak(text, rate, pitch)`.

---

## Ask Tab (Full Pipeline)

**File:** [ui/tabs/ask_tab.py](../ui/tabs/ask_tab.py)

### End-to-End Query Flow

```
1. User records audio β†’ stop_recording event fires
   β†’ WhisperTranscriber.transcribe(audio_path) β†’ transcript text

2. User selects KB(s) β†’ clicks Ask

3. _query_fn():
   a. QueryPreprocessor.process(query) β†’ pq (cleaned, typed)
   b. HybridRetriever(kb_names=selected).search(pq.processed_query) β†’ results
   c. ContextBuilder().build(results) β†’ (context_str, citation_map)
   d. AnswerChain.generate(query, context, citation_map, history, query_type) β†’ generation
   e. db.log_query(...)  ← SHA-256 only, no raw text stored
   f. format_citations_markdown(generation.citations) β†’ citation panel
   g. prepare_for_tts(generation.answer, generation.is_refusal) β†’ TTS text
   h. Update chatbot + citations + history state + TTS state
```

### State Management

- `gr.State([])` β€” conversation history as `list[tuple[str, str]]`
- `gr.State("")` β€” last answer text (for TTS playback)

Conversation history is passed to `AnswerChain._build_messages()` as proper `HumanMessage`/`AIMessage` pairs β€” the correct LangChain pattern for multi-turn conversation.

### Error Handling

Every failure path (no query, no KB selected, pipeline error) produces a user-visible error message in the chatbot rather than crashing. The query logger failure is non-critical (caught and warned, never raises).

### Factory Functions

Event handlers are returned as closures from factory functions:

```python
def _make_transcribe_fn(transcriber):
    def _transcribe(audio_path): ...
    return _transcribe

def _make_query_fn(answer_chain, db_path):
    def _query(query, kb_names, history, chatbot): ...
    return _query
```

This enables dependency injection without globals β€” the `transcriber` and `answer_chain` objects are passed in from `app.py` and captured in the closure.

---

## KB Tab (Management UI)

**File:** [ui/tabs/kb_tab.py](../ui/tabs/kb_tab.py)

Three operations wired to Gradio event handlers:

| Button | Handler | Output |
|--------|---------|--------|
| βž• Create KB | `_create_kb()` | Status message, refreshed dropdowns |
| πŸ“€ Index Documents | `_upload_docs()` | Ingestion report per file |
| πŸ—‘οΈ Delete KB | `_delete_kb()` | Status message, refreshed table + dropdowns |

After each create/delete, all dropdowns and the KB dataframe are updated via `gr.update(choices=...)` β€” no page refresh needed.

---

## Analytics Tab

**File:** [ui/tabs/analytics_tab.py](../ui/tabs/analytics_tab.py)

Pulls data from `sqlite_store.get_query_stats()` on refresh button click:

| Metric | Source |
|--------|--------|
| Total queries (7d) | `COUNT(*)` from `query_log` |
| Avg end-to-end latency | `AVG(total_latency_ms)` |
| Avg citations per answer | `AVG(citation_count)` |
| Queries by day | `GROUP BY DATE(timestamp)` |
| KB inventory | `KBManager.list_kbs()` |

Stats are not loaded on page load β€” the user clicks πŸ”„ Refresh to pull fresh data. This avoids unnecessary DB queries at startup.

---

## app.py β€” Startup Orchestration

**File:** [app.py](../app.py)

```python
_startup() β†’ (kb_manager, transcriber, answer_chain):
    1. cfg.ensure_directories()
    2. KBManager(db_path=data_dir/voicevault.db)  ← initializes SQLite schema
    3. WhisperTranscriber()  ← lazy: no model loaded at startup
    4. AnswerChain()         ← lazy: LLM clients created per call
```

All three singletons are created once and passed to the UI tab builders. This avoids the model-loading overhead being repeated on every query.

---

## Security Decisions

### Password Storage
bcrypt with work factor 12 β€” prevents offline brute-force attacks even if the SQLite file is exfiltrated. The same rounds as industry standard (bcrypt rounds β‰₯ 10 is OWASP recommended).

### KB Name as Path Component
The KB name regex (`^[a-z0-9][a-z0-9\-]{0,62}[a-z0-9]$`) prevents path traversal. All KB filesystem operations use `cfg.kb_dir(kb_name)` which returns `data_dir / kb_name` β€” impossible to escape with a validated slug.

### Query Audit Log β€” PII Protection
The raw query text is NEVER stored in SQLite. Only the SHA-256 hash of the query is stored (`voice_query_hash`). This satisfies GDPR "data minimization" β€” analytics work on aggregates, not raw user queries.

### No Globals in Event Handlers
All pipeline objects (transcriber, answer_chain, kb_manager) are passed via closures, not module-level globals. This makes the code testable (dependency injection) and prevents accidental shared state mutation.

---

## Test Coverage

**File:** [tests/test_phase5.py](../tests/test_phase5.py) | **55/55 passed**

| Class | Tests | What's verified |
|-------|-------|----------------|
| `TestKBManagerCreate` | 16 | Create, list, get, duplicate detection, 5 slug validation cases |
| `TestKBManagerDelete` | 3 | Delete removes from list, nonexistent raises, count decreases |
| `TestKBManagerPassword` | 7 | Public access, protected access, wrong pass, no pass, unknown KB, bcrypt format |
| `TestKBManagerStats` | 3 | Returns dict, has required keys, zeros on empty DB |
| `TestPreparForTTS` | 7 | Citation stripping, refusal β†’ empty, normal text unchanged, no double spaces |
| `TestCitationPanel` | 8 | Filename, page, section, excerpt, multiple, numbered, empty, type |
| `TestUIHelpers` | 7 | KB choices, KB table format, protected lock icon, append_chat, no mutation |
| `TestAppStartup` | 4 | build_app returns Blocks, all three tab builders run without error |

### Fixture Design

The `manager` fixture creates a fresh KBManager backed by a temp SQLite path for each test β€” complete isolation with no shared state between tests.

---

## Full Project Test Summary

| Phase | Tests | Status |
|-------|-------|--------|
| Phase 0 β€” Foundation | 58 passed | βœ… |
| Phase 1 β€” Ingestion | 46 passed | βœ… |
| Phase 2 β€” Retrieval | 33 passed, 0 errors | βœ… |
| Phase 3 β€” ASR | 45 passed, 2 skipped (soundfile) | βœ… |
| Phase 4 β€” Generation | 72 passed | βœ… |
| Phase 5 β€” UI & Access | 55 passed | βœ… |
| **Total** | **309 passed, 2 skipped** | βœ… |

**Note on conftest.py CPU fix:** `CUDA_VISIBLE_DEVICES="-1"` is set in `tests/conftest.py` to force CPU for all tests. This prevents CUDA compatibility errors on RTX 5070 (sm_120 not supported by packaged PyTorch ≀ 2.x). Production deployment on HuggingFace Spaces uses NVIDIA T4 (sm_75) which is fully compatible.