Spaces:

NinjainPJs
/

VoiceVault

Running

File size: 9,961 Bytes

85f900d

# Phase 5 — Full UI, TTS & Access Control

**Status:** ✅ Complete | **Tests:** 55/55 passed | **Files:** 7 modules (3 UI tabs, 2 backend, 1 TTS, 1 updated app.py)

---

## What Was Built

Phase 5 wires all previous phases into a working end-to-end application.

| Module | Responsibility |
|--------|----------------|
| `voicevault/kb/kb_manager.py` | KB lifecycle: create, list, delete, ingest, password auth |
| `voicevault/tts/web_speech.py` | TTS text prep: strip citation markers before speech |
| `ui/tabs/ask_tab.py` | Full voice query pipeline in Gradio |
| `ui/tabs/kb_tab.py` | KB creation, document upload, management |
| `ui/tabs/analytics_tab.py` | Query stats from SQLite audit log |
| `ui/tabs/settings_tab.py` | Configuration panels (display-only) |
| `app.py` | Startup orchestration, pipeline wiring |

---

## KBManager

**File:** [voicevault/kb/kb_manager.py](../voicevault/kb/kb_manager.py)

### Central Database

All KBs share **one** SQLite database at `cfg.data_dir / "voicevault.db"`. This enables cross-KB queries, global analytics, and efficient listing without per-KB filesystem scanning.

### KB Name Validation

```python
_VALID_KB_NAME = re.compile(r"^[a-z0-9][a-z0-9\-]{0,62}[a-z0-9]$|^[a-z0-9]$")
```

- Lowercase alphanumeric + hyphens only
- 1–64 characters
- Cannot start or end with a hyphen
- Prevents path traversal attacks (no `..`, `/`, `\`, spaces)

### Password Protection (bcrypt)

```python
password_hash = bcrypt.hashpw(
    password.encode(), bcrypt.gensalt(rounds=cfg.bcrypt_rounds)  # default: 12
).decode()
```

- Passwords are hashed at creation time — plaintext never stored
- `verify_password()` uses `bcrypt.checkpw()` for constant-time comparison
- Public KBs (no password) return True for any password check

### verify_password Logic

```
KB has no hash (public)  → True  (always accessible)
KB has hash, no password → False (protected but no credentials)
KB has hash, with password → bcrypt.checkpw(password, hash)
```

### ingest_documents Flow

```python
ingest_documents(kb_name, file_paths, password=None):
    1. Verify KB exists
    2. Verify password
    3. IndexBuilder(kb_name).ingest_file(path, db_path) per file
    4. Return list[IngestionReport]
```

Delegates entirely to `IndexBuilder` (Phase 1) which handles parsing, chunking, embedding, ChromaDB upsert, BM25 rebuild, and deduplication.

### delete_kb Flow

```python
delete_kb(kb_name):
    1. Verify KB exists (raises KBManagerError if not)
    2. db.delete_kb() → SQLite CASCADE deletes documents, chunks, query_log
    3. shutil.rmtree(cfg.kb_dir(kb_name)) → removes ChromaDB, BM25, files
```

Irreversible — the UI confirms before calling.

---

## TTS — Web Speech API

**File:** [voicevault/tts/web_speech.py](../voicevault/tts/web_speech.py)

The TTS engine runs entirely in the browser via the `SpeechSynthesis` API — zero API cost, zero server load. Python's role is text preparation only.

### prepare_for_tts()

```python
def prepare_for_tts(answer: str, is_refusal: bool = False) -> str:
    if is_refusal or not answer:
        return ""
    text = _CITATION_MARKER_RE.sub("", answer)  # strip [Source: ...]
    text = re.sub(r"\s{2,}", " ", text).strip()
    return text
```

Removes `[Source: filename, p.N]` markers before passing to the browser — reading "Source: paper dot pdf, p dot 3" aloud is poor UX. The JS bridge (`ui/components/audio_controls.py`) takes this cleaned text and calls `window._vv_tts.speak(text, rate, pitch)`.

---

## Ask Tab (Full Pipeline)

**File:** [ui/tabs/ask_tab.py](../ui/tabs/ask_tab.py)

### End-to-End Query Flow

```
1. User records audio → stop_recording event fires
   → WhisperTranscriber.transcribe(audio_path) → transcript text

2. User selects KB(s) → clicks Ask

3. _query_fn():
   a. QueryPreprocessor.process(query) → pq (cleaned, typed)
   b. HybridRetriever(kb_names=selected).search(pq.processed_query) → results
   c. ContextBuilder().build(results) → (context_str, citation_map)
   d. AnswerChain.generate(query, context, citation_map, history, query_type) → generation
   e. db.log_query(...)  ← SHA-256 only, no raw text stored
   f. format_citations_markdown(generation.citations) → citation panel
   g. prepare_for_tts(generation.answer, generation.is_refusal) → TTS text
   h. Update chatbot + citations + history state + TTS state
```

### State Management

- `gr.State([])` — conversation history as `list[tuple[str, str]]`
- `gr.State("")` — last answer text (for TTS playback)

Conversation history is passed to `AnswerChain._build_messages()` as proper `HumanMessage`/`AIMessage` pairs — the correct LangChain pattern for multi-turn conversation.

### Error Handling

Every failure path (no query, no KB selected, pipeline error) produces a user-visible error message in the chatbot rather than crashing. The query logger failure is non-critical (caught and warned, never raises).

### Factory Functions

Event handlers are returned as closures from factory functions:

```python
def _make_transcribe_fn(transcriber):
    def _transcribe(audio_path): ...
    return _transcribe

def _make_query_fn(answer_chain, db_path):
    def _query(query, kb_names, history, chatbot): ...
    return _query
```

This enables dependency injection without globals — the `transcriber` and `answer_chain` objects are passed in from `app.py` and captured in the closure.

---

## KB Tab (Management UI)

**File:** [ui/tabs/kb_tab.py](../ui/tabs/kb_tab.py)

Three operations wired to Gradio event handlers:

| Button | Handler | Output |
|--------|---------|--------|
| ➕ Create KB | `_create_kb()` | Status message, refreshed dropdowns |
| 📤 Index Documents | `_upload_docs()` | Ingestion report per file |
| 🗑️ Delete KB | `_delete_kb()` | Status message, refreshed table + dropdowns |

After each create/delete, all dropdowns and the KB dataframe are updated via `gr.update(choices=...)` — no page refresh needed.

---

## Analytics Tab

**File:** [ui/tabs/analytics_tab.py](../ui/tabs/analytics_tab.py)

Pulls data from `sqlite_store.get_query_stats()` on refresh button click:

| Metric | Source |
|--------|--------|
| Total queries (7d) | `COUNT(*)` from `query_log` |
| Avg end-to-end latency | `AVG(total_latency_ms)` |
| Avg citations per answer | `AVG(citation_count)` |
| Queries by day | `GROUP BY DATE(timestamp)` |
| KB inventory | `KBManager.list_kbs()` |

Stats are not loaded on page load — the user clicks 🔄 Refresh to pull fresh data. This avoids unnecessary DB queries at startup.

---

## app.py — Startup Orchestration

**File:** [app.py](../app.py)

```python
_startup() → (kb_manager, transcriber, answer_chain):
    1. cfg.ensure_directories()
    2. KBManager(db_path=data_dir/voicevault.db)  ← initializes SQLite schema
    3. WhisperTranscriber()  ← lazy: no model loaded at startup
    4. AnswerChain()         ← lazy: LLM clients created per call
```

All three singletons are created once and passed to the UI tab builders. This avoids the model-loading overhead being repeated on every query.

---

## Security Decisions

### Password Storage
bcrypt with work factor 12 — prevents offline brute-force attacks even if the SQLite file is exfiltrated. The same rounds as industry standard (bcrypt rounds ≥ 10 is OWASP recommended).

### KB Name as Path Component
The KB name regex (`^[a-z0-9][a-z0-9\-]{0,62}[a-z0-9]$`) prevents path traversal. All KB filesystem operations use `cfg.kb_dir(kb_name)` which returns `data_dir / kb_name` — impossible to escape with a validated slug.

### Query Audit Log — PII Protection
The raw query text is NEVER stored in SQLite. Only the SHA-256 hash of the query is stored (`voice_query_hash`). This satisfies GDPR "data minimization" — analytics work on aggregates, not raw user queries.

### No Globals in Event Handlers
All pipeline objects (transcriber, answer_chain, kb_manager) are passed via closures, not module-level globals. This makes the code testable (dependency injection) and prevents accidental shared state mutation.

---

## Test Coverage

**File:** [tests/test_phase5.py](../tests/test_phase5.py) | **55/55 passed**

| Class | Tests | What's verified |
|-------|-------|----------------|
| `TestKBManagerCreate` | 16 | Create, list, get, duplicate detection, 5 slug validation cases |
| `TestKBManagerDelete` | 3 | Delete removes from list, nonexistent raises, count decreases |
| `TestKBManagerPassword` | 7 | Public access, protected access, wrong pass, no pass, unknown KB, bcrypt format |
| `TestKBManagerStats` | 3 | Returns dict, has required keys, zeros on empty DB |
| `TestPreparForTTS` | 7 | Citation stripping, refusal → empty, normal text unchanged, no double spaces |
| `TestCitationPanel` | 8 | Filename, page, section, excerpt, multiple, numbered, empty, type |
| `TestUIHelpers` | 7 | KB choices, KB table format, protected lock icon, append_chat, no mutation |
| `TestAppStartup` | 4 | build_app returns Blocks, all three tab builders run without error |

### Fixture Design

The `manager` fixture creates a fresh KBManager backed by a temp SQLite path for each test — complete isolation with no shared state between tests.

---

## Full Project Test Summary

| Phase | Tests | Status |
|-------|-------|--------|
| Phase 0 — Foundation | 58 passed | ✅ |
| Phase 1 — Ingestion | 46 passed | ✅ |
| Phase 2 — Retrieval | 33 passed, 0 errors | ✅ |
| Phase 3 — ASR | 45 passed, 2 skipped (soundfile) | ✅ |
| Phase 4 — Generation | 72 passed | ✅ |
| Phase 5 — UI & Access | 55 passed | ✅ |
| **Total** | **309 passed, 2 skipped** | ✅ |

**Note on conftest.py CPU fix:** `CUDA_VISIBLE_DEVICES="-1"` is set in `tests/conftest.py` to force CPU for all tests. This prevents CUDA compatibility errors on RTX 5070 (sm_120 not supported by packaged PyTorch ≤ 2.x). Production deployment on HuggingFace Spaces uses NVIDIA T4 (sm_75) which is fully compatible.