VoiceVault / DOCS /phase0_foundation.md
NinjainPJs's picture
Initial release: VoiceVault v1.0.0 β€” Voice-First RAG Knowledge Agent
85f900d
# Phase 0 β€” Project Foundation
**Status:** βœ… Complete | **Tests:** 58/58 passed | **Date:** March 2026
---
## Overview
Phase 0 establishes the complete project skeleton before any business logic is written. Every subsequent phase builds on top of this foundation β€” the directory structure, dependency manifest, centralized config, data contracts (Pydantic models), SQLite schema, and the 4-tab Gradio scaffold are all locked in here.
**Why lock these first?** Schema drift between phases is one of the most common causes of bugs in ML pipelines. By defining the data models and database schema in Phase 0, every later module is guaranteed to produce and consume the same data shapes.
---
## Files Created
| File | Purpose |
|------|---------|
| `requirements.txt` | All project dependencies with rationale comments |
| `.env.example` | Environment variable template (no secrets) |
| `config.py` | Pydantic-settings centralized config singleton |
| `voicevault/__init__.py` | Package init with `__version__`, `__author__` |
| `voicevault/models.py` | All Pydantic data contracts (8 models) |
| `voicevault/asr/__init__.py` | ASR sub-package declaration |
| `voicevault/ingestion/__init__.py` | Ingestion sub-package declaration |
| `voicevault/retrieval/__init__.py` | Retrieval sub-package declaration |
| `voicevault/generation/__init__.py` | Generation sub-package declaration |
| `voicevault/kb/__init__.py` | KB management sub-package declaration |
| `voicevault/tts/__init__.py` | TTS sub-package declaration |
| `voicevault/storage/__init__.py` | Storage sub-package declaration |
| `voicevault/storage/sqlite_store.py` | Full SQLite schema + all CRUD operations |
| `ui/__init__.py` | UI package declaration |
| `ui/tabs/__init__.py` | Tabs sub-package declaration |
| `ui/tabs/ask_tab.py` | Ask tab placeholder (Phase 5 activates it) |
| `ui/tabs/kb_tab.py` | KB Manager tab placeholder |
| `ui/tabs/analytics_tab.py` | Analytics tab placeholder |
| `ui/tabs/settings_tab.py` | Settings tab placeholder |
| `ui/components/__init__.py` | Components sub-package declaration |
| `ui/components/citation_panel.py` | Citation formatter + placeholder |
| `ui/components/audio_controls.py` | Web Speech API JS bridge + placeholder |
| `app.py` | Gradio Blocks entry point (4-tab scaffold) |
| `tests/__init__.py` | Test package declaration |
| `tests/conftest.py` | Shared pytest fixtures |
| `tests/test_phase0.py` | 58 smoke tests covering all Phase 0 deliverables |
| `PLAN.md` | Master E2E implementation plan |
| `DOCS/phase0_foundation.md` | This document |
---
## Architecture Decisions
### 1. Pydantic-Settings for Config (`config.py`)
**What:** A single `VoiceVaultConfig` class inheriting from `pydantic_settings.BaseSettings`. One singleton `cfg` object imported everywhere.
**Why:** Raw `os.environ` calls scattered across modules create maintenance hell. With pydantic-settings:
- Every env var has a typed field with a documented default
- Missing required vars raise a clear `ValidationError` at startup, not a `KeyError` buried in a hot path
- `model_config = SettingsConfigDict(env_file=".env")` means local dev just needs a `.env` file β€” no export commands
- The `ensure_directories()` method runs once at startup to create `data/`, `data/uploads/`, `models/` β€” never fails midway through a request
**Key design choice β€” path helpers as properties/methods, not raw strings:**
```python
# Bad: scattered across modules
path = "data/" + kb_name + "/chroma"
# Good: single definition in config
path = cfg.kb_chroma_dir(kb_name)
```
If the directory layout ever changes, only `config.py` needs to be updated.
**Security fields locked early:**
- `bcrypt_rounds: int = 12` β€” minimum safe work factor enforced at config level
- `share_link_expiry_days: int = 7` β€” default expiry for HMAC share tokens
- `allowed_extensions: frozenset` β€” immutable security whitelist at config level
---
### 2. Pydantic Data Models (`voicevault/models.py`)
**8 models defined:**
| Model | Role | Key Fields |
|-------|------|-----------|
| `DocumentChunk` | A single indexed text chunk | `chunk_id` (UUID), `text_hash` (SHA-256), `page_number`, `section` |
| `IngestionReport` | Result of indexing one document | `status` (success/error/skipped), `chunk_count`, `duration_ms` |
| `RetrievalResult` | A retrieved chunk with scores | `rrf_score`, `rerank_score` |
| `Citation` | One source reference in an answer | `source_file`, `page_number`, `excerpt`, `relevance_score` |
| `QuerySession` | Full query β†’ answer audit record | All latencies, `groq_tokens_used`, `citations` list |
| `KnowledgeBase` | A named document collection | `kb_name` (slug), `password_hash`, `is_protected` property |
| `Document` | A source document in a KB | `file_hash` (SHA-256 for dedup), `is_private` |
| `TranscriptResult` | Whisper ASR output | `transcript`, `model_used`, `confidence`, `query_type` |
**Why lock models in Phase 0?**
Every module from Phase 1 onwards produces or consumes these types. If `DocumentChunk` were defined in `ingestion/` and `RetrievalResult` in `retrieval/`, circular imports would be inevitable. Centralizing in `models.py` breaks all circular dependencies.
**UUID auto-generation:**
```python
chunk_id: str = Field(default_factory=lambda: str(uuid.uuid4()))
```
Every entity gets a unique ID without any external ID generator. Safe for SQLite + ChromaDB + in-memory use.
---
### 3. SQLite Metadata Store (`voicevault/storage/sqlite_store.py`)
**Schema β€” 4 tables:**
```sql
knowledge_bases -- KB registry (name, password hash, owner, counts)
documents -- Per-KB document registry (file hash for dedup, page/chunk count)
chunks -- Chunk-level metadata (text hash, page, section, language)
query_log -- Append-only audit trail (anonymized query hash, all latencies)
```
**Critical security decision β€” query log anonymization:**
The `query_log` table stores `voice_query_hash` (SHA-256 of the query text), **not** the raw query text. This is enforced in the schema (`voice_query_hash TEXT` column, no `voice_query` column) and verified in `test_query_log_schema`. Raw voice queries could contain PII β€” they are never persisted.
**WAL mode enabled on every connection:**
```python
conn.execute("PRAGMA journal_mode=WAL;")
```
WAL (Write-Ahead Logging) allows concurrent readers while a writer is active β€” essential for the Analytics tab reading query stats while the main thread is writing a new query log entry.
**Foreign keys with CASCADE:**
```sql
kb_name TEXT REFERENCES knowledge_bases(kb_name) ON DELETE CASCADE
```
Deleting a KB automatically deletes all its documents and chunks. No orphaned rows possible.
**Parameterized queries everywhere β€” example:**
```python
# CORRECT: parameterized
conn.execute("SELECT * FROM knowledge_bases WHERE kb_name = ?", (kb_name,))
# NEVER: f-string SQL (SQL injection vulnerability)
# conn.execute(f"SELECT * FROM knowledge_bases WHERE kb_name = '{kb_name}'")
```
This pattern is enforced throughout the module. The test suite verifies the schema is correct but also that raw queries are never used (code review confirms all `?` placeholders).
**Idempotent initialization:**
`initialize_database()` uses `CREATE TABLE IF NOT EXISTS` β€” safe to call on every app startup. The application calls it in `_startup()` before accepting any requests.
---
### 4. Gradio App Scaffold (`app.py`)
**4-tab Blocks layout:**
```
gr.Blocks
└── gr.Tabs
β”œβ”€β”€ Tab 1: πŸŽ™οΈ Ask VoiceVault ← build_ask_tab()
β”œβ”€β”€ Tab 2: πŸ“‚ Knowledge Bases ← build_kb_tab()
β”œβ”€β”€ Tab 3: πŸ“Š Analytics ← build_analytics_tab()
└── Tab 4: βš™οΈ Settings ← build_settings_tab()
```
Each tab is a separate function in its own module (`ui/tabs/`). This enables:
- Phase-by-phase activation: each tab becomes functional as its phase completes
- Independent testing of each tab builder
- Clear separation β€” the tab builder returns nothing, just renders into the active Blocks context
**Startup sequence:**
```python
_startup() # ensures directories, logs config summary (no secrets)
app = build_app() # constructs Gradio Blocks
app.launch(...) # binds to host:port
```
**Gradio version compatibility:**
Discovered during testing that Gradio 6.x moved `theme` and `css` from `gr.Blocks(...)` to `launch(...)`. The test suite caught this immediately (`test_gradio_app_builds`), and the fix was isolated to `app.py`. This is an example of why Phase 0 tests exist β€” catching API drift before it causes runtime failures.
---
### 5. Web Speech API Bridge (`ui/components/audio_controls.py`)
The JavaScript that drives browser TTS is declared as a module constant (`WEB_SPEECH_JS`) in Phase 0. It will be injected via `gr.HTML` in Phase 5.
**Why declare it now?**
The JS bridge is a security-sensitive piece (it executes in the browser). By declaring it as a constant rather than building it dynamically, it is:
- Auditable as a static artifact (code review can inspect it)
- Testable (`test_tts_html_contains_js` verifies `speechSynthesis` and `_vv_tts` are present)
- Not constructable from user input (no injection surface)
---
## Test Results
```
58 passed, 0 failed β€” 18.91s
TestConfig (10 tests) β€” config loading, types, defaults, path helpers, security
TestModels (9 tests) β€” all 8 Pydantic models instantiate and validate correctly
TestSQLiteSchema (6 tests) β€” tables created, idempotent, schema columns verified
TestSQLiteCRUD (11 tests) β€” full CRUD round-trips for all tables
TestPackageImports (14 tests) β€” every __init__.py and public function importable
TestUIComponents (8 tests) β€” citation formatter, TTS HTML, Gradio build
```
**Warnings noted (not failures):**
- `datetime.utcnow()` deprecation β€” Pydantic v2 internally calls this on default factories. Not our code. Will resolve when Pydantic updates its internals. Tracked for future upgrade.
---
## Security Audit β€” Phase 0
| Check | Status | Notes |
|-------|--------|-------|
| No API keys in code | βœ… Pass | `.env.example` has placeholders only |
| No hardcoded secrets | βœ… Pass | All sensitive values via env vars |
| Parameterized SQL | βœ… Pass | All queries use `?` placeholders |
| Query log anonymization | βœ… Pass | `voice_query_hash` only, no raw text |
| bcrypt rounds β‰₯ 12 | βœ… Pass | Enforced by config default + test |
| Extension whitelist defined | βœ… Pass | `frozenset` in config β€” immutable |
| Data dir not git-tracked | βœ… Pass | `.gitignore` covers `data/` |
| `.env` not committed | βœ… Pass | `.gitignore` covers `.env` |
---
## Progress Tracker Update
| Phase | Status | Tests | Docs |
|-------|--------|-------|------|
| **Phase 0 β€” Foundation** | βœ… Done | βœ… 58/58 | βœ… Done |
| Phase 1 β€” Ingestion | ⬜ Next | ⬜ | ⬜ |
| Phase 2 β€” Retrieval | ⬜ | ⬜ | ⬜ |
| Phase 3 β€” ASR | ⬜ | ⬜ | ⬜ |
| Phase 4 β€” Generation | ⬜ | ⬜ | ⬜ |
| Phase 5 β€” UI & Access | ⬜ | ⬜ | ⬜ |