Spaces:
Running
Phase 0 β Project Foundation
Status: β Complete | Tests: 58/58 passed | Date: March 2026
Overview
Phase 0 establishes the complete project skeleton before any business logic is written. Every subsequent phase builds on top of this foundation β the directory structure, dependency manifest, centralized config, data contracts (Pydantic models), SQLite schema, and the 4-tab Gradio scaffold are all locked in here.
Why lock these first? Schema drift between phases is one of the most common causes of bugs in ML pipelines. By defining the data models and database schema in Phase 0, every later module is guaranteed to produce and consume the same data shapes.
Files Created
| File | Purpose |
|---|---|
requirements.txt |
All project dependencies with rationale comments |
.env.example |
Environment variable template (no secrets) |
config.py |
Pydantic-settings centralized config singleton |
voicevault/__init__.py |
Package init with __version__, __author__ |
voicevault/models.py |
All Pydantic data contracts (8 models) |
voicevault/asr/__init__.py |
ASR sub-package declaration |
voicevault/ingestion/__init__.py |
Ingestion sub-package declaration |
voicevault/retrieval/__init__.py |
Retrieval sub-package declaration |
voicevault/generation/__init__.py |
Generation sub-package declaration |
voicevault/kb/__init__.py |
KB management sub-package declaration |
voicevault/tts/__init__.py |
TTS sub-package declaration |
voicevault/storage/__init__.py |
Storage sub-package declaration |
voicevault/storage/sqlite_store.py |
Full SQLite schema + all CRUD operations |
ui/__init__.py |
UI package declaration |
ui/tabs/__init__.py |
Tabs sub-package declaration |
ui/tabs/ask_tab.py |
Ask tab placeholder (Phase 5 activates it) |
ui/tabs/kb_tab.py |
KB Manager tab placeholder |
ui/tabs/analytics_tab.py |
Analytics tab placeholder |
ui/tabs/settings_tab.py |
Settings tab placeholder |
ui/components/__init__.py |
Components sub-package declaration |
ui/components/citation_panel.py |
Citation formatter + placeholder |
ui/components/audio_controls.py |
Web Speech API JS bridge + placeholder |
app.py |
Gradio Blocks entry point (4-tab scaffold) |
tests/__init__.py |
Test package declaration |
tests/conftest.py |
Shared pytest fixtures |
tests/test_phase0.py |
58 smoke tests covering all Phase 0 deliverables |
PLAN.md |
Master E2E implementation plan |
DOCS/phase0_foundation.md |
This document |
Architecture Decisions
1. Pydantic-Settings for Config (config.py)
What: A single VoiceVaultConfig class inheriting from pydantic_settings.BaseSettings. One singleton cfg object imported everywhere.
Why: Raw os.environ calls scattered across modules create maintenance hell. With pydantic-settings:
- Every env var has a typed field with a documented default
- Missing required vars raise a clear
ValidationErrorat startup, not aKeyErrorburied in a hot path model_config = SettingsConfigDict(env_file=".env")means local dev just needs a.envfile β no export commands- The
ensure_directories()method runs once at startup to createdata/,data/uploads/,models/β never fails midway through a request
Key design choice β path helpers as properties/methods, not raw strings:
# Bad: scattered across modules
path = "data/" + kb_name + "/chroma"
# Good: single definition in config
path = cfg.kb_chroma_dir(kb_name)
If the directory layout ever changes, only config.py needs to be updated.
Security fields locked early:
bcrypt_rounds: int = 12β minimum safe work factor enforced at config levelshare_link_expiry_days: int = 7β default expiry for HMAC share tokensallowed_extensions: frozensetβ immutable security whitelist at config level
2. Pydantic Data Models (voicevault/models.py)
8 models defined:
| Model | Role | Key Fields |
|---|---|---|
DocumentChunk |
A single indexed text chunk | chunk_id (UUID), text_hash (SHA-256), page_number, section |
IngestionReport |
Result of indexing one document | status (success/error/skipped), chunk_count, duration_ms |
RetrievalResult |
A retrieved chunk with scores | rrf_score, rerank_score |
Citation |
One source reference in an answer | source_file, page_number, excerpt, relevance_score |
QuerySession |
Full query β answer audit record | All latencies, groq_tokens_used, citations list |
KnowledgeBase |
A named document collection | kb_name (slug), password_hash, is_protected property |
Document |
A source document in a KB | file_hash (SHA-256 for dedup), is_private |
TranscriptResult |
Whisper ASR output | transcript, model_used, confidence, query_type |
Why lock models in Phase 0?
Every module from Phase 1 onwards produces or consumes these types. If DocumentChunk were defined in ingestion/ and RetrievalResult in retrieval/, circular imports would be inevitable. Centralizing in models.py breaks all circular dependencies.
UUID auto-generation:
chunk_id: str = Field(default_factory=lambda: str(uuid.uuid4()))
Every entity gets a unique ID without any external ID generator. Safe for SQLite + ChromaDB + in-memory use.
3. SQLite Metadata Store (voicevault/storage/sqlite_store.py)
Schema β 4 tables:
knowledge_bases -- KB registry (name, password hash, owner, counts)
documents -- Per-KB document registry (file hash for dedup, page/chunk count)
chunks -- Chunk-level metadata (text hash, page, section, language)
query_log -- Append-only audit trail (anonymized query hash, all latencies)
Critical security decision β query log anonymization:
The query_log table stores voice_query_hash (SHA-256 of the query text), not the raw query text. This is enforced in the schema (voice_query_hash TEXT column, no voice_query column) and verified in test_query_log_schema. Raw voice queries could contain PII β they are never persisted.
WAL mode enabled on every connection:
conn.execute("PRAGMA journal_mode=WAL;")
WAL (Write-Ahead Logging) allows concurrent readers while a writer is active β essential for the Analytics tab reading query stats while the main thread is writing a new query log entry.
Foreign keys with CASCADE:
kb_name TEXT REFERENCES knowledge_bases(kb_name) ON DELETE CASCADE
Deleting a KB automatically deletes all its documents and chunks. No orphaned rows possible.
Parameterized queries everywhere β example:
# CORRECT: parameterized
conn.execute("SELECT * FROM knowledge_bases WHERE kb_name = ?", (kb_name,))
# NEVER: f-string SQL (SQL injection vulnerability)
# conn.execute(f"SELECT * FROM knowledge_bases WHERE kb_name = '{kb_name}'")
This pattern is enforced throughout the module. The test suite verifies the schema is correct but also that raw queries are never used (code review confirms all ? placeholders).
Idempotent initialization:
initialize_database() uses CREATE TABLE IF NOT EXISTS β safe to call on every app startup. The application calls it in _startup() before accepting any requests.
4. Gradio App Scaffold (app.py)
4-tab Blocks layout:
gr.Blocks
βββ gr.Tabs
βββ Tab 1: ποΈ Ask VoiceVault β build_ask_tab()
βββ Tab 2: π Knowledge Bases β build_kb_tab()
βββ Tab 3: π Analytics β build_analytics_tab()
βββ Tab 4: βοΈ Settings β build_settings_tab()
Each tab is a separate function in its own module (ui/tabs/). This enables:
- Phase-by-phase activation: each tab becomes functional as its phase completes
- Independent testing of each tab builder
- Clear separation β the tab builder returns nothing, just renders into the active Blocks context
Startup sequence:
_startup() # ensures directories, logs config summary (no secrets)
app = build_app() # constructs Gradio Blocks
app.launch(...) # binds to host:port
Gradio version compatibility:
Discovered during testing that Gradio 6.x moved theme and css from gr.Blocks(...) to launch(...). The test suite caught this immediately (test_gradio_app_builds), and the fix was isolated to app.py. This is an example of why Phase 0 tests exist β catching API drift before it causes runtime failures.
5. Web Speech API Bridge (ui/components/audio_controls.py)
The JavaScript that drives browser TTS is declared as a module constant (WEB_SPEECH_JS) in Phase 0. It will be injected via gr.HTML in Phase 5.
Why declare it now? The JS bridge is a security-sensitive piece (it executes in the browser). By declaring it as a constant rather than building it dynamically, it is:
- Auditable as a static artifact (code review can inspect it)
- Testable (
test_tts_html_contains_jsverifiesspeechSynthesisand_vv_ttsare present) - Not constructable from user input (no injection surface)
Test Results
58 passed, 0 failed β 18.91s
TestConfig (10 tests) β config loading, types, defaults, path helpers, security
TestModels (9 tests) β all 8 Pydantic models instantiate and validate correctly
TestSQLiteSchema (6 tests) β tables created, idempotent, schema columns verified
TestSQLiteCRUD (11 tests) β full CRUD round-trips for all tables
TestPackageImports (14 tests) β every __init__.py and public function importable
TestUIComponents (8 tests) β citation formatter, TTS HTML, Gradio build
Warnings noted (not failures):
datetime.utcnow()deprecation β Pydantic v2 internally calls this on default factories. Not our code. Will resolve when Pydantic updates its internals. Tracked for future upgrade.
Security Audit β Phase 0
| Check | Status | Notes |
|---|---|---|
| No API keys in code | β Pass | .env.example has placeholders only |
| No hardcoded secrets | β Pass | All sensitive values via env vars |
| Parameterized SQL | β Pass | All queries use ? placeholders |
| Query log anonymization | β Pass | voice_query_hash only, no raw text |
| bcrypt rounds β₯ 12 | β Pass | Enforced by config default + test |
| Extension whitelist defined | β Pass | frozenset in config β immutable |
| Data dir not git-tracked | β Pass | .gitignore covers data/ |
.env not committed |
β Pass | .gitignore covers .env |
Progress Tracker Update
| Phase | Status | Tests | Docs |
|---|---|---|---|
| Phase 0 β Foundation | β Done | β 58/58 | β Done |
| Phase 1 β Ingestion | β¬ Next | β¬ | β¬ |
| Phase 2 β Retrieval | β¬ | β¬ | β¬ |
| Phase 3 β ASR | β¬ | β¬ | β¬ |
| Phase 4 β Generation | β¬ | β¬ | β¬ |
| Phase 5 β UI & Access | β¬ | β¬ | β¬ |