VoiceVault / DOCS /phase0_foundation.md
NinjainPJs's picture
Initial release: VoiceVault v1.0.0 β€” Voice-First RAG Knowledge Agent
85f900d

Phase 0 β€” Project Foundation

Status: βœ… Complete | Tests: 58/58 passed | Date: March 2026


Overview

Phase 0 establishes the complete project skeleton before any business logic is written. Every subsequent phase builds on top of this foundation β€” the directory structure, dependency manifest, centralized config, data contracts (Pydantic models), SQLite schema, and the 4-tab Gradio scaffold are all locked in here.

Why lock these first? Schema drift between phases is one of the most common causes of bugs in ML pipelines. By defining the data models and database schema in Phase 0, every later module is guaranteed to produce and consume the same data shapes.


Files Created

File Purpose
requirements.txt All project dependencies with rationale comments
.env.example Environment variable template (no secrets)
config.py Pydantic-settings centralized config singleton
voicevault/__init__.py Package init with __version__, __author__
voicevault/models.py All Pydantic data contracts (8 models)
voicevault/asr/__init__.py ASR sub-package declaration
voicevault/ingestion/__init__.py Ingestion sub-package declaration
voicevault/retrieval/__init__.py Retrieval sub-package declaration
voicevault/generation/__init__.py Generation sub-package declaration
voicevault/kb/__init__.py KB management sub-package declaration
voicevault/tts/__init__.py TTS sub-package declaration
voicevault/storage/__init__.py Storage sub-package declaration
voicevault/storage/sqlite_store.py Full SQLite schema + all CRUD operations
ui/__init__.py UI package declaration
ui/tabs/__init__.py Tabs sub-package declaration
ui/tabs/ask_tab.py Ask tab placeholder (Phase 5 activates it)
ui/tabs/kb_tab.py KB Manager tab placeholder
ui/tabs/analytics_tab.py Analytics tab placeholder
ui/tabs/settings_tab.py Settings tab placeholder
ui/components/__init__.py Components sub-package declaration
ui/components/citation_panel.py Citation formatter + placeholder
ui/components/audio_controls.py Web Speech API JS bridge + placeholder
app.py Gradio Blocks entry point (4-tab scaffold)
tests/__init__.py Test package declaration
tests/conftest.py Shared pytest fixtures
tests/test_phase0.py 58 smoke tests covering all Phase 0 deliverables
PLAN.md Master E2E implementation plan
DOCS/phase0_foundation.md This document

Architecture Decisions

1. Pydantic-Settings for Config (config.py)

What: A single VoiceVaultConfig class inheriting from pydantic_settings.BaseSettings. One singleton cfg object imported everywhere.

Why: Raw os.environ calls scattered across modules create maintenance hell. With pydantic-settings:

  • Every env var has a typed field with a documented default
  • Missing required vars raise a clear ValidationError at startup, not a KeyError buried in a hot path
  • model_config = SettingsConfigDict(env_file=".env") means local dev just needs a .env file β€” no export commands
  • The ensure_directories() method runs once at startup to create data/, data/uploads/, models/ β€” never fails midway through a request

Key design choice β€” path helpers as properties/methods, not raw strings:

# Bad: scattered across modules
path = "data/" + kb_name + "/chroma"

# Good: single definition in config
path = cfg.kb_chroma_dir(kb_name)

If the directory layout ever changes, only config.py needs to be updated.

Security fields locked early:

  • bcrypt_rounds: int = 12 β€” minimum safe work factor enforced at config level
  • share_link_expiry_days: int = 7 β€” default expiry for HMAC share tokens
  • allowed_extensions: frozenset β€” immutable security whitelist at config level

2. Pydantic Data Models (voicevault/models.py)

8 models defined:

Model Role Key Fields
DocumentChunk A single indexed text chunk chunk_id (UUID), text_hash (SHA-256), page_number, section
IngestionReport Result of indexing one document status (success/error/skipped), chunk_count, duration_ms
RetrievalResult A retrieved chunk with scores rrf_score, rerank_score
Citation One source reference in an answer source_file, page_number, excerpt, relevance_score
QuerySession Full query β†’ answer audit record All latencies, groq_tokens_used, citations list
KnowledgeBase A named document collection kb_name (slug), password_hash, is_protected property
Document A source document in a KB file_hash (SHA-256 for dedup), is_private
TranscriptResult Whisper ASR output transcript, model_used, confidence, query_type

Why lock models in Phase 0? Every module from Phase 1 onwards produces or consumes these types. If DocumentChunk were defined in ingestion/ and RetrievalResult in retrieval/, circular imports would be inevitable. Centralizing in models.py breaks all circular dependencies.

UUID auto-generation:

chunk_id: str = Field(default_factory=lambda: str(uuid.uuid4()))

Every entity gets a unique ID without any external ID generator. Safe for SQLite + ChromaDB + in-memory use.


3. SQLite Metadata Store (voicevault/storage/sqlite_store.py)

Schema β€” 4 tables:

knowledge_bases  -- KB registry (name, password hash, owner, counts)
documents        -- Per-KB document registry (file hash for dedup, page/chunk count)
chunks           -- Chunk-level metadata (text hash, page, section, language)
query_log        -- Append-only audit trail (anonymized query hash, all latencies)

Critical security decision β€” query log anonymization: The query_log table stores voice_query_hash (SHA-256 of the query text), not the raw query text. This is enforced in the schema (voice_query_hash TEXT column, no voice_query column) and verified in test_query_log_schema. Raw voice queries could contain PII β€” they are never persisted.

WAL mode enabled on every connection:

conn.execute("PRAGMA journal_mode=WAL;")

WAL (Write-Ahead Logging) allows concurrent readers while a writer is active β€” essential for the Analytics tab reading query stats while the main thread is writing a new query log entry.

Foreign keys with CASCADE:

kb_name TEXT REFERENCES knowledge_bases(kb_name) ON DELETE CASCADE

Deleting a KB automatically deletes all its documents and chunks. No orphaned rows possible.

Parameterized queries everywhere β€” example:

# CORRECT: parameterized
conn.execute("SELECT * FROM knowledge_bases WHERE kb_name = ?", (kb_name,))

# NEVER: f-string SQL (SQL injection vulnerability)
# conn.execute(f"SELECT * FROM knowledge_bases WHERE kb_name = '{kb_name}'")

This pattern is enforced throughout the module. The test suite verifies the schema is correct but also that raw queries are never used (code review confirms all ? placeholders).

Idempotent initialization: initialize_database() uses CREATE TABLE IF NOT EXISTS β€” safe to call on every app startup. The application calls it in _startup() before accepting any requests.


4. Gradio App Scaffold (app.py)

4-tab Blocks layout:

gr.Blocks
  └── gr.Tabs
        β”œβ”€β”€ Tab 1: πŸŽ™οΈ Ask VoiceVault   ← build_ask_tab()
        β”œβ”€β”€ Tab 2: πŸ“‚ Knowledge Bases   ← build_kb_tab()
        β”œβ”€β”€ Tab 3: πŸ“Š Analytics          ← build_analytics_tab()
        └── Tab 4: βš™οΈ Settings           ← build_settings_tab()

Each tab is a separate function in its own module (ui/tabs/). This enables:

  • Phase-by-phase activation: each tab becomes functional as its phase completes
  • Independent testing of each tab builder
  • Clear separation β€” the tab builder returns nothing, just renders into the active Blocks context

Startup sequence:

_startup()      # ensures directories, logs config summary (no secrets)
app = build_app()  # constructs Gradio Blocks
app.launch(...)    # binds to host:port

Gradio version compatibility: Discovered during testing that Gradio 6.x moved theme and css from gr.Blocks(...) to launch(...). The test suite caught this immediately (test_gradio_app_builds), and the fix was isolated to app.py. This is an example of why Phase 0 tests exist β€” catching API drift before it causes runtime failures.


5. Web Speech API Bridge (ui/components/audio_controls.py)

The JavaScript that drives browser TTS is declared as a module constant (WEB_SPEECH_JS) in Phase 0. It will be injected via gr.HTML in Phase 5.

Why declare it now? The JS bridge is a security-sensitive piece (it executes in the browser). By declaring it as a constant rather than building it dynamically, it is:

  • Auditable as a static artifact (code review can inspect it)
  • Testable (test_tts_html_contains_js verifies speechSynthesis and _vv_tts are present)
  • Not constructable from user input (no injection surface)

Test Results

58 passed, 0 failed β€” 18.91s

TestConfig          (10 tests) β€” config loading, types, defaults, path helpers, security
TestModels           (9 tests) β€” all 8 Pydantic models instantiate and validate correctly
TestSQLiteSchema     (6 tests) β€” tables created, idempotent, schema columns verified
TestSQLiteCRUD      (11 tests) β€” full CRUD round-trips for all tables
TestPackageImports  (14 tests) β€” every __init__.py and public function importable
TestUIComponents     (8 tests) β€” citation formatter, TTS HTML, Gradio build

Warnings noted (not failures):

  • datetime.utcnow() deprecation β€” Pydantic v2 internally calls this on default factories. Not our code. Will resolve when Pydantic updates its internals. Tracked for future upgrade.

Security Audit β€” Phase 0

Check Status Notes
No API keys in code βœ… Pass .env.example has placeholders only
No hardcoded secrets βœ… Pass All sensitive values via env vars
Parameterized SQL βœ… Pass All queries use ? placeholders
Query log anonymization βœ… Pass voice_query_hash only, no raw text
bcrypt rounds β‰₯ 12 βœ… Pass Enforced by config default + test
Extension whitelist defined βœ… Pass frozenset in config β€” immutable
Data dir not git-tracked βœ… Pass .gitignore covers data/
.env not committed βœ… Pass .gitignore covers .env

Progress Tracker Update

Phase Status Tests Docs
Phase 0 β€” Foundation βœ… Done βœ… 58/58 βœ… Done
Phase 1 β€” Ingestion ⬜ Next ⬜ ⬜
Phase 2 β€” Retrieval ⬜ ⬜ ⬜
Phase 3 β€” ASR ⬜ ⬜ ⬜
Phase 4 β€” Generation ⬜ ⬜ ⬜
Phase 5 β€” UI & Access ⬜ ⬜ ⬜