Spaces:

NinjainPJs
/

VoiceVault

Running

App Files Files Community

VoiceVault / DOCS /phase0_foundation.md

NinjainPJs

Initial release: VoiceVault v1.0.0 — Voice-First RAG Knowledge Agent

85f900d 3 months ago

preview code

raw

history blame contribute delete

11.1 kB

Phase 0 — Project Foundation

Status: ✅ Complete | Tests: 58/58 passed | Date: March 2026

Overview

Phase 0 establishes the complete project skeleton before any business logic is written. Every subsequent phase builds on top of this foundation — the directory structure, dependency manifest, centralized config, data contracts (Pydantic models), SQLite schema, and the 4-tab Gradio scaffold are all locked in here.

Why lock these first? Schema drift between phases is one of the most common causes of bugs in ML pipelines. By defining the data models and database schema in Phase 0, every later module is guaranteed to produce and consume the same data shapes.

Files Created

File	Purpose
`requirements.txt`	All project dependencies with rationale comments
`.env.example`	Environment variable template (no secrets)
`config.py`	Pydantic-settings centralized config singleton
`voicevault/__init__.py`	Package init with `__version__`, `__author__`
`voicevault/models.py`	All Pydantic data contracts (8 models)
`voicevault/asr/__init__.py`	ASR sub-package declaration
`voicevault/ingestion/__init__.py`	Ingestion sub-package declaration
`voicevault/retrieval/__init__.py`	Retrieval sub-package declaration
`voicevault/generation/__init__.py`	Generation sub-package declaration
`voicevault/kb/__init__.py`	KB management sub-package declaration
`voicevault/tts/__init__.py`	TTS sub-package declaration
`voicevault/storage/__init__.py`	Storage sub-package declaration
`voicevault/storage/sqlite_store.py`	Full SQLite schema + all CRUD operations
`ui/__init__.py`	UI package declaration
`ui/tabs/__init__.py`	Tabs sub-package declaration
`ui/tabs/ask_tab.py`	Ask tab placeholder (Phase 5 activates it)
`ui/tabs/kb_tab.py`	KB Manager tab placeholder
`ui/tabs/analytics_tab.py`	Analytics tab placeholder
`ui/tabs/settings_tab.py`	Settings tab placeholder
`ui/components/__init__.py`	Components sub-package declaration
`ui/components/citation_panel.py`	Citation formatter + placeholder
`ui/components/audio_controls.py`	Web Speech API JS bridge + placeholder
`app.py`	Gradio Blocks entry point (4-tab scaffold)
`tests/__init__.py`	Test package declaration
`tests/conftest.py`	Shared pytest fixtures
`tests/test_phase0.py`	58 smoke tests covering all Phase 0 deliverables
`PLAN.md`	Master E2E implementation plan
`DOCS/phase0_foundation.md`	This document

Architecture Decisions

1. Pydantic-Settings for Config (`config.py`)

What: A single VoiceVaultConfig class inheriting from pydantic_settings.BaseSettings. One singleton cfg object imported everywhere.

Why: Raw os.environ calls scattered across modules create maintenance hell. With pydantic-settings:

Every env var has a typed field with a documented default
Missing required vars raise a clear ValidationError at startup, not a KeyError buried in a hot path
model_config = SettingsConfigDict(env_file=".env") means local dev just needs a .env file — no export commands
The ensure_directories() method runs once at startup to create data/, data/uploads/, models/ — never fails midway through a request

Key design choice — path helpers as properties/methods, not raw strings:

# Bad: scattered across modules
path = "data/" + kb_name + "/chroma"

# Good: single definition in config
path = cfg.kb_chroma_dir(kb_name)

If the directory layout ever changes, only config.py needs to be updated.

Security fields locked early:

bcrypt_rounds: int = 12 — minimum safe work factor enforced at config level
share_link_expiry_days: int = 7 — default expiry for HMAC share tokens
allowed_extensions: frozenset — immutable security whitelist at config level

2. Pydantic Data Models (`voicevault/models.py`)

8 models defined:

Model	Role	Key Fields
`DocumentChunk`	A single indexed text chunk	`chunk_id` (UUID), `text_hash` (SHA-256), `page_number`, `section`
`IngestionReport`	Result of indexing one document	`status` (success/error/skipped), `chunk_count`, `duration_ms`
`RetrievalResult`	A retrieved chunk with scores	`rrf_score`, `rerank_score`
`Citation`	One source reference in an answer	`source_file`, `page_number`, `excerpt`, `relevance_score`
`QuerySession`	Full query → answer audit record	All latencies, `groq_tokens_used`, `citations` list
`KnowledgeBase`	A named document collection	`kb_name` (slug), `password_hash`, `is_protected` property
`Document`	A source document in a KB	`file_hash` (SHA-256 for dedup), `is_private`
`TranscriptResult`	Whisper ASR output	`transcript`, `model_used`, `confidence`, `query_type`

Why lock models in Phase 0? Every module from Phase 1 onwards produces or consumes these types. If DocumentChunk were defined in ingestion/ and RetrievalResult in retrieval/, circular imports would be inevitable. Centralizing in models.py breaks all circular dependencies.

UUID auto-generation:

chunk_id: str = Field(default_factory=lambda: str(uuid.uuid4()))

Every entity gets a unique ID without any external ID generator. Safe for SQLite + ChromaDB + in-memory use.

3. SQLite Metadata Store (`voicevault/storage/sqlite_store.py`)

Schema — 4 tables:

knowledge_bases  -- KB registry (name, password hash, owner, counts)
documents        -- Per-KB document registry (file hash for dedup, page/chunk count)
chunks           -- Chunk-level metadata (text hash, page, section, language)
query_log        -- Append-only audit trail (anonymized query hash, all latencies)

Critical security decision — query log anonymization: The query_log table stores voice_query_hash (SHA-256 of the query text), not the raw query text. This is enforced in the schema (voice_query_hash TEXT column, no voice_query column) and verified in test_query_log_schema. Raw voice queries could contain PII — they are never persisted.

WAL mode enabled on every connection:

conn.execute("PRAGMA journal_mode=WAL;")

WAL (Write-Ahead Logging) allows concurrent readers while a writer is active — essential for the Analytics tab reading query stats while the main thread is writing a new query log entry.

Foreign keys with CASCADE:

kb_name TEXT REFERENCES knowledge_bases(kb_name) ON DELETE CASCADE

Deleting a KB automatically deletes all its documents and chunks. No orphaned rows possible.

Parameterized queries everywhere — example:

# CORRECT: parameterized
conn.execute("SELECT * FROM knowledge_bases WHERE kb_name = ?", (kb_name,))

# NEVER: f-string SQL (SQL injection vulnerability)
# conn.execute(f"SELECT * FROM knowledge_bases WHERE kb_name = '{kb_name}'")

This pattern is enforced throughout the module. The test suite verifies the schema is correct but also that raw queries are never used (code review confirms all ? placeholders).

Idempotent initialization: initialize_database() uses CREATE TABLE IF NOT EXISTS — safe to call on every app startup. The application calls it in _startup() before accepting any requests.

4. Gradio App Scaffold (`app.py`)

4-tab Blocks layout:

gr.Blocks
  └── gr.Tabs
        ├── Tab 1: 🎙️ Ask VoiceVault   ← build_ask_tab()
        ├── Tab 2: 📂 Knowledge Bases   ← build_kb_tab()
        ├── Tab 3: 📊 Analytics          ← build_analytics_tab()
        └── Tab 4: ⚙️ Settings           ← build_settings_tab()

Each tab is a separate function in its own module (ui/tabs/). This enables:

Phase-by-phase activation: each tab becomes functional as its phase completes
Independent testing of each tab builder
Clear separation — the tab builder returns nothing, just renders into the active Blocks context

Startup sequence:

_startup()      # ensures directories, logs config summary (no secrets)
app = build_app()  # constructs Gradio Blocks
app.launch(...)    # binds to host:port

Gradio version compatibility: Discovered during testing that Gradio 6.x moved theme and css from gr.Blocks(...) to launch(...). The test suite caught this immediately (test_gradio_app_builds), and the fix was isolated to app.py. This is an example of why Phase 0 tests exist — catching API drift before it causes runtime failures.

5. Web Speech API Bridge (`ui/components/audio_controls.py`)

The JavaScript that drives browser TTS is declared as a module constant (WEB_SPEECH_JS) in Phase 0. It will be injected via gr.HTML in Phase 5.

Why declare it now? The JS bridge is a security-sensitive piece (it executes in the browser). By declaring it as a constant rather than building it dynamically, it is:

Auditable as a static artifact (code review can inspect it)
Testable (test_tts_html_contains_js verifies speechSynthesis and _vv_tts are present)
Not constructable from user input (no injection surface)

Test Results

58 passed, 0 failed — 18.91s

TestConfig          (10 tests) — config loading, types, defaults, path helpers, security
TestModels           (9 tests) — all 8 Pydantic models instantiate and validate correctly
TestSQLiteSchema     (6 tests) — tables created, idempotent, schema columns verified
TestSQLiteCRUD      (11 tests) — full CRUD round-trips for all tables
TestPackageImports  (14 tests) — every __init__.py and public function importable
TestUIComponents     (8 tests) — citation formatter, TTS HTML, Gradio build

Warnings noted (not failures):

datetime.utcnow() deprecation — Pydantic v2 internally calls this on default factories. Not our code. Will resolve when Pydantic updates its internals. Tracked for future upgrade.

Security Audit — Phase 0

Check	Status	Notes
No API keys in code	✅ Pass	`.env.example` has placeholders only
No hardcoded secrets	✅ Pass	All sensitive values via env vars
Parameterized SQL	✅ Pass	All queries use `?` placeholders
Query log anonymization	✅ Pass	`voice_query_hash` only, no raw text
bcrypt rounds ≥ 12	✅ Pass	Enforced by config default + test
Extension whitelist defined	✅ Pass	`frozenset` in config — immutable
Data dir not git-tracked	✅ Pass	`.gitignore` covers `data/`
`.env` not committed	✅ Pass	`.gitignore` covers `.env`

Progress Tracker Update

Phase	Status	Tests	Docs
Phase 0 — Foundation	✅ Done	✅ 58/58	✅ Done
Phase 1 — Ingestion	⬜ Next	⬜	⬜
Phase 2 — Retrieval	⬜	⬜	⬜
Phase 3 — ASR	⬜	⬜	⬜
Phase 4 — Generation	⬜	⬜	⬜
Phase 5 — UI & Access	⬜	⬜	⬜