VoiceVault / DOCS /phase6_deployment.md
NinjainPJs's picture
Update DOCS/phase6_deployment.md
1694352 verified

Phase 6 β€” FastAPI Server, SPA Frontend & Hugging Face Deployment

Status: βœ… Complete | Tests: 17/17 passed (integration suite) | Files: 7 new files


What Was Built

Phase 6 supersedes the Gradio UI with a production-quality FastAPI REST API + hand-crafted single-page application, then deploys the complete system to Hugging Face Spaces via Docker.

Module Responsibility
server.py FastAPI entry point with lifespan startup, singleton injection
api/routes.py All REST endpoints (/api/kbs, /api/ask, /api/transcribe, /api/analytics)
static/index.html SPA shell: sidebar nav, Ask/KB/Analytics views, modals, toast container
static/style.css Dark glassmorphism design system (~600 lines)
static/app.js Full SPA logic: recording, WAV conversion, API calls, chat, TTS (~500 lines)
voicevault/asr/groq_transcriber.py Groq Whisper cloud transcription (~300ms)
Dockerfile CPU-optimized Docker image for HF Spaces
tests/test_api_routes.py Integration tests for all REST endpoints

Why Replace Gradio

The Gradio UI worked well for prototyping but had two production problems:

  1. Blocking event loop β€” Whisper model loading (30–60s on first call) runs synchronously inside Gradio's event handler, freezing the entire UI.
  2. No control over UX β€” Gradio's preset component library limits design to its own aesthetic.

FastAPI + a custom SPA solves both:

  • All slow operations (model loading, retrieval, LLM calls) run inside FastAPI's async handlers on a proper ASGI server (uvicorn).
  • The frontend is plain HTML/CSS/JS β€” no framework overhead, full design control.

FastAPI Server (server.py)

Key Decisions

GPU forced off at the very top:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"   # Must be before any ML imports
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"

The RTX 5070 (sm_120 architecture) is incompatible with packaged PyTorch (max sm_90). Setting CUDA_VISIBLE_DEVICES=-1 before any import prevents the crash. This must be the first two lines β€” any later and sentence_transformers or torch may already have probed CUDA.

Modern lifespan pattern (no deprecated on_event):

@asynccontextmanager
async def _lifespan(app: FastAPI):
    kb_manager, transcriber, answer_chain = _startup()
    init_routes(kb_manager, transcriber, answer_chain, _CENTRAL_DB_PATH)
    yield

app = FastAPI(lifespan=_lifespan)

FastAPI deprecated @app.on_event("startup") in favour of this context manager pattern.

Smart transcriber selection:

if cfg.has_groq_key():
    transcriber = GroqTranscriber()   # ~300ms, cloud
else:
    transcriber = WhisperTranscriber()  # ~5–60s, local CPU

REST API (api/routes.py)

Singleton Injection

Rather than using FastAPI Depends() (which requires each endpoint to declare its dependencies), singletons are injected once at startup via init_routes():

_kb_manager = None
_transcriber = None
_answer_chain = None
_db_path: Optional[Path] = None

def init_routes(kb_manager, transcriber, answer_chain, db_path) -> None:
    global _kb_manager, _transcriber, _answer_chain, _db_path
    _kb_manager = kb_manager
    ...

This makes the module stateful but keeps route definitions clean and eliminates per-request dependency resolution overhead for these heavy singletons.

Critical Bug Fixed: .search() vs .retrieve()

During the first end-to-end runtime test, /api/ask raised:

AttributeError: 'HybridRetriever' object has no attribute 'search'

The existing unit tests never caught this because HybridRetriever was fully mocked β€” the mock accepted any attribute access. The actual public method is retrieve().

Root cause: Unit tests with MagicMock() replacing the entire retriever cannot catch wrong method names.

Fix: retriever.retrieve(search_query) in both api/routes.py and ui/tabs/ask_tab.py.

Prevention: The new integration tests in test_api_routes.py patch only the model-loading step, leaving all method call routing real:

with patch("voicevault.retrieval.hybrid_retriever.HybridRetriever.retrieve",
           return_value=[]) as mock_retrieve:
    r = client.post("/api/ask", json={...})
mock_retrieve.assert_called_once()   # would fail if code calls .search()

Endpoints

Method Path Description
GET /api/kbs List all knowledge bases with doc/chunk counts
POST /api/kbs Create a KB (validates name, hashes password)
DELETE /api/kbs/{kb_name} Delete KB and all its indexed data
POST /api/kbs/{kb_name}/documents Upload + index documents
POST /api/transcribe Transcribe uploaded audio file
POST /api/ask Full RAG pipeline: retrieve β†’ generate β†’ cite
GET /api/analytics Query stats from SQLite audit log

Groq Transcriber (voicevault/asr/groq_transcriber.py)

Why Groq for ASR

Local Whisper on CPU downloads a 1.5GB model on first use and takes 5+ minutes to transcribe a 5-second clip. This is unusable in a live demo setting.

Groq's Whisper API:

  • No model download
  • ~300–500ms round-trip for short voice queries
  • Same whisper-large-v3-turbo model quality
  • Free tier: 7,200 requests/day
response = client.audio.transcriptions.create(
    file=(audio_path.name, f.read()),
    model="whisper-large-v3-turbo",
    language="en",
    response_format="text",
)

The GroqTranscriber returns the same TranscriptResult dataclass as WhisperTranscriber, so the rest of the pipeline is unaware of which was used.


SPA Frontend

Browser Audio β†’ WAV Conversion

The browser's MediaRecorder API outputs audio/webm (Chrome) or audio/ogg (Firefox). The server's soundfile library only reads WAV/FLAC/OGG-Vorbis. Sending WebM directly to the server would require ffmpeg.

Solution: convert in-browser using AudioContext before sending:

async function convertBlobToWav(blob) {
    const arrayBuffer = await blob.arrayBuffer();
    const audioCtx = new (window.AudioContext || window.webkitAudioContext)();
    const audioBuffer = await audioCtx.decodeAudioData(arrayBuffer);
    audioCtx.close();
    return audioBufferToWavBlob(audioBuffer);  // 16-bit PCM WAV
}

This eliminates the ffmpeg system dependency entirely. The server receives a standard PCM WAV file that soundfile reads natively.

Design System

The SPA uses a dark glassmorphism design:

  • Background: near-black #09090b with ambient purple gradient orbs (CSS @keyframes drift)
  • Cards: rgba(255,255,255,0.03) with backdrop-filter: blur(12px) and subtle border
  • Primary: #8b5cf6 (violet-500) β€” consistent across buttons, badges, microphone glow
  • Chat bubbles: user messages right-aligned (violet), assistant messages left-aligned (dark card)
  • Animations: msgIn slide-in for chat, micPulse glow during recording, waveAnim bars during processing

Integration Tests (tests/test_api_routes.py)

Philosophy

The existing 311 tests mock HybridRetriever, AnswerChain, and WhisperTranscriber at the class level using MagicMock(). This correctly validates internal logic within each module, but cannot catch:

  • Wrong method names called across module boundaries
  • Incorrect field names in Pydantic response models
  • Route registration issues (missing prefix, typos)

The integration tests solve this by using:

  • Real KBManager backed by a temp SQLite DB (schema initialized via initialize_database())
  • Real FastAPI TestClient routing (not mocked)
  • Only the LLM and Whisper calls mocked (network/slow)
@pytest.fixture(scope="module")
def client(kb_manager, mock_transcriber, mock_answer_chain, tmp_path_factory):
    db = tmp_path_factory.mktemp("db2") / "server.db"
    db_mod.initialize_database(db)          # real schema
    routes_mod.init_routes(kb_manager, mock_transcriber, mock_answer_chain, db)
    app = FastAPI()
    app.include_router(router)
    return TestClient(app)

Test Coverage

Class Tests What Is Validated
TestKBEndpoints 8 CRUD, validation, duplicate detection, response fields
TestAskEndpoint 6 Method names (retrieve, build), response schema, history, empty/missing inputs
TestAnalyticsEndpoint 2 Stats structure, KB list type
TestTranscribeEndpoint 1 Real WAV file upload, mock transcriber called

Total: 17 tests, all passing.


Docker Deployment

Image Strategy

# 1. CPU-only PyTorch first (saves 1.8GB vs GPU wheel)
RUN pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cpu

# 2. All other requirements
RUN pip install -r requirements.txt

# 3. spaCy model (needed for document chunking)
RUN python -m spacy download en_core_web_sm

# 4. Pre-download ML models at build time (no cold-start delays)
RUN python -c "
    from sentence_transformers import SentenceTransformer, CrossEncoder;
    SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2');
    CrossEncoder('cross-encoder/ms-marco-MiniLM-L12-v2')
"

Pre-baking the embedding and reranking models into the Docker layer means the first document upload doesn't trigger a download. Whisper is intentionally not pre-baked β€” Groq cloud API is used when GROQ_API_KEY is present.

Environment Variables in HF Spaces

API keys are injected as Space secrets (not committed to git):

from huggingface_hub import HfApi
api = HfApi()
api.add_space_secret('NinjainPJs/VoiceVault', 'GROQ_API_KEY', value)
api.add_space_secret('NinjainPJs/VoiceVault', 'GEMINI_API_KEY', value)

Storage

HF Spaces free tier uses ephemeral storage β€” knowledge bases created at runtime are lost on container restart. This is acceptable for a demo deployment. For production persistence, HF Spaces offers a persistent storage add-on, or the data layer can be pointed at an external object store.


Lessons Learned

  1. Mock depth matters β€” mocking at the class level (MagicMock()) cannot catch method-name bugs across modules. Integration tests that mock only I/O boundaries (LLM API, Whisper model) while keeping real routing are essential.

  2. GPU environment variables must be first β€” CUDA_VISIBLE_DEVICES=-1 must precede all Python ML imports. Any utility module that imports torch at module level will trigger CUDA detection before the variable is set if import order is not controlled.

  3. Browser audio formats β€” MediaRecorder output (WebM/OGG) and server-side audio libraries (soundfile) don't share a format. Converting to 16-bit PCM WAV in the browser with AudioContext is the cleanest zero-dependency solution.

  4. Groq vs local Whisper β€” For a live demo, cloud transcription is non-negotiable. A 5-minute wait on first recording kills the experience. The 300ms Groq round-trip feels instant.