ai-time-machine / docs /implementation_plan.md
manikandanj's picture
Prepare AI Time Machine hackathon Space
5862322 verified
|
Raw
History Blame Contribute Delete
8.98 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

AI Time Machine Implementation Plan

This plan organizes the implementation of the "AI Time Machine" (Track 2: Thousand Token Wood) for the Hugging Face Build Small Hackathon.

Updated: 2026-06-09

Decisions Made

  • Architecture: Cloud-first. Together API for LLM, Modal for audio models (STT/TTS), Gradio on HF Spaces for UI.
  • LLM Choice: Qwen3-8B via Together AI API (structured JSON outputs with native JSON mode).
  • STT Choice: NVIDIA Nemotron 3.5 ASR Streaming 0.6B on Modal (NVIDIA sponsor prize requirement).
  • TTS Strategy: Qwen3-TTS 1.7B VoiceDesign on Modal for production. Windows SAPI is the low-latency Walk default on Windows; Kokoro 82M remains available locally for voice-quality comparison.
  • Dev Profile: Together API + local low-latency TTS + text input (no STT needed for dev).
  • Visual Style: Immersive Steampunk cockpit (brass, copper, glowing edison bulbs, deep wood textures, circular portal).
  • Division of Work: Track A (UI/Cockpit & Gradio Shell) vs Track B (Domain Models & AI Adapters).

Deployment Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   HF Spaces     β”‚     β”‚   Together AI     β”‚     β”‚     Modal       β”‚
β”‚   (Gradio UI)   │────>β”‚   (Qwen3-8B LLM)  β”‚     β”‚  (GPU Compute)  β”‚
β”‚                 β”‚     β”‚   JSON mode       β”‚     β”‚                 β”‚
β”‚                 β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚  Nemotron STT   β”‚
β”‚                 │────────────────────────────>β”‚  Qwen3-TTS      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key principle: Divide and conquer dependencies. Never mix NeMo (audio) with vLLM (LLM) in one environment. Use API providers for the LLM (structured JSON output built-in). Dedicate Modal exclusively to audio models.

Phased Execution: Walk / Run / Sprint

Walk (Tonight β€” Local Sync)

Goal: Validate logic end-to-end with real AI inference.

  • Text input -> Together API (Qwen3) -> local TTS -> Audio playback
  • Profile: dev
  • No STT needed β€” type in Gradio's text box
  • Validates: destination generation, persona generation, conversation, souvenir, TTS audio pipeline
  • On Windows, Walk defaults to TIME_MACHINE_DEV_TTS=sapi for low latency. Set TIME_MACHINE_DEV_TTS=kokoro to compare Kokoro voice quality.
  • TIME_MACHINE_MAX_RESPONSE_CHARS overrides the default short spoken reply cap of 260 characters.
$env:TIME_MACHINE_ADAPTER_PROFILE="dev"
$env:TIME_MACHINE_LLM_API_KEY="your-together-key"
python app.py

Local secret-file option for desktop development:

# .env (ignored by git)
TIME_MACHINE_ADAPTER_PROFILE=dev
TIME_MACHINE_LLM_API_KEY=your-together-key
# TOGETHER_API_KEY=your-together-key also works
TIME_MACHINE_LLM_MODEL=Qwen/Qwen2.5-7B-Instruct-Turbo

Optional Kokoro runtime for higher-quality local output (requires Python < 3.13; Python 3.13 uses the built-in WAV fallback):

.\.venv\Scripts\python.exe -m pip install -e ".[dev]"
$env:TIME_MACHINE_DEV_TTS="kokoro"
.\.venv\Scripts\python.exe scripts\walk_smoke.py

Note: Together currently reports Qwen/Qwen3-8B as requiring a dedicated endpoint on this account. The local Walk profile uses the serverless Qwen/Qwen2.5-7B-Instruct-Turbo model unless TIME_MACHINE_LLM_MODEL is set explicitly. Latency note from local Windows testing: Kokoro took 13.41s to synthesize a 6.35s clip, while Windows SAPI synthesized a similar 6.65s clip in 1.87s. The bottleneck was local Kokoro throughput, not the cloud LLM.

Run (Tomorrow β€” Cloud Sync)

Goal: Full voice-first loop with real models on Modal.

  • Push-to-talk microphone β†’ Nemotron STT (Modal) β†’ Together API (Qwen3) β†’ Qwen3-TTS (Modal) β†’ Audio playback
  • Profile: modal
  • Add microphone input component to Gradio UI
  • Create Modal functions for Nemotron and Qwen3-TTS in isolated environments

Sprint (Tomorrow Night β€” Streaming)

Goal: Real-time streaming for a polished demo.

  • Streaming audio β†’ Nemotron partial transcripts β†’ LLM token streaming β†’ TTS audio chunks β†’ Live playback
  • Swap HTTP requests for WebSocket streaming where possible
  • Only attempt if Run phase works flawlessly

Adapter Profiles

Profile LLM STT TTS Use Case
fixture Fixture data Fixture Fixture Tests, UI dev
dev Together API Whisper (local) SAPI on Windows, Kokoro optional Dev testing
local_models Qwen local (transformers) Nemotron local (NeMo) Kokoro (local) Full local (needs big GPU)
modal Together API Nemotron (Modal) Qwen3-TTS (Modal) Production / hackathon submission

Parameter Budget

Model Role Parameters Enabled
Qwen3-8B LLM (via Together API) 8.0B βœ…
Nemotron 3.5 ASR STT (on Modal) 0.6B βœ…
Qwen3-TTS 1.7B TTS (on Modal) 1.7B βœ…
Total 10.3B < 32B βœ…

Dev-only models (not counted for submission):

  • Kokoro 82M (dev TTS fallback)
  • Whisper base 74M (dev STT fallback)

Proposed Changes

Already Implemented βœ…

[NEW] cloud_completion.py

  • Cloud LLM completion function for any OpenAI-compatible API (Together, OpenRouter, etc.)
  • Zero new dependencies (uses stdlib urllib)
  • Injected into QwenStructuredLLMAdapter via existing completion_fn hook

[NEW] whisper_stt.py

  • Whisper STT adapter for dev/testing
  • Drop-in replacement for NemotronStreamingSTTAdapter
  • Same STTAdapter protocol

[MODIFY] container.py

  • Added dev profile wiring: Together API + Whisper STT + local TTS
  • Dev TTS defaults to Windows SAPI on Windows for Walk latency; TIME_MACHINE_DEV_TTS=kokoro preserves the Kokoro path.

Still Needed

[NEW] Modal Nemotron STT endpoint

  • Isolated Modal function running NeMo + Nemotron 3.5 ASR
  • HTTP webhook callable from the Gradio app
  • Accepts audio, returns transcript JSON

[NEW] Modal Qwen3-TTS endpoint

  • Isolated Modal function running Qwen3-TTS 1.7B
  • HTTP webhook callable from the Gradio app
  • Accepts text + voice profile, returns audio

[NEW] Modal STT/TTS adapter wrappers

  • ModalNemotronSTTAdapter β€” calls Modal webhook, returns Transcript
  • ModalQwenTTSAdapter β€” calls Modal webhook, returns AudioResult
  • Both implement existing port interfaces

[MODIFY] container.py

  • Add modal profile wiring the Modal adapters

[MODIFY] gradio_app.py

  • Add microphone input component for push-to-talk voice input

Verification Plan

Walk Phase

  • Launch with dev profile, type messages, verify real Qwen responses and audio playback
  • Run scripts\walk_smoke.py and inspect printed stage timings for launch, conversation, and TTS latency

Run Phase

  • Start the Modal STT and TTS endpoints from PowerShell. The UTF-8 settings avoid Windows console encoding failures from Modal CLI status glyphs.
$env:MODAL_PROFILE="manikandanj"
$env:PYTHONUTF8="1"
$env:PYTHONIOENCODING="utf-8"
$env:TTY_COMPATIBLE="0"
.\.venv\Scripts\modal.exe serve scripts\modal_audio.py
  • Keep Modal warm for the demo. The audio service now preloads Nemotron and Qwen3-TTS in class lifecycle hooks, keeps one GPU container warm by default, and runs a short Qwen3-TTS warmup during container startup. Override with:
$env:TIME_MACHINE_MODAL_MIN_CONTAINERS="1"
$env:TIME_MACHINE_MODAL_MAX_CONTAINERS="1"
$env:TIME_MACHINE_MODAL_SCALEDOWN_SECONDS="1800"
$env:TIME_MACHINE_MODAL_WARMUP_TTS="1"
  • After modal serve prints the STT and TTS endpoint URLs, set TIME_MACHINE_MODAL_STT_URL and TIME_MACHINE_MODAL_TTS_URL, then preflight both real-model endpoints before opening the demo:
.\.venv\Scripts\python.exe scripts\modal_warmup.py
  • Test voice input β†’ text β†’ voice output loop

Automated Tests

  • pytest tests/unit/ β€” domain models, JSON contract parsing, event stream ordering
  • Model budget compliance test (sum enabled params ≀ 32B)

Manual Verification

  • Launch the machine, type/speak to the generated character, bring back a souvenir
  • Verify audio playback works in the Gradio UI