| # AI Time Machine Implementation Plan |
|
|
| This plan organizes the implementation of the "AI Time Machine" (Track 2: Thousand Token Wood) for the Hugging Face Build Small Hackathon. |
|
|
| Updated: 2026-06-09 |
|
|
| ## Decisions Made |
|
|
| - **Architecture**: Cloud-first. Together API for LLM, Modal for audio models (STT/TTS), Gradio on HF Spaces for UI. |
| - **LLM Choice**: Qwen3-8B via Together AI API (structured JSON outputs with native JSON mode). |
| - **STT Choice**: NVIDIA Nemotron 3.5 ASR Streaming 0.6B on Modal (NVIDIA sponsor prize requirement). |
| - **TTS Strategy**: Qwen3-TTS 1.7B VoiceDesign on Modal for production. Windows SAPI is the low-latency Walk default on Windows; Kokoro 82M remains available locally for voice-quality comparison. |
| - **Dev Profile**: Together API + local low-latency TTS + text input (no STT needed for dev). |
| - **Visual Style**: Immersive Steampunk cockpit (brass, copper, glowing edison bulbs, deep wood textures, circular portal). |
| - **Division of Work**: Track A (UI/Cockpit & Gradio Shell) vs Track B (Domain Models & AI Adapters). |
|
|
| ## Deployment Architecture |
|
|
| ```text |
| βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ |
| β HF Spaces β β Together AI β β Modal β |
| β (Gradio UI) βββββ>β (Qwen3-8B LLM) β β (GPU Compute) β |
| β β β JSON mode β β β |
| β β ββββββββββββββββββββ β Nemotron STT β |
| β βββββββββββββββββββββββββββββ>β Qwen3-TTS β |
| βββββββββββββββββββ βββββββββββββββββββ |
| ``` |
|
|
| Key principle: **Divide and conquer dependencies.** Never mix NeMo (audio) with vLLM (LLM) in one environment. Use API providers for the LLM (structured JSON output built-in). Dedicate Modal exclusively to audio models. |
|
|
| ## Phased Execution: Walk / Run / Sprint |
|
|
| ### Walk (Tonight β Local Sync) |
| **Goal**: Validate logic end-to-end with real AI inference. |
|
|
| - Text input -> Together API (Qwen3) -> local TTS -> Audio playback |
| - Profile: `dev` |
| - No STT needed β type in Gradio's text box |
| - Validates: destination generation, persona generation, conversation, souvenir, TTS audio pipeline |
| - On Windows, Walk defaults to `TIME_MACHINE_DEV_TTS=sapi` for low latency. Set `TIME_MACHINE_DEV_TTS=kokoro` to compare Kokoro voice quality. |
| - `TIME_MACHINE_MAX_RESPONSE_CHARS` overrides the default short spoken reply cap of 260 characters. |
|
|
| ```powershell |
| $env:TIME_MACHINE_ADAPTER_PROFILE="dev" |
| $env:TIME_MACHINE_LLM_API_KEY="your-together-key" |
| python app.py |
| ``` |
|
|
| Local secret-file option for desktop development: |
|
|
| ```text |
| # .env (ignored by git) |
| TIME_MACHINE_ADAPTER_PROFILE=dev |
| TIME_MACHINE_LLM_API_KEY=your-together-key |
| # TOGETHER_API_KEY=your-together-key also works |
| TIME_MACHINE_LLM_MODEL=Qwen/Qwen2.5-7B-Instruct-Turbo |
| ``` |
|
|
| Optional Kokoro runtime for higher-quality local output (requires Python < 3.13; Python 3.13 uses the built-in WAV fallback): |
|
|
| ```powershell |
| .\.venv\Scripts\python.exe -m pip install -e ".[dev]" |
| $env:TIME_MACHINE_DEV_TTS="kokoro" |
| .\.venv\Scripts\python.exe scripts\walk_smoke.py |
| ``` |
|
|
| Note: Together currently reports `Qwen/Qwen3-8B` as requiring a dedicated endpoint on this account. The local Walk profile uses the serverless `Qwen/Qwen2.5-7B-Instruct-Turbo` model unless `TIME_MACHINE_LLM_MODEL` is set explicitly. |
| Latency note from local Windows testing: Kokoro took 13.41s to synthesize a 6.35s clip, while Windows SAPI synthesized a similar 6.65s clip in 1.87s. The bottleneck was local Kokoro throughput, not the cloud LLM. |
|
|
| ### Run (Tomorrow β Cloud Sync) |
| **Goal**: Full voice-first loop with real models on Modal. |
|
|
| - Push-to-talk microphone β Nemotron STT (Modal) β Together API (Qwen3) β Qwen3-TTS (Modal) β Audio playback |
| - Profile: `modal` |
| - Add microphone input component to Gradio UI |
| - Create Modal functions for Nemotron and Qwen3-TTS in isolated environments |
|
|
| ### Sprint (Tomorrow Night β Streaming) |
| **Goal**: Real-time streaming for a polished demo. |
|
|
| - Streaming audio β Nemotron partial transcripts β LLM token streaming β TTS audio chunks β Live playback |
| - Swap HTTP requests for WebSocket streaming where possible |
| - Only attempt if Run phase works flawlessly |
|
|
| ## Adapter Profiles |
|
|
| | Profile | LLM | STT | TTS | Use Case | |
| |---------|-----|-----|-----|----------| |
| | `fixture` | Fixture data | Fixture | Fixture | Tests, UI dev | |
| | `dev` | Together API | Whisper (local) | SAPI on Windows, Kokoro optional | Dev testing | |
| | `local_models` | Qwen local (transformers) | Nemotron local (NeMo) | Kokoro (local) | Full local (needs big GPU) | |
| | `modal` | Together API | Nemotron (Modal) | Qwen3-TTS (Modal) | Production / hackathon submission | |
|
|
| ## Parameter Budget |
|
|
| | Model | Role | Parameters | Enabled | |
| |-------|------|------------|----------| |
| | Qwen3-8B | LLM (via Together API) | 8.0B | β
| |
| | Nemotron 3.5 ASR | STT (on Modal) | 0.6B | β
| |
| | Qwen3-TTS 1.7B | TTS (on Modal) | 1.7B | β
| |
| | **Total** | | **10.3B** | **< 32B β
** | |
|
|
| Dev-only models (not counted for submission): |
| - Kokoro 82M (dev TTS fallback) |
| - Whisper base 74M (dev STT fallback) |
|
|
| ## Proposed Changes |
|
|
| ### Already Implemented β
|
|
|
| #### [NEW] [cloud_completion.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/adapters/llm/cloud_completion.py) |
| - Cloud LLM completion function for any OpenAI-compatible API (Together, OpenRouter, etc.) |
| - Zero new dependencies (uses stdlib urllib) |
| - Injected into QwenStructuredLLMAdapter via existing `completion_fn` hook |
|
|
| #### [NEW] [whisper_stt.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/adapters/stt/whisper_stt.py) |
| - Whisper STT adapter for dev/testing |
| - Drop-in replacement for NemotronStreamingSTTAdapter |
| - Same STTAdapter protocol |
|
|
| #### [MODIFY] [container.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/application/container.py) |
| - Added `dev` profile wiring: Together API + Whisper STT + local TTS |
| - Dev TTS defaults to Windows SAPI on Windows for Walk latency; `TIME_MACHINE_DEV_TTS=kokoro` preserves the Kokoro path. |
|
|
| ### Still Needed |
|
|
| #### [NEW] Modal Nemotron STT endpoint |
| - Isolated Modal function running NeMo + Nemotron 3.5 ASR |
| - HTTP webhook callable from the Gradio app |
| - Accepts audio, returns transcript JSON |
|
|
| #### [NEW] Modal Qwen3-TTS endpoint |
| - Isolated Modal function running Qwen3-TTS 1.7B |
| - HTTP webhook callable from the Gradio app |
| - Accepts text + voice profile, returns audio |
|
|
| #### [NEW] Modal STT/TTS adapter wrappers |
| - `ModalNemotronSTTAdapter` β calls Modal webhook, returns `Transcript` |
| - `ModalQwenTTSAdapter` β calls Modal webhook, returns `AudioResult` |
| - Both implement existing port interfaces |
|
|
| #### [MODIFY] [container.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/application/container.py) |
| - Add `modal` profile wiring the Modal adapters |
|
|
| #### [MODIFY] [gradio_app.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/ui/gradio_app.py) |
| - Add microphone input component for push-to-talk voice input |
|
|
| ## Verification Plan |
|
|
| ### Walk Phase |
| - Launch with `dev` profile, type messages, verify real Qwen responses and audio playback |
| - Run `scripts\walk_smoke.py` and inspect printed stage timings for launch, conversation, and TTS latency |
|
|
| ### Run Phase |
| - Start the Modal STT and TTS endpoints from PowerShell. The UTF-8 settings avoid |
| Windows console encoding failures from Modal CLI status glyphs. |
|
|
| ```powershell |
| $env:MODAL_PROFILE="manikandanj" |
| $env:PYTHONUTF8="1" |
| $env:PYTHONIOENCODING="utf-8" |
| $env:TTY_COMPATIBLE="0" |
| .\.venv\Scripts\modal.exe serve scripts\modal_audio.py |
| ``` |
|
|
| - Keep Modal warm for the demo. The audio service now preloads Nemotron and |
| Qwen3-TTS in class lifecycle hooks, keeps one GPU container warm by default, |
| and runs a short Qwen3-TTS warmup during container startup. Override with: |
|
|
| ```powershell |
| $env:TIME_MACHINE_MODAL_MIN_CONTAINERS="1" |
| $env:TIME_MACHINE_MODAL_MAX_CONTAINERS="1" |
| $env:TIME_MACHINE_MODAL_SCALEDOWN_SECONDS="1800" |
| $env:TIME_MACHINE_MODAL_WARMUP_TTS="1" |
| ``` |
|
|
| - After `modal serve` prints the STT and TTS endpoint URLs, set |
| `TIME_MACHINE_MODAL_STT_URL` and `TIME_MACHINE_MODAL_TTS_URL`, then preflight |
| both real-model endpoints before opening the demo: |
|
|
| ```powershell |
| .\.venv\Scripts\python.exe scripts\modal_warmup.py |
| ``` |
|
|
| - Test voice input β text β voice output loop |
|
|
| ### Automated Tests |
| - `pytest tests/unit/` β domain models, JSON contract parsing, event stream ordering |
| - Model budget compliance test (sum enabled params β€ 32B) |
|
|
| ### Manual Verification |
| - Launch the machine, type/speak to the generated character, bring back a souvenir |
| - Verify audio playback works in the Gradio UI |
|
|