# AI Time Machine Implementation Plan This plan organizes the implementation of the "AI Time Machine" (Track 2: Thousand Token Wood) for the Hugging Face Build Small Hackathon. Updated: 2026-06-09 ## Decisions Made - **Architecture**: Cloud-first. Together API for LLM, Modal for audio models (STT/TTS), Gradio on HF Spaces for UI. - **LLM Choice**: Qwen3-8B via Together AI API (structured JSON outputs with native JSON mode). - **STT Choice**: NVIDIA Nemotron 3.5 ASR Streaming 0.6B on Modal (NVIDIA sponsor prize requirement). - **TTS Strategy**: Qwen3-TTS 1.7B VoiceDesign on Modal for production. Windows SAPI is the low-latency Walk default on Windows; Kokoro 82M remains available locally for voice-quality comparison. - **Dev Profile**: Together API + local low-latency TTS + text input (no STT needed for dev). - **Visual Style**: Immersive Steampunk cockpit (brass, copper, glowing edison bulbs, deep wood textures, circular portal). - **Division of Work**: Track A (UI/Cockpit & Gradio Shell) vs Track B (Domain Models & AI Adapters). ## Deployment Architecture ```text ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │ HF Spaces │ │ Together AI │ │ Modal │ │ (Gradio UI) │────>│ (Qwen3-8B LLM) │ │ (GPU Compute) │ │ │ │ JSON mode │ │ │ │ │ └──────────────────┘ │ Nemotron STT │ │ │────────────────────────────>│ Qwen3-TTS │ └─────────────────┘ └─────────────────┘ ``` Key principle: **Divide and conquer dependencies.** Never mix NeMo (audio) with vLLM (LLM) in one environment. Use API providers for the LLM (structured JSON output built-in). Dedicate Modal exclusively to audio models. ## Phased Execution: Walk / Run / Sprint ### Walk (Tonight — Local Sync) **Goal**: Validate logic end-to-end with real AI inference. - Text input -> Together API (Qwen3) -> local TTS -> Audio playback - Profile: `dev` - No STT needed — type in Gradio's text box - Validates: destination generation, persona generation, conversation, souvenir, TTS audio pipeline - On Windows, Walk defaults to `TIME_MACHINE_DEV_TTS=sapi` for low latency. Set `TIME_MACHINE_DEV_TTS=kokoro` to compare Kokoro voice quality. - `TIME_MACHINE_MAX_RESPONSE_CHARS` overrides the default short spoken reply cap of 260 characters. ```powershell $env:TIME_MACHINE_ADAPTER_PROFILE="dev" $env:TIME_MACHINE_LLM_API_KEY="your-together-key" python app.py ``` Local secret-file option for desktop development: ```text # .env (ignored by git) TIME_MACHINE_ADAPTER_PROFILE=dev TIME_MACHINE_LLM_API_KEY=your-together-key # TOGETHER_API_KEY=your-together-key also works TIME_MACHINE_LLM_MODEL=Qwen/Qwen2.5-7B-Instruct-Turbo ``` Optional Kokoro runtime for higher-quality local output (requires Python < 3.13; Python 3.13 uses the built-in WAV fallback): ```powershell .\.venv\Scripts\python.exe -m pip install -e ".[dev]" $env:TIME_MACHINE_DEV_TTS="kokoro" .\.venv\Scripts\python.exe scripts\walk_smoke.py ``` Note: Together currently reports `Qwen/Qwen3-8B` as requiring a dedicated endpoint on this account. The local Walk profile uses the serverless `Qwen/Qwen2.5-7B-Instruct-Turbo` model unless `TIME_MACHINE_LLM_MODEL` is set explicitly. Latency note from local Windows testing: Kokoro took 13.41s to synthesize a 6.35s clip, while Windows SAPI synthesized a similar 6.65s clip in 1.87s. The bottleneck was local Kokoro throughput, not the cloud LLM. ### Run (Tomorrow — Cloud Sync) **Goal**: Full voice-first loop with real models on Modal. - Push-to-talk microphone → Nemotron STT (Modal) → Together API (Qwen3) → Qwen3-TTS (Modal) → Audio playback - Profile: `modal` - Add microphone input component to Gradio UI - Create Modal functions for Nemotron and Qwen3-TTS in isolated environments ### Sprint (Tomorrow Night — Streaming) **Goal**: Real-time streaming for a polished demo. - Streaming audio → Nemotron partial transcripts → LLM token streaming → TTS audio chunks → Live playback - Swap HTTP requests for WebSocket streaming where possible - Only attempt if Run phase works flawlessly ## Adapter Profiles | Profile | LLM | STT | TTS | Use Case | |---------|-----|-----|-----|----------| | `fixture` | Fixture data | Fixture | Fixture | Tests, UI dev | | `dev` | Together API | Whisper (local) | SAPI on Windows, Kokoro optional | Dev testing | | `local_models` | Qwen local (transformers) | Nemotron local (NeMo) | Kokoro (local) | Full local (needs big GPU) | | `modal` | Together API | Nemotron (Modal) | Qwen3-TTS (Modal) | Production / hackathon submission | ## Parameter Budget | Model | Role | Parameters | Enabled | |-------|------|------------|----------| | Qwen3-8B | LLM (via Together API) | 8.0B | ✅ | | Nemotron 3.5 ASR | STT (on Modal) | 0.6B | ✅ | | Qwen3-TTS 1.7B | TTS (on Modal) | 1.7B | ✅ | | **Total** | | **10.3B** | **< 32B ✅** | Dev-only models (not counted for submission): - Kokoro 82M (dev TTS fallback) - Whisper base 74M (dev STT fallback) ## Proposed Changes ### Already Implemented ✅ #### [NEW] [cloud_completion.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/adapters/llm/cloud_completion.py) - Cloud LLM completion function for any OpenAI-compatible API (Together, OpenRouter, etc.) - Zero new dependencies (uses stdlib urllib) - Injected into QwenStructuredLLMAdapter via existing `completion_fn` hook #### [NEW] [whisper_stt.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/adapters/stt/whisper_stt.py) - Whisper STT adapter for dev/testing - Drop-in replacement for NemotronStreamingSTTAdapter - Same STTAdapter protocol #### [MODIFY] [container.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/application/container.py) - Added `dev` profile wiring: Together API + Whisper STT + local TTS - Dev TTS defaults to Windows SAPI on Windows for Walk latency; `TIME_MACHINE_DEV_TTS=kokoro` preserves the Kokoro path. ### Still Needed #### [NEW] Modal Nemotron STT endpoint - Isolated Modal function running NeMo + Nemotron 3.5 ASR - HTTP webhook callable from the Gradio app - Accepts audio, returns transcript JSON #### [NEW] Modal Qwen3-TTS endpoint - Isolated Modal function running Qwen3-TTS 1.7B - HTTP webhook callable from the Gradio app - Accepts text + voice profile, returns audio #### [NEW] Modal STT/TTS adapter wrappers - `ModalNemotronSTTAdapter` — calls Modal webhook, returns `Transcript` - `ModalQwenTTSAdapter` — calls Modal webhook, returns `AudioResult` - Both implement existing port interfaces #### [MODIFY] [container.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/application/container.py) - Add `modal` profile wiring the Modal adapters #### [MODIFY] [gradio_app.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/ui/gradio_app.py) - Add microphone input component for push-to-talk voice input ## Verification Plan ### Walk Phase - Launch with `dev` profile, type messages, verify real Qwen responses and audio playback - Run `scripts\walk_smoke.py` and inspect printed stage timings for launch, conversation, and TTS latency ### Run Phase - Start the Modal STT and TTS endpoints from PowerShell. The UTF-8 settings avoid Windows console encoding failures from Modal CLI status glyphs. ```powershell $env:MODAL_PROFILE="manikandanj" $env:PYTHONUTF8="1" $env:PYTHONIOENCODING="utf-8" $env:TTY_COMPATIBLE="0" .\.venv\Scripts\modal.exe serve scripts\modal_audio.py ``` - Keep Modal warm for the demo. The audio service now preloads Nemotron and Qwen3-TTS in class lifecycle hooks, keeps one GPU container warm by default, and runs a short Qwen3-TTS warmup during container startup. Override with: ```powershell $env:TIME_MACHINE_MODAL_MIN_CONTAINERS="1" $env:TIME_MACHINE_MODAL_MAX_CONTAINERS="1" $env:TIME_MACHINE_MODAL_SCALEDOWN_SECONDS="1800" $env:TIME_MACHINE_MODAL_WARMUP_TTS="1" ``` - After `modal serve` prints the STT and TTS endpoint URLs, set `TIME_MACHINE_MODAL_STT_URL` and `TIME_MACHINE_MODAL_TTS_URL`, then preflight both real-model endpoints before opening the demo: ```powershell .\.venv\Scripts\python.exe scripts\modal_warmup.py ``` - Test voice input → text → voice output loop ### Automated Tests - `pytest tests/unit/` — domain models, JSON contract parsing, event stream ordering - Model budget compliance test (sum enabled params ≤ 32B) ### Manual Verification - Launch the machine, type/speak to the generated character, bring back a souvenir - Verify audio playback works in the Gradio UI