A newer version of the Gradio SDK is available: 6.19.0
AI Time Machine Implementation Plan
This plan organizes the implementation of the "AI Time Machine" (Track 2: Thousand Token Wood) for the Hugging Face Build Small Hackathon.
Updated: 2026-06-09
Decisions Made
- Architecture: Cloud-first. Together API for LLM, Modal for audio models (STT/TTS), Gradio on HF Spaces for UI.
- LLM Choice: Qwen3-8B via Together AI API (structured JSON outputs with native JSON mode).
- STT Choice: NVIDIA Nemotron 3.5 ASR Streaming 0.6B on Modal (NVIDIA sponsor prize requirement).
- TTS Strategy: Qwen3-TTS 1.7B VoiceDesign on Modal for production. Windows SAPI is the low-latency Walk default on Windows; Kokoro 82M remains available locally for voice-quality comparison.
- Dev Profile: Together API + local low-latency TTS + text input (no STT needed for dev).
- Visual Style: Immersive Steampunk cockpit (brass, copper, glowing edison bulbs, deep wood textures, circular portal).
- Division of Work: Track A (UI/Cockpit & Gradio Shell) vs Track B (Domain Models & AI Adapters).
Deployment Architecture
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β HF Spaces β β Together AI β β Modal β
β (Gradio UI) βββββ>β (Qwen3-8B LLM) β β (GPU Compute) β
β β β JSON mode β β β
β β ββββββββββββββββββββ β Nemotron STT β
β βββββββββββββββββββββββββββββ>β Qwen3-TTS β
βββββββββββββββββββ βββββββββββββββββββ
Key principle: Divide and conquer dependencies. Never mix NeMo (audio) with vLLM (LLM) in one environment. Use API providers for the LLM (structured JSON output built-in). Dedicate Modal exclusively to audio models.
Phased Execution: Walk / Run / Sprint
Walk (Tonight β Local Sync)
Goal: Validate logic end-to-end with real AI inference.
- Text input -> Together API (Qwen3) -> local TTS -> Audio playback
- Profile:
dev - No STT needed β type in Gradio's text box
- Validates: destination generation, persona generation, conversation, souvenir, TTS audio pipeline
- On Windows, Walk defaults to
TIME_MACHINE_DEV_TTS=sapifor low latency. SetTIME_MACHINE_DEV_TTS=kokoroto compare Kokoro voice quality. TIME_MACHINE_MAX_RESPONSE_CHARSoverrides the default short spoken reply cap of 260 characters.
$env:TIME_MACHINE_ADAPTER_PROFILE="dev"
$env:TIME_MACHINE_LLM_API_KEY="your-together-key"
python app.py
Local secret-file option for desktop development:
# .env (ignored by git)
TIME_MACHINE_ADAPTER_PROFILE=dev
TIME_MACHINE_LLM_API_KEY=your-together-key
# TOGETHER_API_KEY=your-together-key also works
TIME_MACHINE_LLM_MODEL=Qwen/Qwen2.5-7B-Instruct-Turbo
Optional Kokoro runtime for higher-quality local output (requires Python < 3.13; Python 3.13 uses the built-in WAV fallback):
.\.venv\Scripts\python.exe -m pip install -e ".[dev]"
$env:TIME_MACHINE_DEV_TTS="kokoro"
.\.venv\Scripts\python.exe scripts\walk_smoke.py
Note: Together currently reports Qwen/Qwen3-8B as requiring a dedicated endpoint on this account. The local Walk profile uses the serverless Qwen/Qwen2.5-7B-Instruct-Turbo model unless TIME_MACHINE_LLM_MODEL is set explicitly.
Latency note from local Windows testing: Kokoro took 13.41s to synthesize a 6.35s clip, while Windows SAPI synthesized a similar 6.65s clip in 1.87s. The bottleneck was local Kokoro throughput, not the cloud LLM.
Run (Tomorrow β Cloud Sync)
Goal: Full voice-first loop with real models on Modal.
- Push-to-talk microphone β Nemotron STT (Modal) β Together API (Qwen3) β Qwen3-TTS (Modal) β Audio playback
- Profile:
modal - Add microphone input component to Gradio UI
- Create Modal functions for Nemotron and Qwen3-TTS in isolated environments
Sprint (Tomorrow Night β Streaming)
Goal: Real-time streaming for a polished demo.
- Streaming audio β Nemotron partial transcripts β LLM token streaming β TTS audio chunks β Live playback
- Swap HTTP requests for WebSocket streaming where possible
- Only attempt if Run phase works flawlessly
Adapter Profiles
| Profile | LLM | STT | TTS | Use Case |
|---|---|---|---|---|
fixture |
Fixture data | Fixture | Fixture | Tests, UI dev |
dev |
Together API | Whisper (local) | SAPI on Windows, Kokoro optional | Dev testing |
local_models |
Qwen local (transformers) | Nemotron local (NeMo) | Kokoro (local) | Full local (needs big GPU) |
modal |
Together API | Nemotron (Modal) | Qwen3-TTS (Modal) | Production / hackathon submission |
Parameter Budget
| Model | Role | Parameters | Enabled |
|---|---|---|---|
| Qwen3-8B | LLM (via Together API) | 8.0B | β |
| Nemotron 3.5 ASR | STT (on Modal) | 0.6B | β |
| Qwen3-TTS 1.7B | TTS (on Modal) | 1.7B | β |
| Total | 10.3B | < 32B β |
Dev-only models (not counted for submission):
- Kokoro 82M (dev TTS fallback)
- Whisper base 74M (dev STT fallback)
Proposed Changes
Already Implemented β
[NEW] cloud_completion.py
- Cloud LLM completion function for any OpenAI-compatible API (Together, OpenRouter, etc.)
- Zero new dependencies (uses stdlib urllib)
- Injected into QwenStructuredLLMAdapter via existing
completion_fnhook
[NEW] whisper_stt.py
- Whisper STT adapter for dev/testing
- Drop-in replacement for NemotronStreamingSTTAdapter
- Same STTAdapter protocol
[MODIFY] container.py
- Added
devprofile wiring: Together API + Whisper STT + local TTS - Dev TTS defaults to Windows SAPI on Windows for Walk latency;
TIME_MACHINE_DEV_TTS=kokoropreserves the Kokoro path.
Still Needed
[NEW] Modal Nemotron STT endpoint
- Isolated Modal function running NeMo + Nemotron 3.5 ASR
- HTTP webhook callable from the Gradio app
- Accepts audio, returns transcript JSON
[NEW] Modal Qwen3-TTS endpoint
- Isolated Modal function running Qwen3-TTS 1.7B
- HTTP webhook callable from the Gradio app
- Accepts text + voice profile, returns audio
[NEW] Modal STT/TTS adapter wrappers
ModalNemotronSTTAdapterβ calls Modal webhook, returnsTranscriptModalQwenTTSAdapterβ calls Modal webhook, returnsAudioResult- Both implement existing port interfaces
[MODIFY] container.py
- Add
modalprofile wiring the Modal adapters
[MODIFY] gradio_app.py
- Add microphone input component for push-to-talk voice input
Verification Plan
Walk Phase
- Launch with
devprofile, type messages, verify real Qwen responses and audio playback - Run
scripts\walk_smoke.pyand inspect printed stage timings for launch, conversation, and TTS latency
Run Phase
- Start the Modal STT and TTS endpoints from PowerShell. The UTF-8 settings avoid Windows console encoding failures from Modal CLI status glyphs.
$env:MODAL_PROFILE="manikandanj"
$env:PYTHONUTF8="1"
$env:PYTHONIOENCODING="utf-8"
$env:TTY_COMPATIBLE="0"
.\.venv\Scripts\modal.exe serve scripts\modal_audio.py
- Keep Modal warm for the demo. The audio service now preloads Nemotron and Qwen3-TTS in class lifecycle hooks, keeps one GPU container warm by default, and runs a short Qwen3-TTS warmup during container startup. Override with:
$env:TIME_MACHINE_MODAL_MIN_CONTAINERS="1"
$env:TIME_MACHINE_MODAL_MAX_CONTAINERS="1"
$env:TIME_MACHINE_MODAL_SCALEDOWN_SECONDS="1800"
$env:TIME_MACHINE_MODAL_WARMUP_TTS="1"
- After
modal serveprints the STT and TTS endpoint URLs, setTIME_MACHINE_MODAL_STT_URLandTIME_MACHINE_MODAL_TTS_URL, then preflight both real-model endpoints before opening the demo:
.\.venv\Scripts\python.exe scripts\modal_warmup.py
- Test voice input β text β voice output loop
Automated Tests
pytest tests/unit/β domain models, JSON contract parsing, event stream ordering- Model budget compliance test (sum enabled params β€ 32B)
Manual Verification
- Launch the machine, type/speak to the generated character, bring back a souvenir
- Verify audio playback works in the Gradio UI