File size: 8,980 Bytes
5862322 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 | # AI Time Machine Implementation Plan
This plan organizes the implementation of the "AI Time Machine" (Track 2: Thousand Token Wood) for the Hugging Face Build Small Hackathon.
Updated: 2026-06-09
## Decisions Made
- **Architecture**: Cloud-first. Together API for LLM, Modal for audio models (STT/TTS), Gradio on HF Spaces for UI.
- **LLM Choice**: Qwen3-8B via Together AI API (structured JSON outputs with native JSON mode).
- **STT Choice**: NVIDIA Nemotron 3.5 ASR Streaming 0.6B on Modal (NVIDIA sponsor prize requirement).
- **TTS Strategy**: Qwen3-TTS 1.7B VoiceDesign on Modal for production. Windows SAPI is the low-latency Walk default on Windows; Kokoro 82M remains available locally for voice-quality comparison.
- **Dev Profile**: Together API + local low-latency TTS + text input (no STT needed for dev).
- **Visual Style**: Immersive Steampunk cockpit (brass, copper, glowing edison bulbs, deep wood textures, circular portal).
- **Division of Work**: Track A (UI/Cockpit & Gradio Shell) vs Track B (Domain Models & AI Adapters).
## Deployment Architecture
```text
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β HF Spaces β β Together AI β β Modal β
β (Gradio UI) βββββ>β (Qwen3-8B LLM) β β (GPU Compute) β
β β β JSON mode β β β
β β ββββββββββββββββββββ β Nemotron STT β
β βββββββββββββββββββββββββββββ>β Qwen3-TTS β
βββββββββββββββββββ βββββββββββββββββββ
```
Key principle: **Divide and conquer dependencies.** Never mix NeMo (audio) with vLLM (LLM) in one environment. Use API providers for the LLM (structured JSON output built-in). Dedicate Modal exclusively to audio models.
## Phased Execution: Walk / Run / Sprint
### Walk (Tonight β Local Sync)
**Goal**: Validate logic end-to-end with real AI inference.
- Text input -> Together API (Qwen3) -> local TTS -> Audio playback
- Profile: `dev`
- No STT needed β type in Gradio's text box
- Validates: destination generation, persona generation, conversation, souvenir, TTS audio pipeline
- On Windows, Walk defaults to `TIME_MACHINE_DEV_TTS=sapi` for low latency. Set `TIME_MACHINE_DEV_TTS=kokoro` to compare Kokoro voice quality.
- `TIME_MACHINE_MAX_RESPONSE_CHARS` overrides the default short spoken reply cap of 260 characters.
```powershell
$env:TIME_MACHINE_ADAPTER_PROFILE="dev"
$env:TIME_MACHINE_LLM_API_KEY="your-together-key"
python app.py
```
Local secret-file option for desktop development:
```text
# .env (ignored by git)
TIME_MACHINE_ADAPTER_PROFILE=dev
TIME_MACHINE_LLM_API_KEY=your-together-key
# TOGETHER_API_KEY=your-together-key also works
TIME_MACHINE_LLM_MODEL=Qwen/Qwen2.5-7B-Instruct-Turbo
```
Optional Kokoro runtime for higher-quality local output (requires Python < 3.13; Python 3.13 uses the built-in WAV fallback):
```powershell
.\.venv\Scripts\python.exe -m pip install -e ".[dev]"
$env:TIME_MACHINE_DEV_TTS="kokoro"
.\.venv\Scripts\python.exe scripts\walk_smoke.py
```
Note: Together currently reports `Qwen/Qwen3-8B` as requiring a dedicated endpoint on this account. The local Walk profile uses the serverless `Qwen/Qwen2.5-7B-Instruct-Turbo` model unless `TIME_MACHINE_LLM_MODEL` is set explicitly.
Latency note from local Windows testing: Kokoro took 13.41s to synthesize a 6.35s clip, while Windows SAPI synthesized a similar 6.65s clip in 1.87s. The bottleneck was local Kokoro throughput, not the cloud LLM.
### Run (Tomorrow β Cloud Sync)
**Goal**: Full voice-first loop with real models on Modal.
- Push-to-talk microphone β Nemotron STT (Modal) β Together API (Qwen3) β Qwen3-TTS (Modal) β Audio playback
- Profile: `modal`
- Add microphone input component to Gradio UI
- Create Modal functions for Nemotron and Qwen3-TTS in isolated environments
### Sprint (Tomorrow Night β Streaming)
**Goal**: Real-time streaming for a polished demo.
- Streaming audio β Nemotron partial transcripts β LLM token streaming β TTS audio chunks β Live playback
- Swap HTTP requests for WebSocket streaming where possible
- Only attempt if Run phase works flawlessly
## Adapter Profiles
| Profile | LLM | STT | TTS | Use Case |
|---------|-----|-----|-----|----------|
| `fixture` | Fixture data | Fixture | Fixture | Tests, UI dev |
| `dev` | Together API | Whisper (local) | SAPI on Windows, Kokoro optional | Dev testing |
| `local_models` | Qwen local (transformers) | Nemotron local (NeMo) | Kokoro (local) | Full local (needs big GPU) |
| `modal` | Together API | Nemotron (Modal) | Qwen3-TTS (Modal) | Production / hackathon submission |
## Parameter Budget
| Model | Role | Parameters | Enabled |
|-------|------|------------|----------|
| Qwen3-8B | LLM (via Together API) | 8.0B | β
|
| Nemotron 3.5 ASR | STT (on Modal) | 0.6B | β
|
| Qwen3-TTS 1.7B | TTS (on Modal) | 1.7B | β
|
| **Total** | | **10.3B** | **< 32B β
** |
Dev-only models (not counted for submission):
- Kokoro 82M (dev TTS fallback)
- Whisper base 74M (dev STT fallback)
## Proposed Changes
### Already Implemented β
#### [NEW] [cloud_completion.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/adapters/llm/cloud_completion.py)
- Cloud LLM completion function for any OpenAI-compatible API (Together, OpenRouter, etc.)
- Zero new dependencies (uses stdlib urllib)
- Injected into QwenStructuredLLMAdapter via existing `completion_fn` hook
#### [NEW] [whisper_stt.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/adapters/stt/whisper_stt.py)
- Whisper STT adapter for dev/testing
- Drop-in replacement for NemotronStreamingSTTAdapter
- Same STTAdapter protocol
#### [MODIFY] [container.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/application/container.py)
- Added `dev` profile wiring: Together API + Whisper STT + local TTS
- Dev TTS defaults to Windows SAPI on Windows for Walk latency; `TIME_MACHINE_DEV_TTS=kokoro` preserves the Kokoro path.
### Still Needed
#### [NEW] Modal Nemotron STT endpoint
- Isolated Modal function running NeMo + Nemotron 3.5 ASR
- HTTP webhook callable from the Gradio app
- Accepts audio, returns transcript JSON
#### [NEW] Modal Qwen3-TTS endpoint
- Isolated Modal function running Qwen3-TTS 1.7B
- HTTP webhook callable from the Gradio app
- Accepts text + voice profile, returns audio
#### [NEW] Modal STT/TTS adapter wrappers
- `ModalNemotronSTTAdapter` β calls Modal webhook, returns `Transcript`
- `ModalQwenTTSAdapter` β calls Modal webhook, returns `AudioResult`
- Both implement existing port interfaces
#### [MODIFY] [container.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/application/container.py)
- Add `modal` profile wiring the Modal adapters
#### [MODIFY] [gradio_app.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/ui/gradio_app.py)
- Add microphone input component for push-to-talk voice input
## Verification Plan
### Walk Phase
- Launch with `dev` profile, type messages, verify real Qwen responses and audio playback
- Run `scripts\walk_smoke.py` and inspect printed stage timings for launch, conversation, and TTS latency
### Run Phase
- Start the Modal STT and TTS endpoints from PowerShell. The UTF-8 settings avoid
Windows console encoding failures from Modal CLI status glyphs.
```powershell
$env:MODAL_PROFILE="manikandanj"
$env:PYTHONUTF8="1"
$env:PYTHONIOENCODING="utf-8"
$env:TTY_COMPATIBLE="0"
.\.venv\Scripts\modal.exe serve scripts\modal_audio.py
```
- Keep Modal warm for the demo. The audio service now preloads Nemotron and
Qwen3-TTS in class lifecycle hooks, keeps one GPU container warm by default,
and runs a short Qwen3-TTS warmup during container startup. Override with:
```powershell
$env:TIME_MACHINE_MODAL_MIN_CONTAINERS="1"
$env:TIME_MACHINE_MODAL_MAX_CONTAINERS="1"
$env:TIME_MACHINE_MODAL_SCALEDOWN_SECONDS="1800"
$env:TIME_MACHINE_MODAL_WARMUP_TTS="1"
```
- After `modal serve` prints the STT and TTS endpoint URLs, set
`TIME_MACHINE_MODAL_STT_URL` and `TIME_MACHINE_MODAL_TTS_URL`, then preflight
both real-model endpoints before opening the demo:
```powershell
.\.venv\Scripts\python.exe scripts\modal_warmup.py
```
- Test voice input β text β voice output loop
### Automated Tests
- `pytest tests/unit/` β domain models, JSON contract parsing, event stream ordering
- Model budget compliance test (sum enabled params β€ 32B)
### Manual Verification
- Launch the machine, type/speak to the generated character, bring back a souvenir
- Verify audio playback works in the Gradio UI
|