File size: 8,980 Bytes
5862322
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
# AI Time Machine Implementation Plan

This plan organizes the implementation of the "AI Time Machine" (Track 2: Thousand Token Wood) for the Hugging Face Build Small Hackathon.

Updated: 2026-06-09

## Decisions Made

- **Architecture**: Cloud-first. Together API for LLM, Modal for audio models (STT/TTS), Gradio on HF Spaces for UI.
- **LLM Choice**: Qwen3-8B via Together AI API (structured JSON outputs with native JSON mode).
- **STT Choice**: NVIDIA Nemotron 3.5 ASR Streaming 0.6B on Modal (NVIDIA sponsor prize requirement).
- **TTS Strategy**: Qwen3-TTS 1.7B VoiceDesign on Modal for production. Windows SAPI is the low-latency Walk default on Windows; Kokoro 82M remains available locally for voice-quality comparison.
- **Dev Profile**: Together API + local low-latency TTS + text input (no STT needed for dev).
- **Visual Style**: Immersive Steampunk cockpit (brass, copper, glowing edison bulbs, deep wood textures, circular portal).
- **Division of Work**: Track A (UI/Cockpit & Gradio Shell) vs Track B (Domain Models & AI Adapters).

## Deployment Architecture

```text
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   HF Spaces     β”‚     β”‚   Together AI     β”‚     β”‚     Modal       β”‚
β”‚   (Gradio UI)   │────>β”‚   (Qwen3-8B LLM)  β”‚     β”‚  (GPU Compute)  β”‚
β”‚                 β”‚     β”‚   JSON mode       β”‚     β”‚                 β”‚
β”‚                 β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚  Nemotron STT   β”‚
β”‚                 │────────────────────────────>β”‚  Qwen3-TTS      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

Key principle: **Divide and conquer dependencies.** Never mix NeMo (audio) with vLLM (LLM) in one environment. Use API providers for the LLM (structured JSON output built-in). Dedicate Modal exclusively to audio models.

## Phased Execution: Walk / Run / Sprint

### Walk (Tonight β€” Local Sync)
**Goal**: Validate logic end-to-end with real AI inference.

- Text input -> Together API (Qwen3) -> local TTS -> Audio playback
- Profile: `dev`
- No STT needed β€” type in Gradio's text box
- Validates: destination generation, persona generation, conversation, souvenir, TTS audio pipeline
- On Windows, Walk defaults to `TIME_MACHINE_DEV_TTS=sapi` for low latency. Set `TIME_MACHINE_DEV_TTS=kokoro` to compare Kokoro voice quality.
- `TIME_MACHINE_MAX_RESPONSE_CHARS` overrides the default short spoken reply cap of 260 characters.

```powershell
$env:TIME_MACHINE_ADAPTER_PROFILE="dev"
$env:TIME_MACHINE_LLM_API_KEY="your-together-key"
python app.py
```

Local secret-file option for desktop development:

```text
# .env (ignored by git)
TIME_MACHINE_ADAPTER_PROFILE=dev
TIME_MACHINE_LLM_API_KEY=your-together-key
# TOGETHER_API_KEY=your-together-key also works
TIME_MACHINE_LLM_MODEL=Qwen/Qwen2.5-7B-Instruct-Turbo
```

Optional Kokoro runtime for higher-quality local output (requires Python < 3.13; Python 3.13 uses the built-in WAV fallback):

```powershell
.\.venv\Scripts\python.exe -m pip install -e ".[dev]"
$env:TIME_MACHINE_DEV_TTS="kokoro"
.\.venv\Scripts\python.exe scripts\walk_smoke.py
```

Note: Together currently reports `Qwen/Qwen3-8B` as requiring a dedicated endpoint on this account. The local Walk profile uses the serverless `Qwen/Qwen2.5-7B-Instruct-Turbo` model unless `TIME_MACHINE_LLM_MODEL` is set explicitly.
Latency note from local Windows testing: Kokoro took 13.41s to synthesize a 6.35s clip, while Windows SAPI synthesized a similar 6.65s clip in 1.87s. The bottleneck was local Kokoro throughput, not the cloud LLM.

### Run (Tomorrow β€” Cloud Sync)
**Goal**: Full voice-first loop with real models on Modal.

- Push-to-talk microphone β†’ Nemotron STT (Modal) β†’ Together API (Qwen3) β†’ Qwen3-TTS (Modal) β†’ Audio playback
- Profile: `modal`
- Add microphone input component to Gradio UI
- Create Modal functions for Nemotron and Qwen3-TTS in isolated environments

### Sprint (Tomorrow Night β€” Streaming)
**Goal**: Real-time streaming for a polished demo.

- Streaming audio β†’ Nemotron partial transcripts β†’ LLM token streaming β†’ TTS audio chunks β†’ Live playback
- Swap HTTP requests for WebSocket streaming where possible
- Only attempt if Run phase works flawlessly

## Adapter Profiles

| Profile | LLM | STT | TTS | Use Case |
|---------|-----|-----|-----|----------|
| `fixture` | Fixture data | Fixture | Fixture | Tests, UI dev |
| `dev` | Together API | Whisper (local) | SAPI on Windows, Kokoro optional | Dev testing |
| `local_models` | Qwen local (transformers) | Nemotron local (NeMo) | Kokoro (local) | Full local (needs big GPU) |
| `modal` | Together API | Nemotron (Modal) | Qwen3-TTS (Modal) | Production / hackathon submission |

## Parameter Budget

| Model | Role | Parameters | Enabled |
|-------|------|------------|----------|
| Qwen3-8B | LLM (via Together API) | 8.0B | βœ… |
| Nemotron 3.5 ASR | STT (on Modal) | 0.6B | βœ… |
| Qwen3-TTS 1.7B | TTS (on Modal) | 1.7B | βœ… |
| **Total** | | **10.3B** | **< 32B βœ…** |

Dev-only models (not counted for submission):
- Kokoro 82M (dev TTS fallback)
- Whisper base 74M (dev STT fallback)

## Proposed Changes

### Already Implemented βœ…

#### [NEW] [cloud_completion.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/adapters/llm/cloud_completion.py)
- Cloud LLM completion function for any OpenAI-compatible API (Together, OpenRouter, etc.)
- Zero new dependencies (uses stdlib urllib)
- Injected into QwenStructuredLLMAdapter via existing `completion_fn` hook

#### [NEW] [whisper_stt.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/adapters/stt/whisper_stt.py)
- Whisper STT adapter for dev/testing
- Drop-in replacement for NemotronStreamingSTTAdapter
- Same STTAdapter protocol

#### [MODIFY] [container.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/application/container.py)
- Added `dev` profile wiring: Together API + Whisper STT + local TTS
- Dev TTS defaults to Windows SAPI on Windows for Walk latency; `TIME_MACHINE_DEV_TTS=kokoro` preserves the Kokoro path.

### Still Needed

#### [NEW] Modal Nemotron STT endpoint
- Isolated Modal function running NeMo + Nemotron 3.5 ASR
- HTTP webhook callable from the Gradio app
- Accepts audio, returns transcript JSON

#### [NEW] Modal Qwen3-TTS endpoint
- Isolated Modal function running Qwen3-TTS 1.7B
- HTTP webhook callable from the Gradio app
- Accepts text + voice profile, returns audio

#### [NEW] Modal STT/TTS adapter wrappers
- `ModalNemotronSTTAdapter` β€” calls Modal webhook, returns `Transcript`
- `ModalQwenTTSAdapter` β€” calls Modal webhook, returns `AudioResult`
- Both implement existing port interfaces

#### [MODIFY] [container.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/application/container.py)
- Add `modal` profile wiring the Modal adapters

#### [MODIFY] [gradio_app.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/ui/gradio_app.py)
- Add microphone input component for push-to-talk voice input

## Verification Plan

### Walk Phase
- Launch with `dev` profile, type messages, verify real Qwen responses and audio playback
- Run `scripts\walk_smoke.py` and inspect printed stage timings for launch, conversation, and TTS latency

### Run Phase
- Start the Modal STT and TTS endpoints from PowerShell. The UTF-8 settings avoid
  Windows console encoding failures from Modal CLI status glyphs.

```powershell
$env:MODAL_PROFILE="manikandanj"
$env:PYTHONUTF8="1"
$env:PYTHONIOENCODING="utf-8"
$env:TTY_COMPATIBLE="0"
.\.venv\Scripts\modal.exe serve scripts\modal_audio.py
```

- Keep Modal warm for the demo. The audio service now preloads Nemotron and
  Qwen3-TTS in class lifecycle hooks, keeps one GPU container warm by default,
  and runs a short Qwen3-TTS warmup during container startup. Override with:

```powershell
$env:TIME_MACHINE_MODAL_MIN_CONTAINERS="1"
$env:TIME_MACHINE_MODAL_MAX_CONTAINERS="1"
$env:TIME_MACHINE_MODAL_SCALEDOWN_SECONDS="1800"
$env:TIME_MACHINE_MODAL_WARMUP_TTS="1"
```

- After `modal serve` prints the STT and TTS endpoint URLs, set
  `TIME_MACHINE_MODAL_STT_URL` and `TIME_MACHINE_MODAL_TTS_URL`, then preflight
  both real-model endpoints before opening the demo:

```powershell
.\.venv\Scripts\python.exe scripts\modal_warmup.py
```

- Test voice input β†’ text β†’ voice output loop

### Automated Tests
- `pytest tests/unit/` β€” domain models, JSON contract parsing, event stream ordering
- Model budget compliance test (sum enabled params ≀ 32B)

### Manual Verification
- Launch the machine, type/speak to the generated character, bring back a souvenir
- Verify audio playback works in the Gradio UI