ai-time-machine / docs /implementation_plan.md
manikandanj's picture
Prepare AI Time Machine hackathon Space
5862322 verified
|
Raw
History Blame Contribute Delete
8.98 kB
# AI Time Machine Implementation Plan
This plan organizes the implementation of the "AI Time Machine" (Track 2: Thousand Token Wood) for the Hugging Face Build Small Hackathon.
Updated: 2026-06-09
## Decisions Made
- **Architecture**: Cloud-first. Together API for LLM, Modal for audio models (STT/TTS), Gradio on HF Spaces for UI.
- **LLM Choice**: Qwen3-8B via Together AI API (structured JSON outputs with native JSON mode).
- **STT Choice**: NVIDIA Nemotron 3.5 ASR Streaming 0.6B on Modal (NVIDIA sponsor prize requirement).
- **TTS Strategy**: Qwen3-TTS 1.7B VoiceDesign on Modal for production. Windows SAPI is the low-latency Walk default on Windows; Kokoro 82M remains available locally for voice-quality comparison.
- **Dev Profile**: Together API + local low-latency TTS + text input (no STT needed for dev).
- **Visual Style**: Immersive Steampunk cockpit (brass, copper, glowing edison bulbs, deep wood textures, circular portal).
- **Division of Work**: Track A (UI/Cockpit & Gradio Shell) vs Track B (Domain Models & AI Adapters).
## Deployment Architecture
```text
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ HF Spaces β”‚ β”‚ Together AI β”‚ β”‚ Modal β”‚
β”‚ (Gradio UI) │────>β”‚ (Qwen3-8B LLM) β”‚ β”‚ (GPU Compute) β”‚
β”‚ β”‚ β”‚ JSON mode β”‚ β”‚ β”‚
β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ Nemotron STT β”‚
β”‚ │────────────────────────────>β”‚ Qwen3-TTS β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
Key principle: **Divide and conquer dependencies.** Never mix NeMo (audio) with vLLM (LLM) in one environment. Use API providers for the LLM (structured JSON output built-in). Dedicate Modal exclusively to audio models.
## Phased Execution: Walk / Run / Sprint
### Walk (Tonight β€” Local Sync)
**Goal**: Validate logic end-to-end with real AI inference.
- Text input -> Together API (Qwen3) -> local TTS -> Audio playback
- Profile: `dev`
- No STT needed β€” type in Gradio's text box
- Validates: destination generation, persona generation, conversation, souvenir, TTS audio pipeline
- On Windows, Walk defaults to `TIME_MACHINE_DEV_TTS=sapi` for low latency. Set `TIME_MACHINE_DEV_TTS=kokoro` to compare Kokoro voice quality.
- `TIME_MACHINE_MAX_RESPONSE_CHARS` overrides the default short spoken reply cap of 260 characters.
```powershell
$env:TIME_MACHINE_ADAPTER_PROFILE="dev"
$env:TIME_MACHINE_LLM_API_KEY="your-together-key"
python app.py
```
Local secret-file option for desktop development:
```text
# .env (ignored by git)
TIME_MACHINE_ADAPTER_PROFILE=dev
TIME_MACHINE_LLM_API_KEY=your-together-key
# TOGETHER_API_KEY=your-together-key also works
TIME_MACHINE_LLM_MODEL=Qwen/Qwen2.5-7B-Instruct-Turbo
```
Optional Kokoro runtime for higher-quality local output (requires Python < 3.13; Python 3.13 uses the built-in WAV fallback):
```powershell
.\.venv\Scripts\python.exe -m pip install -e ".[dev]"
$env:TIME_MACHINE_DEV_TTS="kokoro"
.\.venv\Scripts\python.exe scripts\walk_smoke.py
```
Note: Together currently reports `Qwen/Qwen3-8B` as requiring a dedicated endpoint on this account. The local Walk profile uses the serverless `Qwen/Qwen2.5-7B-Instruct-Turbo` model unless `TIME_MACHINE_LLM_MODEL` is set explicitly.
Latency note from local Windows testing: Kokoro took 13.41s to synthesize a 6.35s clip, while Windows SAPI synthesized a similar 6.65s clip in 1.87s. The bottleneck was local Kokoro throughput, not the cloud LLM.
### Run (Tomorrow β€” Cloud Sync)
**Goal**: Full voice-first loop with real models on Modal.
- Push-to-talk microphone β†’ Nemotron STT (Modal) β†’ Together API (Qwen3) β†’ Qwen3-TTS (Modal) β†’ Audio playback
- Profile: `modal`
- Add microphone input component to Gradio UI
- Create Modal functions for Nemotron and Qwen3-TTS in isolated environments
### Sprint (Tomorrow Night β€” Streaming)
**Goal**: Real-time streaming for a polished demo.
- Streaming audio β†’ Nemotron partial transcripts β†’ LLM token streaming β†’ TTS audio chunks β†’ Live playback
- Swap HTTP requests for WebSocket streaming where possible
- Only attempt if Run phase works flawlessly
## Adapter Profiles
| Profile | LLM | STT | TTS | Use Case |
|---------|-----|-----|-----|----------|
| `fixture` | Fixture data | Fixture | Fixture | Tests, UI dev |
| `dev` | Together API | Whisper (local) | SAPI on Windows, Kokoro optional | Dev testing |
| `local_models` | Qwen local (transformers) | Nemotron local (NeMo) | Kokoro (local) | Full local (needs big GPU) |
| `modal` | Together API | Nemotron (Modal) | Qwen3-TTS (Modal) | Production / hackathon submission |
## Parameter Budget
| Model | Role | Parameters | Enabled |
|-------|------|------------|----------|
| Qwen3-8B | LLM (via Together API) | 8.0B | βœ… |
| Nemotron 3.5 ASR | STT (on Modal) | 0.6B | βœ… |
| Qwen3-TTS 1.7B | TTS (on Modal) | 1.7B | βœ… |
| **Total** | | **10.3B** | **< 32B βœ…** |
Dev-only models (not counted for submission):
- Kokoro 82M (dev TTS fallback)
- Whisper base 74M (dev STT fallback)
## Proposed Changes
### Already Implemented βœ…
#### [NEW] [cloud_completion.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/adapters/llm/cloud_completion.py)
- Cloud LLM completion function for any OpenAI-compatible API (Together, OpenRouter, etc.)
- Zero new dependencies (uses stdlib urllib)
- Injected into QwenStructuredLLMAdapter via existing `completion_fn` hook
#### [NEW] [whisper_stt.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/adapters/stt/whisper_stt.py)
- Whisper STT adapter for dev/testing
- Drop-in replacement for NemotronStreamingSTTAdapter
- Same STTAdapter protocol
#### [MODIFY] [container.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/application/container.py)
- Added `dev` profile wiring: Together API + Whisper STT + local TTS
- Dev TTS defaults to Windows SAPI on Windows for Walk latency; `TIME_MACHINE_DEV_TTS=kokoro` preserves the Kokoro path.
### Still Needed
#### [NEW] Modal Nemotron STT endpoint
- Isolated Modal function running NeMo + Nemotron 3.5 ASR
- HTTP webhook callable from the Gradio app
- Accepts audio, returns transcript JSON
#### [NEW] Modal Qwen3-TTS endpoint
- Isolated Modal function running Qwen3-TTS 1.7B
- HTTP webhook callable from the Gradio app
- Accepts text + voice profile, returns audio
#### [NEW] Modal STT/TTS adapter wrappers
- `ModalNemotronSTTAdapter` β€” calls Modal webhook, returns `Transcript`
- `ModalQwenTTSAdapter` β€” calls Modal webhook, returns `AudioResult`
- Both implement existing port interfaces
#### [MODIFY] [container.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/application/container.py)
- Add `modal` profile wiring the Modal adapters
#### [MODIFY] [gradio_app.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/ui/gradio_app.py)
- Add microphone input component for push-to-talk voice input
## Verification Plan
### Walk Phase
- Launch with `dev` profile, type messages, verify real Qwen responses and audio playback
- Run `scripts\walk_smoke.py` and inspect printed stage timings for launch, conversation, and TTS latency
### Run Phase
- Start the Modal STT and TTS endpoints from PowerShell. The UTF-8 settings avoid
Windows console encoding failures from Modal CLI status glyphs.
```powershell
$env:MODAL_PROFILE="manikandanj"
$env:PYTHONUTF8="1"
$env:PYTHONIOENCODING="utf-8"
$env:TTY_COMPATIBLE="0"
.\.venv\Scripts\modal.exe serve scripts\modal_audio.py
```
- Keep Modal warm for the demo. The audio service now preloads Nemotron and
Qwen3-TTS in class lifecycle hooks, keeps one GPU container warm by default,
and runs a short Qwen3-TTS warmup during container startup. Override with:
```powershell
$env:TIME_MACHINE_MODAL_MIN_CONTAINERS="1"
$env:TIME_MACHINE_MODAL_MAX_CONTAINERS="1"
$env:TIME_MACHINE_MODAL_SCALEDOWN_SECONDS="1800"
$env:TIME_MACHINE_MODAL_WARMUP_TTS="1"
```
- After `modal serve` prints the STT and TTS endpoint URLs, set
`TIME_MACHINE_MODAL_STT_URL` and `TIME_MACHINE_MODAL_TTS_URL`, then preflight
both real-model endpoints before opening the demo:
```powershell
.\.venv\Scripts\python.exe scripts\modal_warmup.py
```
- Test voice input β†’ text β†’ voice output loop
### Automated Tests
- `pytest tests/unit/` β€” domain models, JSON contract parsing, event stream ordering
- Model budget compliance test (sum enabled params ≀ 32B)
### Manual Verification
- Launch the machine, type/speak to the generated character, bring back a souvenir
- Verify audio playback works in the Gradio UI