Spaces:

build-small-hackathon
/

ai-time-machine

Running

File size: 8,980 Bytes
# AI Time Machine Implementation Plan

This plan organizes the implementation of the "AI Time Machine" (Track 2: Thousand Token Wood) for the Hugging Face Build Small Hackathon.

Updated: 2026-06-09

## Decisions Made

- **Architecture**: Cloud-first. Together API for LLM, Modal for audio models (STT/TTS), Gradio on HF Spaces for UI.
- **LLM Choice**: Qwen3-8B via Together AI API (structured JSON outputs with native JSON mode).
- **STT Choice**: NVIDIA Nemotron 3.5 ASR Streaming 0.6B on Modal (NVIDIA sponsor prize requirement).
- **TTS Strategy**: Qwen3-TTS 1.7B VoiceDesign on Modal for production. Windows SAPI is the low-latency Walk default on Windows; Kokoro 82M remains available locally for voice-quality comparison.
- **Dev Profile**: Together API + local low-latency TTS + text input (no STT needed for dev).
- **Visual Style**: Immersive Steampunk cockpit (brass, copper, glowing edison bulbs, deep wood textures, circular portal).
- **Division of Work**: Track A (UI/Cockpit & Gradio Shell) vs Track B (Domain Models & AI Adapters).

## Deployment Architecture

```text
┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│   HF Spaces     │     │   Together AI     │     │     Modal       │
│   (Gradio UI)   │────>│   (Qwen3-8B LLM)  │     │  (GPU Compute)  │
│                 │     │   JSON mode       │     │                 │
│                 │     └──────────────────┘     │  Nemotron STT   │
│                 │────────────────────────────>│  Qwen3-TTS      │
└─────────────────┘                              └─────────────────┘
```

Key principle: **Divide and conquer dependencies.** Never mix NeMo (audio) with vLLM (LLM) in one environment. Use API providers for the LLM (structured JSON output built-in). Dedicate Modal exclusively to audio models.

## Phased Execution: Walk / Run / Sprint

### Walk (Tonight — Local Sync)
**Goal**: Validate logic end-to-end with real AI inference.

- Text input -> Together API (Qwen3) -> local TTS -> Audio playback
- Profile: `dev`
- No STT needed — type in Gradio's text box
- Validates: destination generation, persona generation, conversation, souvenir, TTS audio pipeline
- On Windows, Walk defaults to `TIME_MACHINE_DEV_TTS=sapi` for low latency. Set `TIME_MACHINE_DEV_TTS=kokoro` to compare Kokoro voice quality.
- `TIME_MACHINE_MAX_RESPONSE_CHARS` overrides the default short spoken reply cap of 260 characters.

```powershell
$env:TIME_MACHINE_ADAPTER_PROFILE="dev"
$env:TIME_MACHINE_LLM_API_KEY="your-together-key"
python app.py
```

Local secret-file option for desktop development:

```text
# .env (ignored by git)
TIME_MACHINE_ADAPTER_PROFILE=dev
TIME_MACHINE_LLM_API_KEY=your-together-key
# TOGETHER_API_KEY=your-together-key also works
TIME_MACHINE_LLM_MODEL=Qwen/Qwen2.5-7B-Instruct-Turbo
```

Optional Kokoro runtime for higher-quality local output (requires Python < 3.13; Python 3.13 uses the built-in WAV fallback):

```powershell
.\.venv\Scripts\python.exe -m pip install -e ".[dev]"
$env:TIME_MACHINE_DEV_TTS="kokoro"
.\.venv\Scripts\python.exe scripts\walk_smoke.py
```

Note: Together currently reports `Qwen/Qwen3-8B` as requiring a dedicated endpoint on this account. The local Walk profile uses the serverless `Qwen/Qwen2.5-7B-Instruct-Turbo` model unless `TIME_MACHINE_LLM_MODEL` is set explicitly.
Latency note from local Windows testing: Kokoro took 13.41s to synthesize a 6.35s clip, while Windows SAPI synthesized a similar 6.65s clip in 1.87s. The bottleneck was local Kokoro throughput, not the cloud LLM.

### Run (Tomorrow — Cloud Sync)
**Goal**: Full voice-first loop with real models on Modal.

- Push-to-talk microphone → Nemotron STT (Modal) → Together API (Qwen3) → Qwen3-TTS (Modal) → Audio playback
- Profile: `modal`
- Add microphone input component to Gradio UI
- Create Modal functions for Nemotron and Qwen3-TTS in isolated environments

### Sprint (Tomorrow Night — Streaming)
**Goal**: Real-time streaming for a polished demo.

- Streaming audio → Nemotron partial transcripts → LLM token streaming → TTS audio chunks → Live playback
- Swap HTTP requests for WebSocket streaming where possible
- Only attempt if Run phase works flawlessly

## Adapter Profiles

| Profile | LLM | STT | TTS | Use Case |
|---------|-----|-----|-----|----------|
| `fixture` | Fixture data | Fixture | Fixture | Tests, UI dev |
| `dev` | Together API | Whisper (local) | SAPI on Windows, Kokoro optional | Dev testing |
| `local_models` | Qwen local (transformers) | Nemotron local (NeMo) | Kokoro (local) | Full local (needs big GPU) |
| `modal` | Together API | Nemotron (Modal) | Qwen3-TTS (Modal) | Production / hackathon submission |

## Parameter Budget

| Model | Role | Parameters | Enabled |
|-------|------|------------|----------|
| Qwen3-8B | LLM (via Together API) | 8.0B | ✅ |
| Nemotron 3.5 ASR | STT (on Modal) | 0.6B | ✅ |
| Qwen3-TTS 1.7B | TTS (on Modal) | 1.7B | ✅ |
| **Total** | | **10.3B** | **< 32B ✅** |

Dev-only models (not counted for submission):
- Kokoro 82M (dev TTS fallback)
- Whisper base 74M (dev STT fallback)

## Proposed Changes

### Already Implemented ✅

#### [NEW] [cloud_completion.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/adapters/llm/cloud_completion.py)
- Cloud LLM completion function for any OpenAI-compatible API (Together, OpenRouter, etc.)
- Zero new dependencies (uses stdlib urllib)
- Injected into QwenStructuredLLMAdapter via existing `completion_fn` hook

#### [NEW] [whisper_stt.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/adapters/stt/whisper_stt.py)
- Whisper STT adapter for dev/testing
- Drop-in replacement for NemotronStreamingSTTAdapter
- Same STTAdapter protocol

#### [MODIFY] [container.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/application/container.py)
- Added `dev` profile wiring: Together API + Whisper STT + local TTS
- Dev TTS defaults to Windows SAPI on Windows for Walk latency; `TIME_MACHINE_DEV_TTS=kokoro` preserves the Kokoro path.

### Still Needed

#### [NEW] Modal Nemotron STT endpoint
- Isolated Modal function running NeMo + Nemotron 3.5 ASR
- HTTP webhook callable from the Gradio app
- Accepts audio, returns transcript JSON

#### [NEW] Modal Qwen3-TTS endpoint
- Isolated Modal function running Qwen3-TTS 1.7B
- HTTP webhook callable from the Gradio app
- Accepts text + voice profile, returns audio

#### [NEW] Modal STT/TTS adapter wrappers
- `ModalNemotronSTTAdapter` — calls Modal webhook, returns `Transcript`
- `ModalQwenTTSAdapter` — calls Modal webhook, returns `AudioResult`
- Both implement existing port interfaces

#### [MODIFY] [container.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/application/container.py)
- Add `modal` profile wiring the Modal adapters

#### [MODIFY] [gradio_app.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/ui/gradio_app.py)
- Add microphone input component for push-to-talk voice input

## Verification Plan

### Walk Phase
- Launch with `dev` profile, type messages, verify real Qwen responses and audio playback
- Run `scripts\walk_smoke.py` and inspect printed stage timings for launch, conversation, and TTS latency

### Run Phase
- Start the Modal STT and TTS endpoints from PowerShell. The UTF-8 settings avoid
  Windows console encoding failures from Modal CLI status glyphs.

```powershell
$env:MODAL_PROFILE="manikandanj"
$env:PYTHONUTF8="1"
$env:PYTHONIOENCODING="utf-8"
$env:TTY_COMPATIBLE="0"
.\.venv\Scripts\modal.exe serve scripts\modal_audio.py
```

- Keep Modal warm for the demo. The audio service now preloads Nemotron and
  Qwen3-TTS in class lifecycle hooks, keeps one GPU container warm by default,
  and runs a short Qwen3-TTS warmup during container startup. Override with:

```powershell
$env:TIME_MACHINE_MODAL_MIN_CONTAINERS="1"
$env:TIME_MACHINE_MODAL_MAX_CONTAINERS="1"
$env:TIME_MACHINE_MODAL_SCALEDOWN_SECONDS="1800"
$env:TIME_MACHINE_MODAL_WARMUP_TTS="1"
```

- After `modal serve` prints the STT and TTS endpoint URLs, set
  `TIME_MACHINE_MODAL_STT_URL` and `TIME_MACHINE_MODAL_TTS_URL`, then preflight
  both real-model endpoints before opening the demo:

```powershell
.\.venv\Scripts\python.exe scripts\modal_warmup.py
```

- Test voice input → text → voice output loop

### Automated Tests
- `pytest tests/unit/` — domain models, JSON contract parsing, event stream ordering
- Model budget compliance test (sum enabled params ≤ 32B)

### Manual Verification
- Launch the machine, type/speak to the generated character, bring back a souvenir
- Verify audio playback works in the Gradio UI