Spaces:

build-small-hackathon
/

ai-time-machine

Running

App Files Files Community

ai-time-machine / docs /implementation_plan.md

manikandanj

Prepare AI Time Machine hackathon Space

5862322 verified 13 days ago

preview code

Raw

History Blame Contribute Delete

8.98 kB

	# AI Time Machine Implementation Plan

	This plan organizes the implementation of the "AI Time Machine" (Track 2: Thousand Token Wood) for the Hugging Face Build Small Hackathon.

	Updated: 2026-06-09

	## Decisions Made

	- Architecture: Cloud-first. Together API for LLM, Modal for audio models (STT/TTS), Gradio on HF Spaces for UI.
	- LLM Choice: Qwen3-8B via Together AI API (structured JSON outputs with native JSON mode).
	- STT Choice: NVIDIA Nemotron 3.5 ASR Streaming 0.6B on Modal (NVIDIA sponsor prize requirement).
	- TTS Strategy: Qwen3-TTS 1.7B VoiceDesign on Modal for production. Windows SAPI is the low-latency Walk default on Windows; Kokoro 82M remains available locally for voice-quality comparison.
	- Dev Profile: Together API + local low-latency TTS + text input (no STT needed for dev).
	- Visual Style: Immersive Steampunk cockpit (brass, copper, glowing edison bulbs, deep wood textures, circular portal).
	- Division of Work: Track A (UI/Cockpit & Gradio Shell) vs Track B (Domain Models & AI Adapters).

	## Deployment Architecture

	```text
	┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
	│ HF Spaces │ │ Together AI │ │ Modal │
	│ (Gradio UI) │────>│ (Qwen3-8B LLM) │ │ (GPU Compute) │
	│ │ │ JSON mode │ │ │
	│ │ └──────────────────┘ │ Nemotron STT │
	│ │────────────────────────────>│ Qwen3-TTS │
	└─────────────────┘ └─────────────────┘
	```

	Key principle: Divide and conquer dependencies. Never mix NeMo (audio) with vLLM (LLM) in one environment. Use API providers for the LLM (structured JSON output built-in). Dedicate Modal exclusively to audio models.

	## Phased Execution: Walk / Run / Sprint

	### Walk (Tonight — Local Sync)
	Goal: Validate logic end-to-end with real AI inference.

	- Text input -> Together API (Qwen3) -> local TTS -> Audio playback
	- Profile: `dev`
	- No STT needed — type in Gradio's text box
	- Validates: destination generation, persona generation, conversation, souvenir, TTS audio pipeline
	- On Windows, Walk defaults to `TIME_MACHINE_DEV_TTS=sapi` for low latency. Set `TIME_MACHINE_DEV_TTS=kokoro` to compare Kokoro voice quality.
	- `TIME_MACHINE_MAX_RESPONSE_CHARS` overrides the default short spoken reply cap of 260 characters.

	```powershell
	$env:TIME_MACHINE_ADAPTER_PROFILE="dev"
	$env:TIME_MACHINE_LLM_API_KEY="your-together-key"
	python app.py
	```

	Local secret-file option for desktop development:

	```text
	# .env (ignored by git)
	TIME_MACHINE_ADAPTER_PROFILE=dev
	TIME_MACHINE_LLM_API_KEY=your-together-key
	# TOGETHER_API_KEY=your-together-key also works
	TIME_MACHINE_LLM_MODEL=Qwen/Qwen2.5-7B-Instruct-Turbo
	```

	Optional Kokoro runtime for higher-quality local output (requires Python < 3.13; Python 3.13 uses the built-in WAV fallback):

	```powershell
	.\.venv\Scripts\python.exe -m pip install -e ".[dev]"
	$env:TIME_MACHINE_DEV_TTS="kokoro"
	.\.venv\Scripts\python.exe scripts\walk_smoke.py
	```

	Note: Together currently reports `Qwen/Qwen3-8B` as requiring a dedicated endpoint on this account. The local Walk profile uses the serverless `Qwen/Qwen2.5-7B-Instruct-Turbo` model unless `TIME_MACHINE_LLM_MODEL` is set explicitly.
	Latency note from local Windows testing: Kokoro took 13.41s to synthesize a 6.35s clip, while Windows SAPI synthesized a similar 6.65s clip in 1.87s. The bottleneck was local Kokoro throughput, not the cloud LLM.

	### Run (Tomorrow — Cloud Sync)
	Goal: Full voice-first loop with real models on Modal.

	- Push-to-talk microphone → Nemotron STT (Modal) → Together API (Qwen3) → Qwen3-TTS (Modal) → Audio playback
	- Profile: `modal`
	- Add microphone input component to Gradio UI
	- Create Modal functions for Nemotron and Qwen3-TTS in isolated environments

	### Sprint (Tomorrow Night — Streaming)
	Goal: Real-time streaming for a polished demo.

	- Streaming audio → Nemotron partial transcripts → LLM token streaming → TTS audio chunks → Live playback
	- Swap HTTP requests for WebSocket streaming where possible
	- Only attempt if Run phase works flawlessly

	## Adapter Profiles

	\| Profile \| LLM \| STT \| TTS \| Use Case \|
	\|---------\|-----\|-----\|-----\|----------\|
	\| `fixture` \| Fixture data \| Fixture \| Fixture \| Tests, UI dev \|
	\| `dev` \| Together API \| Whisper (local) \| SAPI on Windows, Kokoro optional \| Dev testing \|
	\| `local_models` \| Qwen local (transformers) \| Nemotron local (NeMo) \| Kokoro (local) \| Full local (needs big GPU) \|
	\| `modal` \| Together API \| Nemotron (Modal) \| Qwen3-TTS (Modal) \| Production / hackathon submission \|

	## Parameter Budget

	\| Model \| Role \| Parameters \| Enabled \|
	\|-------\|------\|------------\|----------\|
	\| Qwen3-8B \| LLM (via Together API) \| 8.0B \| ✅ \|
	\| Nemotron 3.5 ASR \| STT (on Modal) \| 0.6B \| ✅ \|
	\| Qwen3-TTS 1.7B \| TTS (on Modal) \| 1.7B \| ✅ \|
	\| Total \| \| 10.3B \| < 32B ✅ \|

	Dev-only models (not counted for submission):
	- Kokoro 82M (dev TTS fallback)
	- Whisper base 74M (dev STT fallback)

	## Proposed Changes

	### Already Implemented ✅

	#### [NEW] [cloud_completion.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/adapters/llm/cloud_completion.py)
	- Cloud LLM completion function for any OpenAI-compatible API (Together, OpenRouter, etc.)
	- Zero new dependencies (uses stdlib urllib)
	- Injected into QwenStructuredLLMAdapter via existing `completion_fn` hook

	#### [NEW] [whisper_stt.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/adapters/stt/whisper_stt.py)
	- Whisper STT adapter for dev/testing
	- Drop-in replacement for NemotronStreamingSTTAdapter
	- Same STTAdapter protocol

	#### [MODIFY] [container.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/application/container.py)
	- Added `dev` profile wiring: Together API + Whisper STT + local TTS
	- Dev TTS defaults to Windows SAPI on Windows for Walk latency; `TIME_MACHINE_DEV_TTS=kokoro` preserves the Kokoro path.

	### Still Needed

	#### [NEW] Modal Nemotron STT endpoint
	- Isolated Modal function running NeMo + Nemotron 3.5 ASR
	- HTTP webhook callable from the Gradio app
	- Accepts audio, returns transcript JSON

	#### [NEW] Modal Qwen3-TTS endpoint
	- Isolated Modal function running Qwen3-TTS 1.7B
	- HTTP webhook callable from the Gradio app
	- Accepts text + voice profile, returns audio

	#### [NEW] Modal STT/TTS adapter wrappers
	- `ModalNemotronSTTAdapter` — calls Modal webhook, returns `Transcript`
	- `ModalQwenTTSAdapter` — calls Modal webhook, returns `AudioResult`
	- Both implement existing port interfaces

	#### [MODIFY] [container.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/application/container.py)
	- Add `modal` profile wiring the Modal adapters

	#### [MODIFY] [gradio_app.py](file:///c:/Mani/Projects/build_small_hackathon/src/time_machine/ui/gradio_app.py)
	- Add microphone input component for push-to-talk voice input

	## Verification Plan

	### Walk Phase
	- Launch with `dev` profile, type messages, verify real Qwen responses and audio playback
	- Run `scripts\walk_smoke.py` and inspect printed stage timings for launch, conversation, and TTS latency

	### Run Phase
	- Start the Modal STT and TTS endpoints from PowerShell. The UTF-8 settings avoid
	Windows console encoding failures from Modal CLI status glyphs.

	```powershell
	$env:MODAL_PROFILE="manikandanj"
	$env:PYTHONUTF8="1"
	$env:PYTHONIOENCODING="utf-8"
	$env:TTY_COMPATIBLE="0"
	.\.venv\Scripts\modal.exe serve scripts\modal_audio.py
	```

	- Keep Modal warm for the demo. The audio service now preloads Nemotron and
	Qwen3-TTS in class lifecycle hooks, keeps one GPU container warm by default,
	and runs a short Qwen3-TTS warmup during container startup. Override with:

	```powershell
	$env:TIME_MACHINE_MODAL_MIN_CONTAINERS="1"
	$env:TIME_MACHINE_MODAL_MAX_CONTAINERS="1"
	$env:TIME_MACHINE_MODAL_SCALEDOWN_SECONDS="1800"
	$env:TIME_MACHINE_MODAL_WARMUP_TTS="1"
	```

	- After `modal serve` prints the STT and TTS endpoint URLs, set
	`TIME_MACHINE_MODAL_STT_URL` and `TIME_MACHINE_MODAL_TTS_URL`, then preflight
	both real-model endpoints before opening the demo:

	```powershell
	.\.venv\Scripts\python.exe scripts\modal_warmup.py
	```

	- Test voice input → text → voice output loop

	### Automated Tests
	- `pytest tests/unit/` — domain models, JSON contract parsing, event stream ordering
	- Model budget compliance test (sum enabled params ≤ 32B)

	### Manual Verification
	- Launch the machine, type/speak to the generated character, bring back a souvenir
	- Verify audio playback works in the Gradio UI