Spaces:

build-small-hackathon
/

ai-time-machine

Running

App Files Files Community

ai-time-machine / docs /implementation_plan.md

manikandanj

Prepare AI Time Machine hackathon Space

5862322 verified 12 days ago

preview code

Raw

History Blame Contribute Delete

8.98 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

AI Time Machine Implementation Plan

This plan organizes the implementation of the "AI Time Machine" (Track 2: Thousand Token Wood) for the Hugging Face Build Small Hackathon.

Updated: 2026-06-09

Decisions Made

Architecture: Cloud-first. Together API for LLM, Modal for audio models (STT/TTS), Gradio on HF Spaces for UI.
LLM Choice: Qwen3-8B via Together AI API (structured JSON outputs with native JSON mode).
STT Choice: NVIDIA Nemotron 3.5 ASR Streaming 0.6B on Modal (NVIDIA sponsor prize requirement).
TTS Strategy: Qwen3-TTS 1.7B VoiceDesign on Modal for production. Windows SAPI is the low-latency Walk default on Windows; Kokoro 82M remains available locally for voice-quality comparison.
Dev Profile: Together API + local low-latency TTS + text input (no STT needed for dev).
Visual Style: Immersive Steampunk cockpit (brass, copper, glowing edison bulbs, deep wood textures, circular portal).
Division of Work: Track A (UI/Cockpit & Gradio Shell) vs Track B (Domain Models & AI Adapters).

Deployment Architecture

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│   HF Spaces     │     │   Together AI     │     │     Modal       │
│   (Gradio UI)   │────>│   (Qwen3-8B LLM)  │     │  (GPU Compute)  │
│                 │     │   JSON mode       │     │                 │
│                 │     └──────────────────┘     │  Nemotron STT   │
│                 │────────────────────────────>│  Qwen3-TTS      │
└─────────────────┘                              └─────────────────┘

Key principle: Divide and conquer dependencies. Never mix NeMo (audio) with vLLM (LLM) in one environment. Use API providers for the LLM (structured JSON output built-in). Dedicate Modal exclusively to audio models.

Phased Execution: Walk / Run / Sprint

Walk (Tonight — Local Sync)

Goal: Validate logic end-to-end with real AI inference.

Text input -> Together API (Qwen3) -> local TTS -> Audio playback
Profile: dev
No STT needed — type in Gradio's text box
Validates: destination generation, persona generation, conversation, souvenir, TTS audio pipeline
On Windows, Walk defaults to TIME_MACHINE_DEV_TTS=sapi for low latency. Set TIME_MACHINE_DEV_TTS=kokoro to compare Kokoro voice quality.
TIME_MACHINE_MAX_RESPONSE_CHARS overrides the default short spoken reply cap of 260 characters.

$env:TIME_MACHINE_ADAPTER_PROFILE="dev"
$env:TIME_MACHINE_LLM_API_KEY="your-together-key"
python app.py

Local secret-file option for desktop development:

# .env (ignored by git)
TIME_MACHINE_ADAPTER_PROFILE=dev
TIME_MACHINE_LLM_API_KEY=your-together-key
# TOGETHER_API_KEY=your-together-key also works
TIME_MACHINE_LLM_MODEL=Qwen/Qwen2.5-7B-Instruct-Turbo

Optional Kokoro runtime for higher-quality local output (requires Python < 3.13; Python 3.13 uses the built-in WAV fallback):

.\.venv\Scripts\python.exe -m pip install -e ".[dev]"
$env:TIME_MACHINE_DEV_TTS="kokoro"
.\.venv\Scripts\python.exe scripts\walk_smoke.py

Note: Together currently reports Qwen/Qwen3-8B as requiring a dedicated endpoint on this account. The local Walk profile uses the serverless Qwen/Qwen2.5-7B-Instruct-Turbo model unless TIME_MACHINE_LLM_MODEL is set explicitly. Latency note from local Windows testing: Kokoro took 13.41s to synthesize a 6.35s clip, while Windows SAPI synthesized a similar 6.65s clip in 1.87s. The bottleneck was local Kokoro throughput, not the cloud LLM.

Run (Tomorrow — Cloud Sync)

Goal: Full voice-first loop with real models on Modal.

Push-to-talk microphone → Nemotron STT (Modal) → Together API (Qwen3) → Qwen3-TTS (Modal) → Audio playback
Profile: modal
Add microphone input component to Gradio UI
Create Modal functions for Nemotron and Qwen3-TTS in isolated environments

Sprint (Tomorrow Night — Streaming)

Goal: Real-time streaming for a polished demo.

Streaming audio → Nemotron partial transcripts → LLM token streaming → TTS audio chunks → Live playback
Swap HTTP requests for WebSocket streaming where possible
Only attempt if Run phase works flawlessly

Adapter Profiles

Profile	LLM	STT	TTS	Use Case
`fixture`	Fixture data	Fixture	Fixture	Tests, UI dev
`dev`	Together API	Whisper (local)	SAPI on Windows, Kokoro optional	Dev testing
`local_models`	Qwen local (transformers)	Nemotron local (NeMo)	Kokoro (local)	Full local (needs big GPU)
`modal`	Together API	Nemotron (Modal)	Qwen3-TTS (Modal)	Production / hackathon submission

Parameter Budget

Model	Role	Parameters	Enabled
Qwen3-8B	LLM (via Together API)	8.0B	✅
Nemotron 3.5 ASR	STT (on Modal)	0.6B	✅
Qwen3-TTS 1.7B	TTS (on Modal)	1.7B	✅
Total		10.3B	< 32B ✅

Dev-only models (not counted for submission):

Kokoro 82M (dev TTS fallback)
Whisper base 74M (dev STT fallback)

Proposed Changes

Already Implemented ✅

[NEW] cloud_completion.py

Cloud LLM completion function for any OpenAI-compatible API (Together, OpenRouter, etc.)
Zero new dependencies (uses stdlib urllib)
Injected into QwenStructuredLLMAdapter via existing completion_fn hook

[NEW] whisper_stt.py

Whisper STT adapter for dev/testing
Drop-in replacement for NemotronStreamingSTTAdapter
Same STTAdapter protocol

[MODIFY] container.py

Added dev profile wiring: Together API + Whisper STT + local TTS
Dev TTS defaults to Windows SAPI on Windows for Walk latency; TIME_MACHINE_DEV_TTS=kokoro preserves the Kokoro path.

Still Needed

[NEW] Modal Nemotron STT endpoint

Isolated Modal function running NeMo + Nemotron 3.5 ASR
HTTP webhook callable from the Gradio app
Accepts audio, returns transcript JSON

[NEW] Modal Qwen3-TTS endpoint

Isolated Modal function running Qwen3-TTS 1.7B
HTTP webhook callable from the Gradio app
Accepts text + voice profile, returns audio

[NEW] Modal STT/TTS adapter wrappers

ModalNemotronSTTAdapter — calls Modal webhook, returns Transcript
ModalQwenTTSAdapter — calls Modal webhook, returns AudioResult
Both implement existing port interfaces

[MODIFY] container.py

Add modal profile wiring the Modal adapters

[MODIFY] gradio_app.py

Add microphone input component for push-to-talk voice input

Verification Plan

Walk Phase

Launch with dev profile, type messages, verify real Qwen responses and audio playback
Run scripts\walk_smoke.py and inspect printed stage timings for launch, conversation, and TTS latency

Run Phase

Start the Modal STT and TTS endpoints from PowerShell. The UTF-8 settings avoid Windows console encoding failures from Modal CLI status glyphs.

$env:MODAL_PROFILE="manikandanj"
$env:PYTHONUTF8="1"
$env:PYTHONIOENCODING="utf-8"
$env:TTY_COMPATIBLE="0"
.\.venv\Scripts\modal.exe serve scripts\modal_audio.py

Keep Modal warm for the demo. The audio service now preloads Nemotron and Qwen3-TTS in class lifecycle hooks, keeps one GPU container warm by default, and runs a short Qwen3-TTS warmup during container startup. Override with:

$env:TIME_MACHINE_MODAL_MIN_CONTAINERS="1"
$env:TIME_MACHINE_MODAL_MAX_CONTAINERS="1"
$env:TIME_MACHINE_MODAL_SCALEDOWN_SECONDS="1800"
$env:TIME_MACHINE_MODAL_WARMUP_TTS="1"

After modal serve prints the STT and TTS endpoint URLs, set TIME_MACHINE_MODAL_STT_URL and TIME_MACHINE_MODAL_TTS_URL, then preflight both real-model endpoints before opening the demo:

.\.venv\Scripts\python.exe scripts\modal_warmup.py

Test voice input → text → voice output loop

Automated Tests

pytest tests/unit/ — domain models, JSON contract parsing, event stream ordering
Model budget compliance test (sum enabled params ≤ 32B)

Manual Verification

Launch the machine, type/speak to the generated character, bring back a souvenir
Verify audio playback works in the Gradio UI