ai-time-machine / docs /tech_stack_decision.md
manikandanj's picture
Prepare AI Time Machine hackathon Space
5862322 verified
|
Raw
History Blame Contribute Delete
19.2 kB
# AI Time Machine Tech Stack Decision
Date: 2026-06-05 (Updated: 2026-06-09)
## Purpose
This document records the current technical direction for the AI Time Machine hackathon project. It is intended to be handoff-ready for another coding agent.
The project is a surreal, voice-first time machine experience for the Hugging Face Build Small Hackathon, Track 2: An Adventure in Thousand Token Wood. The user should feel like they launch a strange laboratory machine, arrive at an impossible coordinate in the past or future, and speak aloud with an ordinary person from that world.
## Primary Decision
Build a polished Gradio app hosted on a Hugging Face Space. Use a modular voice-agent architecture with custom frontend staging, streaming ASR, a small instruction LLM, character-oriented TTS, and a souvenir generator.
Main priority: the most polished recorded demo possible within hackathon constraints.
Secondary priority: keep the architecture modular so higher-risk voice/avatar experiments can be tried without endangering the MVP.
## Hard Constraints
Hackathon constraints:
- The app must be a Gradio application.
- The submitted app must be hosted as a Hugging Face Space.
- Total AI model parameters must stay at or below 32B.
- Core functionality must use small models naturally suited to the task.
- The result must be a working, demonstrable product.
Product constraints:
- Voice-first experience.
- Default flow is Surprise Me.
- Visual theme starts as strange laboratory.
- The theme must be easy to reskin later.
- The project should connect the user to ordinary people, not famous figures.
- The app should prioritize theatrical believability over strict historical accuracy.
Budget context:
- Hugging Face credit: USD 20.
- Modal.com credit: USD 250.
- We do not need to spend the credits, but we should use them if they make the demo materially better.
## Final MVP Stack
MVP target:
- App: Gradio Blocks.
- Host: Hugging Face Space with GPU.
- Language: Python.
- Frontend: custom HTML/CSS/JavaScript embedded in Gradio.
- LLM: Qwen3-8B via Together AI API (structured JSON mode built-in).
- STT: NVIDIA Nemotron 3.5 ASR Streaming 0.6B.
- TTS primary candidate: Qwen3-TTS 1.7B VoiceDesign/CustomVoice.
- TTS fallback: NVIDIA Magpie TTS Multilingual 357M.
- Emergency TTS fallback: Kokoro 82M.
- State/output style: structured JSON between generation steps.
Approximate MVP parameter budget:
- LLM: 4B to 8B.
- STT: 0.6B.
- TTS primary: 1.7B.
- Total: roughly 6.3B to 10.3B.
This is comfortably under the 32B cap and leaves room for experimentation.
## Hosting Strategy
The required Gradio application lives on Hugging Face Spaces. Inference is split across three services to keep dependencies clean and costs manageable.
Decided architecture:
- HF Spaces: Gradio UI (CPU tier is sufficient — no local model inference).
- Together AI: LLM inference (Qwen3-8B with JSON mode).
- Modal: Audio models only (Nemotron STT, Qwen3-TTS).
Budget:
- $250 Modal credits available for audio GPU compute.
- $5 Together AI free credits for dev testing.
Key insight: separating LLM and audio dependencies avoids Python dependency conflicts that arise from bundling everything in one environment.
Tradeoff:
- Calling external services means the app is not local-only and should not pursue the Off the Grid badge.
- This is acceptable because polish matters more than bonus badges for this submission.
All external services remain behind narrow Python interfaces so they can be swapped.
## Application Architecture
The app should be organized as five replaceable layers.
### 1. Cockpit UI
Technology:
- Gradio Blocks.
- Custom HTML/CSS/JavaScript.
- CSS variables for theme tokens.
- Small JavaScript state machine for visual transitions.
Responsibilities:
- Strange laboratory cockpit.
- Large windshield/portal.
- Launch button.
- Status/signal panel.
- Voice controls.
- Transcript/radio panel.
- Souvenir display.
- Animated states: dormant, launch, tunnel, turbulence, landing, smoke clear, destination reveal, signal lock, conversation, souvenir.
Implementation guidance:
- The strange laboratory look should be a skin, not hard-coded into app logic.
- Use CSS variables for color, glow, material, portal palette, warning states, and typography.
- Use `data-state` attributes or equivalent simple state flags for animations.
- Do not make the default UI look like a basic Gradio form.
### 2. World And Persona Engine
Technology:
- Python orchestration.
- Qwen3-class instruction LLM.
- Structured JSON outputs.
Responsibilities:
- Generate a destination profile.
- Generate visual motifs that map to frontend presets.
- Generate an ordinary-person persona card.
- Maintain tone, era/future context, character constraints, and safety.
Important output fields:
- destination year.
- destination place.
- destination mode: past, future, or strange.
- atmosphere.
- visual preset key.
- visual motif list.
- character name or local identifier.
- character role/occupation.
- immediate situation.
- daily concern.
- secret/fear/desire.
- worldview constraints.
- theory about the user's voice.
- speaking style.
### 3. Voice Input
Primary STT:
- NVIDIA Nemotron 3.5 ASR Streaming 0.6B.
Why:
- 600M parameters.
- Designed for streaming ASR.
- Supports multilingual transcription across many language-locales.
- Supports punctuation and capitalization.
- Has configurable chunk sizes from low-latency to higher-accuracy operation.
- Better fit than clip-only ASR for a voice-first cockpit experience.
Recommended MVP behavior:
- Start with microphone clip transcription if Gradio streaming wiring takes too long.
- Upgrade to streaming ASR once the basic loop works.
- Use streaming partials in the UI as signal decoding text when available.
Fallback STT:
- NVIDIA Nemotron Speech Streaming English 0.6B for English-only simplicity.
- Distil-Whisper if NeMo integration blocks progress.
### 4. Conversation And Souvenir Engine
Technology:
- Same Qwen3-class instruction LLM used for world/persona generation.
Responsibilities:
- Respond in character.
- Keep replies short enough for voice playback.
- Preserve the persona's worldview and misunderstanding of the time signal.
- Ask occasional questions back.
- Generate a final temporal souvenir.
Conversation prompt rules:
- The character is an ordinary person.
- The character should not know modern facts unless implied by the destination.
- The character may interpret the user as a spirit, official, dream, machine voice, ancestor, descendant, customer, omen, or strange weather.
- The response should contain sensory detail and personality.
- Avoid turning real suffering into spectacle.
- Prefer vivid, humane, surprising moments over encyclopedia-style history.
Souvenir output:
- Destination.
- Contact.
- Quote.
- Artifact.
- Stamp name.
- Short encounter summary.
### 5. Voice Output
Primary TTS candidate:
- Qwen3-TTS 1.7B VoiceDesign/CustomVoice.
Why:
- Best fit for theatrical character variety.
- Supports natural-language voice control.
- Supports emotion, prosody, timbre, and speaking-style instructions.
- Supports streaming generation.
- Supports custom voice and voice-design workflows.
- Apache 2.0 license.
- Lets the app create a more distinct voice per generated persona.
Recommended Qwen3-TTS workflow:
1. Generate persona.
2. Generate a short voice design instruction from the persona.
3. Use Qwen3-TTS VoiceDesign or CustomVoice to create the character voice.
4. Cache/reuse that voice setup for the encounter.
5. Generate each character line with short text and explicit emotion/prosody hints.
Fallback TTS:
- NVIDIA Magpie TTS Multilingual 357M.
Why:
- NVIDIA-backed.
- NeMo-compatible.
- Small.
- Commercial-ready.
- Multiple speaker options and multilingual support.
- Good reliability candidate if Qwen3-TTS integration is unstable.
Emergency fallback:
- Kokoro 82M.
Why:
- Tiny.
- Apache 2.0.
- Fast and inexpensive.
- Good enough to keep the demo working if stronger TTS options fail.
Do not make TTS a single hard dependency. Put it behind a small interface, for example:
- `synthesize(text, voice_profile) -> audio_path`
- `prepare_voice(persona) -> voice_profile`
## TTS Comparison
### Qwen3-TTS
Decision: primary TTS bet.
Strengths:
- Best character fit.
- VoiceDesign and CustomVoice match our generated-persona concept.
- Natural-language control over voice style.
- Streaming support.
- Clean Apache 2.0 license.
- 0.6B and 1.7B variants provide scaling options.
Risks:
- Newer stack.
- May require FlashAttention 2 and GPU/runtime tuning.
- Needs proof of latency and reliability in our deployment environment.
### NVIDIA Magpie TTS 357M
Decision: first fallback.
Strengths:
- Small and practical.
- NVIDIA-backed.
- Works well with the broader NVIDIA speech stack.
- Multiple voices and nine languages.
- Good voice-agent fit.
Risks:
- Less dynamic character control than Qwen3-TTS.
- Fewer English voices than ideal for many different ordinary people.
- NeMo dependency still needs validation in the Space.
### Parler-TTS Mini
Decision: optional fallback or comparison spike, not MVP default.
Strengths:
- Apache 2.0.
- Prompt-controllable voice features.
- 34 named speakers.
- Simple conceptual model for voice descriptions.
Risks:
- English-only.
- Older and less compelling than Qwen3-TTS for this project.
- Around 0.9B params, larger than Kokoro and less flexible than Qwen3-TTS.
### Kokoro 82M
Decision: emergency fallback.
Strengths:
- Very small.
- Fast.
- Apache 2.0.
- Easy to deploy and inexpensive to run.
Risks:
- Less theatrical.
- Less expressive character control.
- Better as a backup than the product-defining voice.
### Voxtral TTS 4B
Note: The user referred to this as Vostral; the model appears to be Mistral's Voxtral TTS.
Decision: stretch spike, not MVP default.
Strengths:
- Expressive, low-latency voice-agent TTS.
- 20 preset voices.
- Voice adaptation from reference audio.
- vLLM-Omni serving path.
- Runs on a single GPU with at least 16GB VRAM.
Risks:
- CC BY-NC 4.0 license.
- 4B params just for TTS.
- Heavier runtime than Qwen3-TTS, Magpie, or Kokoro.
- Better suited to a Modal experiment than the first Space implementation.
## Stretch Technologies
### AVTR-1 Realtime Avatar
Decision: design the UI with an avatar-ready portal slot, but do not put AVTR-1 on the MVP critical path.
Why it is compelling:
- Real-time talking-head avatar.
- Uses a portrait plus dual-stream audio.
- Can render speaking and active listening behavior.
- Could make the portal feel like a person is truly present.
Why it is risky:
- Gated model access.
- Requires CUDA 12.x, TensorRT 10.x, and Ampere+ GPU.
- Requires building TensorRT engines.
- Hugging Face model card reports L4 near the edge of real-time performance.
- Licensing includes noncommercial pieces and consent-sensitive avatar/deepfake restrictions.
- Parameter count needs verification before hackathon use.
How to prepare for it:
- Build the portal area as a replaceable component.
- MVP should show stylized silhouette, portrait card, waveform, static, and environmental animation.
- Later, AVTR-1 can replace or augment the portal visual.
Recommended spike:
- Run AVTR-1 separately on Modal.
- Use only generated or clearly consent-safe reference portraits.
- Confirm licensing and parameter count before integrating into the submitted app.
### PersonaPlex 7B
Decision: stretch experiment, not MVP.
Why it is compelling:
- Real-time spoken conversation.
- Persona-conditioned speech-to-speech interaction.
- Could make the app feel much more alive.
Why it is risky:
- Gated.
- More complex runtime.
- Likely high-end GPU expectations.
- Could absorb time needed for cockpit polish, launch sequence, reliable voice loop, and demo flow.
Recommended spike:
- Build an isolated proof of concept after the MVP voice loop works.
- Compare latency and character quality against modular STT + LLM + TTS.
- Promote only if it clearly improves the demo without destabilizing delivery.
### llama.cpp
Decision: future stretch for the Llama Champion badge, not MVP.
Rationale:
- The project should first optimize for a polished GPU-backed demo.
- GGUF/llama.cpp can be revisited after the app works end to end.
## Frontend Strategy
The frontend should feel like a cinematic ride plus magical radio.
Default UI direction:
- Strange laboratory.
- Full or near-full cockpit.
- Portal/windshield as the visual center.
- Control panel with dials, meters, switches, and warning lights.
- Signal/status text that feels like the machine is decoding a temporal contact.
- Transcript/radio panel for voice clarity.
Core states:
- Dormant cockpit.
- Launch charging.
- Temporal tunnel.
- Turbulence.
- Landing.
- Smoke clearing.
- Destination reveal.
- Signal lock.
- Conversation.
- Souvenir.
MVP visual approach:
- Use CSS animations and small JavaScript state transitions.
- Use stylized world presets instead of generated images.
- Let the LLM emit visual motifs, then map them to known preset keys.
- Keep generated image and avatar work as stretch.
Example visual presets:
- Rainy lantern district.
- Future flooded market.
- Medieval port.
- Orbital repair bay.
- Desert archive.
- Underground signal room.
- Snowbound radio station.
- Solar eclipse observatory.
## Build Order
Follow this order unless the user changes priorities.
1. Create Gradio app shell.
2. Build strange laboratory cockpit layout.
3. Add frontend state machine and launch/reveal animations.
4. Add static mock destination/persona data to validate UI flow.
5. Implement structured destination generation.
6. Implement structured persona generation.
7. Implement transcript-first conversation loop.
8. Add souvenir generation.
9. Add TTS interface with Kokoro or a stub so audio wiring exists early.
10. Integrate Qwen3-TTS as primary TTS candidate.
11. Add Magpie fallback if Qwen3-TTS is unstable.
12. Add STT interface with clip transcription first.
13. Integrate Nemotron 3.5 ASR Streaming.
14. Add voice-first UX polish: partial transcript, signal decoding, playback states.
15. Tune prompts, timing, and audio length for a crisp recorded demo.
16. Consider Modal offload for Qwen3-TTS or heavier components if Space performance is weak.
17. Run stretch spikes only after the core encounter works end to end.
## Required Interfaces
Keep these boundaries clean so models can be swapped.
### Destination Generator
Input:
- mode.
- optional user coordinate prompt.
- random seed.
Output:
- structured destination JSON.
### Persona Generator
Input:
- destination JSON.
Output:
- structured persona JSON.
### Conversation Engine
Input:
- destination JSON.
- persona JSON.
- conversation history.
- latest user message.
Output:
- short in-character response text.
- optional emotion/prosody hint for TTS.
- optional UI signal event.
### STT Adapter
Input:
- audio file or audio stream chunk.
Output:
- transcript text.
- optional partial transcript.
- optional confidence/timing metadata.
### TTS Adapter
Input:
- response text.
- persona voice profile.
- emotion/prosody hint.
Output:
- audio file path or audio bytes.
### Souvenir Generator
Input:
- destination JSON.
- persona JSON.
- conversation highlights.
Output:
- structured souvenir JSON.
## Risks And Mitigations
### Risk: Qwen3-TTS integration takes too long
Mitigation:
- Keep TTS behind an adapter.
- Start with Kokoro or stub audio for wiring.
- Keep Magpie as first serious fallback.
### Risk: Nemotron streaming ASR is hard to wire through Gradio
Mitigation:
- Start with clip-based microphone transcription.
- Add streaming partials after the main loop works.
- Use partial transcript display as polish, not as a hard MVP dependency.
### Risk: Hugging Face Space GPU is not enough
Mitigation:
- Use Qwen3-4B instead of Qwen3-8B.
- Move TTS or LLM to Modal.
- Avoid image generation and avatar generation in MVP.
### Risk: Custom frontend becomes fragile
Mitigation:
- Keep JavaScript small.
- Use explicit UI states.
- Keep business logic in Python.
- Avoid frontend-generated source-of-truth state.
### Risk: Stretch tech distracts from the demo
Mitigation:
- AVTR-1, PersonaPlex, Voxtral, and llama.cpp are separate spikes.
- Do not block MVP on gated models, TensorRT builds, or alternate runtimes.
- Build the cockpit so these can be added later without rewiring the app.
## Current Decisions To Preserve
- Optimize for the most polished demo.
- Use GPU if it helps quality.
- Make the app voice-first.
- Use Surprise Me as the default first action.
- Start with a strange laboratory visual theme.
- Make the visual theme easy to replace later.
- Use Nemotron 3.5 ASR Streaming for preferred STT.
- Use Qwen3-TTS as the primary TTS bet.
- Keep Magpie and Kokoro as fallbacks.
- Keep AVTR-1 and PersonaPlex as stretch experiments.
- Keep llama.cpp as a later badge-oriented stretch.
- Use Modal if it materially improves product quality or unlocks a hard component.
## Open Questions For Implementation
- ~~Which exact Qwen3 LLM checkpoint should be used first: 4B for reliability or 8B for quality?~~ DECIDED: Qwen3-8B via Together AI API.
- ~~Can Qwen3-TTS run reliably inside the selected Hugging Face Space GPU?~~ DECIDED: Qwen3-TTS runs on Modal, not in HF Space.
- ~~Should Qwen3-TTS run in the Space or on Modal?~~ DECIDED: Modal.
- ~~What Space GPU tier should be used for the demo?~~ DECIDED: CPU tier is sufficient (no local model inference).
- ~~How much true streaming does Gradio need in MVP versus polished clip-turn interaction?~~ DECIDED: Walk phase uses sync/text. Sprint phase adds streaming if time permits.
- Which generated voice attributes should be allowed for safety and consistency? (Still open.)
- ~~Should the MVP include multilingual character speech, or keep speech output English-first?~~ DECIDED: English-first for MVP.
## References
Local docs:
- `docs/hackathon_details.md`
- `docs/ai_time_machine_idea.md`
Model and platform references:
- Hugging Face Build Small Hackathon: https://huggingface.co/build-small-hackathon
- Gradio Spaces documentation: https://huggingface.co/docs/hub/spaces-sdks-gradio
- NVIDIA Nemotron Speech collection: https://huggingface.co/collections/nvidia/nemotron-speech
- NVIDIA Nemotron 3.5 ASR Streaming 0.6B: https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b
- NVIDIA Nemotron Speech Streaming English 0.6B: https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b
- NVIDIA Magpie TTS Multilingual 357M: https://huggingface.co/nvidia/magpie_tts_multilingual_357m
- Qwen3-TTS CustomVoice: https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
- Qwen3-TTS VoiceDesign: https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign
- Kokoro 82M: https://huggingface.co/hexgrad/Kokoro-82M
- Parler-TTS Mini v1: https://huggingface.co/parler-tts/parler-tts-mini-v1
- Voxtral TTS 4B: https://huggingface.co/mistralai/Voxtral-4B-TTS-2603
- AVTR-1 realtime avatar: https://huggingface.co/avaturn-live/avtr-1
- NVIDIA PersonaPlex 7B: https://huggingface.co/nvidia/personaplex-7b-v1
- Modal: https://modal.com