Spaces:

build-small-hackathon
/

ai-time-machine

Running

App Files Files Community

ai-time-machine / docs /tech_stack_decision.md

manikandanj

Prepare AI Time Machine hackathon Space

5862322 verified 11 days ago

preview code

Raw

History Blame Contribute Delete

19.2 kB

	# AI Time Machine Tech Stack Decision

	Date: 2026-06-05 (Updated: 2026-06-09)

	## Purpose

	This document records the current technical direction for the AI Time Machine hackathon project. It is intended to be handoff-ready for another coding agent.

	The project is a surreal, voice-first time machine experience for the Hugging Face Build Small Hackathon, Track 2: An Adventure in Thousand Token Wood. The user should feel like they launch a strange laboratory machine, arrive at an impossible coordinate in the past or future, and speak aloud with an ordinary person from that world.

	## Primary Decision

	Build a polished Gradio app hosted on a Hugging Face Space. Use a modular voice-agent architecture with custom frontend staging, streaming ASR, a small instruction LLM, character-oriented TTS, and a souvenir generator.

	Main priority: the most polished recorded demo possible within hackathon constraints.

	Secondary priority: keep the architecture modular so higher-risk voice/avatar experiments can be tried without endangering the MVP.

	## Hard Constraints

	Hackathon constraints:

	- The app must be a Gradio application.
	- The submitted app must be hosted as a Hugging Face Space.
	- Total AI model parameters must stay at or below 32B.
	- Core functionality must use small models naturally suited to the task.
	- The result must be a working, demonstrable product.

	Product constraints:

	- Voice-first experience.
	- Default flow is Surprise Me.
	- Visual theme starts as strange laboratory.
	- The theme must be easy to reskin later.
	- The project should connect the user to ordinary people, not famous figures.
	- The app should prioritize theatrical believability over strict historical accuracy.

	Budget context:

	- Hugging Face credit: USD 20.
	- Modal.com credit: USD 250.
	- We do not need to spend the credits, but we should use them if they make the demo materially better.

	## Final MVP Stack

	MVP target:

	- App: Gradio Blocks.
	- Host: Hugging Face Space with GPU.
	- Language: Python.
	- Frontend: custom HTML/CSS/JavaScript embedded in Gradio.
	- LLM: Qwen3-8B via Together AI API (structured JSON mode built-in).
	- STT: NVIDIA Nemotron 3.5 ASR Streaming 0.6B.
	- TTS primary candidate: Qwen3-TTS 1.7B VoiceDesign/CustomVoice.
	- TTS fallback: NVIDIA Magpie TTS Multilingual 357M.
	- Emergency TTS fallback: Kokoro 82M.
	- State/output style: structured JSON between generation steps.

	Approximate MVP parameter budget:

	- LLM: 4B to 8B.
	- STT: 0.6B.
	- TTS primary: 1.7B.
	- Total: roughly 6.3B to 10.3B.

	This is comfortably under the 32B cap and leaves room for experimentation.

	## Hosting Strategy

	The required Gradio application lives on Hugging Face Spaces. Inference is split across three services to keep dependencies clean and costs manageable.

	Decided architecture:

	- HF Spaces: Gradio UI (CPU tier is sufficient — no local model inference).
	- Together AI: LLM inference (Qwen3-8B with JSON mode).
	- Modal: Audio models only (Nemotron STT, Qwen3-TTS).

	Budget:

	- $250 Modal credits available for audio GPU compute.
	- $5 Together AI free credits for dev testing.

	Key insight: separating LLM and audio dependencies avoids Python dependency conflicts that arise from bundling everything in one environment.

	Tradeoff:

	- Calling external services means the app is not local-only and should not pursue the Off the Grid badge.
	- This is acceptable because polish matters more than bonus badges for this submission.

	All external services remain behind narrow Python interfaces so they can be swapped.

	## Application Architecture

	The app should be organized as five replaceable layers.

	### 1. Cockpit UI

	Technology:

	- Gradio Blocks.
	- Custom HTML/CSS/JavaScript.
	- CSS variables for theme tokens.
	- Small JavaScript state machine for visual transitions.

	Responsibilities:

	- Strange laboratory cockpit.
	- Large windshield/portal.
	- Launch button.
	- Status/signal panel.
	- Voice controls.
	- Transcript/radio panel.
	- Souvenir display.
	- Animated states: dormant, launch, tunnel, turbulence, landing, smoke clear, destination reveal, signal lock, conversation, souvenir.

	Implementation guidance:

	- The strange laboratory look should be a skin, not hard-coded into app logic.
	- Use CSS variables for color, glow, material, portal palette, warning states, and typography.
	- Use `data-state` attributes or equivalent simple state flags for animations.
	- Do not make the default UI look like a basic Gradio form.

	### 2. World And Persona Engine

	Technology:

	- Python orchestration.
	- Qwen3-class instruction LLM.
	- Structured JSON outputs.

	Responsibilities:

	- Generate a destination profile.
	- Generate visual motifs that map to frontend presets.
	- Generate an ordinary-person persona card.
	- Maintain tone, era/future context, character constraints, and safety.

	Important output fields:

	- destination year.
	- destination place.
	- destination mode: past, future, or strange.
	- atmosphere.
	- visual preset key.
	- visual motif list.
	- character name or local identifier.
	- character role/occupation.
	- immediate situation.
	- daily concern.
	- secret/fear/desire.
	- worldview constraints.
	- theory about the user's voice.
	- speaking style.

	### 3. Voice Input

	Primary STT:

	- NVIDIA Nemotron 3.5 ASR Streaming 0.6B.

	Why:

	- 600M parameters.
	- Designed for streaming ASR.
	- Supports multilingual transcription across many language-locales.
	- Supports punctuation and capitalization.
	- Has configurable chunk sizes from low-latency to higher-accuracy operation.
	- Better fit than clip-only ASR for a voice-first cockpit experience.

	Recommended MVP behavior:

	- Start with microphone clip transcription if Gradio streaming wiring takes too long.
	- Upgrade to streaming ASR once the basic loop works.
	- Use streaming partials in the UI as signal decoding text when available.

	Fallback STT:

	- NVIDIA Nemotron Speech Streaming English 0.6B for English-only simplicity.
	- Distil-Whisper if NeMo integration blocks progress.

	### 4. Conversation And Souvenir Engine

	Technology:

	- Same Qwen3-class instruction LLM used for world/persona generation.

	Responsibilities:

	- Respond in character.
	- Keep replies short enough for voice playback.
	- Preserve the persona's worldview and misunderstanding of the time signal.
	- Ask occasional questions back.
	- Generate a final temporal souvenir.

	Conversation prompt rules:

	- The character is an ordinary person.
	- The character should not know modern facts unless implied by the destination.
	- The character may interpret the user as a spirit, official, dream, machine voice, ancestor, descendant, customer, omen, or strange weather.
	- The response should contain sensory detail and personality.
	- Avoid turning real suffering into spectacle.
	- Prefer vivid, humane, surprising moments over encyclopedia-style history.

	Souvenir output:

	- Destination.
	- Contact.
	- Quote.
	- Artifact.
	- Stamp name.
	- Short encounter summary.

	### 5. Voice Output

	Primary TTS candidate:

	- Qwen3-TTS 1.7B VoiceDesign/CustomVoice.

	Why:

	- Best fit for theatrical character variety.
	- Supports natural-language voice control.
	- Supports emotion, prosody, timbre, and speaking-style instructions.
	- Supports streaming generation.
	- Supports custom voice and voice-design workflows.
	- Apache 2.0 license.
	- Lets the app create a more distinct voice per generated persona.

	Recommended Qwen3-TTS workflow:

	1. Generate persona.
	2. Generate a short voice design instruction from the persona.
	3. Use Qwen3-TTS VoiceDesign or CustomVoice to create the character voice.
	4. Cache/reuse that voice setup for the encounter.
	5. Generate each character line with short text and explicit emotion/prosody hints.

	Fallback TTS:

	- NVIDIA Magpie TTS Multilingual 357M.

	Why:

	- NVIDIA-backed.
	- NeMo-compatible.
	- Small.
	- Commercial-ready.
	- Multiple speaker options and multilingual support.
	- Good reliability candidate if Qwen3-TTS integration is unstable.

	Emergency fallback:

	- Kokoro 82M.

	Why:

	- Tiny.
	- Apache 2.0.
	- Fast and inexpensive.
	- Good enough to keep the demo working if stronger TTS options fail.

	Do not make TTS a single hard dependency. Put it behind a small interface, for example:

	- `synthesize(text, voice_profile) -> audio_path`
	- `prepare_voice(persona) -> voice_profile`

	## TTS Comparison

	### Qwen3-TTS

	Decision: primary TTS bet.

	Strengths:

	- Best character fit.
	- VoiceDesign and CustomVoice match our generated-persona concept.
	- Natural-language control over voice style.
	- Streaming support.
	- Clean Apache 2.0 license.
	- 0.6B and 1.7B variants provide scaling options.

	Risks:

	- Newer stack.
	- May require FlashAttention 2 and GPU/runtime tuning.
	- Needs proof of latency and reliability in our deployment environment.

	### NVIDIA Magpie TTS 357M

	Decision: first fallback.

	Strengths:

	- Small and practical.
	- NVIDIA-backed.
	- Works well with the broader NVIDIA speech stack.
	- Multiple voices and nine languages.
	- Good voice-agent fit.

	Risks:

	- Less dynamic character control than Qwen3-TTS.
	- Fewer English voices than ideal for many different ordinary people.
	- NeMo dependency still needs validation in the Space.

	### Parler-TTS Mini

	Decision: optional fallback or comparison spike, not MVP default.

	Strengths:

	- Apache 2.0.
	- Prompt-controllable voice features.
	- 34 named speakers.
	- Simple conceptual model for voice descriptions.

	Risks:

	- English-only.
	- Older and less compelling than Qwen3-TTS for this project.
	- Around 0.9B params, larger than Kokoro and less flexible than Qwen3-TTS.

	### Kokoro 82M

	Decision: emergency fallback.

	Strengths:

	- Very small.
	- Fast.
	- Apache 2.0.
	- Easy to deploy and inexpensive to run.

	Risks:

	- Less theatrical.
	- Less expressive character control.
	- Better as a backup than the product-defining voice.

	### Voxtral TTS 4B

	Note: The user referred to this as Vostral; the model appears to be Mistral's Voxtral TTS.

	Decision: stretch spike, not MVP default.

	Strengths:

	- Expressive, low-latency voice-agent TTS.
	- 20 preset voices.
	- Voice adaptation from reference audio.
	- vLLM-Omni serving path.
	- Runs on a single GPU with at least 16GB VRAM.

	Risks:

	- CC BY-NC 4.0 license.
	- 4B params just for TTS.
	- Heavier runtime than Qwen3-TTS, Magpie, or Kokoro.
	- Better suited to a Modal experiment than the first Space implementation.

	## Stretch Technologies

	### AVTR-1 Realtime Avatar

	Decision: design the UI with an avatar-ready portal slot, but do not put AVTR-1 on the MVP critical path.

	Why it is compelling:

	- Real-time talking-head avatar.
	- Uses a portrait plus dual-stream audio.
	- Can render speaking and active listening behavior.
	- Could make the portal feel like a person is truly present.

	Why it is risky:

	- Gated model access.
	- Requires CUDA 12.x, TensorRT 10.x, and Ampere+ GPU.
	- Requires building TensorRT engines.
	- Hugging Face model card reports L4 near the edge of real-time performance.
	- Licensing includes noncommercial pieces and consent-sensitive avatar/deepfake restrictions.
	- Parameter count needs verification before hackathon use.

	How to prepare for it:

	- Build the portal area as a replaceable component.
	- MVP should show stylized silhouette, portrait card, waveform, static, and environmental animation.
	- Later, AVTR-1 can replace or augment the portal visual.

	Recommended spike:

	- Run AVTR-1 separately on Modal.
	- Use only generated or clearly consent-safe reference portraits.
	- Confirm licensing and parameter count before integrating into the submitted app.

	### PersonaPlex 7B

	Decision: stretch experiment, not MVP.

	Why it is compelling:

	- Real-time spoken conversation.
	- Persona-conditioned speech-to-speech interaction.
	- Could make the app feel much more alive.

	Why it is risky:

	- Gated.
	- More complex runtime.
	- Likely high-end GPU expectations.
	- Could absorb time needed for cockpit polish, launch sequence, reliable voice loop, and demo flow.

	Recommended spike:

	- Build an isolated proof of concept after the MVP voice loop works.
	- Compare latency and character quality against modular STT + LLM + TTS.
	- Promote only if it clearly improves the demo without destabilizing delivery.

	### llama.cpp

	Decision: future stretch for the Llama Champion badge, not MVP.

	Rationale:

	- The project should first optimize for a polished GPU-backed demo.
	- GGUF/llama.cpp can be revisited after the app works end to end.

	## Frontend Strategy

	The frontend should feel like a cinematic ride plus magical radio.

	Default UI direction:

	- Strange laboratory.
	- Full or near-full cockpit.
	- Portal/windshield as the visual center.
	- Control panel with dials, meters, switches, and warning lights.
	- Signal/status text that feels like the machine is decoding a temporal contact.
	- Transcript/radio panel for voice clarity.

	Core states:

	- Dormant cockpit.
	- Launch charging.
	- Temporal tunnel.
	- Turbulence.
	- Landing.
	- Smoke clearing.
	- Destination reveal.
	- Signal lock.
	- Conversation.
	- Souvenir.

	MVP visual approach:

	- Use CSS animations and small JavaScript state transitions.
	- Use stylized world presets instead of generated images.
	- Let the LLM emit visual motifs, then map them to known preset keys.
	- Keep generated image and avatar work as stretch.

	Example visual presets:

	- Rainy lantern district.
	- Future flooded market.
	- Medieval port.
	- Orbital repair bay.
	- Desert archive.
	- Underground signal room.
	- Snowbound radio station.
	- Solar eclipse observatory.

	## Build Order

	Follow this order unless the user changes priorities.

	1. Create Gradio app shell.
	2. Build strange laboratory cockpit layout.
	3. Add frontend state machine and launch/reveal animations.
	4. Add static mock destination/persona data to validate UI flow.
	5. Implement structured destination generation.
	6. Implement structured persona generation.
	7. Implement transcript-first conversation loop.
	8. Add souvenir generation.
	9. Add TTS interface with Kokoro or a stub so audio wiring exists early.
	10. Integrate Qwen3-TTS as primary TTS candidate.
	11. Add Magpie fallback if Qwen3-TTS is unstable.
	12. Add STT interface with clip transcription first.
	13. Integrate Nemotron 3.5 ASR Streaming.
	14. Add voice-first UX polish: partial transcript, signal decoding, playback states.
	15. Tune prompts, timing, and audio length for a crisp recorded demo.
	16. Consider Modal offload for Qwen3-TTS or heavier components if Space performance is weak.
	17. Run stretch spikes only after the core encounter works end to end.

	## Required Interfaces

	Keep these boundaries clean so models can be swapped.

	### Destination Generator

	Input:

	- mode.
	- optional user coordinate prompt.
	- random seed.

	Output:

	- structured destination JSON.

	### Persona Generator

	Input:

	- destination JSON.

	Output:

	- structured persona JSON.

	### Conversation Engine

	Input:

	- destination JSON.
	- persona JSON.
	- conversation history.
	- latest user message.

	Output:

	- short in-character response text.
	- optional emotion/prosody hint for TTS.
	- optional UI signal event.

	### STT Adapter

	Input:

	- audio file or audio stream chunk.

	Output:

	- transcript text.
	- optional partial transcript.
	- optional confidence/timing metadata.

	### TTS Adapter

	Input:

	- response text.
	- persona voice profile.
	- emotion/prosody hint.

	Output:

	- audio file path or audio bytes.

	### Souvenir Generator

	Input:

	- destination JSON.
	- persona JSON.
	- conversation highlights.

	Output:

	- structured souvenir JSON.

	## Risks And Mitigations

	### Risk: Qwen3-TTS integration takes too long

	Mitigation:

	- Keep TTS behind an adapter.
	- Start with Kokoro or stub audio for wiring.
	- Keep Magpie as first serious fallback.

	### Risk: Nemotron streaming ASR is hard to wire through Gradio

	Mitigation:

	- Start with clip-based microphone transcription.
	- Add streaming partials after the main loop works.
	- Use partial transcript display as polish, not as a hard MVP dependency.

	### Risk: Hugging Face Space GPU is not enough

	Mitigation:

	- Use Qwen3-4B instead of Qwen3-8B.
	- Move TTS or LLM to Modal.
	- Avoid image generation and avatar generation in MVP.

	### Risk: Custom frontend becomes fragile

	Mitigation:

	- Keep JavaScript small.
	- Use explicit UI states.
	- Keep business logic in Python.
	- Avoid frontend-generated source-of-truth state.

	### Risk: Stretch tech distracts from the demo

	Mitigation:

	- AVTR-1, PersonaPlex, Voxtral, and llama.cpp are separate spikes.
	- Do not block MVP on gated models, TensorRT builds, or alternate runtimes.
	- Build the cockpit so these can be added later without rewiring the app.

	## Current Decisions To Preserve

	- Optimize for the most polished demo.
	- Use GPU if it helps quality.
	- Make the app voice-first.
	- Use Surprise Me as the default first action.
	- Start with a strange laboratory visual theme.
	- Make the visual theme easy to replace later.
	- Use Nemotron 3.5 ASR Streaming for preferred STT.
	- Use Qwen3-TTS as the primary TTS bet.
	- Keep Magpie and Kokoro as fallbacks.
	- Keep AVTR-1 and PersonaPlex as stretch experiments.
	- Keep llama.cpp as a later badge-oriented stretch.
	- Use Modal if it materially improves product quality or unlocks a hard component.

	## Open Questions For Implementation

	- ~~Which exact Qwen3 LLM checkpoint should be used first: 4B for reliability or 8B for quality?~~ DECIDED: Qwen3-8B via Together AI API.
	- ~~Can Qwen3-TTS run reliably inside the selected Hugging Face Space GPU?~~ DECIDED: Qwen3-TTS runs on Modal, not in HF Space.
	- ~~Should Qwen3-TTS run in the Space or on Modal?~~ DECIDED: Modal.
	- ~~What Space GPU tier should be used for the demo?~~ DECIDED: CPU tier is sufficient (no local model inference).
	- ~~How much true streaming does Gradio need in MVP versus polished clip-turn interaction?~~ DECIDED: Walk phase uses sync/text. Sprint phase adds streaming if time permits.
	- Which generated voice attributes should be allowed for safety and consistency? (Still open.)
	- ~~Should the MVP include multilingual character speech, or keep speech output English-first?~~ DECIDED: English-first for MVP.

	## References

	Local docs:

	- `docs/hackathon_details.md`
	- `docs/ai_time_machine_idea.md`

	Model and platform references:

	- Hugging Face Build Small Hackathon: https://huggingface.co/build-small-hackathon
	- Gradio Spaces documentation: https://huggingface.co/docs/hub/spaces-sdks-gradio
	- NVIDIA Nemotron Speech collection: https://huggingface.co/collections/nvidia/nemotron-speech
	- NVIDIA Nemotron 3.5 ASR Streaming 0.6B: https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b
	- NVIDIA Nemotron Speech Streaming English 0.6B: https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b
	- NVIDIA Magpie TTS Multilingual 357M: https://huggingface.co/nvidia/magpie_tts_multilingual_357m
	- Qwen3-TTS CustomVoice: https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
	- Qwen3-TTS VoiceDesign: https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign
	- Kokoro 82M: https://huggingface.co/hexgrad/Kokoro-82M
	- Parler-TTS Mini v1: https://huggingface.co/parler-tts/parler-tts-mini-v1
	- Voxtral TTS 4B: https://huggingface.co/mistralai/Voxtral-4B-TTS-2603
	- AVTR-1 realtime avatar: https://huggingface.co/avaturn-live/avtr-1
	- NVIDIA PersonaPlex 7B: https://huggingface.co/nvidia/personaplex-7b-v1
	- Modal: https://modal.com