| # AI Time Machine Tech Stack Decision |
|
|
| Date: 2026-06-05 (Updated: 2026-06-09) |
|
|
| ## Purpose |
|
|
| This document records the current technical direction for the AI Time Machine hackathon project. It is intended to be handoff-ready for another coding agent. |
|
|
| The project is a surreal, voice-first time machine experience for the Hugging Face Build Small Hackathon, Track 2: An Adventure in Thousand Token Wood. The user should feel like they launch a strange laboratory machine, arrive at an impossible coordinate in the past or future, and speak aloud with an ordinary person from that world. |
|
|
| ## Primary Decision |
|
|
| Build a polished Gradio app hosted on a Hugging Face Space. Use a modular voice-agent architecture with custom frontend staging, streaming ASR, a small instruction LLM, character-oriented TTS, and a souvenir generator. |
|
|
| Main priority: the most polished recorded demo possible within hackathon constraints. |
|
|
| Secondary priority: keep the architecture modular so higher-risk voice/avatar experiments can be tried without endangering the MVP. |
|
|
| ## Hard Constraints |
|
|
| Hackathon constraints: |
|
|
| - The app must be a Gradio application. |
| - The submitted app must be hosted as a Hugging Face Space. |
| - Total AI model parameters must stay at or below 32B. |
| - Core functionality must use small models naturally suited to the task. |
| - The result must be a working, demonstrable product. |
|
|
| Product constraints: |
|
|
| - Voice-first experience. |
| - Default flow is Surprise Me. |
| - Visual theme starts as strange laboratory. |
| - The theme must be easy to reskin later. |
| - The project should connect the user to ordinary people, not famous figures. |
| - The app should prioritize theatrical believability over strict historical accuracy. |
|
|
| Budget context: |
|
|
| - Hugging Face credit: USD 20. |
| - Modal.com credit: USD 250. |
| - We do not need to spend the credits, but we should use them if they make the demo materially better. |
|
|
| ## Final MVP Stack |
|
|
| MVP target: |
|
|
| - App: Gradio Blocks. |
| - Host: Hugging Face Space with GPU. |
| - Language: Python. |
| - Frontend: custom HTML/CSS/JavaScript embedded in Gradio. |
| - LLM: Qwen3-8B via Together AI API (structured JSON mode built-in). |
| - STT: NVIDIA Nemotron 3.5 ASR Streaming 0.6B. |
| - TTS primary candidate: Qwen3-TTS 1.7B VoiceDesign/CustomVoice. |
| - TTS fallback: NVIDIA Magpie TTS Multilingual 357M. |
| - Emergency TTS fallback: Kokoro 82M. |
| - State/output style: structured JSON between generation steps. |
|
|
| Approximate MVP parameter budget: |
|
|
| - LLM: 4B to 8B. |
| - STT: 0.6B. |
| - TTS primary: 1.7B. |
| - Total: roughly 6.3B to 10.3B. |
|
|
| This is comfortably under the 32B cap and leaves room for experimentation. |
|
|
| ## Hosting Strategy |
|
|
| The required Gradio application lives on Hugging Face Spaces. Inference is split across three services to keep dependencies clean and costs manageable. |
|
|
| Decided architecture: |
|
|
| - HF Spaces: Gradio UI (CPU tier is sufficient — no local model inference). |
| - Together AI: LLM inference (Qwen3-8B with JSON mode). |
| - Modal: Audio models only (Nemotron STT, Qwen3-TTS). |
|
|
| Budget: |
|
|
| - $250 Modal credits available for audio GPU compute. |
| - $5 Together AI free credits for dev testing. |
|
|
| Key insight: separating LLM and audio dependencies avoids Python dependency conflicts that arise from bundling everything in one environment. |
|
|
| Tradeoff: |
|
|
| - Calling external services means the app is not local-only and should not pursue the Off the Grid badge. |
| - This is acceptable because polish matters more than bonus badges for this submission. |
|
|
| All external services remain behind narrow Python interfaces so they can be swapped. |
|
|
| ## Application Architecture |
|
|
| The app should be organized as five replaceable layers. |
|
|
| ### 1. Cockpit UI |
|
|
| Technology: |
|
|
| - Gradio Blocks. |
| - Custom HTML/CSS/JavaScript. |
| - CSS variables for theme tokens. |
| - Small JavaScript state machine for visual transitions. |
|
|
| Responsibilities: |
|
|
| - Strange laboratory cockpit. |
| - Large windshield/portal. |
| - Launch button. |
| - Status/signal panel. |
| - Voice controls. |
| - Transcript/radio panel. |
| - Souvenir display. |
| - Animated states: dormant, launch, tunnel, turbulence, landing, smoke clear, destination reveal, signal lock, conversation, souvenir. |
|
|
| Implementation guidance: |
|
|
| - The strange laboratory look should be a skin, not hard-coded into app logic. |
| - Use CSS variables for color, glow, material, portal palette, warning states, and typography. |
| - Use `data-state` attributes or equivalent simple state flags for animations. |
| - Do not make the default UI look like a basic Gradio form. |
|
|
| ### 2. World And Persona Engine |
|
|
| Technology: |
|
|
| - Python orchestration. |
| - Qwen3-class instruction LLM. |
| - Structured JSON outputs. |
|
|
| Responsibilities: |
|
|
| - Generate a destination profile. |
| - Generate visual motifs that map to frontend presets. |
| - Generate an ordinary-person persona card. |
| - Maintain tone, era/future context, character constraints, and safety. |
|
|
| Important output fields: |
|
|
| - destination year. |
| - destination place. |
| - destination mode: past, future, or strange. |
| - atmosphere. |
| - visual preset key. |
| - visual motif list. |
| - character name or local identifier. |
| - character role/occupation. |
| - immediate situation. |
| - daily concern. |
| - secret/fear/desire. |
| - worldview constraints. |
| - theory about the user's voice. |
| - speaking style. |
|
|
| ### 3. Voice Input |
|
|
| Primary STT: |
|
|
| - NVIDIA Nemotron 3.5 ASR Streaming 0.6B. |
|
|
| Why: |
|
|
| - 600M parameters. |
| - Designed for streaming ASR. |
| - Supports multilingual transcription across many language-locales. |
| - Supports punctuation and capitalization. |
| - Has configurable chunk sizes from low-latency to higher-accuracy operation. |
| - Better fit than clip-only ASR for a voice-first cockpit experience. |
|
|
| Recommended MVP behavior: |
|
|
| - Start with microphone clip transcription if Gradio streaming wiring takes too long. |
| - Upgrade to streaming ASR once the basic loop works. |
| - Use streaming partials in the UI as signal decoding text when available. |
|
|
| Fallback STT: |
|
|
| - NVIDIA Nemotron Speech Streaming English 0.6B for English-only simplicity. |
| - Distil-Whisper if NeMo integration blocks progress. |
|
|
| ### 4. Conversation And Souvenir Engine |
|
|
| Technology: |
|
|
| - Same Qwen3-class instruction LLM used for world/persona generation. |
|
|
| Responsibilities: |
|
|
| - Respond in character. |
| - Keep replies short enough for voice playback. |
| - Preserve the persona's worldview and misunderstanding of the time signal. |
| - Ask occasional questions back. |
| - Generate a final temporal souvenir. |
|
|
| Conversation prompt rules: |
|
|
| - The character is an ordinary person. |
| - The character should not know modern facts unless implied by the destination. |
| - The character may interpret the user as a spirit, official, dream, machine voice, ancestor, descendant, customer, omen, or strange weather. |
| - The response should contain sensory detail and personality. |
| - Avoid turning real suffering into spectacle. |
| - Prefer vivid, humane, surprising moments over encyclopedia-style history. |
|
|
| Souvenir output: |
|
|
| - Destination. |
| - Contact. |
| - Quote. |
| - Artifact. |
| - Stamp name. |
| - Short encounter summary. |
|
|
| ### 5. Voice Output |
|
|
| Primary TTS candidate: |
|
|
| - Qwen3-TTS 1.7B VoiceDesign/CustomVoice. |
|
|
| Why: |
|
|
| - Best fit for theatrical character variety. |
| - Supports natural-language voice control. |
| - Supports emotion, prosody, timbre, and speaking-style instructions. |
| - Supports streaming generation. |
| - Supports custom voice and voice-design workflows. |
| - Apache 2.0 license. |
| - Lets the app create a more distinct voice per generated persona. |
|
|
| Recommended Qwen3-TTS workflow: |
|
|
| 1. Generate persona. |
| 2. Generate a short voice design instruction from the persona. |
| 3. Use Qwen3-TTS VoiceDesign or CustomVoice to create the character voice. |
| 4. Cache/reuse that voice setup for the encounter. |
| 5. Generate each character line with short text and explicit emotion/prosody hints. |
|
|
| Fallback TTS: |
|
|
| - NVIDIA Magpie TTS Multilingual 357M. |
|
|
| Why: |
|
|
| - NVIDIA-backed. |
| - NeMo-compatible. |
| - Small. |
| - Commercial-ready. |
| - Multiple speaker options and multilingual support. |
| - Good reliability candidate if Qwen3-TTS integration is unstable. |
|
|
| Emergency fallback: |
|
|
| - Kokoro 82M. |
|
|
| Why: |
|
|
| - Tiny. |
| - Apache 2.0. |
| - Fast and inexpensive. |
| - Good enough to keep the demo working if stronger TTS options fail. |
|
|
| Do not make TTS a single hard dependency. Put it behind a small interface, for example: |
|
|
| - `synthesize(text, voice_profile) -> audio_path` |
| - `prepare_voice(persona) -> voice_profile` |
|
|
| ## TTS Comparison |
|
|
| ### Qwen3-TTS |
|
|
| Decision: primary TTS bet. |
|
|
| Strengths: |
|
|
| - Best character fit. |
| - VoiceDesign and CustomVoice match our generated-persona concept. |
| - Natural-language control over voice style. |
| - Streaming support. |
| - Clean Apache 2.0 license. |
| - 0.6B and 1.7B variants provide scaling options. |
|
|
| Risks: |
|
|
| - Newer stack. |
| - May require FlashAttention 2 and GPU/runtime tuning. |
| - Needs proof of latency and reliability in our deployment environment. |
|
|
| ### NVIDIA Magpie TTS 357M |
|
|
| Decision: first fallback. |
|
|
| Strengths: |
|
|
| - Small and practical. |
| - NVIDIA-backed. |
| - Works well with the broader NVIDIA speech stack. |
| - Multiple voices and nine languages. |
| - Good voice-agent fit. |
|
|
| Risks: |
|
|
| - Less dynamic character control than Qwen3-TTS. |
| - Fewer English voices than ideal for many different ordinary people. |
| - NeMo dependency still needs validation in the Space. |
|
|
| ### Parler-TTS Mini |
|
|
| Decision: optional fallback or comparison spike, not MVP default. |
|
|
| Strengths: |
|
|
| - Apache 2.0. |
| - Prompt-controllable voice features. |
| - 34 named speakers. |
| - Simple conceptual model for voice descriptions. |
|
|
| Risks: |
|
|
| - English-only. |
| - Older and less compelling than Qwen3-TTS for this project. |
| - Around 0.9B params, larger than Kokoro and less flexible than Qwen3-TTS. |
|
|
| ### Kokoro 82M |
|
|
| Decision: emergency fallback. |
|
|
| Strengths: |
|
|
| - Very small. |
| - Fast. |
| - Apache 2.0. |
| - Easy to deploy and inexpensive to run. |
|
|
| Risks: |
|
|
| - Less theatrical. |
| - Less expressive character control. |
| - Better as a backup than the product-defining voice. |
|
|
| ### Voxtral TTS 4B |
|
|
| Note: The user referred to this as Vostral; the model appears to be Mistral's Voxtral TTS. |
|
|
| Decision: stretch spike, not MVP default. |
|
|
| Strengths: |
|
|
| - Expressive, low-latency voice-agent TTS. |
| - 20 preset voices. |
| - Voice adaptation from reference audio. |
| - vLLM-Omni serving path. |
| - Runs on a single GPU with at least 16GB VRAM. |
|
|
| Risks: |
|
|
| - CC BY-NC 4.0 license. |
| - 4B params just for TTS. |
| - Heavier runtime than Qwen3-TTS, Magpie, or Kokoro. |
| - Better suited to a Modal experiment than the first Space implementation. |
|
|
| ## Stretch Technologies |
|
|
| ### AVTR-1 Realtime Avatar |
|
|
| Decision: design the UI with an avatar-ready portal slot, but do not put AVTR-1 on the MVP critical path. |
|
|
| Why it is compelling: |
|
|
| - Real-time talking-head avatar. |
| - Uses a portrait plus dual-stream audio. |
| - Can render speaking and active listening behavior. |
| - Could make the portal feel like a person is truly present. |
|
|
| Why it is risky: |
|
|
| - Gated model access. |
| - Requires CUDA 12.x, TensorRT 10.x, and Ampere+ GPU. |
| - Requires building TensorRT engines. |
| - Hugging Face model card reports L4 near the edge of real-time performance. |
| - Licensing includes noncommercial pieces and consent-sensitive avatar/deepfake restrictions. |
| - Parameter count needs verification before hackathon use. |
|
|
| How to prepare for it: |
|
|
| - Build the portal area as a replaceable component. |
| - MVP should show stylized silhouette, portrait card, waveform, static, and environmental animation. |
| - Later, AVTR-1 can replace or augment the portal visual. |
|
|
| Recommended spike: |
|
|
| - Run AVTR-1 separately on Modal. |
| - Use only generated or clearly consent-safe reference portraits. |
| - Confirm licensing and parameter count before integrating into the submitted app. |
|
|
| ### PersonaPlex 7B |
|
|
| Decision: stretch experiment, not MVP. |
|
|
| Why it is compelling: |
|
|
| - Real-time spoken conversation. |
| - Persona-conditioned speech-to-speech interaction. |
| - Could make the app feel much more alive. |
|
|
| Why it is risky: |
|
|
| - Gated. |
| - More complex runtime. |
| - Likely high-end GPU expectations. |
| - Could absorb time needed for cockpit polish, launch sequence, reliable voice loop, and demo flow. |
|
|
| Recommended spike: |
|
|
| - Build an isolated proof of concept after the MVP voice loop works. |
| - Compare latency and character quality against modular STT + LLM + TTS. |
| - Promote only if it clearly improves the demo without destabilizing delivery. |
|
|
| ### llama.cpp |
|
|
| Decision: future stretch for the Llama Champion badge, not MVP. |
|
|
| Rationale: |
|
|
| - The project should first optimize for a polished GPU-backed demo. |
| - GGUF/llama.cpp can be revisited after the app works end to end. |
|
|
| ## Frontend Strategy |
|
|
| The frontend should feel like a cinematic ride plus magical radio. |
|
|
| Default UI direction: |
|
|
| - Strange laboratory. |
| - Full or near-full cockpit. |
| - Portal/windshield as the visual center. |
| - Control panel with dials, meters, switches, and warning lights. |
| - Signal/status text that feels like the machine is decoding a temporal contact. |
| - Transcript/radio panel for voice clarity. |
|
|
| Core states: |
|
|
| - Dormant cockpit. |
| - Launch charging. |
| - Temporal tunnel. |
| - Turbulence. |
| - Landing. |
| - Smoke clearing. |
| - Destination reveal. |
| - Signal lock. |
| - Conversation. |
| - Souvenir. |
|
|
| MVP visual approach: |
|
|
| - Use CSS animations and small JavaScript state transitions. |
| - Use stylized world presets instead of generated images. |
| - Let the LLM emit visual motifs, then map them to known preset keys. |
| - Keep generated image and avatar work as stretch. |
|
|
| Example visual presets: |
|
|
| - Rainy lantern district. |
| - Future flooded market. |
| - Medieval port. |
| - Orbital repair bay. |
| - Desert archive. |
| - Underground signal room. |
| - Snowbound radio station. |
| - Solar eclipse observatory. |
|
|
| ## Build Order |
|
|
| Follow this order unless the user changes priorities. |
|
|
| 1. Create Gradio app shell. |
| 2. Build strange laboratory cockpit layout. |
| 3. Add frontend state machine and launch/reveal animations. |
| 4. Add static mock destination/persona data to validate UI flow. |
| 5. Implement structured destination generation. |
| 6. Implement structured persona generation. |
| 7. Implement transcript-first conversation loop. |
| 8. Add souvenir generation. |
| 9. Add TTS interface with Kokoro or a stub so audio wiring exists early. |
| 10. Integrate Qwen3-TTS as primary TTS candidate. |
| 11. Add Magpie fallback if Qwen3-TTS is unstable. |
| 12. Add STT interface with clip transcription first. |
| 13. Integrate Nemotron 3.5 ASR Streaming. |
| 14. Add voice-first UX polish: partial transcript, signal decoding, playback states. |
| 15. Tune prompts, timing, and audio length for a crisp recorded demo. |
| 16. Consider Modal offload for Qwen3-TTS or heavier components if Space performance is weak. |
| 17. Run stretch spikes only after the core encounter works end to end. |
|
|
| ## Required Interfaces |
|
|
| Keep these boundaries clean so models can be swapped. |
|
|
| ### Destination Generator |
|
|
| Input: |
|
|
| - mode. |
| - optional user coordinate prompt. |
| - random seed. |
|
|
| Output: |
|
|
| - structured destination JSON. |
|
|
| ### Persona Generator |
|
|
| Input: |
|
|
| - destination JSON. |
|
|
| Output: |
|
|
| - structured persona JSON. |
|
|
| ### Conversation Engine |
|
|
| Input: |
|
|
| - destination JSON. |
| - persona JSON. |
| - conversation history. |
| - latest user message. |
|
|
| Output: |
|
|
| - short in-character response text. |
| - optional emotion/prosody hint for TTS. |
| - optional UI signal event. |
|
|
| ### STT Adapter |
|
|
| Input: |
|
|
| - audio file or audio stream chunk. |
|
|
| Output: |
|
|
| - transcript text. |
| - optional partial transcript. |
| - optional confidence/timing metadata. |
|
|
| ### TTS Adapter |
|
|
| Input: |
|
|
| - response text. |
| - persona voice profile. |
| - emotion/prosody hint. |
|
|
| Output: |
|
|
| - audio file path or audio bytes. |
|
|
| ### Souvenir Generator |
|
|
| Input: |
|
|
| - destination JSON. |
| - persona JSON. |
| - conversation highlights. |
|
|
| Output: |
|
|
| - structured souvenir JSON. |
|
|
| ## Risks And Mitigations |
|
|
| ### Risk: Qwen3-TTS integration takes too long |
|
|
| Mitigation: |
|
|
| - Keep TTS behind an adapter. |
| - Start with Kokoro or stub audio for wiring. |
| - Keep Magpie as first serious fallback. |
|
|
| ### Risk: Nemotron streaming ASR is hard to wire through Gradio |
|
|
| Mitigation: |
|
|
| - Start with clip-based microphone transcription. |
| - Add streaming partials after the main loop works. |
| - Use partial transcript display as polish, not as a hard MVP dependency. |
|
|
| ### Risk: Hugging Face Space GPU is not enough |
|
|
| Mitigation: |
|
|
| - Use Qwen3-4B instead of Qwen3-8B. |
| - Move TTS or LLM to Modal. |
| - Avoid image generation and avatar generation in MVP. |
|
|
| ### Risk: Custom frontend becomes fragile |
|
|
| Mitigation: |
|
|
| - Keep JavaScript small. |
| - Use explicit UI states. |
| - Keep business logic in Python. |
| - Avoid frontend-generated source-of-truth state. |
|
|
| ### Risk: Stretch tech distracts from the demo |
|
|
| Mitigation: |
|
|
| - AVTR-1, PersonaPlex, Voxtral, and llama.cpp are separate spikes. |
| - Do not block MVP on gated models, TensorRT builds, or alternate runtimes. |
| - Build the cockpit so these can be added later without rewiring the app. |
|
|
| ## Current Decisions To Preserve |
|
|
| - Optimize for the most polished demo. |
| - Use GPU if it helps quality. |
| - Make the app voice-first. |
| - Use Surprise Me as the default first action. |
| - Start with a strange laboratory visual theme. |
| - Make the visual theme easy to replace later. |
| - Use Nemotron 3.5 ASR Streaming for preferred STT. |
| - Use Qwen3-TTS as the primary TTS bet. |
| - Keep Magpie and Kokoro as fallbacks. |
| - Keep AVTR-1 and PersonaPlex as stretch experiments. |
| - Keep llama.cpp as a later badge-oriented stretch. |
| - Use Modal if it materially improves product quality or unlocks a hard component. |
|
|
| ## Open Questions For Implementation |
|
|
| - ~~Which exact Qwen3 LLM checkpoint should be used first: 4B for reliability or 8B for quality?~~ DECIDED: Qwen3-8B via Together AI API. |
| - ~~Can Qwen3-TTS run reliably inside the selected Hugging Face Space GPU?~~ DECIDED: Qwen3-TTS runs on Modal, not in HF Space. |
| - ~~Should Qwen3-TTS run in the Space or on Modal?~~ DECIDED: Modal. |
| - ~~What Space GPU tier should be used for the demo?~~ DECIDED: CPU tier is sufficient (no local model inference). |
| - ~~How much true streaming does Gradio need in MVP versus polished clip-turn interaction?~~ DECIDED: Walk phase uses sync/text. Sprint phase adds streaming if time permits. |
| - Which generated voice attributes should be allowed for safety and consistency? (Still open.) |
| - ~~Should the MVP include multilingual character speech, or keep speech output English-first?~~ DECIDED: English-first for MVP. |
|
|
| ## References |
|
|
| Local docs: |
|
|
| - `docs/hackathon_details.md` |
| - `docs/ai_time_machine_idea.md` |
|
|
| Model and platform references: |
|
|
| - Hugging Face Build Small Hackathon: https://huggingface.co/build-small-hackathon |
| - Gradio Spaces documentation: https://huggingface.co/docs/hub/spaces-sdks-gradio |
| - NVIDIA Nemotron Speech collection: https://huggingface.co/collections/nvidia/nemotron-speech |
| - NVIDIA Nemotron 3.5 ASR Streaming 0.6B: https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b |
| - NVIDIA Nemotron Speech Streaming English 0.6B: https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b |
| - NVIDIA Magpie TTS Multilingual 357M: https://huggingface.co/nvidia/magpie_tts_multilingual_357m |
| - Qwen3-TTS CustomVoice: https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice |
| - Qwen3-TTS VoiceDesign: https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign |
| - Kokoro 82M: https://huggingface.co/hexgrad/Kokoro-82M |
| - Parler-TTS Mini v1: https://huggingface.co/parler-tts/parler-tts-mini-v1 |
| - Voxtral TTS 4B: https://huggingface.co/mistralai/Voxtral-4B-TTS-2603 |
| - AVTR-1 realtime avatar: https://huggingface.co/avaturn-live/avtr-1 |
| - NVIDIA PersonaPlex 7B: https://huggingface.co/nvidia/personaplex-7b-v1 |
| - Modal: https://modal.com |
| |