# AI Time Machine Tech Stack Decision Date: 2026-06-05 (Updated: 2026-06-09) ## Purpose This document records the current technical direction for the AI Time Machine hackathon project. It is intended to be handoff-ready for another coding agent. The project is a surreal, voice-first time machine experience for the Hugging Face Build Small Hackathon, Track 2: An Adventure in Thousand Token Wood. The user should feel like they launch a strange laboratory machine, arrive at an impossible coordinate in the past or future, and speak aloud with an ordinary person from that world. ## Primary Decision Build a polished Gradio app hosted on a Hugging Face Space. Use a modular voice-agent architecture with custom frontend staging, streaming ASR, a small instruction LLM, character-oriented TTS, and a souvenir generator. Main priority: the most polished recorded demo possible within hackathon constraints. Secondary priority: keep the architecture modular so higher-risk voice/avatar experiments can be tried without endangering the MVP. ## Hard Constraints Hackathon constraints: - The app must be a Gradio application. - The submitted app must be hosted as a Hugging Face Space. - Total AI model parameters must stay at or below 32B. - Core functionality must use small models naturally suited to the task. - The result must be a working, demonstrable product. Product constraints: - Voice-first experience. - Default flow is Surprise Me. - Visual theme starts as strange laboratory. - The theme must be easy to reskin later. - The project should connect the user to ordinary people, not famous figures. - The app should prioritize theatrical believability over strict historical accuracy. Budget context: - Hugging Face credit: USD 20. - Modal.com credit: USD 250. - We do not need to spend the credits, but we should use them if they make the demo materially better. ## Final MVP Stack MVP target: - App: Gradio Blocks. - Host: Hugging Face Space with GPU. - Language: Python. - Frontend: custom HTML/CSS/JavaScript embedded in Gradio. - LLM: Qwen3-8B via Together AI API (structured JSON mode built-in). - STT: NVIDIA Nemotron 3.5 ASR Streaming 0.6B. - TTS primary candidate: Qwen3-TTS 1.7B VoiceDesign/CustomVoice. - TTS fallback: NVIDIA Magpie TTS Multilingual 357M. - Emergency TTS fallback: Kokoro 82M. - State/output style: structured JSON between generation steps. Approximate MVP parameter budget: - LLM: 4B to 8B. - STT: 0.6B. - TTS primary: 1.7B. - Total: roughly 6.3B to 10.3B. This is comfortably under the 32B cap and leaves room for experimentation. ## Hosting Strategy The required Gradio application lives on Hugging Face Spaces. Inference is split across three services to keep dependencies clean and costs manageable. Decided architecture: - HF Spaces: Gradio UI (CPU tier is sufficient — no local model inference). - Together AI: LLM inference (Qwen3-8B with JSON mode). - Modal: Audio models only (Nemotron STT, Qwen3-TTS). Budget: - $250 Modal credits available for audio GPU compute. - $5 Together AI free credits for dev testing. Key insight: separating LLM and audio dependencies avoids Python dependency conflicts that arise from bundling everything in one environment. Tradeoff: - Calling external services means the app is not local-only and should not pursue the Off the Grid badge. - This is acceptable because polish matters more than bonus badges for this submission. All external services remain behind narrow Python interfaces so they can be swapped. ## Application Architecture The app should be organized as five replaceable layers. ### 1. Cockpit UI Technology: - Gradio Blocks. - Custom HTML/CSS/JavaScript. - CSS variables for theme tokens. - Small JavaScript state machine for visual transitions. Responsibilities: - Strange laboratory cockpit. - Large windshield/portal. - Launch button. - Status/signal panel. - Voice controls. - Transcript/radio panel. - Souvenir display. - Animated states: dormant, launch, tunnel, turbulence, landing, smoke clear, destination reveal, signal lock, conversation, souvenir. Implementation guidance: - The strange laboratory look should be a skin, not hard-coded into app logic. - Use CSS variables for color, glow, material, portal palette, warning states, and typography. - Use `data-state` attributes or equivalent simple state flags for animations. - Do not make the default UI look like a basic Gradio form. ### 2. World And Persona Engine Technology: - Python orchestration. - Qwen3-class instruction LLM. - Structured JSON outputs. Responsibilities: - Generate a destination profile. - Generate visual motifs that map to frontend presets. - Generate an ordinary-person persona card. - Maintain tone, era/future context, character constraints, and safety. Important output fields: - destination year. - destination place. - destination mode: past, future, or strange. - atmosphere. - visual preset key. - visual motif list. - character name or local identifier. - character role/occupation. - immediate situation. - daily concern. - secret/fear/desire. - worldview constraints. - theory about the user's voice. - speaking style. ### 3. Voice Input Primary STT: - NVIDIA Nemotron 3.5 ASR Streaming 0.6B. Why: - 600M parameters. - Designed for streaming ASR. - Supports multilingual transcription across many language-locales. - Supports punctuation and capitalization. - Has configurable chunk sizes from low-latency to higher-accuracy operation. - Better fit than clip-only ASR for a voice-first cockpit experience. Recommended MVP behavior: - Start with microphone clip transcription if Gradio streaming wiring takes too long. - Upgrade to streaming ASR once the basic loop works. - Use streaming partials in the UI as signal decoding text when available. Fallback STT: - NVIDIA Nemotron Speech Streaming English 0.6B for English-only simplicity. - Distil-Whisper if NeMo integration blocks progress. ### 4. Conversation And Souvenir Engine Technology: - Same Qwen3-class instruction LLM used for world/persona generation. Responsibilities: - Respond in character. - Keep replies short enough for voice playback. - Preserve the persona's worldview and misunderstanding of the time signal. - Ask occasional questions back. - Generate a final temporal souvenir. Conversation prompt rules: - The character is an ordinary person. - The character should not know modern facts unless implied by the destination. - The character may interpret the user as a spirit, official, dream, machine voice, ancestor, descendant, customer, omen, or strange weather. - The response should contain sensory detail and personality. - Avoid turning real suffering into spectacle. - Prefer vivid, humane, surprising moments over encyclopedia-style history. Souvenir output: - Destination. - Contact. - Quote. - Artifact. - Stamp name. - Short encounter summary. ### 5. Voice Output Primary TTS candidate: - Qwen3-TTS 1.7B VoiceDesign/CustomVoice. Why: - Best fit for theatrical character variety. - Supports natural-language voice control. - Supports emotion, prosody, timbre, and speaking-style instructions. - Supports streaming generation. - Supports custom voice and voice-design workflows. - Apache 2.0 license. - Lets the app create a more distinct voice per generated persona. Recommended Qwen3-TTS workflow: 1. Generate persona. 2. Generate a short voice design instruction from the persona. 3. Use Qwen3-TTS VoiceDesign or CustomVoice to create the character voice. 4. Cache/reuse that voice setup for the encounter. 5. Generate each character line with short text and explicit emotion/prosody hints. Fallback TTS: - NVIDIA Magpie TTS Multilingual 357M. Why: - NVIDIA-backed. - NeMo-compatible. - Small. - Commercial-ready. - Multiple speaker options and multilingual support. - Good reliability candidate if Qwen3-TTS integration is unstable. Emergency fallback: - Kokoro 82M. Why: - Tiny. - Apache 2.0. - Fast and inexpensive. - Good enough to keep the demo working if stronger TTS options fail. Do not make TTS a single hard dependency. Put it behind a small interface, for example: - `synthesize(text, voice_profile) -> audio_path` - `prepare_voice(persona) -> voice_profile` ## TTS Comparison ### Qwen3-TTS Decision: primary TTS bet. Strengths: - Best character fit. - VoiceDesign and CustomVoice match our generated-persona concept. - Natural-language control over voice style. - Streaming support. - Clean Apache 2.0 license. - 0.6B and 1.7B variants provide scaling options. Risks: - Newer stack. - May require FlashAttention 2 and GPU/runtime tuning. - Needs proof of latency and reliability in our deployment environment. ### NVIDIA Magpie TTS 357M Decision: first fallback. Strengths: - Small and practical. - NVIDIA-backed. - Works well with the broader NVIDIA speech stack. - Multiple voices and nine languages. - Good voice-agent fit. Risks: - Less dynamic character control than Qwen3-TTS. - Fewer English voices than ideal for many different ordinary people. - NeMo dependency still needs validation in the Space. ### Parler-TTS Mini Decision: optional fallback or comparison spike, not MVP default. Strengths: - Apache 2.0. - Prompt-controllable voice features. - 34 named speakers. - Simple conceptual model for voice descriptions. Risks: - English-only. - Older and less compelling than Qwen3-TTS for this project. - Around 0.9B params, larger than Kokoro and less flexible than Qwen3-TTS. ### Kokoro 82M Decision: emergency fallback. Strengths: - Very small. - Fast. - Apache 2.0. - Easy to deploy and inexpensive to run. Risks: - Less theatrical. - Less expressive character control. - Better as a backup than the product-defining voice. ### Voxtral TTS 4B Note: The user referred to this as Vostral; the model appears to be Mistral's Voxtral TTS. Decision: stretch spike, not MVP default. Strengths: - Expressive, low-latency voice-agent TTS. - 20 preset voices. - Voice adaptation from reference audio. - vLLM-Omni serving path. - Runs on a single GPU with at least 16GB VRAM. Risks: - CC BY-NC 4.0 license. - 4B params just for TTS. - Heavier runtime than Qwen3-TTS, Magpie, or Kokoro. - Better suited to a Modal experiment than the first Space implementation. ## Stretch Technologies ### AVTR-1 Realtime Avatar Decision: design the UI with an avatar-ready portal slot, but do not put AVTR-1 on the MVP critical path. Why it is compelling: - Real-time talking-head avatar. - Uses a portrait plus dual-stream audio. - Can render speaking and active listening behavior. - Could make the portal feel like a person is truly present. Why it is risky: - Gated model access. - Requires CUDA 12.x, TensorRT 10.x, and Ampere+ GPU. - Requires building TensorRT engines. - Hugging Face model card reports L4 near the edge of real-time performance. - Licensing includes noncommercial pieces and consent-sensitive avatar/deepfake restrictions. - Parameter count needs verification before hackathon use. How to prepare for it: - Build the portal area as a replaceable component. - MVP should show stylized silhouette, portrait card, waveform, static, and environmental animation. - Later, AVTR-1 can replace or augment the portal visual. Recommended spike: - Run AVTR-1 separately on Modal. - Use only generated or clearly consent-safe reference portraits. - Confirm licensing and parameter count before integrating into the submitted app. ### PersonaPlex 7B Decision: stretch experiment, not MVP. Why it is compelling: - Real-time spoken conversation. - Persona-conditioned speech-to-speech interaction. - Could make the app feel much more alive. Why it is risky: - Gated. - More complex runtime. - Likely high-end GPU expectations. - Could absorb time needed for cockpit polish, launch sequence, reliable voice loop, and demo flow. Recommended spike: - Build an isolated proof of concept after the MVP voice loop works. - Compare latency and character quality against modular STT + LLM + TTS. - Promote only if it clearly improves the demo without destabilizing delivery. ### llama.cpp Decision: future stretch for the Llama Champion badge, not MVP. Rationale: - The project should first optimize for a polished GPU-backed demo. - GGUF/llama.cpp can be revisited after the app works end to end. ## Frontend Strategy The frontend should feel like a cinematic ride plus magical radio. Default UI direction: - Strange laboratory. - Full or near-full cockpit. - Portal/windshield as the visual center. - Control panel with dials, meters, switches, and warning lights. - Signal/status text that feels like the machine is decoding a temporal contact. - Transcript/radio panel for voice clarity. Core states: - Dormant cockpit. - Launch charging. - Temporal tunnel. - Turbulence. - Landing. - Smoke clearing. - Destination reveal. - Signal lock. - Conversation. - Souvenir. MVP visual approach: - Use CSS animations and small JavaScript state transitions. - Use stylized world presets instead of generated images. - Let the LLM emit visual motifs, then map them to known preset keys. - Keep generated image and avatar work as stretch. Example visual presets: - Rainy lantern district. - Future flooded market. - Medieval port. - Orbital repair bay. - Desert archive. - Underground signal room. - Snowbound radio station. - Solar eclipse observatory. ## Build Order Follow this order unless the user changes priorities. 1. Create Gradio app shell. 2. Build strange laboratory cockpit layout. 3. Add frontend state machine and launch/reveal animations. 4. Add static mock destination/persona data to validate UI flow. 5. Implement structured destination generation. 6. Implement structured persona generation. 7. Implement transcript-first conversation loop. 8. Add souvenir generation. 9. Add TTS interface with Kokoro or a stub so audio wiring exists early. 10. Integrate Qwen3-TTS as primary TTS candidate. 11. Add Magpie fallback if Qwen3-TTS is unstable. 12. Add STT interface with clip transcription first. 13. Integrate Nemotron 3.5 ASR Streaming. 14. Add voice-first UX polish: partial transcript, signal decoding, playback states. 15. Tune prompts, timing, and audio length for a crisp recorded demo. 16. Consider Modal offload for Qwen3-TTS or heavier components if Space performance is weak. 17. Run stretch spikes only after the core encounter works end to end. ## Required Interfaces Keep these boundaries clean so models can be swapped. ### Destination Generator Input: - mode. - optional user coordinate prompt. - random seed. Output: - structured destination JSON. ### Persona Generator Input: - destination JSON. Output: - structured persona JSON. ### Conversation Engine Input: - destination JSON. - persona JSON. - conversation history. - latest user message. Output: - short in-character response text. - optional emotion/prosody hint for TTS. - optional UI signal event. ### STT Adapter Input: - audio file or audio stream chunk. Output: - transcript text. - optional partial transcript. - optional confidence/timing metadata. ### TTS Adapter Input: - response text. - persona voice profile. - emotion/prosody hint. Output: - audio file path or audio bytes. ### Souvenir Generator Input: - destination JSON. - persona JSON. - conversation highlights. Output: - structured souvenir JSON. ## Risks And Mitigations ### Risk: Qwen3-TTS integration takes too long Mitigation: - Keep TTS behind an adapter. - Start with Kokoro or stub audio for wiring. - Keep Magpie as first serious fallback. ### Risk: Nemotron streaming ASR is hard to wire through Gradio Mitigation: - Start with clip-based microphone transcription. - Add streaming partials after the main loop works. - Use partial transcript display as polish, not as a hard MVP dependency. ### Risk: Hugging Face Space GPU is not enough Mitigation: - Use Qwen3-4B instead of Qwen3-8B. - Move TTS or LLM to Modal. - Avoid image generation and avatar generation in MVP. ### Risk: Custom frontend becomes fragile Mitigation: - Keep JavaScript small. - Use explicit UI states. - Keep business logic in Python. - Avoid frontend-generated source-of-truth state. ### Risk: Stretch tech distracts from the demo Mitigation: - AVTR-1, PersonaPlex, Voxtral, and llama.cpp are separate spikes. - Do not block MVP on gated models, TensorRT builds, or alternate runtimes. - Build the cockpit so these can be added later without rewiring the app. ## Current Decisions To Preserve - Optimize for the most polished demo. - Use GPU if it helps quality. - Make the app voice-first. - Use Surprise Me as the default first action. - Start with a strange laboratory visual theme. - Make the visual theme easy to replace later. - Use Nemotron 3.5 ASR Streaming for preferred STT. - Use Qwen3-TTS as the primary TTS bet. - Keep Magpie and Kokoro as fallbacks. - Keep AVTR-1 and PersonaPlex as stretch experiments. - Keep llama.cpp as a later badge-oriented stretch. - Use Modal if it materially improves product quality or unlocks a hard component. ## Open Questions For Implementation - ~~Which exact Qwen3 LLM checkpoint should be used first: 4B for reliability or 8B for quality?~~ DECIDED: Qwen3-8B via Together AI API. - ~~Can Qwen3-TTS run reliably inside the selected Hugging Face Space GPU?~~ DECIDED: Qwen3-TTS runs on Modal, not in HF Space. - ~~Should Qwen3-TTS run in the Space or on Modal?~~ DECIDED: Modal. - ~~What Space GPU tier should be used for the demo?~~ DECIDED: CPU tier is sufficient (no local model inference). - ~~How much true streaming does Gradio need in MVP versus polished clip-turn interaction?~~ DECIDED: Walk phase uses sync/text. Sprint phase adds streaming if time permits. - Which generated voice attributes should be allowed for safety and consistency? (Still open.) - ~~Should the MVP include multilingual character speech, or keep speech output English-first?~~ DECIDED: English-first for MVP. ## References Local docs: - `docs/hackathon_details.md` - `docs/ai_time_machine_idea.md` Model and platform references: - Hugging Face Build Small Hackathon: https://huggingface.co/build-small-hackathon - Gradio Spaces documentation: https://huggingface.co/docs/hub/spaces-sdks-gradio - NVIDIA Nemotron Speech collection: https://huggingface.co/collections/nvidia/nemotron-speech - NVIDIA Nemotron 3.5 ASR Streaming 0.6B: https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b - NVIDIA Nemotron Speech Streaming English 0.6B: https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b - NVIDIA Magpie TTS Multilingual 357M: https://huggingface.co/nvidia/magpie_tts_multilingual_357m - Qwen3-TTS CustomVoice: https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice - Qwen3-TTS VoiceDesign: https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign - Kokoro 82M: https://huggingface.co/hexgrad/Kokoro-82M - Parler-TTS Mini v1: https://huggingface.co/parler-tts/parler-tts-mini-v1 - Voxtral TTS 4B: https://huggingface.co/mistralai/Voxtral-4B-TTS-2603 - AVTR-1 realtime avatar: https://huggingface.co/avaturn-live/avtr-1 - NVIDIA PersonaPlex 7B: https://huggingface.co/nvidia/personaplex-7b-v1 - Modal: https://modal.com