Spaces:

build-small-hackathon
/

ai-time-machine

Running

File size: 19,200 Bytes
# AI Time Machine Tech Stack Decision

Date: 2026-06-05 (Updated: 2026-06-09)

## Purpose

This document records the current technical direction for the AI Time Machine hackathon project. It is intended to be handoff-ready for another coding agent.

The project is a surreal, voice-first time machine experience for the Hugging Face Build Small Hackathon, Track 2: An Adventure in Thousand Token Wood. The user should feel like they launch a strange laboratory machine, arrive at an impossible coordinate in the past or future, and speak aloud with an ordinary person from that world.

## Primary Decision

Build a polished Gradio app hosted on a Hugging Face Space. Use a modular voice-agent architecture with custom frontend staging, streaming ASR, a small instruction LLM, character-oriented TTS, and a souvenir generator.

Main priority: the most polished recorded demo possible within hackathon constraints.

Secondary priority: keep the architecture modular so higher-risk voice/avatar experiments can be tried without endangering the MVP.

## Hard Constraints

Hackathon constraints:

- The app must be a Gradio application.
- The submitted app must be hosted as a Hugging Face Space.
- Total AI model parameters must stay at or below 32B.
- Core functionality must use small models naturally suited to the task.
- The result must be a working, demonstrable product.

Product constraints:

- Voice-first experience.
- Default flow is Surprise Me.
- Visual theme starts as strange laboratory.
- The theme must be easy to reskin later.
- The project should connect the user to ordinary people, not famous figures.
- The app should prioritize theatrical believability over strict historical accuracy.

Budget context:

- Hugging Face credit: USD 20.
- Modal.com credit: USD 250.
- We do not need to spend the credits, but we should use them if they make the demo materially better.

## Final MVP Stack

MVP target:

- App: Gradio Blocks.
- Host: Hugging Face Space with GPU.
- Language: Python.
- Frontend: custom HTML/CSS/JavaScript embedded in Gradio.
- LLM: Qwen3-8B via Together AI API (structured JSON mode built-in).
- STT: NVIDIA Nemotron 3.5 ASR Streaming 0.6B.
- TTS primary candidate: Qwen3-TTS 1.7B VoiceDesign/CustomVoice.
- TTS fallback: NVIDIA Magpie TTS Multilingual 357M.
- Emergency TTS fallback: Kokoro 82M.
- State/output style: structured JSON between generation steps.

Approximate MVP parameter budget:

- LLM: 4B to 8B.
- STT: 0.6B.
- TTS primary: 1.7B.
- Total: roughly 6.3B to 10.3B.

This is comfortably under the 32B cap and leaves room for experimentation.

## Hosting Strategy

The required Gradio application lives on Hugging Face Spaces. Inference is split across three services to keep dependencies clean and costs manageable.

Decided architecture:

- HF Spaces: Gradio UI (CPU tier is sufficient — no local model inference).
- Together AI: LLM inference (Qwen3-8B with JSON mode).
- Modal: Audio models only (Nemotron STT, Qwen3-TTS).

Budget:

- $250 Modal credits available for audio GPU compute.
- $5 Together AI free credits for dev testing.

Key insight: separating LLM and audio dependencies avoids Python dependency conflicts that arise from bundling everything in one environment.

Tradeoff:

- Calling external services means the app is not local-only and should not pursue the Off the Grid badge.
- This is acceptable because polish matters more than bonus badges for this submission.

All external services remain behind narrow Python interfaces so they can be swapped.

## Application Architecture

The app should be organized as five replaceable layers.

### 1. Cockpit UI

Technology:

- Gradio Blocks.
- Custom HTML/CSS/JavaScript.
- CSS variables for theme tokens.
- Small JavaScript state machine for visual transitions.

Responsibilities:

- Strange laboratory cockpit.
- Large windshield/portal.
- Launch button.
- Status/signal panel.
- Voice controls.
- Transcript/radio panel.
- Souvenir display.
- Animated states: dormant, launch, tunnel, turbulence, landing, smoke clear, destination reveal, signal lock, conversation, souvenir.

Implementation guidance:

- The strange laboratory look should be a skin, not hard-coded into app logic.
- Use CSS variables for color, glow, material, portal palette, warning states, and typography.
- Use `data-state` attributes or equivalent simple state flags for animations.
- Do not make the default UI look like a basic Gradio form.

### 2. World And Persona Engine

Technology:

- Python orchestration.
- Qwen3-class instruction LLM.
- Structured JSON outputs.

Responsibilities:

- Generate a destination profile.
- Generate visual motifs that map to frontend presets.
- Generate an ordinary-person persona card.
- Maintain tone, era/future context, character constraints, and safety.

Important output fields:

- destination year.
- destination place.
- destination mode: past, future, or strange.
- atmosphere.
- visual preset key.
- visual motif list.
- character name or local identifier.
- character role/occupation.
- immediate situation.
- daily concern.
- secret/fear/desire.
- worldview constraints.
- theory about the user's voice.
- speaking style.

### 3. Voice Input

Primary STT:

- NVIDIA Nemotron 3.5 ASR Streaming 0.6B.

Why:

- 600M parameters.
- Designed for streaming ASR.
- Supports multilingual transcription across many language-locales.
- Supports punctuation and capitalization.
- Has configurable chunk sizes from low-latency to higher-accuracy operation.
- Better fit than clip-only ASR for a voice-first cockpit experience.

Recommended MVP behavior:

- Start with microphone clip transcription if Gradio streaming wiring takes too long.
- Upgrade to streaming ASR once the basic loop works.
- Use streaming partials in the UI as signal decoding text when available.

Fallback STT:

- NVIDIA Nemotron Speech Streaming English 0.6B for English-only simplicity.
- Distil-Whisper if NeMo integration blocks progress.

### 4. Conversation And Souvenir Engine

Technology:

- Same Qwen3-class instruction LLM used for world/persona generation.

Responsibilities:

- Respond in character.
- Keep replies short enough for voice playback.
- Preserve the persona's worldview and misunderstanding of the time signal.
- Ask occasional questions back.
- Generate a final temporal souvenir.

Conversation prompt rules:

- The character is an ordinary person.
- The character should not know modern facts unless implied by the destination.
- The character may interpret the user as a spirit, official, dream, machine voice, ancestor, descendant, customer, omen, or strange weather.
- The response should contain sensory detail and personality.
- Avoid turning real suffering into spectacle.
- Prefer vivid, humane, surprising moments over encyclopedia-style history.

Souvenir output:

- Destination.
- Contact.
- Quote.
- Artifact.
- Stamp name.
- Short encounter summary.

### 5. Voice Output

Primary TTS candidate:

- Qwen3-TTS 1.7B VoiceDesign/CustomVoice.

Why:

- Best fit for theatrical character variety.
- Supports natural-language voice control.
- Supports emotion, prosody, timbre, and speaking-style instructions.
- Supports streaming generation.
- Supports custom voice and voice-design workflows.
- Apache 2.0 license.
- Lets the app create a more distinct voice per generated persona.

Recommended Qwen3-TTS workflow:

1. Generate persona.
2. Generate a short voice design instruction from the persona.
3. Use Qwen3-TTS VoiceDesign or CustomVoice to create the character voice.
4. Cache/reuse that voice setup for the encounter.
5. Generate each character line with short text and explicit emotion/prosody hints.

Fallback TTS:

- NVIDIA Magpie TTS Multilingual 357M.

Why:

- NVIDIA-backed.
- NeMo-compatible.
- Small.
- Commercial-ready.
- Multiple speaker options and multilingual support.
- Good reliability candidate if Qwen3-TTS integration is unstable.

Emergency fallback:

- Kokoro 82M.

Why:

- Tiny.
- Apache 2.0.
- Fast and inexpensive.
- Good enough to keep the demo working if stronger TTS options fail.

Do not make TTS a single hard dependency. Put it behind a small interface, for example:

- `synthesize(text, voice_profile) -> audio_path`
- `prepare_voice(persona) -> voice_profile`

## TTS Comparison

### Qwen3-TTS

Decision: primary TTS bet.

Strengths:

- Best character fit.
- VoiceDesign and CustomVoice match our generated-persona concept.
- Natural-language control over voice style.
- Streaming support.
- Clean Apache 2.0 license.
- 0.6B and 1.7B variants provide scaling options.

Risks:

- Newer stack.
- May require FlashAttention 2 and GPU/runtime tuning.
- Needs proof of latency and reliability in our deployment environment.

### NVIDIA Magpie TTS 357M

Decision: first fallback.

Strengths:

- Small and practical.
- NVIDIA-backed.
- Works well with the broader NVIDIA speech stack.
- Multiple voices and nine languages.
- Good voice-agent fit.

Risks:

- Less dynamic character control than Qwen3-TTS.
- Fewer English voices than ideal for many different ordinary people.
- NeMo dependency still needs validation in the Space.

### Parler-TTS Mini

Decision: optional fallback or comparison spike, not MVP default.

Strengths:

- Apache 2.0.
- Prompt-controllable voice features.
- 34 named speakers.
- Simple conceptual model for voice descriptions.

Risks:

- English-only.
- Older and less compelling than Qwen3-TTS for this project.
- Around 0.9B params, larger than Kokoro and less flexible than Qwen3-TTS.

### Kokoro 82M

Decision: emergency fallback.

Strengths:

- Very small.
- Fast.
- Apache 2.0.
- Easy to deploy and inexpensive to run.

Risks:

- Less theatrical.
- Less expressive character control.
- Better as a backup than the product-defining voice.

### Voxtral TTS 4B

Note: The user referred to this as Vostral; the model appears to be Mistral's Voxtral TTS.

Decision: stretch spike, not MVP default.

Strengths:

- Expressive, low-latency voice-agent TTS.
- 20 preset voices.
- Voice adaptation from reference audio.
- vLLM-Omni serving path.
- Runs on a single GPU with at least 16GB VRAM.

Risks:

- CC BY-NC 4.0 license.
- 4B params just for TTS.
- Heavier runtime than Qwen3-TTS, Magpie, or Kokoro.
- Better suited to a Modal experiment than the first Space implementation.

## Stretch Technologies

### AVTR-1 Realtime Avatar

Decision: design the UI with an avatar-ready portal slot, but do not put AVTR-1 on the MVP critical path.

Why it is compelling:

- Real-time talking-head avatar.
- Uses a portrait plus dual-stream audio.
- Can render speaking and active listening behavior.
- Could make the portal feel like a person is truly present.

Why it is risky:

- Gated model access.
- Requires CUDA 12.x, TensorRT 10.x, and Ampere+ GPU.
- Requires building TensorRT engines.
- Hugging Face model card reports L4 near the edge of real-time performance.
- Licensing includes noncommercial pieces and consent-sensitive avatar/deepfake restrictions.
- Parameter count needs verification before hackathon use.

How to prepare for it:

- Build the portal area as a replaceable component.
- MVP should show stylized silhouette, portrait card, waveform, static, and environmental animation.
- Later, AVTR-1 can replace or augment the portal visual.

Recommended spike:

- Run AVTR-1 separately on Modal.
- Use only generated or clearly consent-safe reference portraits.
- Confirm licensing and parameter count before integrating into the submitted app.

### PersonaPlex 7B

Decision: stretch experiment, not MVP.

Why it is compelling:

- Real-time spoken conversation.
- Persona-conditioned speech-to-speech interaction.
- Could make the app feel much more alive.

Why it is risky:

- Gated.
- More complex runtime.
- Likely high-end GPU expectations.
- Could absorb time needed for cockpit polish, launch sequence, reliable voice loop, and demo flow.

Recommended spike:

- Build an isolated proof of concept after the MVP voice loop works.
- Compare latency and character quality against modular STT + LLM + TTS.
- Promote only if it clearly improves the demo without destabilizing delivery.

### llama.cpp

Decision: future stretch for the Llama Champion badge, not MVP.

Rationale:

- The project should first optimize for a polished GPU-backed demo.
- GGUF/llama.cpp can be revisited after the app works end to end.

## Frontend Strategy

The frontend should feel like a cinematic ride plus magical radio.

Default UI direction:

- Strange laboratory.
- Full or near-full cockpit.
- Portal/windshield as the visual center.
- Control panel with dials, meters, switches, and warning lights.
- Signal/status text that feels like the machine is decoding a temporal contact.
- Transcript/radio panel for voice clarity.

Core states:

- Dormant cockpit.
- Launch charging.
- Temporal tunnel.
- Turbulence.
- Landing.
- Smoke clearing.
- Destination reveal.
- Signal lock.
- Conversation.
- Souvenir.

MVP visual approach:

- Use CSS animations and small JavaScript state transitions.
- Use stylized world presets instead of generated images.
- Let the LLM emit visual motifs, then map them to known preset keys.
- Keep generated image and avatar work as stretch.

Example visual presets:

- Rainy lantern district.
- Future flooded market.
- Medieval port.
- Orbital repair bay.
- Desert archive.
- Underground signal room.
- Snowbound radio station.
- Solar eclipse observatory.

## Build Order

Follow this order unless the user changes priorities.

1. Create Gradio app shell.
2. Build strange laboratory cockpit layout.
3. Add frontend state machine and launch/reveal animations.
4. Add static mock destination/persona data to validate UI flow.
5. Implement structured destination generation.
6. Implement structured persona generation.
7. Implement transcript-first conversation loop.
8. Add souvenir generation.
9. Add TTS interface with Kokoro or a stub so audio wiring exists early.
10. Integrate Qwen3-TTS as primary TTS candidate.
11. Add Magpie fallback if Qwen3-TTS is unstable.
12. Add STT interface with clip transcription first.
13. Integrate Nemotron 3.5 ASR Streaming.
14. Add voice-first UX polish: partial transcript, signal decoding, playback states.
15. Tune prompts, timing, and audio length for a crisp recorded demo.
16. Consider Modal offload for Qwen3-TTS or heavier components if Space performance is weak.
17. Run stretch spikes only after the core encounter works end to end.

## Required Interfaces

Keep these boundaries clean so models can be swapped.

### Destination Generator

Input:

- mode.
- optional user coordinate prompt.
- random seed.

Output:

- structured destination JSON.

### Persona Generator

Input:

- destination JSON.

Output:

- structured persona JSON.

### Conversation Engine

Input:

- destination JSON.
- persona JSON.
- conversation history.
- latest user message.

Output:

- short in-character response text.
- optional emotion/prosody hint for TTS.
- optional UI signal event.

### STT Adapter

Input:

- audio file or audio stream chunk.

Output:

- transcript text.
- optional partial transcript.
- optional confidence/timing metadata.

### TTS Adapter

Input:

- response text.
- persona voice profile.
- emotion/prosody hint.

Output:

- audio file path or audio bytes.

### Souvenir Generator

Input:

- destination JSON.
- persona JSON.
- conversation highlights.

Output:

- structured souvenir JSON.

## Risks And Mitigations

### Risk: Qwen3-TTS integration takes too long

Mitigation:

- Keep TTS behind an adapter.
- Start with Kokoro or stub audio for wiring.
- Keep Magpie as first serious fallback.

### Risk: Nemotron streaming ASR is hard to wire through Gradio

Mitigation:

- Start with clip-based microphone transcription.
- Add streaming partials after the main loop works.
- Use partial transcript display as polish, not as a hard MVP dependency.

### Risk: Hugging Face Space GPU is not enough

Mitigation:

- Use Qwen3-4B instead of Qwen3-8B.
- Move TTS or LLM to Modal.
- Avoid image generation and avatar generation in MVP.

### Risk: Custom frontend becomes fragile

Mitigation:

- Keep JavaScript small.
- Use explicit UI states.
- Keep business logic in Python.
- Avoid frontend-generated source-of-truth state.

### Risk: Stretch tech distracts from the demo

Mitigation:

- AVTR-1, PersonaPlex, Voxtral, and llama.cpp are separate spikes.
- Do not block MVP on gated models, TensorRT builds, or alternate runtimes.
- Build the cockpit so these can be added later without rewiring the app.

## Current Decisions To Preserve

- Optimize for the most polished demo.
- Use GPU if it helps quality.
- Make the app voice-first.
- Use Surprise Me as the default first action.
- Start with a strange laboratory visual theme.
- Make the visual theme easy to replace later.
- Use Nemotron 3.5 ASR Streaming for preferred STT.
- Use Qwen3-TTS as the primary TTS bet.
- Keep Magpie and Kokoro as fallbacks.
- Keep AVTR-1 and PersonaPlex as stretch experiments.
- Keep llama.cpp as a later badge-oriented stretch.
- Use Modal if it materially improves product quality or unlocks a hard component.

## Open Questions For Implementation

- ~~Which exact Qwen3 LLM checkpoint should be used first: 4B for reliability or 8B for quality?~~ DECIDED: Qwen3-8B via Together AI API.
- ~~Can Qwen3-TTS run reliably inside the selected Hugging Face Space GPU?~~ DECIDED: Qwen3-TTS runs on Modal, not in HF Space.
- ~~Should Qwen3-TTS run in the Space or on Modal?~~ DECIDED: Modal.
- ~~What Space GPU tier should be used for the demo?~~ DECIDED: CPU tier is sufficient (no local model inference).
- ~~How much true streaming does Gradio need in MVP versus polished clip-turn interaction?~~ DECIDED: Walk phase uses sync/text. Sprint phase adds streaming if time permits.
- Which generated voice attributes should be allowed for safety and consistency? (Still open.)
- ~~Should the MVP include multilingual character speech, or keep speech output English-first?~~ DECIDED: English-first for MVP.

## References

Local docs:

- `docs/hackathon_details.md`
- `docs/ai_time_machine_idea.md`

Model and platform references:

- Hugging Face Build Small Hackathon: https://huggingface.co/build-small-hackathon
- Gradio Spaces documentation: https://huggingface.co/docs/hub/spaces-sdks-gradio
- NVIDIA Nemotron Speech collection: https://huggingface.co/collections/nvidia/nemotron-speech
- NVIDIA Nemotron 3.5 ASR Streaming 0.6B: https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b
- NVIDIA Nemotron Speech Streaming English 0.6B: https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b
- NVIDIA Magpie TTS Multilingual 357M: https://huggingface.co/nvidia/magpie_tts_multilingual_357m
- Qwen3-TTS CustomVoice: https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
- Qwen3-TTS VoiceDesign: https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign
- Kokoro 82M: https://huggingface.co/hexgrad/Kokoro-82M
- Parler-TTS Mini v1: https://huggingface.co/parler-tts/parler-tts-mini-v1
- Voxtral TTS 4B: https://huggingface.co/mistralai/Voxtral-4B-TTS-2603
- AVTR-1 realtime avatar: https://huggingface.co/avaturn-live/avtr-1
- NVIDIA PersonaPlex 7B: https://huggingface.co/nvidia/personaplex-7b-v1
- Modal: https://modal.com