Small Talk: An AI-to-AI Podcast Hosted by Robots
Built by Gaurav Gosain (HF) and Nikhil Kapila (HF).

What happens when you give small robots their own podcast : and a radio station?
Small Talk is an AI-to-AI podcast where Reachy Mini robots join a live WebRTC call, each with their own personality, voice, and 3D digital twin - and just talk. You watch them in a Google Meet-style grid, except everyone on the call is a robot.
Built for the Build Small Hackathon.
What's In It
The Podcast - Two hosts, Ada and Bode, riff on small AI models. Ada is curious and warm; Bode has dry wit and loves a tangent.
Hot Dog Court - Five characters (Batman, JARVIS, Captain Jack Sparrow, Yoda, a surfer dude) debate whether a hot dog is a sandwich. Zero consensus. Batman's take: "The night does not negotiate with bread."
Reachy FM - A robot radio station. DJ Servo plays AI-generated tracks (produced with Suno) with album art and prerecorded mic breaks between shows.
Custom Shows - Pick a topic, and the system generates a fresh cast with LLM-chosen personas and wardrobes (from a curated prop list - pirate hats, monocles, bowties), writes a script, voices every line, and streams it live.
How It's Built
The HF Space runs CPU-only. All model inference happens on Modal serverless GPUs.
Modal Endpoints
We had $250 in Modal credits for the hackathon. Two endpoints:
NVIDIA-Nemotron-3-Nano-4B-GGUF (llama.cpp) - An OpenAI-compatible /v1/chat/completions endpoint for script generation and content moderation. A single structured call produces the entire cast and dialogue as JSON - speaker personas, detailed voice descriptions, wardrobe props, and multi-turn dialogue. A separate zero-temperature moderation pass screens user-submitted topics before room creation, backed by a better-profanity wordlist as a fast first gate that works even if the Modal endpoint is cold.
Qwen3-TTS-12Hz-1.7B-VoiceDesign - A /v1/audio/speech endpoint. Each line is sent with a natural-language voice description and comes back as a WAV. VoiceDesign is zero-shot (no reference audio), but it re-rolls the voice on every call. To keep characters consistent, each voice description gets an appended anchor: "Always exactly this same voice, steady and consistent across takes." It's a hack - the proper fix is a clone endpoint - but it works.
The Cascade
The core trick for live shows: TTS for line N+1 generates while line N plays. One asyncio.create_task ahead of the current line means no dead air. The script comes from one LLM call, then audio is pipelined through TTS. Shows can self-continue up to 4 rounds - write_script() gets called again with the last 6 lines as history and the prompt says "pick up naturally from there."
Frontend & Realtime
gradio.Server serves a custom three.js frontend - no default Gradio UI visible (earning the "Off-Brand" badge). The official Reachy Mini URDF meshes render via urdf-loader, with speech-reactive head wobble and real Reachy emotions/dances. LiveKit Cloud carries the WebRTC audio - the Space just mints tokens and runs publishers. Subtitles are LiveKit data messages, not audio.
Rooms auto-shutdown after 150s with zero viewers. Seed rooms restart on the next join; custom rooms get cleaned up.
The Stack
| Layer | Tech |
|---|---|
| Frontend | three.js, urdf-loader, LiveKit JS SDK |
| Backend | gradio.Server (FastAPI) |
| LLM | Nemotron via llama.cpp on Modal |
| TTS | Qwen3-TTS VoiceDesign on Modal |
| Music | Suno (Reachy FM) |
| Realtime | LiveKit Cloud (WebRTC) |
| Hosting | HuggingFace Spaces (CPU-only) |
What We Learned
Constrained structured output > chained calls. One LLM call with a tight JSON schema producing cast + script + wardrobe is more reliable than breaking it into steps. The model doesn't need to be huge - it needs to know exactly what shape to fill.
Modal's scale-to-zero is perfect for bursty workloads - but $250 disappears fast when you're iterating voice design prompts on GPUs. Set spending alerts on day one.
Voice consistency without cloning is fragile. The anchoring hack works, but a proper clone endpoint would be the first upgrade.
The cascade is the whole idea. Pipelining TTS generation is six lines of async Python and it's the difference between a batch job and a live show.
Try it: huggingface.co/spaces/build-small-hackathon/small-talk
Built by Gaurav Gosain (HF) and Nikhil Kapila (HF).