Small Talk: An AI-to-AI Podcast Hosted by Robots

Community Article

Published June 13, 2026

Upvote

Nikhil K

nkapila6

build-small-hackathon

Try it: huggingface.co/spaces/build-small-hackathon/small-talk

In-depth blog-post: https://nkapila.me/posts/small-talk

Built by Gaurav Gosain (HF) and Nikhil Kapila (HF).

What happens when you give small robots their own podcast : and a radio station?

Small Talk is an AI-to-AI podcast where Reachy Mini robots join a live WebRTC call, each with their own personality, voice, and 3D digital twin - and just talk. You watch them in a Google Meet-style grid, except everyone on the call is a robot.

Built for the Build Small Hackathon.

What's In It

The Podcast - Two hosts, Ada and Bode, riff on small AI models. Ada is curious and warm; Bode has dry wit and loves a tangent.

Hot Dog Court - Five characters (Batman, JARVIS, Captain Jack Sparrow, Yoda, a surfer dude) debate whether a hot dog is a sandwich. Zero consensus. Batman's take: "The night does not negotiate with bread."

Reachy FM - A robot radio station. DJ Servo plays AI-generated tracks (produced with Suno) with album art and prerecorded mic breaks between shows.

Custom Shows - Pick a topic, and the system generates a fresh cast with LLM-chosen personas and wardrobes (from a curated prop list - pirate hats, monocles, bowties), writes a script, voices every line, and streams it live.

How It's Built

The HF Space runs CPU-only. All model inference happens on Modal serverless GPUs.

Modal Endpoints

We had $250 in Modal credits for the hackathon. Two endpoints:

NVIDIA-Nemotron-3-Nano-4B-GGUF (llama.cpp) - An OpenAI-compatible /v1/chat/completions endpoint for script generation and content moderation. A single structured call produces the entire cast and dialogue as JSON - speaker personas, detailed voice descriptions, wardrobe props, and multi-turn dialogue. A separate zero-temperature moderation pass screens user-submitted topics before room creation, backed by a better-profanity wordlist as a fast first gate that works even if the Modal endpoint is cold.

Qwen3-TTS-12Hz-1.7B-VoiceDesign - A /v1/audio/speech endpoint. Each line is sent with a natural-language voice description and comes back as a WAV. VoiceDesign is zero-shot (no reference audio), but it re-rolls the voice on every call. To keep characters consistent, each voice description gets an appended anchor: "Always exactly this same voice, steady and consistent across takes." It's a hack - the proper fix is a clone endpoint - but it works.

The Cascade

The core trick for live shows: TTS for line N+1 generates while line N plays. One asyncio.create_task ahead of the current line means no dead air. The script comes from one LLM call, then audio is pipelined through TTS. Shows can self-continue up to 4 rounds - write_script() gets called again with the last 6 lines as history and the prompt says "pick up naturally from there."

Frontend & Realtime

gradio.Server serves a custom three.js frontend - no default Gradio UI visible (earning the "Off-Brand" badge). The official Reachy Mini URDF meshes render via urdf-loader, with speech-reactive head wobble and real Reachy emotions/dances. LiveKit Cloud carries the WebRTC audio - the Space just mints tokens and runs publishers. Subtitles are LiveKit data messages, not audio.

Rooms auto-shutdown after 150s with zero viewers. Seed rooms restart on the next join; custom rooms get cleaned up.

The Stack

Layer	Tech
Frontend	three.js, urdf-loader, LiveKit JS SDK
Backend	`gradio.Server` (FastAPI)
LLM	Nemotron via llama.cpp on Modal
TTS	Qwen3-TTS VoiceDesign on Modal
Music	Suno (Reachy FM)
Realtime	LiveKit Cloud (WebRTC)
Hosting	HuggingFace Spaces (CPU-only)

What We Learned

Constrained structured output > chained calls. One LLM call with a tight JSON schema producing cast + script + wardrobe is more reliable than breaking it into steps. The model doesn't need to be huge - it needs to know exactly what shape to fill.

Modal's scale-to-zero is perfect for bursty workloads - but $250 disappears fast when you're iterating voice design prompts on GPUs. Set spending alerts on day one.

Voice consistency without cloning is fragile. The anchoring hack works, but a proper clone endpoint would be the first upgrade.

The cascade is the whole idea. Pipelining TTS generation is six lines of async Python and it's the difference between a batch job and a live show.

Try it: huggingface.co/spaces/build-small-hackathon/small-talk

Built by Gaurav Gosain (HF) and Nikhil Kapila (HF).

Models mentioned in this article 2

Spaces mentioned in this article 1

Signal Garden: A Game Engine That Keeps Mutating

June 16, 2026

Noteworthy

June 15, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote