# SmolDuplex: Complete Architecture & PRD ## 1. SYSTEM OVERVIEW **SmolDuplex** is a 139M-parameter full-duplex spoken interaction model. It simultaneously listens and speaks with 200ms turn-taking granularity. | Metric | Target | |--------|--------| | Trainable Parameters | ~139M | | Turn-taking latency | <400ms | | Chunk size | 200ms | | Simultaneous listen+speak | Yes | | Hardware (inference) | Any 8GB+ GPU | | Training cost | ~$44 cloud | | Training time | ~38 GPU-hours | ## 2. ARCHITECTURE DIAGRAM ``` USER AUDIO IN (mic) AGENT AUDIO OUT (speaker) | ^ v | +------------+ 5 tokens/200ms +-----------+ 5 tokens/200ms +------------+ | CosyVoice |-------------------->| SmolLM2 |--------------------->| CosyVoice | | Encoder | | 135M | | Decoder | | (frozen) | | (trained) | | (frozen) | | ~70M | | ~139M | | ~80M | +------------+ +-----------+ +------------+ | Standard causal next-token prediction ``` ## 3. LLM BACKBONE **Base**: [HuggingFaceTB/SmolLM2-135M](https://huggingface.co/HuggingFaceTB/SmolLM2-135M) | Parameter | Value | |-----------|-------| | Architecture | LlamaForCausalLM | | Layers | 30 | | Hidden size | 576 | | Attention heads | 9 (GQA, 3 KV heads) | | FFN intermediate | 1536 | | Activation | SiLU | | Context window | 8192 tokens | | Position encoding | RoPE (theta=100000) | | Original vocab | 49,152 | | **Expanded vocab** | **53,258** (+4096 speech + 10 special) | | **Final trainable params** | **~139M** | ## 4. VOCABULARY (53,258 tokens) ``` Tokens 0-49151: BPE text tokens (original SmolLM2) Tokens 49152-53247: CosyVoice speech codes (4096 codebook) Token 53248: [CHUNK] — 200ms chunk boundary Token 53249: [ASR] — ASR task prefix Token 53250: [TTS] — TTS task prefix Token 53251: [SOS] — Start of speech Token 53252: [EOS] — End of speech Token 53253: [SOT] — Start of text Token 53254: [EOT] — End of text Token 53255: — Silent speech (agent listening) Token 53256: — No text this chunk Token 53257: — Backchannel trigger ``` ## 5. TOKEN SEQUENCE FORMATS ### Stage 1 — ASR: ``` [ASR] [SOS] sp1 sp2 sp3 ... spN [EOS] [SOT] txt1 txt2 ... txtM [EOT] ``` ### Stage 1 — TTS: ``` [TTS] [SOT] txt1 txt2 ... txtM [EOT] [SOS] sp1 sp2 sp3 ... spN [EOS] ``` ### Stage 2 — Half-Duplex Dialogue: ``` [SOS] user_sp1..N [EOS] [SOT] agent_txt1..M [EOT] [SOS] agent_sp1..K [EOS] [SOS] user_sp... [EOS] ... ``` ### Stage 3 — Full-Duplex (200ms chunks): ``` Per chunk (13 tokens): [CHUNK] usr_sp1 usr_sp2 usr_sp3 usr_sp4 usr_sp5 | agt_txt1 agt_txt2 | agt_sp1 agt_sp2 agt_sp3 agt_sp4 agt_sp5 Agent speaking, user silent: [CHUNK] | txt1 txt2 | sp1 sp2 sp3 sp4 sp5 Agent listening, user speaking: [CHUNK] usr1 usr2 usr3 usr4 usr5 | | Backchannel: [CHUNK] usr1 usr2 usr3 usr4 usr5 | | bch_sp1 bch_sp2 bch_sp3 bch_sp4 bch_sp5 ``` ### Context window math: - 8192 tokens / 13 per chunk = 630 chunks - 630 x 200ms = **126 seconds** max conversation ## 6. TRAINING STAGES ### Stage 1: Modality Alignment | | | |--|--| | **Goal** | Teach speech <-> text mapping | | **Tasks** | ASR + TTS (50/50 mix) | | **Data** | LibriSpeech 960h (free, HuggingFace) | | **Format** | `[ASR] speech -> text` and `[TTS] text -> speech` | | **Init** | SmolLM2-135M pretrained (expanded vocab) | | **LR** | 5e-5, cosine schedule, 500 warmup steps | | **Batch** | 32 (x2 grad accum = 64 effective) | | **Seq len** | 1024 | | **Epochs** | 3 | | **Loss mask** | Target tokens only (text for ASR, speech for TTS) | | **Duration** | ~2h on A10G | | **Success** | Model can do basic ASR + TTS | ### Stage 2: Half-Duplex Dialogue | | | |--|--| | **Goal** | Learn turn-based conversation | | **Tasks** | User speaks -> Agent thinks (text) -> Agent speaks | | **Data** | 10K synthetic conversations (LLM-generated text + CosyVoice TTS) | | **Format** | Sequential: user_speech -> agent_text -> agent_speech -> ... | | **Init** | Stage 1 checkpoint | | **LR** | 2e-5, cosine, 200 warmup | | **Batch** | 16 (x4 grad accum = 64 effective) | | **Seq len** | 4096 | | **Epochs** | 5 | | **Loss mask** | Agent tokens only (not user speech) | | **Duration** | ~3h on A10G | | **Success** | Coherent responses to spoken queries | ### Stage 3: Full-Duplex Interaction | | | |--|--| | **Goal** | Simultaneous listen+speak, 200ms turn-taking | | **Tasks** | Turn-taking, backchanneling, interruption handling | | **Data** | 15K conversations chunked at 200ms (augmented from Stage 2 + fresh) | | **Format** | Flattened chunks: `[CHUNK] user_sp5 agent_txt2 agent_sp5` | | **Sub-stage 3a** | Three-stream (user_sp + agent_txt + agent_sp) — 6 epochs | | **Sub-stage 3b** | Two-stream (user_sp + agent_sp, no text) — 4 epochs | | **Init** | Stage 2 checkpoint | | **LR** | 1e-5, cosine, 100 warmup | | **Batch** | 8 (x8 grad accum = 64 effective) | | **Seq len** | 8192 | | **Epochs** | 10 total | | **Loss mask** | Agent tokens only | | **Duration** | ~8h on A10G | | **Success** | <400ms turn-taking, natural backchannels, handles interrupts | ## 7. DATA GENERATION ### Stage 1 (download only): ```bash # LibriSpeech — free, ready from datasets import load_dataset ds = load_dataset("openslr/librispeech_asr", "all") # Then tokenize audio with CosyVoice encoder (~10h on A10G) ``` ### Stage 2 (synthetic generation): ```python # Step 1: Generate 10K text dialogues with any LLM # Step 2: Synthesize each turn with CosyVoice TTS # Step 3: Tokenize all audio with CosyVoice encoder # Step 4: Format as half-duplex sequences # Time: ~10h total, Cost: ~$14 ``` ### Stage 3 (augmentation): ```python # Step 1: Take Stage 2 dialogues # Step 2: Chunk into 200ms segments # Step 3: Inject backchannels (p=0.10 per chunk during user speech) # Step 4: Inject interruptions (p=0.05 per agent turn) # Step 5: Add natural pauses (1-4 chunks between turns) # Step 6: Generate 5K fresh conversations for diversity # Step 7: Flatten into [CHUNK] user_sp5 agent_txt2 agent_sp5 format # Time: ~5h for fresh generation, ~1h for augmentation ``` ## 8. INFERENCE ### Latency Budget (per 200ms chunk): ``` Audio capture: ~1ms CosyVoice encode: ~10ms LLM forward (7 tok): ~20-30ms (RTX 3060) CosyVoice decode: ~15ms Audio playback: ~1ms ───────────────────────────── TOTAL: ~50-60ms ✓ (within 200ms budget) ``` ### Memory: ``` Model (bf16): 270 MB KV cache (8192 ctx): ~1.5 GB CosyVoice enc+dec: ~300 MB ───────────────────────────── TOTAL VRAM: ~2.5 GB (fits on any modern GPU) ``` ### Realtime loop: ```python while True: user_audio = capture_200ms() # 200ms mic input user_tokens = cosyvoice_encode(user_audio) # -> 5 tokens context.extend([CHUNK] + user_tokens) # append to context agent_tokens = llm.generate(context, max_new=7) # predict 7 tokens context.extend(agent_tokens) # append generated agent_sp = agent_tokens[2:] # last 5 = speech if not all_silent(agent_sp): play(cosyvoice_decode(agent_sp)) # output audio ``` ## 9. COSTS | Item | GPU Hours | Cost (A10G) | |------|-----------|-------------| | Tokenize LibriSpeech | 10h | $11 | | Generate Stage 2 text | 2h | $2 | | CosyVoice TTS (Stage 2) | 8h | $9 | | CosyVoice TTS (Stage 3 extra) | 5h | $6 | | Train Stage 1 | 2h | $2 | | Train Stage 2 | 3h | $3 | | Train Stage 3 | 8h | $9 | | **TOTAL** | **~38h** | **~$44** | Own RTX 4090: ~$2 electricity. ## 10. EVALUATION METRICS | Metric | Target | Stage | |--------|--------|-------| | ASR WER (LibriSpeech dev) | <30% | 1 | | TTS intelligibility | >70% words correct | 1 | | Response relevance (GPT-4 judge) | >3.0/5.0 | 2 | | Turn-taking latency | <400ms | 3 | | Backchannel F1 | >0.5 | 3 | | Interruption yield rate | >80% within 400ms | 3 | | FD-Bench score | Report (no target) | 3 | ## 11. REFERENCES - [OmniFlatten (2410.17799)](https://arxiv.org/abs/2410.17799) — Core arch, proven at 500M - [SyncLLM (2409.15594)](https://arxiv.org/abs/2409.15594) — Time-sync mechanism - [Chronological Thinking (2510.05150)](https://arxiv.org/abs/2510.05150) — Think-while-listen, 1.5B - [Full-Duplex-Bench (2503.04721)](https://arxiv.org/abs/2503.04721) — Evaluation standard - [Sommelier (2603.25750)](https://arxiv.org/abs/2603.25750) — Data processing pipeline - [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) — Speech tokenizer/detokenizer - [SmolLM2 (2502.02737)](https://arxiv.org/abs/2502.02737) — LLM backbone