| # SmolDuplex: Complete Architecture & PRD |
|
|
| ## 1. SYSTEM OVERVIEW |
|
|
| **SmolDuplex** is a 139M-parameter full-duplex spoken interaction model. It simultaneously listens and speaks with 200ms turn-taking granularity. |
|
|
| | Metric | Target | |
| |--------|--------| |
| | Trainable Parameters | ~139M | |
| | Turn-taking latency | <400ms | |
| | Chunk size | 200ms | |
| | Simultaneous listen+speak | Yes | |
| | Hardware (inference) | Any 8GB+ GPU | |
| | Training cost | ~$44 cloud | |
| | Training time | ~38 GPU-hours | |
|
|
| ## 2. ARCHITECTURE DIAGRAM |
|
|
| ``` |
| USER AUDIO IN (mic) AGENT AUDIO OUT (speaker) |
| | ^ |
| v | |
| +------------+ 5 tokens/200ms +-----------+ 5 tokens/200ms +------------+ |
| | CosyVoice |-------------------->| SmolLM2 |--------------------->| CosyVoice | |
| | Encoder | | 135M | | Decoder | |
| | (frozen) | | (trained) | | (frozen) | |
| | ~70M | | ~139M | | ~80M | |
| +------------+ +-----------+ +------------+ |
| | |
| Standard causal |
| next-token prediction |
| ``` |
|
|
| ## 3. LLM BACKBONE |
|
|
| **Base**: [HuggingFaceTB/SmolLM2-135M](https://huggingface.co/HuggingFaceTB/SmolLM2-135M) |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | Architecture | LlamaForCausalLM | |
| | Layers | 30 | |
| | Hidden size | 576 | |
| | Attention heads | 9 (GQA, 3 KV heads) | |
| | FFN intermediate | 1536 | |
| | Activation | SiLU | |
| | Context window | 8192 tokens | |
| | Position encoding | RoPE (theta=100000) | |
| | Original vocab | 49,152 | |
| | **Expanded vocab** | **53,258** (+4096 speech + 10 special) | |
| | **Final trainable params** | **~139M** | |
|
|
| ## 4. VOCABULARY (53,258 tokens) |
|
|
| ``` |
| Tokens 0-49151: BPE text tokens (original SmolLM2) |
| Tokens 49152-53247: CosyVoice speech codes (4096 codebook) |
| Token 53248: [CHUNK] β 200ms chunk boundary |
| Token 53249: [ASR] β ASR task prefix |
| Token 53250: [TTS] β TTS task prefix |
| Token 53251: [SOS] β Start of speech |
| Token 53252: [EOS] β End of speech |
| Token 53253: [SOT] β Start of text |
| Token 53254: [EOT] β End of text |
| Token 53255: <sil_sp> β Silent speech (agent listening) |
| Token 53256: <sil_txt> β No text this chunk |
| Token 53257: <bch> β Backchannel trigger |
| ``` |
|
|
| ## 5. TOKEN SEQUENCE FORMATS |
|
|
| ### Stage 1 β ASR: |
| ``` |
| [ASR] [SOS] sp1 sp2 sp3 ... spN [EOS] [SOT] txt1 txt2 ... txtM [EOT] |
| ``` |
|
|
| ### Stage 1 β TTS: |
| ``` |
| [TTS] [SOT] txt1 txt2 ... txtM [EOT] [SOS] sp1 sp2 sp3 ... spN [EOS] |
| ``` |
|
|
| ### Stage 2 β Half-Duplex Dialogue: |
| ``` |
| [SOS] user_sp1..N [EOS] [SOT] agent_txt1..M [EOT] [SOS] agent_sp1..K [EOS] [SOS] user_sp... [EOS] ... |
| ``` |
|
|
| ### Stage 3 β Full-Duplex (200ms chunks): |
| ``` |
| Per chunk (13 tokens): |
| [CHUNK] usr_sp1 usr_sp2 usr_sp3 usr_sp4 usr_sp5 | agt_txt1 agt_txt2 | agt_sp1 agt_sp2 agt_sp3 agt_sp4 agt_sp5 |
| |
| Agent speaking, user silent: |
| [CHUNK] <sil_sp> <sil_sp> <sil_sp> <sil_sp> <sil_sp> | txt1 txt2 | sp1 sp2 sp3 sp4 sp5 |
| |
| Agent listening, user speaking: |
| [CHUNK] usr1 usr2 usr3 usr4 usr5 | <sil_txt> <sil_txt> | <sil_sp> <sil_sp> <sil_sp> <sil_sp> <sil_sp> |
| |
| Backchannel: |
| [CHUNK] usr1 usr2 usr3 usr4 usr5 | <bch> <sil_txt> | bch_sp1 bch_sp2 bch_sp3 bch_sp4 bch_sp5 |
| ``` |
|
|
| ### Context window math: |
| - 8192 tokens / 13 per chunk = 630 chunks |
| - 630 x 200ms = **126 seconds** max conversation |
|
|
| ## 6. TRAINING STAGES |
|
|
| ### Stage 1: Modality Alignment |
|
|
| | | | |
| |--|--| |
| | **Goal** | Teach speech <-> text mapping | |
| | **Tasks** | ASR + TTS (50/50 mix) | |
| | **Data** | LibriSpeech 960h (free, HuggingFace) | |
| | **Format** | `[ASR] speech -> text` and `[TTS] text -> speech` | |
| | **Init** | SmolLM2-135M pretrained (expanded vocab) | |
| | **LR** | 5e-5, cosine schedule, 500 warmup steps | |
| | **Batch** | 32 (x2 grad accum = 64 effective) | |
| | **Seq len** | 1024 | |
| | **Epochs** | 3 | |
| | **Loss mask** | Target tokens only (text for ASR, speech for TTS) | |
| | **Duration** | ~2h on A10G | |
| | **Success** | Model can do basic ASR + TTS | |
|
|
| ### Stage 2: Half-Duplex Dialogue |
|
|
| | | | |
| |--|--| |
| | **Goal** | Learn turn-based conversation | |
| | **Tasks** | User speaks -> Agent thinks (text) -> Agent speaks | |
| | **Data** | 10K synthetic conversations (LLM-generated text + CosyVoice TTS) | |
| | **Format** | Sequential: user_speech -> agent_text -> agent_speech -> ... | |
| | **Init** | Stage 1 checkpoint | |
| | **LR** | 2e-5, cosine, 200 warmup | |
| | **Batch** | 16 (x4 grad accum = 64 effective) | |
| | **Seq len** | 4096 | |
| | **Epochs** | 5 | |
| | **Loss mask** | Agent tokens only (not user speech) | |
| | **Duration** | ~3h on A10G | |
| | **Success** | Coherent responses to spoken queries | |
| |
| ### Stage 3: Full-Duplex Interaction |
| |
| | | | |
| |--|--| |
| | **Goal** | Simultaneous listen+speak, 200ms turn-taking | |
| | **Tasks** | Turn-taking, backchanneling, interruption handling | |
| | **Data** | 15K conversations chunked at 200ms (augmented from Stage 2 + fresh) | |
| | **Format** | Flattened chunks: `[CHUNK] user_sp5 agent_txt2 agent_sp5` | |
| | **Sub-stage 3a** | Three-stream (user_sp + agent_txt + agent_sp) β 6 epochs | |
| | **Sub-stage 3b** | Two-stream (user_sp + agent_sp, no text) β 4 epochs | |
| | **Init** | Stage 2 checkpoint | |
| | **LR** | 1e-5, cosine, 100 warmup | |
| | **Batch** | 8 (x8 grad accum = 64 effective) | |
| | **Seq len** | 8192 | |
| | **Epochs** | 10 total | |
| | **Loss mask** | Agent tokens only | |
| | **Duration** | ~8h on A10G | |
| | **Success** | <400ms turn-taking, natural backchannels, handles interrupts | |
| |
| ## 7. DATA GENERATION |
| |
| ### Stage 1 (download only): |
| ```bash |
| # LibriSpeech β free, ready |
| from datasets import load_dataset |
| ds = load_dataset("openslr/librispeech_asr", "all") |
| # Then tokenize audio with CosyVoice encoder (~10h on A10G) |
| ``` |
| |
| ### Stage 2 (synthetic generation): |
| ```python |
| # Step 1: Generate 10K text dialogues with any LLM |
| # Step 2: Synthesize each turn with CosyVoice TTS |
| # Step 3: Tokenize all audio with CosyVoice encoder |
| # Step 4: Format as half-duplex sequences |
| # Time: ~10h total, Cost: ~$14 |
| ``` |
| |
| ### Stage 3 (augmentation): |
| ```python |
| # Step 1: Take Stage 2 dialogues |
| # Step 2: Chunk into 200ms segments |
| # Step 3: Inject backchannels (p=0.10 per chunk during user speech) |
| # Step 4: Inject interruptions (p=0.05 per agent turn) |
| # Step 5: Add natural pauses (1-4 chunks between turns) |
| # Step 6: Generate 5K fresh conversations for diversity |
| # Step 7: Flatten into [CHUNK] user_sp5 agent_txt2 agent_sp5 format |
| # Time: ~5h for fresh generation, ~1h for augmentation |
| ``` |
| |
| ## 8. INFERENCE |
| |
| ### Latency Budget (per 200ms chunk): |
| ``` |
| Audio capture: ~1ms |
| CosyVoice encode: ~10ms |
| LLM forward (7 tok): ~20-30ms (RTX 3060) |
| CosyVoice decode: ~15ms |
| Audio playback: ~1ms |
| βββββββββββββββββββββββββββββ |
| TOTAL: ~50-60ms β (within 200ms budget) |
| ``` |
| |
| ### Memory: |
| ``` |
| Model (bf16): 270 MB |
| KV cache (8192 ctx): ~1.5 GB |
| CosyVoice enc+dec: ~300 MB |
| βββββββββββββββββββββββββββββ |
| TOTAL VRAM: ~2.5 GB (fits on any modern GPU) |
| ``` |
| |
| ### Realtime loop: |
| ```python |
| while True: |
| user_audio = capture_200ms() # 200ms mic input |
| user_tokens = cosyvoice_encode(user_audio) # -> 5 tokens |
| context.extend([CHUNK] + user_tokens) # append to context |
| agent_tokens = llm.generate(context, max_new=7) # predict 7 tokens |
| context.extend(agent_tokens) # append generated |
| agent_sp = agent_tokens[2:] # last 5 = speech |
| if not all_silent(agent_sp): |
| play(cosyvoice_decode(agent_sp)) # output audio |
| ``` |
| |
| ## 9. COSTS |
|
|
| | Item | GPU Hours | Cost (A10G) | |
| |------|-----------|-------------| |
| | Tokenize LibriSpeech | 10h | $11 | |
| | Generate Stage 2 text | 2h | $2 | |
| | CosyVoice TTS (Stage 2) | 8h | $9 | |
| | CosyVoice TTS (Stage 3 extra) | 5h | $6 | |
| | Train Stage 1 | 2h | $2 | |
| | Train Stage 2 | 3h | $3 | |
| | Train Stage 3 | 8h | $9 | |
| | **TOTAL** | **~38h** | **~$44** | |
|
|
| Own RTX 4090: ~$2 electricity. |
|
|
| ## 10. EVALUATION METRICS |
|
|
| | Metric | Target | Stage | |
| |--------|--------|-------| |
| | ASR WER (LibriSpeech dev) | <30% | 1 | |
| | TTS intelligibility | >70% words correct | 1 | |
| | Response relevance (GPT-4 judge) | >3.0/5.0 | 2 | |
| | Turn-taking latency | <400ms | 3 | |
| | Backchannel F1 | >0.5 | 3 | |
| | Interruption yield rate | >80% within 400ms | 3 | |
| | FD-Bench score | Report (no target) | 3 | |
|
|
| ## 11. REFERENCES |
|
|
| - [OmniFlatten (2410.17799)](https://arxiv.org/abs/2410.17799) β Core arch, proven at 500M |
| - [SyncLLM (2409.15594)](https://arxiv.org/abs/2409.15594) β Time-sync mechanism |
| - [Chronological Thinking (2510.05150)](https://arxiv.org/abs/2510.05150) β Think-while-listen, 1.5B |
| - [Full-Duplex-Bench (2503.04721)](https://arxiv.org/abs/2503.04721) β Evaluation standard |
| - [Sommelier (2603.25750)](https://arxiv.org/abs/2603.25750) β Data processing pipeline |
| - [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) β Speech tokenizer/detokenizer |
| - [SmolLM2 (2502.02737)](https://arxiv.org/abs/2502.02737) β LLM backbone |
|
|