SmolDuplex: Complete Architecture & PRD
1. SYSTEM OVERVIEW
SmolDuplex is a 139M-parameter full-duplex spoken interaction model. It simultaneously listens and speaks with 200ms turn-taking granularity.
| Metric |
Target |
| Trainable Parameters |
~139M |
| Turn-taking latency |
<400ms |
| Chunk size |
200ms |
| Simultaneous listen+speak |
Yes |
| Hardware (inference) |
Any 8GB+ GPU |
| Training cost |
~$44 cloud |
| Training time |
~38 GPU-hours |
2. ARCHITECTURE DIAGRAM
USER AUDIO IN (mic) AGENT AUDIO OUT (speaker)
| ^
v |
+------------+ 5 tokens/200ms +-----------+ 5 tokens/200ms +------------+
| CosyVoice |-------------------->| SmolLM2 |--------------------->| CosyVoice |
| Encoder | | 135M | | Decoder |
| (frozen) | | (trained) | | (frozen) |
| ~70M | | ~139M | | ~80M |
+------------+ +-----------+ +------------+
|
Standard causal
next-token prediction
3. LLM BACKBONE
Base: HuggingFaceTB/SmolLM2-135M
| Parameter |
Value |
| Architecture |
LlamaForCausalLM |
| Layers |
30 |
| Hidden size |
576 |
| Attention heads |
9 (GQA, 3 KV heads) |
| FFN intermediate |
1536 |
| Activation |
SiLU |
| Context window |
8192 tokens |
| Position encoding |
RoPE (theta=100000) |
| Original vocab |
49,152 |
| Expanded vocab |
53,258 (+4096 speech + 10 special) |
| Final trainable params |
~139M |
4. VOCABULARY (53,258 tokens)
Tokens 0-49151: BPE text tokens (original SmolLM2)
Tokens 49152-53247: CosyVoice speech codes (4096 codebook)
Token 53248: [CHUNK] β 200ms chunk boundary
Token 53249: [ASR] β ASR task prefix
Token 53250: [TTS] β TTS task prefix
Token 53251: [SOS] β Start of speech
Token 53252: [EOS] β End of speech
Token 53253: [SOT] β Start of text
Token 53254: [EOT] β End of text
Token 53255: <sil_sp> β Silent speech (agent listening)
Token 53256: <sil_txt> β No text this chunk
Token 53257: <bch> β Backchannel trigger
5. TOKEN SEQUENCE FORMATS
Stage 1 β ASR:
[ASR] [SOS] sp1 sp2 sp3 ... spN [EOS] [SOT] txt1 txt2 ... txtM [EOT]
Stage 1 β TTS:
[TTS] [SOT] txt1 txt2 ... txtM [EOT] [SOS] sp1 sp2 sp3 ... spN [EOS]
Stage 2 β Half-Duplex Dialogue:
[SOS] user_sp1..N [EOS] [SOT] agent_txt1..M [EOT] [SOS] agent_sp1..K [EOS] [SOS] user_sp... [EOS] ...
Stage 3 β Full-Duplex (200ms chunks):
Per chunk (13 tokens):
[CHUNK] usr_sp1 usr_sp2 usr_sp3 usr_sp4 usr_sp5 | agt_txt1 agt_txt2 | agt_sp1 agt_sp2 agt_sp3 agt_sp4 agt_sp5
Agent speaking, user silent:
[CHUNK] <sil_sp> <sil_sp> <sil_sp> <sil_sp> <sil_sp> | txt1 txt2 | sp1 sp2 sp3 sp4 sp5
Agent listening, user speaking:
[CHUNK] usr1 usr2 usr3 usr4 usr5 | <sil_txt> <sil_txt> | <sil_sp> <sil_sp> <sil_sp> <sil_sp> <sil_sp>
Backchannel:
[CHUNK] usr1 usr2 usr3 usr4 usr5 | <bch> <sil_txt> | bch_sp1 bch_sp2 bch_sp3 bch_sp4 bch_sp5
Context window math:
- 8192 tokens / 13 per chunk = 630 chunks
- 630 x 200ms = 126 seconds max conversation
6. TRAINING STAGES
Stage 1: Modality Alignment
|
|
| Goal |
Teach speech <-> text mapping |
| Tasks |
ASR + TTS (50/50 mix) |
| Data |
LibriSpeech 960h (free, HuggingFace) |
| Format |
[ASR] speech -> text and [TTS] text -> speech |
| Init |
SmolLM2-135M pretrained (expanded vocab) |
| LR |
5e-5, cosine schedule, 500 warmup steps |
| Batch |
32 (x2 grad accum = 64 effective) |
| Seq len |
1024 |
| Epochs |
3 |
| Loss mask |
Target tokens only (text for ASR, speech for TTS) |
| Duration |
~2h on A10G |
| Success |
Model can do basic ASR + TTS |
Stage 2: Half-Duplex Dialogue
|
|
| Goal |
Learn turn-based conversation |
| Tasks |
User speaks -> Agent thinks (text) -> Agent speaks |
| Data |
10K synthetic conversations (LLM-generated text + CosyVoice TTS) |
| Format |
Sequential: user_speech -> agent_text -> agent_speech -> ... |
| Init |
Stage 1 checkpoint |
| LR |
2e-5, cosine, 200 warmup |
| Batch |
16 (x4 grad accum = 64 effective) |
| Seq len |
4096 |
| Epochs |
5 |
| Loss mask |
Agent tokens only (not user speech) |
| Duration |
~3h on A10G |
| Success |
Coherent responses to spoken queries |
Stage 3: Full-Duplex Interaction
|
|
| Goal |
Simultaneous listen+speak, 200ms turn-taking |
| Tasks |
Turn-taking, backchanneling, interruption handling |
| Data |
15K conversations chunked at 200ms (augmented from Stage 2 + fresh) |
| Format |
Flattened chunks: [CHUNK] user_sp5 agent_txt2 agent_sp5 |
| Sub-stage 3a |
Three-stream (user_sp + agent_txt + agent_sp) β 6 epochs |
| Sub-stage 3b |
Two-stream (user_sp + agent_sp, no text) β 4 epochs |
| Init |
Stage 2 checkpoint |
| LR |
1e-5, cosine, 100 warmup |
| Batch |
8 (x8 grad accum = 64 effective) |
| Seq len |
8192 |
| Epochs |
10 total |
| Loss mask |
Agent tokens only |
| Duration |
~8h on A10G |
| Success |
<400ms turn-taking, natural backchannels, handles interrupts |
7. DATA GENERATION
Stage 1 (download only):
from datasets import load_dataset
ds = load_dataset("openslr/librispeech_asr", "all")
Stage 2 (synthetic generation):
Stage 3 (augmentation):
8. INFERENCE
Latency Budget (per 200ms chunk):
Audio capture: ~1ms
CosyVoice encode: ~10ms
LLM forward (7 tok): ~20-30ms (RTX 3060)
CosyVoice decode: ~15ms
Audio playback: ~1ms
βββββββββββββββββββββββββββββ
TOTAL: ~50-60ms β (within 200ms budget)
Memory:
Model (bf16): 270 MB
KV cache (8192 ctx): ~1.5 GB
CosyVoice enc+dec: ~300 MB
βββββββββββββββββββββββββββββ
TOTAL VRAM: ~2.5 GB (fits on any modern GPU)
Realtime loop:
while True:
user_audio = capture_200ms()
user_tokens = cosyvoice_encode(user_audio)
context.extend([CHUNK] + user_tokens)
agent_tokens = llm.generate(context, max_new=7)
context.extend(agent_tokens)
agent_sp = agent_tokens[2:]
if not all_silent(agent_sp):
play(cosyvoice_decode(agent_sp))
9. COSTS
| Item |
GPU Hours |
Cost (A10G) |
| Tokenize LibriSpeech |
10h |
$11 |
| Generate Stage 2 text |
2h |
$2 |
| CosyVoice TTS (Stage 2) |
8h |
$9 |
| CosyVoice TTS (Stage 3 extra) |
5h |
$6 |
| Train Stage 1 |
2h |
$2 |
| Train Stage 2 |
3h |
$3 |
| Train Stage 3 |
8h |
$9 |
| TOTAL |
~38h |
~$44 |
Own RTX 4090: ~$2 electricity.
10. EVALUATION METRICS
| Metric |
Target |
Stage |
| ASR WER (LibriSpeech dev) |
<30% |
1 |
| TTS intelligibility |
>70% words correct |
1 |
| Response relevance (GPT-4 judge) |
>3.0/5.0 |
2 |
| Turn-taking latency |
<400ms |
3 |
| Backchannel F1 |
>0.5 |
3 |
| Interruption yield rate |
>80% within 400ms |
3 |
| FD-Bench score |
Report (no target) |
3 |
11. REFERENCES