ml-intern
SmolDuplex / ARCHITECTURE.md
PranavHarshan's picture
Add complete architecture document and PRD
cca4a5e verified
# SmolDuplex: Complete Architecture & PRD
## 1. SYSTEM OVERVIEW
**SmolDuplex** is a 139M-parameter full-duplex spoken interaction model. It simultaneously listens and speaks with 200ms turn-taking granularity.
| Metric | Target |
|--------|--------|
| Trainable Parameters | ~139M |
| Turn-taking latency | <400ms |
| Chunk size | 200ms |
| Simultaneous listen+speak | Yes |
| Hardware (inference) | Any 8GB+ GPU |
| Training cost | ~$44 cloud |
| Training time | ~38 GPU-hours |
## 2. ARCHITECTURE DIAGRAM
```
USER AUDIO IN (mic) AGENT AUDIO OUT (speaker)
| ^
v |
+------------+ 5 tokens/200ms +-----------+ 5 tokens/200ms +------------+
| CosyVoice |-------------------->| SmolLM2 |--------------------->| CosyVoice |
| Encoder | | 135M | | Decoder |
| (frozen) | | (trained) | | (frozen) |
| ~70M | | ~139M | | ~80M |
+------------+ +-----------+ +------------+
|
Standard causal
next-token prediction
```
## 3. LLM BACKBONE
**Base**: [HuggingFaceTB/SmolLM2-135M](https://huggingface.co/HuggingFaceTB/SmolLM2-135M)
| Parameter | Value |
|-----------|-------|
| Architecture | LlamaForCausalLM |
| Layers | 30 |
| Hidden size | 576 |
| Attention heads | 9 (GQA, 3 KV heads) |
| FFN intermediate | 1536 |
| Activation | SiLU |
| Context window | 8192 tokens |
| Position encoding | RoPE (theta=100000) |
| Original vocab | 49,152 |
| **Expanded vocab** | **53,258** (+4096 speech + 10 special) |
| **Final trainable params** | **~139M** |
## 4. VOCABULARY (53,258 tokens)
```
Tokens 0-49151: BPE text tokens (original SmolLM2)
Tokens 49152-53247: CosyVoice speech codes (4096 codebook)
Token 53248: [CHUNK] β€” 200ms chunk boundary
Token 53249: [ASR] β€” ASR task prefix
Token 53250: [TTS] β€” TTS task prefix
Token 53251: [SOS] β€” Start of speech
Token 53252: [EOS] β€” End of speech
Token 53253: [SOT] β€” Start of text
Token 53254: [EOT] β€” End of text
Token 53255: <sil_sp> β€” Silent speech (agent listening)
Token 53256: <sil_txt> β€” No text this chunk
Token 53257: <bch> β€” Backchannel trigger
```
## 5. TOKEN SEQUENCE FORMATS
### Stage 1 β€” ASR:
```
[ASR] [SOS] sp1 sp2 sp3 ... spN [EOS] [SOT] txt1 txt2 ... txtM [EOT]
```
### Stage 1 β€” TTS:
```
[TTS] [SOT] txt1 txt2 ... txtM [EOT] [SOS] sp1 sp2 sp3 ... spN [EOS]
```
### Stage 2 β€” Half-Duplex Dialogue:
```
[SOS] user_sp1..N [EOS] [SOT] agent_txt1..M [EOT] [SOS] agent_sp1..K [EOS] [SOS] user_sp... [EOS] ...
```
### Stage 3 β€” Full-Duplex (200ms chunks):
```
Per chunk (13 tokens):
[CHUNK] usr_sp1 usr_sp2 usr_sp3 usr_sp4 usr_sp5 | agt_txt1 agt_txt2 | agt_sp1 agt_sp2 agt_sp3 agt_sp4 agt_sp5
Agent speaking, user silent:
[CHUNK] <sil_sp> <sil_sp> <sil_sp> <sil_sp> <sil_sp> | txt1 txt2 | sp1 sp2 sp3 sp4 sp5
Agent listening, user speaking:
[CHUNK] usr1 usr2 usr3 usr4 usr5 | <sil_txt> <sil_txt> | <sil_sp> <sil_sp> <sil_sp> <sil_sp> <sil_sp>
Backchannel:
[CHUNK] usr1 usr2 usr3 usr4 usr5 | <bch> <sil_txt> | bch_sp1 bch_sp2 bch_sp3 bch_sp4 bch_sp5
```
### Context window math:
- 8192 tokens / 13 per chunk = 630 chunks
- 630 x 200ms = **126 seconds** max conversation
## 6. TRAINING STAGES
### Stage 1: Modality Alignment
| | |
|--|--|
| **Goal** | Teach speech <-> text mapping |
| **Tasks** | ASR + TTS (50/50 mix) |
| **Data** | LibriSpeech 960h (free, HuggingFace) |
| **Format** | `[ASR] speech -> text` and `[TTS] text -> speech` |
| **Init** | SmolLM2-135M pretrained (expanded vocab) |
| **LR** | 5e-5, cosine schedule, 500 warmup steps |
| **Batch** | 32 (x2 grad accum = 64 effective) |
| **Seq len** | 1024 |
| **Epochs** | 3 |
| **Loss mask** | Target tokens only (text for ASR, speech for TTS) |
| **Duration** | ~2h on A10G |
| **Success** | Model can do basic ASR + TTS |
### Stage 2: Half-Duplex Dialogue
| | |
|--|--|
| **Goal** | Learn turn-based conversation |
| **Tasks** | User speaks -> Agent thinks (text) -> Agent speaks |
| **Data** | 10K synthetic conversations (LLM-generated text + CosyVoice TTS) |
| **Format** | Sequential: user_speech -> agent_text -> agent_speech -> ... |
| **Init** | Stage 1 checkpoint |
| **LR** | 2e-5, cosine, 200 warmup |
| **Batch** | 16 (x4 grad accum = 64 effective) |
| **Seq len** | 4096 |
| **Epochs** | 5 |
| **Loss mask** | Agent tokens only (not user speech) |
| **Duration** | ~3h on A10G |
| **Success** | Coherent responses to spoken queries |
### Stage 3: Full-Duplex Interaction
| | |
|--|--|
| **Goal** | Simultaneous listen+speak, 200ms turn-taking |
| **Tasks** | Turn-taking, backchanneling, interruption handling |
| **Data** | 15K conversations chunked at 200ms (augmented from Stage 2 + fresh) |
| **Format** | Flattened chunks: `[CHUNK] user_sp5 agent_txt2 agent_sp5` |
| **Sub-stage 3a** | Three-stream (user_sp + agent_txt + agent_sp) β€” 6 epochs |
| **Sub-stage 3b** | Two-stream (user_sp + agent_sp, no text) β€” 4 epochs |
| **Init** | Stage 2 checkpoint |
| **LR** | 1e-5, cosine, 100 warmup |
| **Batch** | 8 (x8 grad accum = 64 effective) |
| **Seq len** | 8192 |
| **Epochs** | 10 total |
| **Loss mask** | Agent tokens only |
| **Duration** | ~8h on A10G |
| **Success** | <400ms turn-taking, natural backchannels, handles interrupts |
## 7. DATA GENERATION
### Stage 1 (download only):
```bash
# LibriSpeech β€” free, ready
from datasets import load_dataset
ds = load_dataset("openslr/librispeech_asr", "all")
# Then tokenize audio with CosyVoice encoder (~10h on A10G)
```
### Stage 2 (synthetic generation):
```python
# Step 1: Generate 10K text dialogues with any LLM
# Step 2: Synthesize each turn with CosyVoice TTS
# Step 3: Tokenize all audio with CosyVoice encoder
# Step 4: Format as half-duplex sequences
# Time: ~10h total, Cost: ~$14
```
### Stage 3 (augmentation):
```python
# Step 1: Take Stage 2 dialogues
# Step 2: Chunk into 200ms segments
# Step 3: Inject backchannels (p=0.10 per chunk during user speech)
# Step 4: Inject interruptions (p=0.05 per agent turn)
# Step 5: Add natural pauses (1-4 chunks between turns)
# Step 6: Generate 5K fresh conversations for diversity
# Step 7: Flatten into [CHUNK] user_sp5 agent_txt2 agent_sp5 format
# Time: ~5h for fresh generation, ~1h for augmentation
```
## 8. INFERENCE
### Latency Budget (per 200ms chunk):
```
Audio capture: ~1ms
CosyVoice encode: ~10ms
LLM forward (7 tok): ~20-30ms (RTX 3060)
CosyVoice decode: ~15ms
Audio playback: ~1ms
─────────────────────────────
TOTAL: ~50-60ms βœ“ (within 200ms budget)
```
### Memory:
```
Model (bf16): 270 MB
KV cache (8192 ctx): ~1.5 GB
CosyVoice enc+dec: ~300 MB
─────────────────────────────
TOTAL VRAM: ~2.5 GB (fits on any modern GPU)
```
### Realtime loop:
```python
while True:
user_audio = capture_200ms() # 200ms mic input
user_tokens = cosyvoice_encode(user_audio) # -> 5 tokens
context.extend([CHUNK] + user_tokens) # append to context
agent_tokens = llm.generate(context, max_new=7) # predict 7 tokens
context.extend(agent_tokens) # append generated
agent_sp = agent_tokens[2:] # last 5 = speech
if not all_silent(agent_sp):
play(cosyvoice_decode(agent_sp)) # output audio
```
## 9. COSTS
| Item | GPU Hours | Cost (A10G) |
|------|-----------|-------------|
| Tokenize LibriSpeech | 10h | $11 |
| Generate Stage 2 text | 2h | $2 |
| CosyVoice TTS (Stage 2) | 8h | $9 |
| CosyVoice TTS (Stage 3 extra) | 5h | $6 |
| Train Stage 1 | 2h | $2 |
| Train Stage 2 | 3h | $3 |
| Train Stage 3 | 8h | $9 |
| **TOTAL** | **~38h** | **~$44** |
Own RTX 4090: ~$2 electricity.
## 10. EVALUATION METRICS
| Metric | Target | Stage |
|--------|--------|-------|
| ASR WER (LibriSpeech dev) | <30% | 1 |
| TTS intelligibility | >70% words correct | 1 |
| Response relevance (GPT-4 judge) | >3.0/5.0 | 2 |
| Turn-taking latency | <400ms | 3 |
| Backchannel F1 | >0.5 | 3 |
| Interruption yield rate | >80% within 400ms | 3 |
| FD-Bench score | Report (no target) | 3 |
## 11. REFERENCES
- [OmniFlatten (2410.17799)](https://arxiv.org/abs/2410.17799) β€” Core arch, proven at 500M
- [SyncLLM (2409.15594)](https://arxiv.org/abs/2409.15594) β€” Time-sync mechanism
- [Chronological Thinking (2510.05150)](https://arxiv.org/abs/2510.05150) β€” Think-while-listen, 1.5B
- [Full-Duplex-Bench (2503.04721)](https://arxiv.org/abs/2503.04721) β€” Evaluation standard
- [Sommelier (2603.25750)](https://arxiv.org/abs/2603.25750) β€” Data processing pipeline
- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) β€” Speech tokenizer/detokenizer
- [SmolLM2 (2502.02737)](https://arxiv.org/abs/2502.02737) β€” LLM backbone