SmolDuplex / ARCHITECTURE.md

Add complete architecture document and PRD

cca4a5e verified 15 days ago

9.31 kB

	# SmolDuplex: Complete Architecture & PRD

	## 1. SYSTEM OVERVIEW

	SmolDuplex is a 139M-parameter full-duplex spoken interaction model. It simultaneously listens and speaks with 200ms turn-taking granularity.

	\| Metric \| Target \|
	\|--------\|--------\|
	\| Trainable Parameters \| ~139M \|
	\| Turn-taking latency \| <400ms \|
	\| Chunk size \| 200ms \|
	\| Simultaneous listen+speak \| Yes \|
	\| Hardware (inference) \| Any 8GB+ GPU \|
	\| Training cost \| ~$44 cloud \|
	\| Training time \| ~38 GPU-hours \|

	## 2. ARCHITECTURE DIAGRAM

	```
	USER AUDIO IN (mic) AGENT AUDIO OUT (speaker)
	\| ^
	v \|
	+------------+ 5 tokens/200ms +-----------+ 5 tokens/200ms +------------+
	\| CosyVoice \|-------------------->\| SmolLM2 \|--------------------->\| CosyVoice \|
	\| Encoder \| \| 135M \| \| Decoder \|
	\| (frozen) \| \| (trained) \| \| (frozen) \|
	\| ~70M \| \| ~139M \| \| ~80M \|
	+------------+ +-----------+ +------------+
	\|
	Standard causal
	next-token prediction
	```

	## 3. LLM BACKBONE

	Base: [HuggingFaceTB/SmolLM2-135M](https://huggingface.co/HuggingFaceTB/SmolLM2-135M)

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Architecture \| LlamaForCausalLM \|
	\| Layers \| 30 \|
	\| Hidden size \| 576 \|
	\| Attention heads \| 9 (GQA, 3 KV heads) \|
	\| FFN intermediate \| 1536 \|
	\| Activation \| SiLU \|
	\| Context window \| 8192 tokens \|
	\| Position encoding \| RoPE (theta=100000) \|
	\| Original vocab \| 49,152 \|
	\| Expanded vocab \| 53,258 (+4096 speech + 10 special) \|
	\| Final trainable params \| ~139M \|

	## 4. VOCABULARY (53,258 tokens)

	```
	Tokens 0-49151: BPE text tokens (original SmolLM2)
	Tokens 49152-53247: CosyVoice speech codes (4096 codebook)
	Token 53248: [CHUNK] — 200ms chunk boundary
	Token 53249: [ASR] — ASR task prefix
	Token 53250: [TTS] — TTS task prefix
	Token 53251: [SOS] — Start of speech
	Token 53252: [EOS] — End of speech
	Token 53253: [SOT] — Start of text
	Token 53254: [EOT] — End of text
	Token 53255: <sil_sp> — Silent speech (agent listening)
	Token 53256: <sil_txt> — No text this chunk
	Token 53257: <bch> — Backchannel trigger
	```

	## 5. TOKEN SEQUENCE FORMATS

	### Stage 1 — ASR:
	```
	[ASR] [SOS] sp1 sp2 sp3 ... spN [EOS] [SOT] txt1 txt2 ... txtM [EOT]
	```

	### Stage 1 — TTS:
	```
	[TTS] [SOT] txt1 txt2 ... txtM [EOT] [SOS] sp1 sp2 sp3 ... spN [EOS]
	```

	### Stage 2 — Half-Duplex Dialogue:
	```
	[SOS] user_sp1..N [EOS] [SOT] agent_txt1..M [EOT] [SOS] agent_sp1..K [EOS] [SOS] user_sp... [EOS] ...
	```

	### Stage 3 — Full-Duplex (200ms chunks):
	```
	Per chunk (13 tokens):
	[CHUNK] usr_sp1 usr_sp2 usr_sp3 usr_sp4 usr_sp5 \| agt_txt1 agt_txt2 \| agt_sp1 agt_sp2 agt_sp3 agt_sp4 agt_sp5

	Agent speaking, user silent:
	[CHUNK] <sil_sp> <sil_sp> <sil_sp> <sil_sp> <sil_sp> \| txt1 txt2 \| sp1 sp2 sp3 sp4 sp5

	Agent listening, user speaking:
	[CHUNK] usr1 usr2 usr3 usr4 usr5 \| <sil_txt> <sil_txt> \| <sil_sp> <sil_sp> <sil_sp> <sil_sp> <sil_sp>

	Backchannel:
	[CHUNK] usr1 usr2 usr3 usr4 usr5 \| <bch> <sil_txt> \| bch_sp1 bch_sp2 bch_sp3 bch_sp4 bch_sp5
	```

	### Context window math:
	- 8192 tokens / 13 per chunk = 630 chunks
	- 630 x 200ms = 126 seconds max conversation

	## 6. TRAINING STAGES

	### Stage 1: Modality Alignment

	\| \| \|
	\|--\|--\|
	\| Goal \| Teach speech <-> text mapping \|
	\| Tasks \| ASR + TTS (50/50 mix) \|
	\| Data \| LibriSpeech 960h (free, HuggingFace) \|
	\| Format \| `[ASR] speech -> text` and `[TTS] text -> speech` \|
	\| Init \| SmolLM2-135M pretrained (expanded vocab) \|
	\| LR \| 5e-5, cosine schedule, 500 warmup steps \|
	\| Batch \| 32 (x2 grad accum = 64 effective) \|
	\| Seq len \| 1024 \|
	\| Epochs \| 3 \|
	\| Loss mask \| Target tokens only (text for ASR, speech for TTS) \|
	\| Duration \| ~2h on A10G \|
	\| Success \| Model can do basic ASR + TTS \|

	### Stage 2: Half-Duplex Dialogue

	\| \| \|
	\|--\|--\|
	\| Goal \| Learn turn-based conversation \|
	\| Tasks \| User speaks -> Agent thinks (text) -> Agent speaks \|
	\| Data \| 10K synthetic conversations (LLM-generated text + CosyVoice TTS) \|
	\| Format \| Sequential: user_speech -> agent_text -> agent_speech -> ... \|
	\| Init \| Stage 1 checkpoint \|
	\| LR \| 2e-5, cosine, 200 warmup \|
	\| Batch \| 16 (x4 grad accum = 64 effective) \|
	\| Seq len \| 4096 \|
	\| Epochs \| 5 \|
	\| Loss mask \| Agent tokens only (not user speech) \|
	\| Duration \| ~3h on A10G \|
	\| Success \| Coherent responses to spoken queries \|

	### Stage 3: Full-Duplex Interaction

	\| \| \|
	\|--\|--\|
	\| Goal \| Simultaneous listen+speak, 200ms turn-taking \|
	\| Tasks \| Turn-taking, backchanneling, interruption handling \|
	\| Data \| 15K conversations chunked at 200ms (augmented from Stage 2 + fresh) \|
	\| Format \| Flattened chunks: `[CHUNK] user_sp5 agent_txt2 agent_sp5` \|
	\| Sub-stage 3a \| Three-stream (user_sp + agent_txt + agent_sp) — 6 epochs \|
	\| Sub-stage 3b \| Two-stream (user_sp + agent_sp, no text) — 4 epochs \|
	\| Init \| Stage 2 checkpoint \|
	\| LR \| 1e-5, cosine, 100 warmup \|
	\| Batch \| 8 (x8 grad accum = 64 effective) \|
	\| Seq len \| 8192 \|
	\| Epochs \| 10 total \|
	\| Loss mask \| Agent tokens only \|
	\| Duration \| ~8h on A10G \|
	\| Success \| <400ms turn-taking, natural backchannels, handles interrupts \|

	## 7. DATA GENERATION

	### Stage 1 (download only):
	```bash
	# LibriSpeech — free, ready
	from datasets import load_dataset
	ds = load_dataset("openslr/librispeech_asr", "all")
	# Then tokenize audio with CosyVoice encoder (~10h on A10G)
	```

	### Stage 2 (synthetic generation):
	```python
	# Step 1: Generate 10K text dialogues with any LLM
	# Step 2: Synthesize each turn with CosyVoice TTS
	# Step 3: Tokenize all audio with CosyVoice encoder
	# Step 4: Format as half-duplex sequences
	# Time: ~10h total, Cost: ~$14
	```

	### Stage 3 (augmentation):
	```python
	# Step 1: Take Stage 2 dialogues
	# Step 2: Chunk into 200ms segments
	# Step 3: Inject backchannels (p=0.10 per chunk during user speech)
	# Step 4: Inject interruptions (p=0.05 per agent turn)
	# Step 5: Add natural pauses (1-4 chunks between turns)
	# Step 6: Generate 5K fresh conversations for diversity
	# Step 7: Flatten into [CHUNK] user_sp5 agent_txt2 agent_sp5 format
	# Time: ~5h for fresh generation, ~1h for augmentation
	```

	## 8. INFERENCE

	### Latency Budget (per 200ms chunk):
	```
	Audio capture: ~1ms
	CosyVoice encode: ~10ms
	LLM forward (7 tok): ~20-30ms (RTX 3060)
	CosyVoice decode: ~15ms
	Audio playback: ~1ms
	─────────────────────────────
	TOTAL: ~50-60ms ✓ (within 200ms budget)
	```

	### Memory:
	```
	Model (bf16): 270 MB
	KV cache (8192 ctx): ~1.5 GB
	CosyVoice enc+dec: ~300 MB
	─────────────────────────────
	TOTAL VRAM: ~2.5 GB (fits on any modern GPU)
	```

	### Realtime loop:
	```python
	while True:
	user_audio = capture_200ms() # 200ms mic input
	user_tokens = cosyvoice_encode(user_audio) # -> 5 tokens
	context.extend([CHUNK] + user_tokens) # append to context
	agent_tokens = llm.generate(context, max_new=7) # predict 7 tokens
	context.extend(agent_tokens) # append generated
	agent_sp = agent_tokens[2:] # last 5 = speech
	if not all_silent(agent_sp):
	play(cosyvoice_decode(agent_sp)) # output audio
	```

	## 9. COSTS

	\| Item \| GPU Hours \| Cost (A10G) \|
	\|------\|-----------\|-------------\|
	\| Tokenize LibriSpeech \| 10h \| $11 \|
	\| Generate Stage 2 text \| 2h \| $2 \|
	\| CosyVoice TTS (Stage 2) \| 8h \| $9 \|
	\| CosyVoice TTS (Stage 3 extra) \| 5h \| $6 \|
	\| Train Stage 1 \| 2h \| $2 \|
	\| Train Stage 2 \| 3h \| $3 \|
	\| Train Stage 3 \| 8h \| $9 \|
	\| TOTAL \| ~38h \| ~$44 \|

	Own RTX 4090: ~$2 electricity.

	## 10. EVALUATION METRICS

	\| Metric \| Target \| Stage \|
	\|--------\|--------\|-------\|
	\| ASR WER (LibriSpeech dev) \| <30% \| 1 \|
	\| TTS intelligibility \| >70% words correct \| 1 \|
	\| Response relevance (GPT-4 judge) \| >3.0/5.0 \| 2 \|
	\| Turn-taking latency \| <400ms \| 3 \|
	\| Backchannel F1 \| >0.5 \| 3 \|
	\| Interruption yield rate \| >80% within 400ms \| 3 \|
	\| FD-Bench score \| Report (no target) \| 3 \|

	## 11. REFERENCES

	- [OmniFlatten (2410.17799)](https://arxiv.org/abs/2410.17799) — Core arch, proven at 500M
	- [SyncLLM (2409.15594)](https://arxiv.org/abs/2409.15594) — Time-sync mechanism
	- [Chronological Thinking (2510.05150)](https://arxiv.org/abs/2510.05150) — Think-while-listen, 1.5B
	- [Full-Duplex-Bench (2503.04721)](https://arxiv.org/abs/2503.04721) — Evaluation standard
	- [Sommelier (2603.25750)](https://arxiv.org/abs/2603.25750) — Data processing pipeline
	- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) — Speech tokenizer/detokenizer
	- [SmolLM2 (2502.02737)](https://arxiv.org/abs/2502.02737) — LLM backbone