SmolDuplex / ARCHITECTURE.md

PranavHarshan

Add complete architecture document and PRD

cca4a5e verified 15 days ago

preview code

raw

history blame contribute delete

9.31 kB

SmolDuplex: Complete Architecture & PRD

1. SYSTEM OVERVIEW

SmolDuplex is a 139M-parameter full-duplex spoken interaction model. It simultaneously listens and speaks with 200ms turn-taking granularity.

Metric	Target
Trainable Parameters	~139M
Turn-taking latency	<400ms
Chunk size	200ms
Simultaneous listen+speak	Yes
Hardware (inference)	Any 8GB+ GPU
Training cost	~$44 cloud
Training time	~38 GPU-hours

2. ARCHITECTURE DIAGRAM

USER AUDIO IN (mic)                              AGENT AUDIO OUT (speaker)
     |                                                    ^
     v                                                    |
+------------+    5 tokens/200ms    +-----------+    5 tokens/200ms    +------------+
| CosyVoice  |-------------------->| SmolLM2   |--------------------->| CosyVoice  |
| Encoder    |                     | 135M      |                      | Decoder    |
| (frozen)   |                     | (trained) |                      | (frozen)   |
| ~70M       |                     | ~139M     |                      | ~80M       |
+------------+                     +-----------+                      +------------+
                                        |
                                   Standard causal
                                   next-token prediction

3. LLM BACKBONE

Base: HuggingFaceTB/SmolLM2-135M

Parameter	Value
Architecture	LlamaForCausalLM
Layers	30
Hidden size	576
Attention heads	9 (GQA, 3 KV heads)
FFN intermediate	1536
Activation	SiLU
Context window	8192 tokens
Position encoding	RoPE (theta=100000)
Original vocab	49,152
Expanded vocab	53,258 (+4096 speech + 10 special)
Final trainable params	~139M

4. VOCABULARY (53,258 tokens)

Tokens 0-49151:     BPE text tokens (original SmolLM2)
Tokens 49152-53247: CosyVoice speech codes (4096 codebook)
Token 53248:        [CHUNK]    — 200ms chunk boundary
Token 53249:        [ASR]      — ASR task prefix
Token 53250:        [TTS]      — TTS task prefix
Token 53251:        [SOS]      — Start of speech
Token 53252:        [EOS]      — End of speech
Token 53253:        [SOT]      — Start of text
Token 53254:        [EOT]      — End of text
Token 53255:        <sil_sp>   — Silent speech (agent listening)
Token 53256:        <sil_txt>  — No text this chunk
Token 53257:        <bch>      — Backchannel trigger

5. TOKEN SEQUENCE FORMATS

Stage 1 — ASR:

[ASR] [SOS] sp1 sp2 sp3 ... spN [EOS] [SOT] txt1 txt2 ... txtM [EOT]

Stage 1 — TTS:

[TTS] [SOT] txt1 txt2 ... txtM [EOT] [SOS] sp1 sp2 sp3 ... spN [EOS]

Stage 2 — Half-Duplex Dialogue:

[SOS] user_sp1..N [EOS] [SOT] agent_txt1..M [EOT] [SOS] agent_sp1..K [EOS] [SOS] user_sp... [EOS] ...

Stage 3 — Full-Duplex (200ms chunks):

Per chunk (13 tokens):
[CHUNK] usr_sp1 usr_sp2 usr_sp3 usr_sp4 usr_sp5 | agt_txt1 agt_txt2 | agt_sp1 agt_sp2 agt_sp3 agt_sp4 agt_sp5

Agent speaking, user silent:
[CHUNK] <sil_sp> <sil_sp> <sil_sp> <sil_sp> <sil_sp> | txt1 txt2 | sp1 sp2 sp3 sp4 sp5

Agent listening, user speaking:
[CHUNK] usr1 usr2 usr3 usr4 usr5 | <sil_txt> <sil_txt> | <sil_sp> <sil_sp> <sil_sp> <sil_sp> <sil_sp>

Backchannel:
[CHUNK] usr1 usr2 usr3 usr4 usr5 | <bch> <sil_txt> | bch_sp1 bch_sp2 bch_sp3 bch_sp4 bch_sp5

Context window math:

8192 tokens / 13 per chunk = 630 chunks
630 x 200ms = 126 seconds max conversation

6. TRAINING STAGES

Stage 1: Modality Alignment


Goal	Teach speech <-> text mapping
Tasks	ASR + TTS (50/50 mix)
Data	LibriSpeech 960h (free, HuggingFace)
Format	`[ASR] speech -> text` and `[TTS] text -> speech`
Init	SmolLM2-135M pretrained (expanded vocab)
LR	5e-5, cosine schedule, 500 warmup steps
Batch	32 (x2 grad accum = 64 effective)
Seq len	1024
Epochs	3
Loss mask	Target tokens only (text for ASR, speech for TTS)
Duration	~2h on A10G
Success	Model can do basic ASR + TTS

Stage 2: Half-Duplex Dialogue


Goal	Learn turn-based conversation
Tasks	User speaks -> Agent thinks (text) -> Agent speaks
Data	10K synthetic conversations (LLM-generated text + CosyVoice TTS)
Format	Sequential: user_speech -> agent_text -> agent_speech -> ...
Init	Stage 1 checkpoint
LR	2e-5, cosine, 200 warmup
Batch	16 (x4 grad accum = 64 effective)
Seq len	4096
Epochs	5
Loss mask	Agent tokens only (not user speech)
Duration	~3h on A10G
Success	Coherent responses to spoken queries

Stage 3: Full-Duplex Interaction


Goal	Simultaneous listen+speak, 200ms turn-taking
Tasks	Turn-taking, backchanneling, interruption handling
Data	15K conversations chunked at 200ms (augmented from Stage 2 + fresh)
Format	Flattened chunks: `[CHUNK] user_sp5 agent_txt2 agent_sp5`
Sub-stage 3a	Three-stream (user_sp + agent_txt + agent_sp) — 6 epochs
Sub-stage 3b	Two-stream (user_sp + agent_sp, no text) — 4 epochs
Init	Stage 2 checkpoint
LR	1e-5, cosine, 100 warmup
Batch	8 (x8 grad accum = 64 effective)
Seq len	8192
Epochs	10 total
Loss mask	Agent tokens only
Duration	~8h on A10G
Success	<400ms turn-taking, natural backchannels, handles interrupts

7. DATA GENERATION

Stage 1 (download only):

# LibriSpeech — free, ready
from datasets import load_dataset
ds = load_dataset("openslr/librispeech_asr", "all")
# Then tokenize audio with CosyVoice encoder (~10h on A10G)

Stage 2 (synthetic generation):

# Step 1: Generate 10K text dialogues with any LLM
# Step 2: Synthesize each turn with CosyVoice TTS
# Step 3: Tokenize all audio with CosyVoice encoder
# Step 4: Format as half-duplex sequences
# Time: ~10h total, Cost: ~$14

Stage 3 (augmentation):

# Step 1: Take Stage 2 dialogues
# Step 2: Chunk into 200ms segments
# Step 3: Inject backchannels (p=0.10 per chunk during user speech)
# Step 4: Inject interruptions (p=0.05 per agent turn)
# Step 5: Add natural pauses (1-4 chunks between turns)
# Step 6: Generate 5K fresh conversations for diversity
# Step 7: Flatten into [CHUNK] user_sp5 agent_txt2 agent_sp5 format
# Time: ~5h for fresh generation, ~1h for augmentation

8. INFERENCE

Latency Budget (per 200ms chunk):

Audio capture:        ~1ms
CosyVoice encode:     ~10ms
LLM forward (7 tok):  ~20-30ms (RTX 3060)
CosyVoice decode:     ~15ms
Audio playback:       ~1ms
─────────────────────────────
TOTAL:                ~50-60ms ✓ (within 200ms budget)

Memory:

Model (bf16):         270 MB
KV cache (8192 ctx):  ~1.5 GB
CosyVoice enc+dec:    ~300 MB
─────────────────────────────
TOTAL VRAM:           ~2.5 GB (fits on any modern GPU)

Realtime loop:

while True:
    user_audio = capture_200ms()                      # 200ms mic input
    user_tokens = cosyvoice_encode(user_audio)        # -> 5 tokens
    context.extend([CHUNK] + user_tokens)             # append to context
    agent_tokens = llm.generate(context, max_new=7)   # predict 7 tokens
    context.extend(agent_tokens)                      # append generated
    agent_sp = agent_tokens[2:]                       # last 5 = speech
    if not all_silent(agent_sp):
        play(cosyvoice_decode(agent_sp))              # output audio

9. COSTS

Item	GPU Hours	Cost (A10G)
Tokenize LibriSpeech	10h	$11
Generate Stage 2 text	2h	$2
CosyVoice TTS (Stage 2)	8h	$9
CosyVoice TTS (Stage 3 extra)	5h	$6
Train Stage 1	2h	$2
Train Stage 2	3h	$3
Train Stage 3	8h	$9
TOTAL	~38h	~$44

Own RTX 4090: ~$2 electricity.

10. EVALUATION METRICS

Metric	Target	Stage
ASR WER (LibriSpeech dev)	<30%	1
TTS intelligibility	>70% words correct	1
Response relevance (GPT-4 judge)	>3.0/5.0	2
Turn-taking latency	<400ms	3
Backchannel F1	>0.5	3
Interruption yield rate	>80% within 400ms	3
FD-Bench score	Report (no target)	3

11. REFERENCES

OmniFlatten (2410.17799) — Core arch, proven at 500M
SyncLLM (2409.15594) — Time-sync mechanism
Chronological Thinking (2510.05150) — Think-while-listen, 1.5B
Full-Duplex-Bench (2503.04721) — Evaluation standard
Sommelier (2603.25750) — Data processing pipeline
CosyVoice — Speech tokenizer/detokenizer
SmolLM2 (2502.02737) — LLM backbone