ml-intern
SmolDuplex / ARCHITECTURE.md
PranavHarshan's picture
Add complete architecture document and PRD
cca4a5e verified

SmolDuplex: Complete Architecture & PRD

1. SYSTEM OVERVIEW

SmolDuplex is a 139M-parameter full-duplex spoken interaction model. It simultaneously listens and speaks with 200ms turn-taking granularity.

Metric Target
Trainable Parameters ~139M
Turn-taking latency <400ms
Chunk size 200ms
Simultaneous listen+speak Yes
Hardware (inference) Any 8GB+ GPU
Training cost ~$44 cloud
Training time ~38 GPU-hours

2. ARCHITECTURE DIAGRAM

USER AUDIO IN (mic)                              AGENT AUDIO OUT (speaker)
     |                                                    ^
     v                                                    |
+------------+    5 tokens/200ms    +-----------+    5 tokens/200ms    +------------+
| CosyVoice  |-------------------->| SmolLM2   |--------------------->| CosyVoice  |
| Encoder    |                     | 135M      |                      | Decoder    |
| (frozen)   |                     | (trained) |                      | (frozen)   |
| ~70M       |                     | ~139M     |                      | ~80M       |
+------------+                     +-----------+                      +------------+
                                        |
                                   Standard causal
                                   next-token prediction

3. LLM BACKBONE

Base: HuggingFaceTB/SmolLM2-135M

Parameter Value
Architecture LlamaForCausalLM
Layers 30
Hidden size 576
Attention heads 9 (GQA, 3 KV heads)
FFN intermediate 1536
Activation SiLU
Context window 8192 tokens
Position encoding RoPE (theta=100000)
Original vocab 49,152
Expanded vocab 53,258 (+4096 speech + 10 special)
Final trainable params ~139M

4. VOCABULARY (53,258 tokens)

Tokens 0-49151:     BPE text tokens (original SmolLM2)
Tokens 49152-53247: CosyVoice speech codes (4096 codebook)
Token 53248:        [CHUNK]    β€” 200ms chunk boundary
Token 53249:        [ASR]      β€” ASR task prefix
Token 53250:        [TTS]      β€” TTS task prefix
Token 53251:        [SOS]      β€” Start of speech
Token 53252:        [EOS]      β€” End of speech
Token 53253:        [SOT]      β€” Start of text
Token 53254:        [EOT]      β€” End of text
Token 53255:        <sil_sp>   β€” Silent speech (agent listening)
Token 53256:        <sil_txt>  β€” No text this chunk
Token 53257:        <bch>      β€” Backchannel trigger

5. TOKEN SEQUENCE FORMATS

Stage 1 β€” ASR:

[ASR] [SOS] sp1 sp2 sp3 ... spN [EOS] [SOT] txt1 txt2 ... txtM [EOT]

Stage 1 β€” TTS:

[TTS] [SOT] txt1 txt2 ... txtM [EOT] [SOS] sp1 sp2 sp3 ... spN [EOS]

Stage 2 β€” Half-Duplex Dialogue:

[SOS] user_sp1..N [EOS] [SOT] agent_txt1..M [EOT] [SOS] agent_sp1..K [EOS] [SOS] user_sp... [EOS] ...

Stage 3 β€” Full-Duplex (200ms chunks):

Per chunk (13 tokens):
[CHUNK] usr_sp1 usr_sp2 usr_sp3 usr_sp4 usr_sp5 | agt_txt1 agt_txt2 | agt_sp1 agt_sp2 agt_sp3 agt_sp4 agt_sp5

Agent speaking, user silent:
[CHUNK] <sil_sp> <sil_sp> <sil_sp> <sil_sp> <sil_sp> | txt1 txt2 | sp1 sp2 sp3 sp4 sp5

Agent listening, user speaking:
[CHUNK] usr1 usr2 usr3 usr4 usr5 | <sil_txt> <sil_txt> | <sil_sp> <sil_sp> <sil_sp> <sil_sp> <sil_sp>

Backchannel:
[CHUNK] usr1 usr2 usr3 usr4 usr5 | <bch> <sil_txt> | bch_sp1 bch_sp2 bch_sp3 bch_sp4 bch_sp5

Context window math:

  • 8192 tokens / 13 per chunk = 630 chunks
  • 630 x 200ms = 126 seconds max conversation

6. TRAINING STAGES

Stage 1: Modality Alignment

Goal Teach speech <-> text mapping
Tasks ASR + TTS (50/50 mix)
Data LibriSpeech 960h (free, HuggingFace)
Format [ASR] speech -> text and [TTS] text -> speech
Init SmolLM2-135M pretrained (expanded vocab)
LR 5e-5, cosine schedule, 500 warmup steps
Batch 32 (x2 grad accum = 64 effective)
Seq len 1024
Epochs 3
Loss mask Target tokens only (text for ASR, speech for TTS)
Duration ~2h on A10G
Success Model can do basic ASR + TTS

Stage 2: Half-Duplex Dialogue

Goal Learn turn-based conversation
Tasks User speaks -> Agent thinks (text) -> Agent speaks
Data 10K synthetic conversations (LLM-generated text + CosyVoice TTS)
Format Sequential: user_speech -> agent_text -> agent_speech -> ...
Init Stage 1 checkpoint
LR 2e-5, cosine, 200 warmup
Batch 16 (x4 grad accum = 64 effective)
Seq len 4096
Epochs 5
Loss mask Agent tokens only (not user speech)
Duration ~3h on A10G
Success Coherent responses to spoken queries

Stage 3: Full-Duplex Interaction

Goal Simultaneous listen+speak, 200ms turn-taking
Tasks Turn-taking, backchanneling, interruption handling
Data 15K conversations chunked at 200ms (augmented from Stage 2 + fresh)
Format Flattened chunks: [CHUNK] user_sp5 agent_txt2 agent_sp5
Sub-stage 3a Three-stream (user_sp + agent_txt + agent_sp) β€” 6 epochs
Sub-stage 3b Two-stream (user_sp + agent_sp, no text) β€” 4 epochs
Init Stage 2 checkpoint
LR 1e-5, cosine, 100 warmup
Batch 8 (x8 grad accum = 64 effective)
Seq len 8192
Epochs 10 total
Loss mask Agent tokens only
Duration ~8h on A10G
Success <400ms turn-taking, natural backchannels, handles interrupts

7. DATA GENERATION

Stage 1 (download only):

# LibriSpeech β€” free, ready
from datasets import load_dataset
ds = load_dataset("openslr/librispeech_asr", "all")
# Then tokenize audio with CosyVoice encoder (~10h on A10G)

Stage 2 (synthetic generation):

# Step 1: Generate 10K text dialogues with any LLM
# Step 2: Synthesize each turn with CosyVoice TTS
# Step 3: Tokenize all audio with CosyVoice encoder
# Step 4: Format as half-duplex sequences
# Time: ~10h total, Cost: ~$14

Stage 3 (augmentation):

# Step 1: Take Stage 2 dialogues
# Step 2: Chunk into 200ms segments
# Step 3: Inject backchannels (p=0.10 per chunk during user speech)
# Step 4: Inject interruptions (p=0.05 per agent turn)
# Step 5: Add natural pauses (1-4 chunks between turns)
# Step 6: Generate 5K fresh conversations for diversity
# Step 7: Flatten into [CHUNK] user_sp5 agent_txt2 agent_sp5 format
# Time: ~5h for fresh generation, ~1h for augmentation

8. INFERENCE

Latency Budget (per 200ms chunk):

Audio capture:        ~1ms
CosyVoice encode:     ~10ms
LLM forward (7 tok):  ~20-30ms (RTX 3060)
CosyVoice decode:     ~15ms
Audio playback:       ~1ms
─────────────────────────────
TOTAL:                ~50-60ms βœ“ (within 200ms budget)

Memory:

Model (bf16):         270 MB
KV cache (8192 ctx):  ~1.5 GB
CosyVoice enc+dec:    ~300 MB
─────────────────────────────
TOTAL VRAM:           ~2.5 GB (fits on any modern GPU)

Realtime loop:

while True:
    user_audio = capture_200ms()                      # 200ms mic input
    user_tokens = cosyvoice_encode(user_audio)        # -> 5 tokens
    context.extend([CHUNK] + user_tokens)             # append to context
    agent_tokens = llm.generate(context, max_new=7)   # predict 7 tokens
    context.extend(agent_tokens)                      # append generated
    agent_sp = agent_tokens[2:]                       # last 5 = speech
    if not all_silent(agent_sp):
        play(cosyvoice_decode(agent_sp))              # output audio

9. COSTS

Item GPU Hours Cost (A10G)
Tokenize LibriSpeech 10h $11
Generate Stage 2 text 2h $2
CosyVoice TTS (Stage 2) 8h $9
CosyVoice TTS (Stage 3 extra) 5h $6
Train Stage 1 2h $2
Train Stage 2 3h $3
Train Stage 3 8h $9
TOTAL ~38h ~$44

Own RTX 4090: ~$2 electricity.

10. EVALUATION METRICS

Metric Target Stage
ASR WER (LibriSpeech dev) <30% 1
TTS intelligibility >70% words correct 1
Response relevance (GPT-4 judge) >3.0/5.0 2
Turn-taking latency <400ms 3
Backchannel F1 >0.5 3
Interruption yield rate >80% within 400ms 3
FD-Bench score Report (no target) 3

11. REFERENCES