𧬠Darwin-TTS-1.7B-Cross
World's first cross-modal FFN transfer from LLM to TTS β emotion-enhanced speech synthesis without any training.
Darwin-TTS blends 3% of Qwen3-1.7B (LLM) FFN weights into Qwen3-TTS-1.7B (TTS) talker module. No training, no data, no GPU hours β just weight-space arithmetic.
Key Discovery
| Blend (Ξ±) | Emotion | Quality | Status |
|---|---|---|---|
| 0% | Baseline | Normal | Original Qwen3-TTS |
| 1% | No change | Normal | Too subtle |
| 3% | Emotion appears | Normal | β This model (default) |
| 5% | Emotion intensified | Normal | β β Max stable |
| 10% | Broken | Failed | Infinite generation |
Why It Works
Qwen3-1.7B (LLM) and Qwen3-TTS-1.7B's talker share 100% identical architecture:
Qwen3-1.7B (LLM) Qwen3-TTS talker Match
hidden_size 2048 2048 β
intermediate_size 6144 6144 β
num_hidden_layers 28 28 β
num_attention_heads 16 16 β
num_key_value_heads 8 8 β
This means zero SVD, zero truncation, zero layer mapping β pure 1:1 lerp blending across all 84 FFN tensors (gate_proj, up_proj, down_proj Γ 28 layers).
Architecture
Qwen3-TTS-1.7B (4-module structure):
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β talker (28L Qwen3 LM backbone) β
β βββ 84 FFN tensors blended with LLM (Ξ±=3%) β β MODIFIED
β βββ talker.model.layers.N.mlp.{gate,up,down} β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β code_predictor (5L, h=1024) β β UNTOUCHED
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β speech_tokenizer (12Hz RVQ codec) β β UNTOUCHED
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β encoder/decoder (audio waveform) β β UNTOUCHED
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
FFN Source: Qwen3-1.7B (LLM)
βββ model.layers.N.mlp.{gate,up,down}_proj.weight
βββ Key mapping: model.layers.N β talker.model.layers.N (1:1)
Only the talker's FFN weights are modified. The code_predictor, speech_tokenizer, and encoder/decoder remain 100% original β preserving the audio codec pipeline entirely.
Quick Start
Option 1: Load pre-blended weights (this model)
from qwen_tts import Qwen3TTSModel
# Load Darwin-TTS-1.7B-Cross (Ξ±=3% pre-blended)
model = Qwen3TTSModel.from_pretrained(
"FINAL-Bench/Darwin-TTS-1.7B-Cross",
device_map="cuda:0",
dtype=torch.bfloat16
)
# Synthesize
wavs, sr = model.generate_voice_clone(
text="μλ
νμΈμ, μ λ λ€μ μΈκ³΅μ§λ₯μ
λλ€!",
ref_audio="your_voice.wav",
ref_text="ref",
x_vector_only_mode=True
)
Option 2: Custom blend ratio (runtime blending)
from qwen_tts import Qwen3TTSModel
model = Qwen3TTSModel.from_pretrained("FINAL-Bench/Darwin-TTS-1.7B-Cross")
wavs, sr = model.generate_voice_clone(
text="μ λ§ κΈ°μ μμμ΄μμ!",
ref_audio="voice.wav",
ref_text="ref",
x_vector_only_mode=True
)
CLI
python darwin_tts_blend.py --alpha 3 --text "Hello, Darwin!" --ref voice.wav --output speech.wav
Installation
pip install torch qwen-tts safetensors soundfile huggingface_hub
Research Background
The Problem
Cross-modal capability transfer (e.g., adding emotion to TTS) traditionally requires:
- Thousands of hours of emotional speech data
- Hundreds of GPU hours for training
- Careful data curation and annotation
The Darwin Approach
Darwin's evolutionary merge framework, originally developed for LLM merging (Darwin LLM V7 achieved GPQA Diamond 86.9%, World #5), is extended to cross-modal transfer:
- Find architecture-compatible models across modalities (LLM β TTS)
- Blend FFN weights at low ratios (3~5%) using simple lerp
- Preserve modality-specific components (audio codec, tokenizer)
Key Findings
- Cross-modal FFN transfer works β LLM's language understanding patterns enhance TTS emotional expressiveness
- Sweet spot is 3~5% β TTS is far more sensitive than LLM merging (which tolerates 7~93%)
- Same backbone is required β TADA-1B (Llama backbone) Γ Qwen3-TTS failed completely; Qwen3 Γ Qwen3 succeeded
- 10%+ destroys TTS β LLM's "continue generating tokens" pattern overrides the TTS stop signal, causing 655-second outputs
- Bidirectional potential β LLM + TTS FFN may enable "Speaking LLM" (the GPT-4o direction)
What Failed (and why it matters)
| Experiment | Why Failed | Lesson |
|---|---|---|
| TADA-1B(Llama) Γ Qwen3-TTS | Different backbone (Llama vs Qwen3) | Same backbone required |
| FFN 100% replacement | Too aggressive | Low ratio (3~5%) needed |
| x_vector_only_mode=False | ref_text mismatch | Use x_vector_only_mode=True |
| Ξ±=10% blend | LLM "keep generating" pattern | TTS has narrow tolerance |
Novelty (Prior Art Survey)
| Approach | Training Required | Cross-Modal | Published |
|---|---|---|---|
| LLM Γ LLM merging (TIES, DARE, SLERP) | No | No (same modal) | Many |
| TTS Γ TTS averaging (Murata 2024) | No | No (same modal) | INTERSPEECH 2024 |
| SmolTolk (adapter-based) | Yes (adapter training) | Yes | arxiv 2503.06211 |
| CSLM (fine-tuning) | Yes (continual pretraining) | Yes | arxiv 2604.11096 |
| GPT-4o (end-to-end) | Yes ($$$) | Yes | OpenAI 2024 |
| Darwin-TTS (this work) | No | Yes | World's First |
Experimental Timeline (2026-04-15)
09:00 TTS hidden_size compatibility analysis β h=2048 group discovered
09:30 TADA-1B Γ Qwen3-TTS download + config analysis
10:00 Chimera v1 (FFN 100%) β failed (noise)
10:30 Environment setup (darwin-tts-venv, torch 2.6.0+cu124)
10:50 Original Qwen3-TTS synthesis verified
11:00 SLERP blend 10/20/30% build (TADA) β failed (different backbone)
11:30 Key insight: Qwen3-1.7B LLM has IDENTICAL architecture to TTS talker!
12:00 Qwen3-1.7B download β config comparison β 5/5 parameters match!
12:15 Ξ±=1/3/5/10% LLMβTTS blending experiments
12:23 β
Ξ±=3% emotion appears, Ξ±=5% emotion intensified, Ξ±=10% broken
12:30 4 voice references Γ 3 blend ratios high-quality sample generation
13:00 Prior art survey β confirmed world's first
13:30 Darwin-TTS-1.7B-Cross (Ξ±=3%) final build + HuggingFace release
Model Details
- Model type: Text-to-Speech (cross-modal FFN blended)
- Base models: Qwen3-TTS-1.7B-Base + Qwen3-1.7B (3% FFN)
- Parameters: ~2.1B
- Languages: Korean, English, Japanese, Chinese + 6 more
- License: Apache 2.0
- Blend ratio: Ξ±=0.03 (3%)
- FFN tensors modified: 84 / 976 total (8.6%)
- Build time: ~2 minutes (no training)
Credits
VIDRAFT (λΉλλννΈ) β Darwin Evolutionary Merge Framework
- Darwin LLM V7: GPQA Diamond 86.9% (World #3)
- FINAL Bench: Text AGI benchmark
- 11 Pillar Technologies: AETHER, PROMETHEUS, HEPHAESTUS, Darwin, FINAL Bench, MARL, SiteAgent, νμ§+νμ, VDash, μΈκ³΅μ¬ν, StealthMark
Built on Qwen3-TTS-1.7B and Qwen3-1.7B by Alibaba Cloud (Apache 2.0).
Related
- Darwin-9B-Opus β Darwin LLM (GPQA Diamond 86.9%)
- FINAL Bench β Text AGI Benchmark
- Darwin Evolutionary Merge Framework β CMA-ES + FFN crossbreeding
- Downloads last month
- -