Add paper link, author information, and citation
#1
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,37 +1,42 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
| 2 |
language:
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
license: apache-2.0
|
| 14 |
-
tags:
|
| 15 |
-
- tts
|
| 16 |
-
- text-to-speech
|
| 17 |
-
- darwin
|
| 18 |
-
- cross-modal
|
| 19 |
-
- ffn-blending
|
| 20 |
-
- model-merging
|
| 21 |
-
- qwen3
|
| 22 |
-
- voice-cloning
|
| 23 |
-
- emotion
|
| 24 |
-
- vidraft
|
| 25 |
-
base_model:
|
| 26 |
-
- Qwen/Qwen3-TTS-12Hz-1.7B-Base
|
| 27 |
-
- Qwen/Qwen3-1.7B
|
| 28 |
pipeline_tag: text-to-speech
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
---
|
| 30 |
|
| 31 |
# 𧬠Darwin-TTS-1.7B-Cross
|
| 32 |
|
| 33 |
**World's first cross-modal FFN transfer from LLM to TTS β emotion-enhanced speech synthesis without any training.**
|
| 34 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
> Darwin-TTS blends 3% of [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) (LLM) FFN weights into [Qwen3-TTS-1.7B](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base) (TTS) talker module. No training, no data, no GPU hours β just weight-space arithmetic.
|
| 36 |
|
| 37 |
## Key Discovery
|
|
@@ -88,6 +93,7 @@ Only the talker's FFN weights are modified. The code_predictor, speech_tokenizer
|
|
| 88 |
|
| 89 |
```python
|
| 90 |
from qwen_tts import Qwen3TTSModel
|
|
|
|
| 91 |
|
| 92 |
# Load Darwin-TTS-1.7B-Cross (Ξ±=3% pre-blended)
|
| 93 |
model = Qwen3TTSModel.from_pretrained(
|
|
@@ -149,48 +155,10 @@ Darwin's evolutionary merge framework, originally developed for LLM merging (Dar
|
|
| 149 |
|
| 150 |
1. **Cross-modal FFN transfer works** β LLM's language understanding patterns enhance TTS emotional expressiveness
|
| 151 |
2. **Sweet spot is 3~5%** β TTS is far more sensitive than LLM merging (which tolerates 7~93%)
|
| 152 |
-
3. **Same backbone is required** β
|
| 153 |
4. **10%+ destroys TTS** β LLM's "continue generating tokens" pattern overrides the TTS stop signal, causing 655-second outputs
|
| 154 |
5. **Bidirectional potential** β LLM + TTS FFN may enable "Speaking LLM" (the GPT-4o direction)
|
| 155 |
|
| 156 |
-
### What Failed (and why it matters)
|
| 157 |
-
|
| 158 |
-
| Experiment | Why Failed | Lesson |
|
| 159 |
-
|-----------|-----------|--------|
|
| 160 |
-
| TADA-1B(Llama) Γ Qwen3-TTS | Different backbone (Llama vs Qwen3) | Same backbone required |
|
| 161 |
-
| FFN 100% replacement | Too aggressive | Low ratio (3~5%) needed |
|
| 162 |
-
| x_vector_only_mode=False | ref_text mismatch | Use x_vector_only_mode=True |
|
| 163 |
-
| Ξ±=10% blend | LLM "keep generating" pattern | TTS has narrow tolerance |
|
| 164 |
-
|
| 165 |
-
### Novelty (Prior Art Survey)
|
| 166 |
-
|
| 167 |
-
| Approach | Training Required | Cross-Modal | Published |
|
| 168 |
-
|----------|:-:|:-:|:-:|
|
| 169 |
-
| LLM Γ LLM merging (TIES, DARE, SLERP) | No | No (same modal) | Many |
|
| 170 |
-
| TTS Γ TTS averaging (Murata 2024) | No | No (same modal) | INTERSPEECH 2024 |
|
| 171 |
-
| SmolTolk (adapter-based) | **Yes** (adapter training) | Yes | arxiv 2503.06211 |
|
| 172 |
-
| CSLM (fine-tuning) | **Yes** (continual pretraining) | Yes | arxiv 2604.11096 |
|
| 173 |
-
| GPT-4o (end-to-end) | **Yes** ($$$) | Yes | OpenAI 2024 |
|
| 174 |
-
| **Darwin-TTS (this work)** | **No** | **Yes** | **World's First** |
|
| 175 |
-
|
| 176 |
-
## Experimental Timeline (2026-04-15)
|
| 177 |
-
|
| 178 |
-
```
|
| 179 |
-
09:00 TTS hidden_size compatibility analysis β h=2048 group discovered
|
| 180 |
-
09:30 TADA-1B Γ Qwen3-TTS download + config analysis
|
| 181 |
-
10:00 Chimera v1 (FFN 100%) β failed (noise)
|
| 182 |
-
10:30 Environment setup (darwin-tts-venv, torch 2.6.0+cu124)
|
| 183 |
-
10:50 Original Qwen3-TTS synthesis verified
|
| 184 |
-
11:00 SLERP blend 10/20/30% build (TADA) β failed (different backbone)
|
| 185 |
-
11:30 Key insight: Qwen3-1.7B LLM has IDENTICAL architecture to TTS talker!
|
| 186 |
-
12:00 Qwen3-1.7B download β config comparison β 5/5 parameters match!
|
| 187 |
-
12:15 Ξ±=1/3/5/10% LLMβTTS blending experiments
|
| 188 |
-
12:23 β
Ξ±=3% emotion appears, Ξ±=5% emotion intensified, Ξ±=10% broken
|
| 189 |
-
12:30 4 voice references Γ 3 blend ratios high-quality sample generation
|
| 190 |
-
13:00 Prior art survey β confirmed world's first
|
| 191 |
-
13:30 Darwin-TTS-1.7B-Cross (Ξ±=3%) final build + HuggingFace release
|
| 192 |
-
```
|
| 193 |
-
|
| 194 |
## Model Details
|
| 195 |
|
| 196 |
- **Model type**: Text-to-Speech (cross-modal FFN blended)
|
|
@@ -202,21 +170,27 @@ Darwin's evolutionary merge framework, originally developed for LLM merging (Dar
|
|
| 202 |
- **FFN tensors modified**: 84 / 976 total (8.6%)
|
| 203 |
- **Build time**: ~2 minutes (no training)
|
| 204 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 205 |
## Credits
|
| 206 |
|
| 207 |
**[VIDRAFT](https://vidraft.nwr)** (λΉλλννΈ) β Darwin Evolutionary Merge Framework
|
| 208 |
|
| 209 |
-
- Darwin LLM V7: GPQA Diamond 86.9% (World #3)
|
| 210 |
-
- FINAL Bench: Text AGI benchmark
|
| 211 |
-
- 11 Pillar Technologies: AETHER, PROMETHEUS, HEPHAESTUS, Darwin, FINAL Bench, MARL, SiteAgent, νμ§+νμ, VDash, μΈκ³΅μ¬ν, StealthMark
|
| 212 |
-
|
| 213 |
Built on [Qwen3-TTS-1.7B](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base) and [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) by Alibaba Cloud (Apache 2.0).
|
| 214 |
|
| 215 |
-
|
| 216 |
## Related
|
| 217 |
|
| 218 |
-
- [Darwin-
|
| 219 |
- [FINAL Bench](https://huggingface.co/FINAL-Bench) β Text AGI Benchmark
|
| 220 |
-
- [Darwin Evolutionary Merge Framework](https://huggingface.co/FINAL-Bench) β CMA-ES + FFN crossbreeding
|
| 221 |
-
|
| 222 |
-
This model is introduced in [Darwin Family](https://arxiv.org/abs/2605.14386).
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model:
|
| 3 |
+
- Qwen/Qwen3-TTS-12Hz-1.7B-Base
|
| 4 |
+
- Qwen/Qwen3-1.7B
|
| 5 |
language:
|
| 6 |
+
- ko
|
| 7 |
+
- en
|
| 8 |
+
- ja
|
| 9 |
+
- zh
|
| 10 |
+
- de
|
| 11 |
+
- fr
|
| 12 |
+
- ru
|
| 13 |
+
- pt
|
| 14 |
+
- es
|
| 15 |
+
- it
|
| 16 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
pipeline_tag: text-to-speech
|
| 18 |
+
tags:
|
| 19 |
+
- tts
|
| 20 |
+
- text-to-speech
|
| 21 |
+
- darwin
|
| 22 |
+
- cross-modal
|
| 23 |
+
- ffn-blending
|
| 24 |
+
- model-merging
|
| 25 |
+
- qwen3
|
| 26 |
+
- voice-cloning
|
| 27 |
+
- emotion
|
| 28 |
+
- vidraft
|
| 29 |
+
project_page: https://vidraft.nwr
|
| 30 |
---
|
| 31 |
|
| 32 |
# 𧬠Darwin-TTS-1.7B-Cross
|
| 33 |
|
| 34 |
**World's first cross-modal FFN transfer from LLM to TTS β emotion-enhanced speech synthesis without any training.**
|
| 35 |
|
| 36 |
+
This model is a cross-modal application of the Darwin Family framework, introduced in the paper: [Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning](https://huggingface.co/papers/2605.14386).
|
| 37 |
+
|
| 38 |
+
**Authors:** Taebong Kim, Youngsik Hong, Minsik Kim, Sunyoung Choi, Jaewon Jang, Junghoon Shin, Minseo Kim.
|
| 39 |
+
|
| 40 |
> Darwin-TTS blends 3% of [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) (LLM) FFN weights into [Qwen3-TTS-1.7B](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base) (TTS) talker module. No training, no data, no GPU hours β just weight-space arithmetic.
|
| 41 |
|
| 42 |
## Key Discovery
|
|
|
|
| 93 |
|
| 94 |
```python
|
| 95 |
from qwen_tts import Qwen3TTSModel
|
| 96 |
+
import torch
|
| 97 |
|
| 98 |
# Load Darwin-TTS-1.7B-Cross (Ξ±=3% pre-blended)
|
| 99 |
model = Qwen3TTSModel.from_pretrained(
|
|
|
|
| 155 |
|
| 156 |
1. **Cross-modal FFN transfer works** β LLM's language understanding patterns enhance TTS emotional expressiveness
|
| 157 |
2. **Sweet spot is 3~5%** β TTS is far more sensitive than LLM merging (which tolerates 7~93%)
|
| 158 |
+
3. **Same backbone is required** β Qwen3 Γ Qwen3 succeeded; cross-backbone merges (e.g., Llama) failed.
|
| 159 |
4. **10%+ destroys TTS** β LLM's "continue generating tokens" pattern overrides the TTS stop signal, causing 655-second outputs
|
| 160 |
5. **Bidirectional potential** β LLM + TTS FFN may enable "Speaking LLM" (the GPT-4o direction)
|
| 161 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 162 |
## Model Details
|
| 163 |
|
| 164 |
- **Model type**: Text-to-Speech (cross-modal FFN blended)
|
|
|
|
| 170 |
- **FFN tensors modified**: 84 / 976 total (8.6%)
|
| 171 |
- **Build time**: ~2 minutes (no training)
|
| 172 |
|
| 173 |
+
## Citation
|
| 174 |
+
|
| 175 |
+
If you find this work useful in your research, please cite:
|
| 176 |
+
|
| 177 |
+
```bibtex
|
| 178 |
+
@article{kim2026darwin,
|
| 179 |
+
title={Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning},
|
| 180 |
+
author={Kim, Taebong and Hong, Youngsik and Kim, Minsik and Choi, Sunyoung and Jang, Jaewon and Shin, Junghoon and Kim, Minseo},
|
| 181 |
+
journal={arXiv preprint arXiv:2605.14386},
|
| 182 |
+
year={2026}
|
| 183 |
+
}
|
| 184 |
+
```
|
| 185 |
+
|
| 186 |
## Credits
|
| 187 |
|
| 188 |
**[VIDRAFT](https://vidraft.nwr)** (λΉλλννΈ) β Darwin Evolutionary Merge Framework
|
| 189 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 190 |
Built on [Qwen3-TTS-1.7B](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base) and [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) by Alibaba Cloud (Apache 2.0).
|
| 191 |
|
|
|
|
| 192 |
## Related
|
| 193 |
|
| 194 |
+
- [Darwin-27B-Opus](https://huggingface.co/FINAL-Bench/Darwin-27B-Opus) β Darwin LLM Flagship
|
| 195 |
- [FINAL Bench](https://huggingface.co/FINAL-Bench) β Text AGI Benchmark
|
| 196 |
+
- [Darwin Evolutionary Merge Framework](https://huggingface.co/FINAL-Bench) β CMA-ES + FFN crossbreeding
|
|
|
|
|
|