Add paper link, author information, and citation

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +46 -72
README.md CHANGED
@@ -1,37 +1,42 @@
1
  ---
 
 
 
2
  language:
3
- - ko
4
- - en
5
- - ja
6
- - zh
7
- - de
8
- - fr
9
- - ru
10
- - pt
11
- - es
12
- - it
13
  license: apache-2.0
14
- tags:
15
- - tts
16
- - text-to-speech
17
- - darwin
18
- - cross-modal
19
- - ffn-blending
20
- - model-merging
21
- - qwen3
22
- - voice-cloning
23
- - emotion
24
- - vidraft
25
- base_model:
26
- - Qwen/Qwen3-TTS-12Hz-1.7B-Base
27
- - Qwen/Qwen3-1.7B
28
  pipeline_tag: text-to-speech
 
 
 
 
 
 
 
 
 
 
 
 
29
  ---
30
 
31
  # 🧬 Darwin-TTS-1.7B-Cross
32
 
33
  **World's first cross-modal FFN transfer from LLM to TTS β€” emotion-enhanced speech synthesis without any training.**
34
 
 
 
 
 
35
  > Darwin-TTS blends 3% of [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) (LLM) FFN weights into [Qwen3-TTS-1.7B](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base) (TTS) talker module. No training, no data, no GPU hours β€” just weight-space arithmetic.
36
 
37
  ## Key Discovery
@@ -88,6 +93,7 @@ Only the talker's FFN weights are modified. The code_predictor, speech_tokenizer
88
 
89
  ```python
90
  from qwen_tts import Qwen3TTSModel
 
91
 
92
  # Load Darwin-TTS-1.7B-Cross (Ξ±=3% pre-blended)
93
  model = Qwen3TTSModel.from_pretrained(
@@ -149,48 +155,10 @@ Darwin's evolutionary merge framework, originally developed for LLM merging (Dar
149
 
150
  1. **Cross-modal FFN transfer works** β€” LLM's language understanding patterns enhance TTS emotional expressiveness
151
  2. **Sweet spot is 3~5%** β€” TTS is far more sensitive than LLM merging (which tolerates 7~93%)
152
- 3. **Same backbone is required** β€” TADA-1B (Llama backbone) Γ— Qwen3-TTS failed completely; Qwen3 Γ— Qwen3 succeeded
153
  4. **10%+ destroys TTS** β€” LLM's "continue generating tokens" pattern overrides the TTS stop signal, causing 655-second outputs
154
  5. **Bidirectional potential** β€” LLM + TTS FFN may enable "Speaking LLM" (the GPT-4o direction)
155
 
156
- ### What Failed (and why it matters)
157
-
158
- | Experiment | Why Failed | Lesson |
159
- |-----------|-----------|--------|
160
- | TADA-1B(Llama) Γ— Qwen3-TTS | Different backbone (Llama vs Qwen3) | Same backbone required |
161
- | FFN 100% replacement | Too aggressive | Low ratio (3~5%) needed |
162
- | x_vector_only_mode=False | ref_text mismatch | Use x_vector_only_mode=True |
163
- | Ξ±=10% blend | LLM "keep generating" pattern | TTS has narrow tolerance |
164
-
165
- ### Novelty (Prior Art Survey)
166
-
167
- | Approach | Training Required | Cross-Modal | Published |
168
- |----------|:-:|:-:|:-:|
169
- | LLM Γ— LLM merging (TIES, DARE, SLERP) | No | No (same modal) | Many |
170
- | TTS Γ— TTS averaging (Murata 2024) | No | No (same modal) | INTERSPEECH 2024 |
171
- | SmolTolk (adapter-based) | **Yes** (adapter training) | Yes | arxiv 2503.06211 |
172
- | CSLM (fine-tuning) | **Yes** (continual pretraining) | Yes | arxiv 2604.11096 |
173
- | GPT-4o (end-to-end) | **Yes** ($$$) | Yes | OpenAI 2024 |
174
- | **Darwin-TTS (this work)** | **No** | **Yes** | **World's First** |
175
-
176
- ## Experimental Timeline (2026-04-15)
177
-
178
- ```
179
- 09:00 TTS hidden_size compatibility analysis β†’ h=2048 group discovered
180
- 09:30 TADA-1B Γ— Qwen3-TTS download + config analysis
181
- 10:00 Chimera v1 (FFN 100%) β†’ failed (noise)
182
- 10:30 Environment setup (darwin-tts-venv, torch 2.6.0+cu124)
183
- 10:50 Original Qwen3-TTS synthesis verified
184
- 11:00 SLERP blend 10/20/30% build (TADA) β†’ failed (different backbone)
185
- 11:30 Key insight: Qwen3-1.7B LLM has IDENTICAL architecture to TTS talker!
186
- 12:00 Qwen3-1.7B download β†’ config comparison β†’ 5/5 parameters match!
187
- 12:15 α=1/3/5/10% LLM→TTS blending experiments
188
- 12:23 βœ… Ξ±=3% emotion appears, Ξ±=5% emotion intensified, Ξ±=10% broken
189
- 12:30 4 voice references Γ— 3 blend ratios high-quality sample generation
190
- 13:00 Prior art survey β†’ confirmed world's first
191
- 13:30 Darwin-TTS-1.7B-Cross (Ξ±=3%) final build + HuggingFace release
192
- ```
193
-
194
  ## Model Details
195
 
196
  - **Model type**: Text-to-Speech (cross-modal FFN blended)
@@ -202,21 +170,27 @@ Darwin's evolutionary merge framework, originally developed for LLM merging (Dar
202
  - **FFN tensors modified**: 84 / 976 total (8.6%)
203
  - **Build time**: ~2 minutes (no training)
204
 
 
 
 
 
 
 
 
 
 
 
 
 
 
205
  ## Credits
206
 
207
  **[VIDRAFT](https://vidraft.nwr)** (λΉ„λ“œλž˜ν”„νŠΈ) β€” Darwin Evolutionary Merge Framework
208
 
209
- - Darwin LLM V7: GPQA Diamond 86.9% (World #3)
210
- - FINAL Bench: Text AGI benchmark
211
- - 11 Pillar Technologies: AETHER, PROMETHEUS, HEPHAESTUS, Darwin, FINAL Bench, MARL, SiteAgent, ν•œμ§€+ν•œμ–‘, VDash, μΈκ³΅μ‚¬νšŒ, StealthMark
212
-
213
  Built on [Qwen3-TTS-1.7B](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base) and [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) by Alibaba Cloud (Apache 2.0).
214
 
215
-
216
  ## Related
217
 
218
- - [Darwin-9B-Opus](https://huggingface.co/FINAL-Bench/Darwin-9B-Opus) β€” Darwin LLM (GPQA Diamond 86.9%)
219
  - [FINAL Bench](https://huggingface.co/FINAL-Bench) β€” Text AGI Benchmark
220
- - [Darwin Evolutionary Merge Framework](https://huggingface.co/FINAL-Bench) β€” CMA-ES + FFN crossbreeding
221
-
222
- This model is introduced in [Darwin Family](https://arxiv.org/abs/2605.14386).
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen3-TTS-12Hz-1.7B-Base
4
+ - Qwen/Qwen3-1.7B
5
  language:
6
+ - ko
7
+ - en
8
+ - ja
9
+ - zh
10
+ - de
11
+ - fr
12
+ - ru
13
+ - pt
14
+ - es
15
+ - it
16
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  pipeline_tag: text-to-speech
18
+ tags:
19
+ - tts
20
+ - text-to-speech
21
+ - darwin
22
+ - cross-modal
23
+ - ffn-blending
24
+ - model-merging
25
+ - qwen3
26
+ - voice-cloning
27
+ - emotion
28
+ - vidraft
29
+ project_page: https://vidraft.nwr
30
  ---
31
 
32
  # 🧬 Darwin-TTS-1.7B-Cross
33
 
34
  **World's first cross-modal FFN transfer from LLM to TTS β€” emotion-enhanced speech synthesis without any training.**
35
 
36
+ This model is a cross-modal application of the Darwin Family framework, introduced in the paper: [Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning](https://huggingface.co/papers/2605.14386).
37
+
38
+ **Authors:** Taebong Kim, Youngsik Hong, Minsik Kim, Sunyoung Choi, Jaewon Jang, Junghoon Shin, Minseo Kim.
39
+
40
  > Darwin-TTS blends 3% of [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) (LLM) FFN weights into [Qwen3-TTS-1.7B](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base) (TTS) talker module. No training, no data, no GPU hours β€” just weight-space arithmetic.
41
 
42
  ## Key Discovery
 
93
 
94
  ```python
95
  from qwen_tts import Qwen3TTSModel
96
+ import torch
97
 
98
  # Load Darwin-TTS-1.7B-Cross (Ξ±=3% pre-blended)
99
  model = Qwen3TTSModel.from_pretrained(
 
155
 
156
  1. **Cross-modal FFN transfer works** β€” LLM's language understanding patterns enhance TTS emotional expressiveness
157
  2. **Sweet spot is 3~5%** β€” TTS is far more sensitive than LLM merging (which tolerates 7~93%)
158
+ 3. **Same backbone is required** β€” Qwen3 Γ— Qwen3 succeeded; cross-backbone merges (e.g., Llama) failed.
159
  4. **10%+ destroys TTS** β€” LLM's "continue generating tokens" pattern overrides the TTS stop signal, causing 655-second outputs
160
  5. **Bidirectional potential** β€” LLM + TTS FFN may enable "Speaking LLM" (the GPT-4o direction)
161
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
162
  ## Model Details
163
 
164
  - **Model type**: Text-to-Speech (cross-modal FFN blended)
 
170
  - **FFN tensors modified**: 84 / 976 total (8.6%)
171
  - **Build time**: ~2 minutes (no training)
172
 
173
+ ## Citation
174
+
175
+ If you find this work useful in your research, please cite:
176
+
177
+ ```bibtex
178
+ @article{kim2026darwin,
179
+ title={Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning},
180
+ author={Kim, Taebong and Hong, Youngsik and Kim, Minsik and Choi, Sunyoung and Jang, Jaewon and Shin, Junghoon and Kim, Minseo},
181
+ journal={arXiv preprint arXiv:2605.14386},
182
+ year={2026}
183
+ }
184
+ ```
185
+
186
  ## Credits
187
 
188
  **[VIDRAFT](https://vidraft.nwr)** (λΉ„λ“œλž˜ν”„νŠΈ) β€” Darwin Evolutionary Merge Framework
189
 
 
 
 
 
190
  Built on [Qwen3-TTS-1.7B](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base) and [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) by Alibaba Cloud (Apache 2.0).
191
 
 
192
  ## Related
193
 
194
+ - [Darwin-27B-Opus](https://huggingface.co/FINAL-Bench/Darwin-27B-Opus) β€” Darwin LLM Flagship
195
  - [FINAL Bench](https://huggingface.co/FINAL-Bench) β€” Text AGI Benchmark
196
+ - [Darwin Evolutionary Merge Framework](https://huggingface.co/FINAL-Bench) β€” CMA-ES + FFN crossbreeding