FINAL-Bench
/

Darwin-TTS-1.7B-Cross

@@ -1,37 +1,42 @@
 ---
 language:
-  - ko
-  - en
-  - ja
-  - zh
-  - de
-  - fr
-  - ru
-  - pt
-  - es
-  - it
 license: apache-2.0
-tags:
-  - tts
-  - text-to-speech
-  - darwin
-  - cross-modal
-  - ffn-blending
-  - model-merging
-  - qwen3
-  - voice-cloning
-  - emotion
-  - vidraft
-base_model:
-  - Qwen/Qwen3-TTS-12Hz-1.7B-Base
-  - Qwen/Qwen3-1.7B
 pipeline_tag: text-to-speech
 ---
 # 🧬 Darwin-TTS-1.7B-Cross
 **World's first cross-modal FFN transfer from LLM to TTS — emotion-enhanced speech synthesis without any training.**
 > Darwin-TTS blends 3% of [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) (LLM) FFN weights into [Qwen3-TTS-1.7B](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base) (TTS) talker module. No training, no data, no GPU hours — just weight-space arithmetic.
 ## Key Discovery
@@ -88,6 +93,7 @@ Only the talker's FFN weights are modified. The code_predictor, speech_tokenizer
 ```python
 from qwen_tts import Qwen3TTSModel
 # Load Darwin-TTS-1.7B-Cross (α=3% pre-blended)
 model = Qwen3TTSModel.from_pretrained(
@@ -149,48 +155,10 @@ Darwin's evolutionary merge framework, originally developed for LLM merging (Dar
 1. **Cross-modal FFN transfer works** — LLM's language understanding patterns enhance TTS emotional expressiveness
 2. **Sweet spot is 3~5%** — TTS is far more sensitive than LLM merging (which tolerates 7~93%)
-3. **Same backbone is required** — TADA-1B (Llama backbone) × Qwen3-TTS failed completely; Qwen3 × Qwen3 succeeded
 4. **10%+ destroys TTS** — LLM's "continue generating tokens" pattern overrides the TTS stop signal, causing 655-second outputs
 5. **Bidirectional potential** — LLM + TTS FFN may enable "Speaking LLM" (the GPT-4o direction)
-### What Failed (and why it matters)
-| Experiment | Why Failed | Lesson |
-|-----------|-----------|--------|
-| TADA-1B(Llama) × Qwen3-TTS | Different backbone (Llama vs Qwen3) | Same backbone required |
-| FFN 100% replacement | Too aggressive | Low ratio (3~5%) needed |
-| x_vector_only_mode=False | ref_text mismatch | Use x_vector_only_mode=True |
-| α=10% blend | LLM "keep generating" pattern | TTS has narrow tolerance |
-### Novelty (Prior Art Survey)
-| Approach | Training Required | Cross-Modal | Published |
-|----------|:-:|:-:|:-:|
-| LLM × LLM merging (TIES, DARE, SLERP) | No | No (same modal) | Many |
-| TTS × TTS averaging (Murata 2024) | No | No (same modal) | INTERSPEECH 2024 |
-| SmolTolk (adapter-based) | **Yes** (adapter training) | Yes | arxiv 2503.06211 |
-| CSLM (fine-tuning) | **Yes** (continual pretraining) | Yes | arxiv 2604.11096 |
-| GPT-4o (end-to-end) | **Yes** ($$$) | Yes | OpenAI 2024 |
-| **Darwin-TTS (this work)** | **No** | **Yes** | **World's First** |
-## Experimental Timeline (2026-04-15)
-```
-09:00  TTS hidden_size compatibility analysis → h=2048 group discovered
-09:30  TADA-1B × Qwen3-TTS download + config analysis
-10:00  Chimera v1 (FFN 100%) → failed (noise)
-10:30  Environment setup (darwin-tts-venv, torch 2.6.0+cu124)
-10:50  Original Qwen3-TTS synthesis verified
-11:00  SLERP blend 10/20/30% build (TADA) → failed (different backbone)
-11:30  Key insight: Qwen3-1.7B LLM has IDENTICAL architecture to TTS talker!
-12:00  Qwen3-1.7B download → config comparison → 5/5 parameters match!
-12:15  α=1/3/5/10% LLM→TTS blending experiments
-12:23  ✅ α=3% emotion appears, α=5% emotion intensified, α=10% broken
-12:30  4 voice references × 3 blend ratios high-quality sample generation
-13:00  Prior art survey → confirmed world's first
-13:30  Darwin-TTS-1.7B-Cross (α=3%) final build + HuggingFace release
-```
 ## Model Details
 - **Model type**: Text-to-Speech (cross-modal FFN blended)
@@ -202,21 +170,27 @@ Darwin's evolutionary merge framework, originally developed for LLM merging (Dar
 - **FFN tensors modified**: 84 / 976 total (8.6%)
 - **Build time**: ~2 minutes (no training)
 ## Credits
 **[VIDRAFT](https://vidraft.nwr)** (비드래프트) — Darwin Evolutionary Merge Framework
-- Darwin LLM V7: GPQA Diamond 86.9% (World #3)
-- FINAL Bench: Text AGI benchmark
-- 11 Pillar Technologies: AETHER, PROMETHEUS, HEPHAESTUS, Darwin, FINAL Bench, MARL, SiteAgent, 한지+한양, VDash, 인공사회, StealthMark
 Built on [Qwen3-TTS-1.7B](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base) and [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) by Alibaba Cloud (Apache 2.0).
 ## Related
-- [Darwin-9B-Opus](https://huggingface.co/FINAL-Bench/Darwin-9B-Opus) — Darwin LLM (GPQA Diamond 86.9%)
 - [FINAL Bench](https://huggingface.co/FINAL-Bench) — Text AGI Benchmark
-- [Darwin Evolutionary Merge Framework](https://huggingface.co/FINAL-Bench) — CMA-ES + FFN crossbreeding
-This model is introduced in [Darwin Family](https://arxiv.org/abs/2605.14386).

 ---
+base_model:
+- Qwen/Qwen3-TTS-12Hz-1.7B-Base
+- Qwen/Qwen3-1.7B
 language:
+- ko
+- en
+- ja
+- zh
+- de
+- fr
+- ru
+- pt
+- es
+- it
 license: apache-2.0
 pipeline_tag: text-to-speech
+tags:
+- tts
+- text-to-speech
+- darwin
+- cross-modal
+- ffn-blending
+- model-merging
+- qwen3
+- voice-cloning
+- emotion
+- vidraft
+project_page: https://vidraft.nwr
 ---
 # 🧬 Darwin-TTS-1.7B-Cross
 **World's first cross-modal FFN transfer from LLM to TTS — emotion-enhanced speech synthesis without any training.**
+This model is a cross-modal application of the Darwin Family framework, introduced in the paper: [Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning](https://huggingface.co/papers/2605.14386).
+**Authors:** Taebong Kim, Youngsik Hong, Minsik Kim, Sunyoung Choi, Jaewon Jang, Junghoon Shin, Minseo Kim.
 > Darwin-TTS blends 3% of [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) (LLM) FFN weights into [Qwen3-TTS-1.7B](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base) (TTS) talker module. No training, no data, no GPU hours — just weight-space arithmetic.
 ## Key Discovery
 ```python
 from qwen_tts import Qwen3TTSModel
+import torch
 # Load Darwin-TTS-1.7B-Cross (α=3% pre-blended)
 model = Qwen3TTSModel.from_pretrained(
 1. **Cross-modal FFN transfer works** — LLM's language understanding patterns enhance TTS emotional expressiveness
 2. **Sweet spot is 3~5%** — TTS is far more sensitive than LLM merging (which tolerates 7~93%)
+3. **Same backbone is required** — Qwen3 × Qwen3 succeeded; cross-backbone merges (e.g., Llama) failed.
 4. **10%+ destroys TTS** — LLM's "continue generating tokens" pattern overrides the TTS stop signal, causing 655-second outputs
 5. **Bidirectional potential** — LLM + TTS FFN may enable "Speaking LLM" (the GPT-4o direction)
 ## Model Details
 - **Model type**: Text-to-Speech (cross-modal FFN blended)
 - **FFN tensors modified**: 84 / 976 total (8.6%)
 - **Build time**: ~2 minutes (no training)
+## Citation
+If you find this work useful in your research, please cite:
+```bibtex
+@article{kim2026darwin,
+  title={Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning},
+  author={Kim, Taebong and Hong, Youngsik and Kim, Minsik and Choi, Sunyoung and Jang, Jaewon and Shin, Junghoon and Kim, Minseo},
+  journal={arXiv preprint arXiv:2605.14386},
+  year={2026}
+}
+```
 ## Credits
 **[VIDRAFT](https://vidraft.nwr)** (비드래프트) — Darwin Evolutionary Merge Framework
 Built on [Qwen3-TTS-1.7B](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base) and [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) by Alibaba Cloud (Apache 2.0).
 ## Related
+- [Darwin-27B-Opus](https://huggingface.co/FINAL-Bench/Darwin-27B-Opus) — Darwin LLM Flagship
 - [FINAL Bench](https://huggingface.co/FINAL-Bench) — Text AGI Benchmark
+- [Darwin Evolutionary Merge Framework](https://huggingface.co/FINAL-Bench) — CMA-ES + FFN crossbreeding