Update README.md

7c00604 verified about 22 hours ago

8.8 kB

	---
	language:
	- ko
	- en
	- ja
	- zh
	- de
	- fr
	- ru
	- pt
	- es
	- it
	license: apache-2.0
	tags:
	- tts
	- text-to-speech
	- darwin
	- cross-modal
	- ffn-blending
	- model-merging
	- qwen3
	- voice-cloning
	- emotion
	- vidraft
	base_model:
	- Qwen/Qwen3-TTS-12Hz-1.7B-Base
	- Qwen/Qwen3-1.7B
	pipeline_tag: text-to-speech
	---

	# 🧬 Darwin-TTS-1.7B-Cross

	World's first cross-modal FFN transfer from LLM to TTS — emotion-enhanced speech synthesis without any training.

	> Darwin-TTS blends 3% of [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) (LLM) FFN weights into [Qwen3-TTS-1.7B](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base) (TTS) talker module. No training, no data, no GPU hours — just weight-space arithmetic.

	## Key Discovery

	\| Blend (α) \| Emotion \| Quality \| Status \|
	\|-----------\|---------\|---------\|--------\|
	\| 0% \| Baseline \| Normal \| Original Qwen3-TTS \|
	\| 1% \| No change \| Normal \| Too subtle \|
	\| 3% \| Emotion appears \| Normal \| ★ This model (default) \|
	\| 5% \| Emotion intensified \| Normal \| ★★ Max stable \|
	\| 10% \| Broken \| Failed \| Infinite generation \|

	## Why It Works

	Qwen3-1.7B (LLM) and Qwen3-TTS-1.7B's talker share 100% identical architecture:

	```
	Qwen3-1.7B (LLM) Qwen3-TTS talker Match
	hidden_size 2048 2048 ✅
	intermediate_size 6144 6144 ✅
	num_hidden_layers 28 28 ✅
	num_attention_heads 16 16 ✅
	num_key_value_heads 8 8 ✅
	```

	This means zero SVD, zero truncation, zero layer mapping — pure 1:1 lerp blending across all 84 FFN tensors (gate_proj, up_proj, down_proj × 28 layers).

	## Architecture

	```
	Qwen3-TTS-1.7B (4-module structure):
	┌─────────────────────────────────────────────────────┐
	│ talker (28L Qwen3 LM backbone) │
	│ └── 84 FFN tensors blended with LLM (α=3%) │ ← MODIFIED
	│ └── talker.model.layers.N.mlp.{gate,up,down} │
	├─────────────────────────────────────────────────────┤
	│ code_predictor (5L, h=1024) │ ← UNTOUCHED
	├─────────────────────────────────────────────────────┤
	│ speech_tokenizer (12Hz RVQ codec) │ ← UNTOUCHED
	├─────────────────────────────────────────────────────┤
	│ encoder/decoder (audio waveform) │ ← UNTOUCHED
	└─────────────────────────────────────────────────────┘

	FFN Source: Qwen3-1.7B (LLM)
	└── model.layers.N.mlp.{gate,up,down}_proj.weight
	└── Key mapping: model.layers.N → talker.model.layers.N (1:1)
	```

	Only the talker's FFN weights are modified. The code_predictor, speech_tokenizer, and encoder/decoder remain 100% original — preserving the audio codec pipeline entirely.

	## Quick Start

	### Option 1: Load pre-blended weights (this model)

	```python
	from qwen_tts import Qwen3TTSModel

	# Load Darwin-TTS-1.7B-Cross (α=3% pre-blended)
	model = Qwen3TTSModel.from_pretrained(
	"FINAL-Bench/Darwin-TTS-1.7B-Cross",
	device_map="cuda:0",
	dtype=torch.bfloat16
	)

	# Synthesize
	wavs, sr = model.generate_voice_clone(
	text="안녕하세요, 저는 다윈 인공지능입니다!",
	ref_audio="your_voice.wav",
	ref_text="ref",
	x_vector_only_mode=True
	)
	```

	### Option 2: Custom blend ratio (runtime blending)

	```python
	from qwen_tts import Qwen3TTSModel
	model = Qwen3TTSModel.from_pretrained("FINAL-Bench/Darwin-TTS-1.7B-Cross")
	wavs, sr = model.generate_voice_clone(
	text="정말 기쁜 소식이에요!",
	ref_audio="voice.wav",
	ref_text="ref",
	x_vector_only_mode=True
	)
	```

	### CLI

	```bash
	python darwin_tts_blend.py --alpha 3 --text "Hello, Darwin!" --ref voice.wav --output speech.wav
	```

	## Installation

	```bash
	pip install torch qwen-tts safetensors soundfile huggingface_hub
	```

	## Research Background

	### The Problem
	Cross-modal capability transfer (e.g., adding emotion to TTS) traditionally requires:
	- Thousands of hours of emotional speech data
	- Hundreds of GPU hours for training
	- Careful data curation and annotation

	### The Darwin Approach
	Darwin's evolutionary merge framework, originally developed for LLM merging (Darwin LLM V7 achieved GPQA Diamond 86.9%, World #5), is extended to cross-modal transfer:

	1. Find architecture-compatible models across modalities (LLM ↔ TTS)
	2. Blend FFN weights at low ratios (3~5%) using simple lerp
	3. Preserve modality-specific components (audio codec, tokenizer)

	### Key Findings

	1. Cross-modal FFN transfer works — LLM's language understanding patterns enhance TTS emotional expressiveness
	2. Sweet spot is 3~5% — TTS is far more sensitive than LLM merging (which tolerates 7~93%)
	3. Same backbone is required — TADA-1B (Llama backbone) × Qwen3-TTS failed completely; Qwen3 × Qwen3 succeeded
	4. 10%+ destroys TTS — LLM's "continue generating tokens" pattern overrides the TTS stop signal, causing 655-second outputs
	5. Bidirectional potential — LLM + TTS FFN may enable "Speaking LLM" (the GPT-4o direction)

	### What Failed (and why it matters)

	\| Experiment \| Why Failed \| Lesson \|
	\|-----------\|-----------\|--------\|
	\| TADA-1B(Llama) × Qwen3-TTS \| Different backbone (Llama vs Qwen3) \| Same backbone required \|
	\| FFN 100% replacement \| Too aggressive \| Low ratio (3~5%) needed \|
	\| x_vector_only_mode=False \| ref_text mismatch \| Use x_vector_only_mode=True \|
	\| α=10% blend \| LLM "keep generating" pattern \| TTS has narrow tolerance \|

	### Novelty (Prior Art Survey)

	\| Approach \| Training Required \| Cross-Modal \| Published \|
	\|----------\|:-:\|:-:\|:-:\|
	\| LLM × LLM merging (TIES, DARE, SLERP) \| No \| No (same modal) \| Many \|
	\| TTS × TTS averaging (Murata 2024) \| No \| No (same modal) \| INTERSPEECH 2024 \|
	\| SmolTolk (adapter-based) \| Yes (adapter training) \| Yes \| arxiv 2503.06211 \|
	\| CSLM (fine-tuning) \| Yes (continual pretraining) \| Yes \| arxiv 2604.11096 \|
	\| GPT-4o (end-to-end) \| Yes ($$$) \| Yes \| OpenAI 2024 \|
	\| Darwin-TTS (this work) \| No \| Yes \| World's First \|

	## Experimental Timeline (2026-04-15)

	```
	09:00 TTS hidden_size compatibility analysis → h=2048 group discovered
	09:30 TADA-1B × Qwen3-TTS download + config analysis
	10:00 Chimera v1 (FFN 100%) → failed (noise)
	10:30 Environment setup (darwin-tts-venv, torch 2.6.0+cu124)
	10:50 Original Qwen3-TTS synthesis verified
	11:00 SLERP blend 10/20/30% build (TADA) → failed (different backbone)
	11:30 Key insight: Qwen3-1.7B LLM has IDENTICAL architecture to TTS talker!
	12:00 Qwen3-1.7B download → config comparison → 5/5 parameters match!
	12:15 α=1/3/5/10% LLM→TTS blending experiments
	12:23 ✅ α=3% emotion appears, α=5% emotion intensified, α=10% broken
	12:30 4 voice references × 3 blend ratios high-quality sample generation
	13:00 Prior art survey → confirmed world's first
	13:30 Darwin-TTS-1.7B-Cross (α=3%) final build + HuggingFace release
	```

	## Model Details

	- Model type: Text-to-Speech (cross-modal FFN blended)
	- Base models: Qwen3-TTS-1.7B-Base + Qwen3-1.7B (3% FFN)
	- Parameters: ~2.1B
	- Languages: Korean, English, Japanese, Chinese + 6 more
	- License: Apache 2.0
	- Blend ratio: α=0.03 (3%)
	- FFN tensors modified: 84 / 976 total (8.6%)
	- Build time: ~2 minutes (no training)

	## Credits

	[VIDRAFT](https://vidraft.nwr) (비드래프트) — Darwin Evolutionary Merge Framework

	- Darwin LLM V7: GPQA Diamond 86.9% (World #3)
	- FINAL Bench: Text AGI benchmark
	- 11 Pillar Technologies: AETHER, PROMETHEUS, HEPHAESTUS, Darwin, FINAL Bench, MARL, SiteAgent, 한지+한양, VDash, 인공사회, StealthMark

	Built on [Qwen3-TTS-1.7B](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base) and [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) by Alibaba Cloud (Apache 2.0).


	## Related

	- [Darwin-9B-Opus](https://huggingface.co/FINAL-Bench/Darwin-9B-Opus) — Darwin LLM (GPQA Diamond 86.9%)
	- [FINAL Bench](https://huggingface.co/FINAL-Bench) — Text AGI Benchmark
	- [Darwin Evolutionary Merge Framework](https://huggingface.co/FINAL-Bench) — CMA-ES + FFN crossbreeding