--- language: - ko - en - ja - zh - de - fr - ru - pt - es - it license: apache-2.0 tags: - tts - text-to-speech - darwin - cross-modal - ffn-blending - model-merging - qwen3 - voice-cloning - emotion - vidraft base_model: - Qwen/Qwen3-TTS-12Hz-1.7B-Base - Qwen/Qwen3-1.7B pipeline_tag: text-to-speech --- # 🧬 Darwin-TTS-1.7B-Cross **World's first cross-modal FFN transfer from LLM to TTS β€” emotion-enhanced speech synthesis without any training.** > Darwin-TTS blends 3% of [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) (LLM) FFN weights into [Qwen3-TTS-1.7B](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base) (TTS) talker module. No training, no data, no GPU hours β€” just weight-space arithmetic. ## Key Discovery | Blend (Ξ±) | Emotion | Quality | Status | |-----------|---------|---------|--------| | 0% | Baseline | Normal | Original Qwen3-TTS | | 1% | No change | Normal | Too subtle | | **3%** | **Emotion appears** | **Normal** | **β˜… This model (default)** | | 5% | Emotion intensified | Normal | β˜…β˜… Max stable | | 10% | Broken | Failed | Infinite generation | ## Why It Works Qwen3-1.7B (LLM) and Qwen3-TTS-1.7B's talker share **100% identical architecture**: ``` Qwen3-1.7B (LLM) Qwen3-TTS talker Match hidden_size 2048 2048 βœ… intermediate_size 6144 6144 βœ… num_hidden_layers 28 28 βœ… num_attention_heads 16 16 βœ… num_key_value_heads 8 8 βœ… ``` This means **zero SVD, zero truncation, zero layer mapping** β€” pure 1:1 lerp blending across all 84 FFN tensors (gate_proj, up_proj, down_proj Γ— 28 layers). ## Architecture ``` Qwen3-TTS-1.7B (4-module structure): β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ talker (28L Qwen3 LM backbone) β”‚ β”‚ └── 84 FFN tensors blended with LLM (Ξ±=3%) β”‚ ← MODIFIED β”‚ └── talker.model.layers.N.mlp.{gate,up,down} β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ code_predictor (5L, h=1024) β”‚ ← UNTOUCHED β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ speech_tokenizer (12Hz RVQ codec) β”‚ ← UNTOUCHED β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ encoder/decoder (audio waveform) β”‚ ← UNTOUCHED β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ FFN Source: Qwen3-1.7B (LLM) └── model.layers.N.mlp.{gate,up,down}_proj.weight └── Key mapping: model.layers.N β†’ talker.model.layers.N (1:1) ``` Only the talker's FFN weights are modified. The code_predictor, speech_tokenizer, and encoder/decoder remain 100% original β€” preserving the audio codec pipeline entirely. ## Quick Start ### Option 1: Load pre-blended weights (this model) ```python from qwen_tts import Qwen3TTSModel # Load Darwin-TTS-1.7B-Cross (Ξ±=3% pre-blended) model = Qwen3TTSModel.from_pretrained( "FINAL-Bench/Darwin-TTS-1.7B-Cross", device_map="cuda:0", dtype=torch.bfloat16 ) # Synthesize wavs, sr = model.generate_voice_clone( text="μ•ˆλ…•ν•˜μ„Έμš”, μ €λŠ” λ‹€μœˆ 인곡지λŠ₯μž…λ‹ˆλ‹€!", ref_audio="your_voice.wav", ref_text="ref", x_vector_only_mode=True ) ``` ### Option 2: Custom blend ratio (runtime blending) ```python from qwen_tts import Qwen3TTSModel model = Qwen3TTSModel.from_pretrained("FINAL-Bench/Darwin-TTS-1.7B-Cross") wavs, sr = model.generate_voice_clone( text="정말 기쁜 μ†Œμ‹μ΄μ—μš”!", ref_audio="voice.wav", ref_text="ref", x_vector_only_mode=True ) ``` ### CLI ```bash python darwin_tts_blend.py --alpha 3 --text "Hello, Darwin!" --ref voice.wav --output speech.wav ``` ## Installation ```bash pip install torch qwen-tts safetensors soundfile huggingface_hub ``` ## Research Background ### The Problem Cross-modal capability transfer (e.g., adding emotion to TTS) traditionally requires: - Thousands of hours of emotional speech data - Hundreds of GPU hours for training - Careful data curation and annotation ### The Darwin Approach Darwin's evolutionary merge framework, originally developed for LLM merging (Darwin LLM V7 achieved GPQA Diamond 86.9%, World #5), is extended to cross-modal transfer: 1. **Find architecture-compatible models** across modalities (LLM ↔ TTS) 2. **Blend FFN weights** at low ratios (3~5%) using simple lerp 3. **Preserve modality-specific components** (audio codec, tokenizer) ### Key Findings 1. **Cross-modal FFN transfer works** β€” LLM's language understanding patterns enhance TTS emotional expressiveness 2. **Sweet spot is 3~5%** β€” TTS is far more sensitive than LLM merging (which tolerates 7~93%) 3. **Same backbone is required** β€” TADA-1B (Llama backbone) Γ— Qwen3-TTS failed completely; Qwen3 Γ— Qwen3 succeeded 4. **10%+ destroys TTS** β€” LLM's "continue generating tokens" pattern overrides the TTS stop signal, causing 655-second outputs 5. **Bidirectional potential** β€” LLM + TTS FFN may enable "Speaking LLM" (the GPT-4o direction) ### What Failed (and why it matters) | Experiment | Why Failed | Lesson | |-----------|-----------|--------| | TADA-1B(Llama) Γ— Qwen3-TTS | Different backbone (Llama vs Qwen3) | Same backbone required | | FFN 100% replacement | Too aggressive | Low ratio (3~5%) needed | | x_vector_only_mode=False | ref_text mismatch | Use x_vector_only_mode=True | | Ξ±=10% blend | LLM "keep generating" pattern | TTS has narrow tolerance | ### Novelty (Prior Art Survey) | Approach | Training Required | Cross-Modal | Published | |----------|:-:|:-:|:-:| | LLM Γ— LLM merging (TIES, DARE, SLERP) | No | No (same modal) | Many | | TTS Γ— TTS averaging (Murata 2024) | No | No (same modal) | INTERSPEECH 2024 | | SmolTolk (adapter-based) | **Yes** (adapter training) | Yes | arxiv 2503.06211 | | CSLM (fine-tuning) | **Yes** (continual pretraining) | Yes | arxiv 2604.11096 | | GPT-4o (end-to-end) | **Yes** ($$$) | Yes | OpenAI 2024 | | **Darwin-TTS (this work)** | **No** | **Yes** | **World's First** | ## Experimental Timeline (2026-04-15) ``` 09:00 TTS hidden_size compatibility analysis β†’ h=2048 group discovered 09:30 TADA-1B Γ— Qwen3-TTS download + config analysis 10:00 Chimera v1 (FFN 100%) β†’ failed (noise) 10:30 Environment setup (darwin-tts-venv, torch 2.6.0+cu124) 10:50 Original Qwen3-TTS synthesis verified 11:00 SLERP blend 10/20/30% build (TADA) β†’ failed (different backbone) 11:30 Key insight: Qwen3-1.7B LLM has IDENTICAL architecture to TTS talker! 12:00 Qwen3-1.7B download β†’ config comparison β†’ 5/5 parameters match! 12:15 Ξ±=1/3/5/10% LLMβ†’TTS blending experiments 12:23 βœ… Ξ±=3% emotion appears, Ξ±=5% emotion intensified, Ξ±=10% broken 12:30 4 voice references Γ— 3 blend ratios high-quality sample generation 13:00 Prior art survey β†’ confirmed world's first 13:30 Darwin-TTS-1.7B-Cross (Ξ±=3%) final build + HuggingFace release ``` ## Model Details - **Model type**: Text-to-Speech (cross-modal FFN blended) - **Base models**: Qwen3-TTS-1.7B-Base + Qwen3-1.7B (3% FFN) - **Parameters**: ~2.1B - **Languages**: Korean, English, Japanese, Chinese + 6 more - **License**: Apache 2.0 - **Blend ratio**: Ξ±=0.03 (3%) - **FFN tensors modified**: 84 / 976 total (8.6%) - **Build time**: ~2 minutes (no training) ## Credits **[VIDRAFT](https://vidraft.nwr)** (λΉ„λ“œλž˜ν”„νŠΈ) β€” Darwin Evolutionary Merge Framework - Darwin LLM V7: GPQA Diamond 86.9% (World #3) - FINAL Bench: Text AGI benchmark - 11 Pillar Technologies: AETHER, PROMETHEUS, HEPHAESTUS, Darwin, FINAL Bench, MARL, SiteAgent, ν•œμ§€+ν•œμ–‘, VDash, μΈκ³΅μ‚¬νšŒ, StealthMark Built on [Qwen3-TTS-1.7B](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base) and [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) by Alibaba Cloud (Apache 2.0). ## Related - [Darwin-9B-Opus](https://huggingface.co/FINAL-Bench/Darwin-9B-Opus) β€” Darwin LLM (GPQA Diamond 86.9%) - [FINAL Bench](https://huggingface.co/FINAL-Bench) β€” Text AGI Benchmark - [Darwin Evolutionary Merge Framework](https://huggingface.co/FINAL-Bench) β€” CMA-ES + FFN crossbreeding