| --- |
| language: |
| - ko |
| - en |
| - ja |
| - zh |
| - de |
| - fr |
| - ru |
| - pt |
| - es |
| - it |
| license: apache-2.0 |
| tags: |
| - tts |
| - text-to-speech |
| - darwin |
| - cross-modal |
| - ffn-blending |
| - model-merging |
| - qwen3 |
| - voice-cloning |
| - emotion |
| - vidraft |
| base_model: |
| - Qwen/Qwen3-TTS-12Hz-1.7B-Base |
| - Qwen/Qwen3-1.7B |
| pipeline_tag: text-to-speech |
| --- |
| |
| # 𧬠Darwin-TTS-1.7B-Cross |
|
|
| **World's first cross-modal FFN transfer from LLM to TTS β emotion-enhanced speech synthesis without any training.** |
|
|
| > Darwin-TTS blends 3% of [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) (LLM) FFN weights into [Qwen3-TTS-1.7B](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base) (TTS) talker module. No training, no data, no GPU hours β just weight-space arithmetic. |
|
|
| ## Key Discovery |
|
|
| | Blend (Ξ±) | Emotion | Quality | Status | |
| |-----------|---------|---------|--------| |
| | 0% | Baseline | Normal | Original Qwen3-TTS | |
| | 1% | No change | Normal | Too subtle | |
| | **3%** | **Emotion appears** | **Normal** | **β
This model (default)** | |
| | 5% | Emotion intensified | Normal | β
β
Max stable | |
| | 10% | Broken | Failed | Infinite generation | |
|
|
| ## Why It Works |
|
|
| Qwen3-1.7B (LLM) and Qwen3-TTS-1.7B's talker share **100% identical architecture**: |
|
|
| ``` |
| Qwen3-1.7B (LLM) Qwen3-TTS talker Match |
| hidden_size 2048 2048 β
|
| intermediate_size 6144 6144 β
|
| num_hidden_layers 28 28 β
|
| num_attention_heads 16 16 β
|
| num_key_value_heads 8 8 β
|
| ``` |
|
|
| This means **zero SVD, zero truncation, zero layer mapping** β pure 1:1 lerp blending across all 84 FFN tensors (gate_proj, up_proj, down_proj Γ 28 layers). |
| |
| ## Architecture |
| |
| ``` |
| Qwen3-TTS-1.7B (4-module structure): |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β talker (28L Qwen3 LM backbone) β |
| β βββ 84 FFN tensors blended with LLM (Ξ±=3%) β β MODIFIED |
| β βββ talker.model.layers.N.mlp.{gate,up,down} β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ |
| β code_predictor (5L, h=1024) β β UNTOUCHED |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ |
| β speech_tokenizer (12Hz RVQ codec) β β UNTOUCHED |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ |
| β encoder/decoder (audio waveform) β β UNTOUCHED |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| |
| FFN Source: Qwen3-1.7B (LLM) |
| βββ model.layers.N.mlp.{gate,up,down}_proj.weight |
| βββ Key mapping: model.layers.N β talker.model.layers.N (1:1) |
| ``` |
| |
| Only the talker's FFN weights are modified. The code_predictor, speech_tokenizer, and encoder/decoder remain 100% original β preserving the audio codec pipeline entirely. |
|
|
| ## Quick Start |
|
|
| ### Option 1: Load pre-blended weights (this model) |
|
|
| ```python |
| from qwen_tts import Qwen3TTSModel |
| |
| # Load Darwin-TTS-1.7B-Cross (Ξ±=3% pre-blended) |
| model = Qwen3TTSModel.from_pretrained( |
| "FINAL-Bench/Darwin-TTS-1.7B-Cross", |
| device_map="cuda:0", |
| dtype=torch.bfloat16 |
| ) |
| |
| # Synthesize |
| wavs, sr = model.generate_voice_clone( |
| text="μλ
νμΈμ, μ λ λ€μ μΈκ³΅μ§λ₯μ
λλ€!", |
| ref_audio="your_voice.wav", |
| ref_text="ref", |
| x_vector_only_mode=True |
| ) |
| ``` |
|
|
| ### Option 2: Custom blend ratio (runtime blending) |
|
|
| ```python |
| from qwen_tts import Qwen3TTSModel |
| model = Qwen3TTSModel.from_pretrained("FINAL-Bench/Darwin-TTS-1.7B-Cross") |
| wavs, sr = model.generate_voice_clone( |
| text="μ λ§ κΈ°μ μμμ΄μμ!", |
| ref_audio="voice.wav", |
| ref_text="ref", |
| x_vector_only_mode=True |
| ) |
| ``` |
|
|
| ### CLI |
|
|
| ```bash |
| python darwin_tts_blend.py --alpha 3 --text "Hello, Darwin!" --ref voice.wav --output speech.wav |
| ``` |
|
|
| ## Installation |
|
|
| ```bash |
| pip install torch qwen-tts safetensors soundfile huggingface_hub |
| ``` |
|
|
| ## Research Background |
|
|
| ### The Problem |
| Cross-modal capability transfer (e.g., adding emotion to TTS) traditionally requires: |
| - Thousands of hours of emotional speech data |
| - Hundreds of GPU hours for training |
| - Careful data curation and annotation |
|
|
| ### The Darwin Approach |
| Darwin's evolutionary merge framework, originally developed for LLM merging (Darwin LLM V7 achieved GPQA Diamond 86.9%, World #5), is extended to cross-modal transfer: |
|
|
| 1. **Find architecture-compatible models** across modalities (LLM β TTS) |
| 2. **Blend FFN weights** at low ratios (3~5%) using simple lerp |
| 3. **Preserve modality-specific components** (audio codec, tokenizer) |
|
|
| ### Key Findings |
|
|
| 1. **Cross-modal FFN transfer works** β LLM's language understanding patterns enhance TTS emotional expressiveness |
| 2. **Sweet spot is 3~5%** β TTS is far more sensitive than LLM merging (which tolerates 7~93%) |
| 3. **Same backbone is required** β TADA-1B (Llama backbone) Γ Qwen3-TTS failed completely; Qwen3 Γ Qwen3 succeeded |
| 4. **10%+ destroys TTS** β LLM's "continue generating tokens" pattern overrides the TTS stop signal, causing 655-second outputs |
| 5. **Bidirectional potential** β LLM + TTS FFN may enable "Speaking LLM" (the GPT-4o direction) |
|
|
| ### What Failed (and why it matters) |
|
|
| | Experiment | Why Failed | Lesson | |
| |-----------|-----------|--------| |
| | TADA-1B(Llama) Γ Qwen3-TTS | Different backbone (Llama vs Qwen3) | Same backbone required | |
| | FFN 100% replacement | Too aggressive | Low ratio (3~5%) needed | |
| | x_vector_only_mode=False | ref_text mismatch | Use x_vector_only_mode=True | |
| | Ξ±=10% blend | LLM "keep generating" pattern | TTS has narrow tolerance | |
| |
| ### Novelty (Prior Art Survey) |
| |
| | Approach | Training Required | Cross-Modal | Published | |
| |----------|:-:|:-:|:-:| |
| | LLM Γ LLM merging (TIES, DARE, SLERP) | No | No (same modal) | Many | |
| | TTS Γ TTS averaging (Murata 2024) | No | No (same modal) | INTERSPEECH 2024 | |
| | SmolTolk (adapter-based) | **Yes** (adapter training) | Yes | arxiv 2503.06211 | |
| | CSLM (fine-tuning) | **Yes** (continual pretraining) | Yes | arxiv 2604.11096 | |
| | GPT-4o (end-to-end) | **Yes** ($$$) | Yes | OpenAI 2024 | |
| | **Darwin-TTS (this work)** | **No** | **Yes** | **World's First** | |
| |
| ## Experimental Timeline (2026-04-15) |
| |
| ``` |
| 09:00 TTS hidden_size compatibility analysis β h=2048 group discovered |
| 09:30 TADA-1B Γ Qwen3-TTS download + config analysis |
| 10:00 Chimera v1 (FFN 100%) β failed (noise) |
| 10:30 Environment setup (darwin-tts-venv, torch 2.6.0+cu124) |
| 10:50 Original Qwen3-TTS synthesis verified |
| 11:00 SLERP blend 10/20/30% build (TADA) β failed (different backbone) |
| 11:30 Key insight: Qwen3-1.7B LLM has IDENTICAL architecture to TTS talker! |
| 12:00 Qwen3-1.7B download β config comparison β 5/5 parameters match! |
| 12:15 Ξ±=1/3/5/10% LLMβTTS blending experiments |
| 12:23 β
Ξ±=3% emotion appears, Ξ±=5% emotion intensified, Ξ±=10% broken |
| 12:30 4 voice references Γ 3 blend ratios high-quality sample generation |
| 13:00 Prior art survey β confirmed world's first |
| 13:30 Darwin-TTS-1.7B-Cross (Ξ±=3%) final build + HuggingFace release |
| ``` |
| |
| ## Model Details |
| |
| - **Model type**: Text-to-Speech (cross-modal FFN blended) |
| - **Base models**: Qwen3-TTS-1.7B-Base + Qwen3-1.7B (3% FFN) |
| - **Parameters**: ~2.1B |
| - **Languages**: Korean, English, Japanese, Chinese + 6 more |
| - **License**: Apache 2.0 |
| - **Blend ratio**: Ξ±=0.03 (3%) |
| - **FFN tensors modified**: 84 / 976 total (8.6%) |
| - **Build time**: ~2 minutes (no training) |
| |
| ## Credits |
| |
| **[VIDRAFT](https://vidraft.nwr)** (λΉλλννΈ) β Darwin Evolutionary Merge Framework |
| |
| - Darwin LLM V7: GPQA Diamond 86.9% (World #3) |
| - FINAL Bench: Text AGI benchmark |
| - 11 Pillar Technologies: AETHER, PROMETHEUS, HEPHAESTUS, Darwin, FINAL Bench, MARL, SiteAgent, νμ§+νμ, VDash, μΈκ³΅μ¬ν, StealthMark |
| |
| Built on [Qwen3-TTS-1.7B](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base) and [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) by Alibaba Cloud (Apache 2.0). |
| |
| |
| ## Related |
| |
| - [Darwin-9B-Opus](https://huggingface.co/FINAL-Bench/Darwin-9B-Opus) β Darwin LLM (GPQA Diamond 86.9%) |
| - [FINAL Bench](https://huggingface.co/FINAL-Bench) β Text AGI Benchmark |
| - [Darwin Evolutionary Merge Framework](https://huggingface.co/FINAL-Bench) β CMA-ES + FFN crossbreeding |
| |