---
language:
  - ko
  - en
  - ja
  - zh
  - de
  - fr
  - ru
  - pt
  - es
  - it
license: apache-2.0
tags:
  - tts
  - text-to-speech
  - darwin
  - cross-modal
  - ffn-blending
  - model-merging
  - qwen3
  - voice-cloning
  - emotion
  - vidraft
base_model:
  - Qwen/Qwen3-TTS-12Hz-1.7B-Base
  - Qwen/Qwen3-1.7B
pipeline_tag: text-to-speech
---

# 🧬 Darwin-TTS-1.7B-Cross

**World's first cross-modal FFN transfer from LLM to TTS — emotion-enhanced speech synthesis without any training.**

> Darwin-TTS blends 3% of [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) (LLM) FFN weights into [Qwen3-TTS-1.7B](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base) (TTS) talker module. No training, no data, no GPU hours — just weight-space arithmetic.

## Key Discovery

| Blend (α) | Emotion | Quality | Status |
|-----------|---------|---------|--------|
| 0% | Baseline | Normal | Original Qwen3-TTS |
| 1% | No change | Normal | Too subtle |
| **3%** | **Emotion appears** | **Normal** | **★ This model (default)** |
| 5% | Emotion intensified | Normal | ★★ Max stable |
| 10% | Broken | Failed | Infinite generation |

## Why It Works

Qwen3-1.7B (LLM) and Qwen3-TTS-1.7B's talker share **100% identical architecture**:

```
                    Qwen3-1.7B (LLM)    Qwen3-TTS talker    Match
hidden_size         2048                 2048                ✅
intermediate_size   6144                 6144                ✅
num_hidden_layers   28                   28                  ✅
num_attention_heads 16                   16                  ✅
num_key_value_heads 8                    8                   ✅
```

This means **zero SVD, zero truncation, zero layer mapping** — pure 1:1 lerp blending across all 84 FFN tensors (gate_proj, up_proj, down_proj × 28 layers).

## Architecture

```
Qwen3-TTS-1.7B (4-module structure):
┌─────────────────────────────────────────────────────┐
│ talker (28L Qwen3 LM backbone)                      │
│   └── 84 FFN tensors blended with LLM (α=3%)       │ ← MODIFIED
│       └── talker.model.layers.N.mlp.{gate,up,down}  │
├─────────────────────────────────────────────────────┤
│ code_predictor (5L, h=1024)                          │ ← UNTOUCHED
├─────────────────────────────────────────────────────┤
│ speech_tokenizer (12Hz RVQ codec)                    │ ← UNTOUCHED
├─────────────────────────────────────────────────────┤
│ encoder/decoder (audio waveform)                     │ ← UNTOUCHED
└─────────────────────────────────────────────────────┘

FFN Source: Qwen3-1.7B (LLM)
└── model.layers.N.mlp.{gate,up,down}_proj.weight
    └── Key mapping: model.layers.N → talker.model.layers.N (1:1)
```

Only the talker's FFN weights are modified. The code_predictor, speech_tokenizer, and encoder/decoder remain 100% original — preserving the audio codec pipeline entirely.

## Quick Start

### Option 1: Load pre-blended weights (this model)

```python
from qwen_tts import Qwen3TTSModel

# Load Darwin-TTS-1.7B-Cross (α=3% pre-blended)
model = Qwen3TTSModel.from_pretrained(
    "FINAL-Bench/Darwin-TTS-1.7B-Cross",
    device_map="cuda:0",
    dtype=torch.bfloat16
)

# Synthesize
wavs, sr = model.generate_voice_clone(
    text="안녕하세요, 저는 다윈 인공지능입니다!",
    ref_audio="your_voice.wav",
    ref_text="ref",
    x_vector_only_mode=True
)
```

### Option 2: Custom blend ratio (runtime blending)

```python
from qwen_tts import Qwen3TTSModel
model = Qwen3TTSModel.from_pretrained("FINAL-Bench/Darwin-TTS-1.7B-Cross")
wavs, sr = model.generate_voice_clone(
    text="정말 기쁜 소식이에요!",
    ref_audio="voice.wav",
    ref_text="ref",
    x_vector_only_mode=True
)
```

### CLI

```bash
python darwin_tts_blend.py --alpha 3 --text "Hello, Darwin!" --ref voice.wav --output speech.wav
```

## Installation

```bash
pip install torch qwen-tts safetensors soundfile huggingface_hub
```

## Research Background

### The Problem
Cross-modal capability transfer (e.g., adding emotion to TTS) traditionally requires:
- Thousands of hours of emotional speech data
- Hundreds of GPU hours for training
- Careful data curation and annotation

### The Darwin Approach
Darwin's evolutionary merge framework, originally developed for LLM merging (Darwin LLM V7 achieved GPQA Diamond 86.9%, World #5), is extended to cross-modal transfer:

1. **Find architecture-compatible models** across modalities (LLM ↔ TTS)
2. **Blend FFN weights** at low ratios (3~5%) using simple lerp
3. **Preserve modality-specific components** (audio codec, tokenizer)

### Key Findings

1. **Cross-modal FFN transfer works** — LLM's language understanding patterns enhance TTS emotional expressiveness
2. **Sweet spot is 3~5%** — TTS is far more sensitive than LLM merging (which tolerates 7~93%)
3. **Same backbone is required** — TADA-1B (Llama backbone) × Qwen3-TTS failed completely; Qwen3 × Qwen3 succeeded
4. **10%+ destroys TTS** — LLM's "continue generating tokens" pattern overrides the TTS stop signal, causing 655-second outputs
5. **Bidirectional potential** — LLM + TTS FFN may enable "Speaking LLM" (the GPT-4o direction)

### What Failed (and why it matters)

| Experiment | Why Failed | Lesson |
|-----------|-----------|--------|
| TADA-1B(Llama) × Qwen3-TTS | Different backbone (Llama vs Qwen3) | Same backbone required |
| FFN 100% replacement | Too aggressive | Low ratio (3~5%) needed |
| x_vector_only_mode=False | ref_text mismatch | Use x_vector_only_mode=True |
| α=10% blend | LLM "keep generating" pattern | TTS has narrow tolerance |

### Novelty (Prior Art Survey)

| Approach | Training Required | Cross-Modal | Published |
|----------|:-:|:-:|:-:|
| LLM × LLM merging (TIES, DARE, SLERP) | No | No (same modal) | Many |
| TTS × TTS averaging (Murata 2024) | No | No (same modal) | INTERSPEECH 2024 |
| SmolTolk (adapter-based) | **Yes** (adapter training) | Yes | arxiv 2503.06211 |
| CSLM (fine-tuning) | **Yes** (continual pretraining) | Yes | arxiv 2604.11096 |
| GPT-4o (end-to-end) | **Yes** ($$$) | Yes | OpenAI 2024 |
| **Darwin-TTS (this work)** | **No** | **Yes** | **World's First** |

## Experimental Timeline (2026-04-15)

```
09:00  TTS hidden_size compatibility analysis → h=2048 group discovered
09:30  TADA-1B × Qwen3-TTS download + config analysis
10:00  Chimera v1 (FFN 100%) → failed (noise)
10:30  Environment setup (darwin-tts-venv, torch 2.6.0+cu124)
10:50  Original Qwen3-TTS synthesis verified
11:00  SLERP blend 10/20/30% build (TADA) → failed (different backbone)
11:30  Key insight: Qwen3-1.7B LLM has IDENTICAL architecture to TTS talker!
12:00  Qwen3-1.7B download → config comparison → 5/5 parameters match!
12:15  α=1/3/5/10% LLM→TTS blending experiments
12:23  ✅ α=3% emotion appears, α=5% emotion intensified, α=10% broken
12:30  4 voice references × 3 blend ratios high-quality sample generation
13:00  Prior art survey → confirmed world's first
13:30  Darwin-TTS-1.7B-Cross (α=3%) final build + HuggingFace release
```

## Model Details

- **Model type**: Text-to-Speech (cross-modal FFN blended)
- **Base models**: Qwen3-TTS-1.7B-Base + Qwen3-1.7B (3% FFN)
- **Parameters**: ~2.1B
- **Languages**: Korean, English, Japanese, Chinese + 6 more
- **License**: Apache 2.0
- **Blend ratio**: α=0.03 (3%)
- **FFN tensors modified**: 84 / 976 total (8.6%)
- **Build time**: ~2 minutes (no training)

## Credits

**[VIDRAFT](https://vidraft.nwr)** (비드래프트) — Darwin Evolutionary Merge Framework

- Darwin LLM V7: GPQA Diamond 86.9% (World #3)
- FINAL Bench: Text AGI benchmark
- 11 Pillar Technologies: AETHER, PROMETHEUS, HEPHAESTUS, Darwin, FINAL Bench, MARL, SiteAgent, 한지+한양, VDash, 인공사회, StealthMark

Built on [Qwen3-TTS-1.7B](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base) and [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) by Alibaba Cloud (Apache 2.0).


## Related

- [Darwin-9B-Opus](https://huggingface.co/FINAL-Bench/Darwin-9B-Opus) — Darwin LLM (GPQA Diamond 86.9%)
- [FINAL Bench](https://huggingface.co/FINAL-Bench) — Text AGI Benchmark
- [Darwin Evolutionary Merge Framework](https://huggingface.co/FINAL-Bench) — CMA-ES + FFN crossbreeding