SeaWolf-AI's picture
Update README.md
7c00604 verified
metadata
language:
  - ko
  - en
  - ja
  - zh
  - de
  - fr
  - ru
  - pt
  - es
  - it
license: apache-2.0
tags:
  - tts
  - text-to-speech
  - darwin
  - cross-modal
  - ffn-blending
  - model-merging
  - qwen3
  - voice-cloning
  - emotion
  - vidraft
base_model:
  - Qwen/Qwen3-TTS-12Hz-1.7B-Base
  - Qwen/Qwen3-1.7B
pipeline_tag: text-to-speech

🧬 Darwin-TTS-1.7B-Cross

World's first cross-modal FFN transfer from LLM to TTS β€” emotion-enhanced speech synthesis without any training.

Darwin-TTS blends 3% of Qwen3-1.7B (LLM) FFN weights into Qwen3-TTS-1.7B (TTS) talker module. No training, no data, no GPU hours β€” just weight-space arithmetic.

Key Discovery

Blend (Ξ±) Emotion Quality Status
0% Baseline Normal Original Qwen3-TTS
1% No change Normal Too subtle
3% Emotion appears Normal β˜… This model (default)
5% Emotion intensified Normal β˜…β˜… Max stable
10% Broken Failed Infinite generation

Why It Works

Qwen3-1.7B (LLM) and Qwen3-TTS-1.7B's talker share 100% identical architecture:

                    Qwen3-1.7B (LLM)    Qwen3-TTS talker    Match
hidden_size         2048                 2048                βœ…
intermediate_size   6144                 6144                βœ…
num_hidden_layers   28                   28                  βœ…
num_attention_heads 16                   16                  βœ…
num_key_value_heads 8                    8                   βœ…

This means zero SVD, zero truncation, zero layer mapping β€” pure 1:1 lerp blending across all 84 FFN tensors (gate_proj, up_proj, down_proj Γ— 28 layers).

Architecture

Qwen3-TTS-1.7B (4-module structure):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ talker (28L Qwen3 LM backbone)                      β”‚
β”‚   └── 84 FFN tensors blended with LLM (Ξ±=3%)       β”‚ ← MODIFIED
β”‚       └── talker.model.layers.N.mlp.{gate,up,down}  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ code_predictor (5L, h=1024)                          β”‚ ← UNTOUCHED
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ speech_tokenizer (12Hz RVQ codec)                    β”‚ ← UNTOUCHED
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ encoder/decoder (audio waveform)                     β”‚ ← UNTOUCHED
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

FFN Source: Qwen3-1.7B (LLM)
└── model.layers.N.mlp.{gate,up,down}_proj.weight
    └── Key mapping: model.layers.N β†’ talker.model.layers.N (1:1)

Only the talker's FFN weights are modified. The code_predictor, speech_tokenizer, and encoder/decoder remain 100% original β€” preserving the audio codec pipeline entirely.

Quick Start

Option 1: Load pre-blended weights (this model)

from qwen_tts import Qwen3TTSModel

# Load Darwin-TTS-1.7B-Cross (Ξ±=3% pre-blended)
model = Qwen3TTSModel.from_pretrained(
    "FINAL-Bench/Darwin-TTS-1.7B-Cross",
    device_map="cuda:0",
    dtype=torch.bfloat16
)

# Synthesize
wavs, sr = model.generate_voice_clone(
    text="μ•ˆλ…•ν•˜μ„Έμš”, μ €λŠ” λ‹€μœˆ 인곡지λŠ₯μž…λ‹ˆλ‹€!",
    ref_audio="your_voice.wav",
    ref_text="ref",
    x_vector_only_mode=True
)

Option 2: Custom blend ratio (runtime blending)

from qwen_tts import Qwen3TTSModel
model = Qwen3TTSModel.from_pretrained("FINAL-Bench/Darwin-TTS-1.7B-Cross")
wavs, sr = model.generate_voice_clone(
    text="정말 기쁜 μ†Œμ‹μ΄μ—μš”!",
    ref_audio="voice.wav",
    ref_text="ref",
    x_vector_only_mode=True
)

CLI

python darwin_tts_blend.py --alpha 3 --text "Hello, Darwin!" --ref voice.wav --output speech.wav

Installation

pip install torch qwen-tts safetensors soundfile huggingface_hub

Research Background

The Problem

Cross-modal capability transfer (e.g., adding emotion to TTS) traditionally requires:

  • Thousands of hours of emotional speech data
  • Hundreds of GPU hours for training
  • Careful data curation and annotation

The Darwin Approach

Darwin's evolutionary merge framework, originally developed for LLM merging (Darwin LLM V7 achieved GPQA Diamond 86.9%, World #5), is extended to cross-modal transfer:

  1. Find architecture-compatible models across modalities (LLM ↔ TTS)
  2. Blend FFN weights at low ratios (3~5%) using simple lerp
  3. Preserve modality-specific components (audio codec, tokenizer)

Key Findings

  1. Cross-modal FFN transfer works β€” LLM's language understanding patterns enhance TTS emotional expressiveness
  2. Sweet spot is 3~5% β€” TTS is far more sensitive than LLM merging (which tolerates 7~93%)
  3. Same backbone is required β€” TADA-1B (Llama backbone) Γ— Qwen3-TTS failed completely; Qwen3 Γ— Qwen3 succeeded
  4. 10%+ destroys TTS β€” LLM's "continue generating tokens" pattern overrides the TTS stop signal, causing 655-second outputs
  5. Bidirectional potential β€” LLM + TTS FFN may enable "Speaking LLM" (the GPT-4o direction)

What Failed (and why it matters)

Experiment Why Failed Lesson
TADA-1B(Llama) Γ— Qwen3-TTS Different backbone (Llama vs Qwen3) Same backbone required
FFN 100% replacement Too aggressive Low ratio (3~5%) needed
x_vector_only_mode=False ref_text mismatch Use x_vector_only_mode=True
Ξ±=10% blend LLM "keep generating" pattern TTS has narrow tolerance

Novelty (Prior Art Survey)

Approach Training Required Cross-Modal Published
LLM Γ— LLM merging (TIES, DARE, SLERP) No No (same modal) Many
TTS Γ— TTS averaging (Murata 2024) No No (same modal) INTERSPEECH 2024
SmolTolk (adapter-based) Yes (adapter training) Yes arxiv 2503.06211
CSLM (fine-tuning) Yes (continual pretraining) Yes arxiv 2604.11096
GPT-4o (end-to-end) Yes ($$$) Yes OpenAI 2024
Darwin-TTS (this work) No Yes World's First

Experimental Timeline (2026-04-15)

09:00  TTS hidden_size compatibility analysis β†’ h=2048 group discovered
09:30  TADA-1B Γ— Qwen3-TTS download + config analysis
10:00  Chimera v1 (FFN 100%) β†’ failed (noise)
10:30  Environment setup (darwin-tts-venv, torch 2.6.0+cu124)
10:50  Original Qwen3-TTS synthesis verified
11:00  SLERP blend 10/20/30% build (TADA) β†’ failed (different backbone)
11:30  Key insight: Qwen3-1.7B LLM has IDENTICAL architecture to TTS talker!
12:00  Qwen3-1.7B download β†’ config comparison β†’ 5/5 parameters match!
12:15  α=1/3/5/10% LLM→TTS blending experiments
12:23  βœ… Ξ±=3% emotion appears, Ξ±=5% emotion intensified, Ξ±=10% broken
12:30  4 voice references Γ— 3 blend ratios high-quality sample generation
13:00  Prior art survey β†’ confirmed world's first
13:30  Darwin-TTS-1.7B-Cross (Ξ±=3%) final build + HuggingFace release

Model Details

  • Model type: Text-to-Speech (cross-modal FFN blended)
  • Base models: Qwen3-TTS-1.7B-Base + Qwen3-1.7B (3% FFN)
  • Parameters: ~2.1B
  • Languages: Korean, English, Japanese, Chinese + 6 more
  • License: Apache 2.0
  • Blend ratio: Ξ±=0.03 (3%)
  • FFN tensors modified: 84 / 976 total (8.6%)
  • Build time: ~2 minutes (no training)

Credits

VIDRAFT (λΉ„λ“œλž˜ν”„νŠΈ) β€” Darwin Evolutionary Merge Framework

  • Darwin LLM V7: GPQA Diamond 86.9% (World #3)
  • FINAL Bench: Text AGI benchmark
  • 11 Pillar Technologies: AETHER, PROMETHEUS, HEPHAESTUS, Darwin, FINAL Bench, MARL, SiteAgent, ν•œμ§€+ν•œμ–‘, VDash, μΈκ³΅μ‚¬νšŒ, StealthMark

Built on Qwen3-TTS-1.7B and Qwen3-1.7B by Alibaba Cloud (Apache 2.0).

Related