Update README.md

7c00604 verified about 14 hours ago

8.8 kB

language:
  - ko
  - en
  - ja
  - zh
  - de
  - fr
  - ru
  - pt
  - es
  - it
license: apache-2.0
tags:
  - tts
  - text-to-speech
  - darwin
  - cross-modal
  - ffn-blending
  - model-merging
  - qwen3
  - voice-cloning
  - emotion
  - vidraft
base_model:
  - Qwen/Qwen3-TTS-12Hz-1.7B-Base
  - Qwen/Qwen3-1.7B
pipeline_tag: text-to-speech

🧬 Darwin-TTS-1.7B-Cross

World's first cross-modal FFN transfer from LLM to TTS — emotion-enhanced speech synthesis without any training.

Darwin-TTS blends 3% of Qwen3-1.7B (LLM) FFN weights into Qwen3-TTS-1.7B (TTS) talker module. No training, no data, no GPU hours — just weight-space arithmetic.

Key Discovery

Blend (α)	Emotion	Quality	Status
0%	Baseline	Normal	Original Qwen3-TTS
1%	No change	Normal	Too subtle
3%	Emotion appears	Normal	★ This model (default)
5%	Emotion intensified	Normal	★★ Max stable
10%	Broken	Failed	Infinite generation

Why It Works

Qwen3-1.7B (LLM) and Qwen3-TTS-1.7B's talker share 100% identical architecture:

                    Qwen3-1.7B (LLM)    Qwen3-TTS talker    Match
hidden_size         2048                 2048                ✅
intermediate_size   6144                 6144                ✅
num_hidden_layers   28                   28                  ✅
num_attention_heads 16                   16                  ✅
num_key_value_heads 8                    8                   ✅

This means zero SVD, zero truncation, zero layer mapping — pure 1:1 lerp blending across all 84 FFN tensors (gate_proj, up_proj, down_proj × 28 layers).

Architecture

Qwen3-TTS-1.7B (4-module structure):
┌─────────────────────────────────────────────────────┐
│ talker (28L Qwen3 LM backbone)                      │
│   └── 84 FFN tensors blended with LLM (α=3%)       │ ← MODIFIED
│       └── talker.model.layers.N.mlp.{gate,up,down}  │
├─────────────────────────────────────────────────────┤
│ code_predictor (5L, h=1024)                          │ ← UNTOUCHED
├─────────────────────────────────────────────────────┤
│ speech_tokenizer (12Hz RVQ codec)                    │ ← UNTOUCHED
├─────────────────────────────────────────────────────┤
│ encoder/decoder (audio waveform)                     │ ← UNTOUCHED
└─────────────────────────────────────────────────────┘

FFN Source: Qwen3-1.7B (LLM)
└── model.layers.N.mlp.{gate,up,down}_proj.weight
    └── Key mapping: model.layers.N → talker.model.layers.N (1:1)

Only the talker's FFN weights are modified. The code_predictor, speech_tokenizer, and encoder/decoder remain 100% original — preserving the audio codec pipeline entirely.

Quick Start

Option 1: Load pre-blended weights (this model)

from qwen_tts import Qwen3TTSModel

# Load Darwin-TTS-1.7B-Cross (α=3% pre-blended)
model = Qwen3TTSModel.from_pretrained(
    "FINAL-Bench/Darwin-TTS-1.7B-Cross",
    device_map="cuda:0",
    dtype=torch.bfloat16
)

# Synthesize
wavs, sr = model.generate_voice_clone(
    text="안녕하세요, 저는 다윈 인공지능입니다!",
    ref_audio="your_voice.wav",
    ref_text="ref",
    x_vector_only_mode=True
)

Option 2: Custom blend ratio (runtime blending)

from qwen_tts import Qwen3TTSModel
model = Qwen3TTSModel.from_pretrained("FINAL-Bench/Darwin-TTS-1.7B-Cross")
wavs, sr = model.generate_voice_clone(
    text="정말 기쁜 소식이에요!",
    ref_audio="voice.wav",
    ref_text="ref",
    x_vector_only_mode=True
)

CLI

python darwin_tts_blend.py --alpha 3 --text "Hello, Darwin!" --ref voice.wav --output speech.wav

Installation

pip install torch qwen-tts safetensors soundfile huggingface_hub

Research Background

The Problem

Cross-modal capability transfer (e.g., adding emotion to TTS) traditionally requires:

Thousands of hours of emotional speech data
Hundreds of GPU hours for training
Careful data curation and annotation

The Darwin Approach

Darwin's evolutionary merge framework, originally developed for LLM merging (Darwin LLM V7 achieved GPQA Diamond 86.9%, World #5), is extended to cross-modal transfer:

Find architecture-compatible models across modalities (LLM ↔ TTS)
Blend FFN weights at low ratios (3~5%) using simple lerp
Preserve modality-specific components (audio codec, tokenizer)

Key Findings

Cross-modal FFN transfer works — LLM's language understanding patterns enhance TTS emotional expressiveness
Sweet spot is 3~5% — TTS is far more sensitive than LLM merging (which tolerates 7~93%)
Same backbone is required — TADA-1B (Llama backbone) × Qwen3-TTS failed completely; Qwen3 × Qwen3 succeeded
10%+ destroys TTS — LLM's "continue generating tokens" pattern overrides the TTS stop signal, causing 655-second outputs
Bidirectional potential — LLM + TTS FFN may enable "Speaking LLM" (the GPT-4o direction)

What Failed (and why it matters)

Experiment	Why Failed	Lesson
TADA-1B(Llama) × Qwen3-TTS	Different backbone (Llama vs Qwen3)	Same backbone required
FFN 100% replacement	Too aggressive	Low ratio (3~5%) needed
x_vector_only_mode=False	ref_text mismatch	Use x_vector_only_mode=True
α=10% blend	LLM "keep generating" pattern	TTS has narrow tolerance

Novelty (Prior Art Survey)

Approach	Training Required	Cross-Modal	Published
LLM × LLM merging (TIES, DARE, SLERP)	No	No (same modal)	Many
TTS × TTS averaging (Murata 2024)	No	No (same modal)	INTERSPEECH 2024
SmolTolk (adapter-based)	Yes (adapter training)	Yes	arxiv 2503.06211
CSLM (fine-tuning)	Yes (continual pretraining)	Yes	arxiv 2604.11096
GPT-4o (end-to-end)	Yes ($$$)	Yes	OpenAI 2024
Darwin-TTS (this work)	No	Yes	World's First

Experimental Timeline (2026-04-15)

09:00  TTS hidden_size compatibility analysis → h=2048 group discovered
09:30  TADA-1B × Qwen3-TTS download + config analysis
10:00  Chimera v1 (FFN 100%) → failed (noise)
10:30  Environment setup (darwin-tts-venv, torch 2.6.0+cu124)
10:50  Original Qwen3-TTS synthesis verified
11:00  SLERP blend 10/20/30% build (TADA) → failed (different backbone)
11:30  Key insight: Qwen3-1.7B LLM has IDENTICAL architecture to TTS talker!
12:00  Qwen3-1.7B download → config comparison → 5/5 parameters match!
12:15  α=1/3/5/10% LLM→TTS blending experiments
12:23  ✅ α=3% emotion appears, α=5% emotion intensified, α=10% broken
12:30  4 voice references × 3 blend ratios high-quality sample generation
13:00  Prior art survey → confirmed world's first
13:30  Darwin-TTS-1.7B-Cross (α=3%) final build + HuggingFace release

Model Details

Model type: Text-to-Speech (cross-modal FFN blended)
Base models: Qwen3-TTS-1.7B-Base + Qwen3-1.7B (3% FFN)
Parameters: ~2.1B
Languages: Korean, English, Japanese, Chinese + 6 more
License: Apache 2.0
Blend ratio: α=0.03 (3%)
FFN tensors modified: 84 / 976 total (8.6%)
Build time: ~2 minutes (no training)

Credits

VIDRAFT (비드래프트) — Darwin Evolutionary Merge Framework

Darwin LLM V7: GPQA Diamond 86.9% (World #3)
FINAL Bench: Text AGI benchmark
11 Pillar Technologies: AETHER, PROMETHEUS, HEPHAESTUS, Darwin, FINAL Bench, MARL, SiteAgent, 한지+한양, VDash, 인공사회, StealthMark

Built on Qwen3-TTS-1.7B and Qwen3-1.7B by Alibaba Cloud (Apache 2.0).

Darwin-9B-Opus — Darwin LLM (GPQA Diamond 86.9%)
FINAL Bench — Text AGI Benchmark
Darwin Evolutionary Merge Framework — CMA-ES + FFN crossbreeding

FINAL-Bench
/

Darwin-TTS-1.7B-Cross