Darwin-TTS-1.7B-Cross โ€” Qwen3-TTS compatibility repack

This repository is a compatibility repack of FINAL-Bench/Darwin-TTS-1.7B-Cross.

The original Darwin checkpoint appears to omit the speech_tokenizer/ directory required by the standard qwen-tts loader. This repack adds the missing speech_tokenizer/ files from Qwen/Qwen3-TTS-12Hz-1.7B-Base.

No model blending, training, fine-tuning, or behavioral changes were performed in this repack. The purpose is only to make the model load with:

from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained("zeropointnine/Darwin-TTS-1.7B-Cross-Qwen3Tokenizer")

Provenance

  • Main model weights and model card: FINAL-Bench/Darwin-TTS-1.7B-Cross
  • Added tokenizer assets: Qwen/Qwen3-TTS-12Hz-1.7B-Base
  • License: Apache 2.0, matching the upstream model cards.

Original Darwin-TTS-1.7B-Cross model card follows below:

๐Ÿงฌ Darwin-TTS-1.7B-Cross

World's first cross-modal FFN transfer from LLM to TTS โ€” emotion-enhanced speech synthesis without any training.

This model is a cross-modal application of the Darwin Family framework, introduced in the paper: Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning.

Authors: Taebong Kim, Youngsik Hong, Minsik Kim, Sunyoung Choi, Jaewon Jang, Junghoon Shin, Minseo Kim.

Darwin-TTS blends 3% of Qwen3-1.7B (LLM) FFN weights into Qwen3-TTS-1.7B (TTS) talker module. No training, no data, no GPU hours โ€” just weight-space arithmetic.

Key Discovery

Blend (ฮฑ) Emotion Quality Status
0% Baseline Normal Original Qwen3-TTS
1% No change Normal Too subtle
3% Emotion appears Normal โ˜… This model (default)
5% Emotion intensified Normal โ˜…โ˜… Max stable
10% Broken Failed Infinite generation

Why It Works

Qwen3-1.7B (LLM) and Qwen3-TTS-1.7B's talker share 100% identical architecture:

                    Qwen3-1.7B (LLM)    Qwen3-TTS talker    Match
hidden_size         2048                 2048                โœ…
intermediate_size   6144                 6144                โœ…
num_hidden_layers   28                   28                  โœ…
num_attention_heads 16                   16                  โœ…
num_key_value_heads 8                    8                   โœ…

This means zero SVD, zero truncation, zero layer mapping โ€” pure 1:1 lerp blending across all 84 FFN tensors (gate_proj, up_proj, down_proj ร— 28 layers).

Architecture

Qwen3-TTS-1.7B (4-module structure):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ talker (28L Qwen3 LM backbone)                      โ”‚
โ”‚   โ””โ”€โ”€ 84 FFN tensors blended with LLM (ฮฑ=3%)       โ”‚ โ† MODIFIED
โ”‚       โ””โ”€โ”€ talker.model.layers.N.mlp.{gate,up,down}  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ code_predictor (5L, h=1024)                          โ”‚ โ† UNTOUCHED
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ speech_tokenizer (12Hz RVQ codec)                    โ”‚ โ† UNTOUCHED
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ encoder/decoder (audio waveform)                     โ”‚ โ† UNTOUCHED
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

FFN Source: Qwen3-1.7B (LLM)
โ””โ”€โ”€ model.layers.N.mlp.{gate,up,down}_proj.weight
    โ””โ”€โ”€ Key mapping: model.layers.N โ†’ talker.model.layers.N (1:1)

Only the talker's FFN weights are modified. The code_predictor, speech_tokenizer, and encoder/decoder remain 100% original โ€” preserving the audio codec pipeline entirely.

Quick Start

Option 1: Load pre-blended weights (this model)

from qwen_tts import Qwen3TTSModel
import torch

# Load Darwin-TTS-1.7B-Cross (ฮฑ=3% pre-blended)
model = Qwen3TTSModel.from_pretrained(
    "FINAL-Bench/Darwin-TTS-1.7B-Cross",
    device_map="cuda:0",
    dtype=torch.bfloat16
)

# Synthesize
wavs, sr = model.generate_voice_clone(
    text="์•ˆ๋…•ํ•˜์„ธ์š”, ์ €๋Š” ๋‹ค์œˆ ์ธ๊ณต์ง€๋Šฅ์ž…๋‹ˆ๋‹ค!",
    ref_audio="your_voice.wav",
    ref_text="ref",
    x_vector_only_mode=True
)

Option 2: Custom blend ratio (runtime blending)

from qwen_tts import Qwen3TTSModel
model = Qwen3TTSModel.from_pretrained("FINAL-Bench/Darwin-TTS-1.7B-Cross")
wavs, sr = model.generate_voice_clone(
    text="์ •๋ง ๊ธฐ์œ ์†Œ์‹์ด์—์š”!",
    ref_audio="voice.wav",
    ref_text="ref",
    x_vector_only_mode=True
)

CLI

python darwin_tts_blend.py --alpha 3 --text "Hello, Darwin!" --ref voice.wav --output speech.wav

Installation

pip install torch qwen-tts safetensors soundfile huggingface_hub

Research Background

The Problem

Cross-modal capability transfer (e.g., adding emotion to TTS) traditionally requires:

  • Thousands of hours of emotional speech data
  • Hundreds of GPU hours for training
  • Careful data curation and annotation

The Darwin Approach

Darwin's evolutionary merge framework, originally developed for LLM merging (Darwin LLM V7 achieved GPQA Diamond 86.9%, World #5), is extended to cross-modal transfer:

  1. Find architecture-compatible models across modalities (LLM โ†” TTS)
  2. Blend FFN weights at low ratios (3~5%) using simple lerp
  3. Preserve modality-specific components (audio codec, tokenizer)

Key Findings

  1. Cross-modal FFN transfer works โ€” LLM's language understanding patterns enhance TTS emotional expressiveness
  2. Sweet spot is 3~5% โ€” TTS is far more sensitive than LLM merging (which tolerates 7~93%)
  3. Same backbone is required โ€” Qwen3 ร— Qwen3 succeeded; cross-backbone merges (e.g., Llama) failed.
  4. 10%+ destroys TTS โ€” LLM's "continue generating tokens" pattern overrides the TTS stop signal, causing 655-second outputs
  5. Bidirectional potential โ€” LLM + TTS FFN may enable "Speaking LLM" (the GPT-4o direction)

Model Details

  • Model type: Text-to-Speech (cross-modal FFN blended)
  • Base models: Qwen3-TTS-1.7B-Base + Qwen3-1.7B (3% FFN)
  • Parameters: ~2.1B
  • Languages: Korean, English, Japanese, Chinese + 6 more
  • License: Apache 2.0
  • Blend ratio: ฮฑ=0.03 (3%)
  • FFN tensors modified: 84 / 976 total (8.6%)
  • Build time: ~2 minutes (no training)

Citation

If you find this work useful in your research, please cite:

@article{kim2026darwin,
  title={Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning},
  author={Kim, Taebong and Hong, Youngsik and Kim, Minsik and Choi, Sunyoung and Jang, Jaewon and Shin, Junghoon and Kim, Minseo},
  journal={arXiv preprint arXiv:2605.14386},
  year={2026}
}

Credits

VIDRAFT (๋น„๋“œ๋ž˜ํ”„ํŠธ) โ€” Darwin Evolutionary Merge Framework

Built on Qwen3-TTS-1.7B and Qwen3-1.7B by Alibaba Cloud (Apache 2.0).

Related

Downloads last month
64
Safetensors
Model size
2B params
Tensor type
F32
ยท
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for zeropointnine/Darwin-TTS-1.7B-Cross-Qwen3Tokenizer

Finetuned
Qwen/Qwen3-1.7B
Finetuned
(1)
this model

Paper for zeropointnine/Darwin-TTS-1.7B-Cross-Qwen3Tokenizer