Qwen3-TTS VoiceDesign — T2

A second-pass fine-tune of Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign. T2 keeps the corrected-protocol training recipe from T1 and focuses on closing the gender × pitch confound observed in T1 — the failure mode where prompts like "a male speaker at a high pitch" would sometimes render with a female timbre. T2 addresses this by enforcing gender parity inside every pitch bucket during dataset sampling, and by training for one extra epoch with a higher LR floor so the adapter has more time on the newly balanced distribution.

Base: Qwen3-TTS-12Hz-1.7B-VoiceDesign (frozen during training, LoRA adapter merged into this repo)
Method: LoRA on the Talker's attention + MLP projections, merged back into the base weights
Training data: TextrolSpeech, 11,437 clips, 50/50 gender parity per pitch bucket
Output: 24 kHz mono wav via the Qwen3 12 Hz multi-codebook codec

This repo is self-contained — it ships the merged transformer weights, the audio codec (speech_tokenizer/), the tokenizer, and all configs. You do not need to pull any other HF repo at inference time.

What T2 changes vs T1

axis	T1	T2	why
stratification	`[pitch, speed, situational, gender]` round-robin	`[gender, pitch, situational]` with gender-parity per pitch (take `min(#male, #female)` per pitch bucket)	direct fix for T1's gender-pitch leakage
epochs	2	3	more passes over the rebalanced distribution
`min_lr_ratio`	0.1 (floor 2e-6)	0.2 (floor 4e-6)	keeps late-training LR high enough to keep learning rather than plateauing
dataset size	12,000	11,437 (slight trim for exact 50/50 per bucket)	parity is achieved by capping the majority class

Everything else (base model, LoRA r/α/dropout, target modules, Talker-only scope, bf16, sdpa, max_seq_length 1536, manual CB-0 cross-entropy loss, sub_talker_loss_weight=0, same eval protocol) is identical to T1.

Results on the 16-prompt in-distribution eval

Evaluation protocol matches T1: 16 natural-language voice-design prompts covering 8 emotions × 2 genders × 3 pitch/speed buckets; ASR: openai/whisper-base; decoder temperature=0.9, top_k=50, top_p=1.0, repetition_penalty=1.05, max_new_tokens=600. Unlike T1, T2's eval run had the emotion classifier and UTMOS scorers installed, so those columns are reported.

checkpoint	Composite	WER	emotion_acc	UTMOS	notes
baseline	9.921	0.007	0.438	3.001	reference
T2 (this repo, from `checkpoint-2000`)	9.887	0.007	0.312	3.086	matches baseline intelligibility; UTMOS +2.8%; listener-preferred on naturalness
T2 checkpoint-1000	9.642	0.013	0.438	3.045	earlier snapshot

What actually moved

No intelligibility regression — WER is identical to baseline at 0.007 (roughly 1 token wrong across all 16 prompts). The balance change didn't cost anything on the transcription axis.
Naturalness ticks up — UTMOS (the automatic MOS proxy) moves from 3.001 → 3.086, a small but measurable lift. In manual listening this shows as slightly smoother prosody and more consistent speaker timbre inside an utterance; not a dramatic change, just cleaner.
Emotion tagging shifts slightly — the 16-prompt emotion axis shows a small decrease in the automatic classifier's hit rate (0.438 → 0.312), but that's a 2-prompt difference at n=16, within the noise floor of this eval size. Manual listening does not pick out obvious emotion regressions.
Gender × pitch behavior improves on the prompts where T1 failed — the parity sampler gave the model matched training support across the male × low_pitch and male × high_pitch cells, and the listener rate of wrong-gender renders on those specific prompts is lower than T1 on informal spot-checks.

Positioning: T2 is a small, conservative step on top of T1 that prioritizes naturalness and gender behavior while keeping the headline WER gain locked in. It is not a dramatic jump — and intentionally so. The T1 → T2 delta is the kind of change you'd expect from a pure data-balance intervention with the same recipe otherwise.

Raw metrics: eval_results.json.

Quick start

Install the Qwen3-TTS inference package (it registers the custom Qwen3TTSForConditionalGeneration model class with transformers):

pip install qwen-tts transformers torch soundfile

Generate a clip:

from qwen_tts import Qwen3TTSModel
import soundfile as sf

wrap = Qwen3TTSModel.from_pretrained("macminix/qwen3_voice_design_t2")

wavs, sr = wrap.generate_voice_design(
    text="Come and look at this, you are not going to believe it.",
    instruct="A male speaker delivers his happy speech at a moderate pace with standard energy.",
    language="english",
    temperature=0.9, top_k=50, top_p=1.0,
    repetition_penalty=1.05, max_new_tokens=600,
)
sf.write("out.wav", wavs[0], sr)

A ready-to-run version is provided at example_inference.py.

The `instruct` prompt format

The instruct field is free-form English describing the voice. The training distribution has four axes:

gender — "a male/female speaker"
pitch — "high/medium/low pitched", "deep", "high-pitched"
speed — "slow/moderate/fast", "at a brisk pace"
situational / emotion — "happy", "angry", "sad", "whispered", "contemptuous", etc.

Example prompts:

A male speaker delivers his happy speech at a moderate pace with standard energy.
A female voice speaks slowly with a sad tone, low energy, almost whispering.
A low-pitched male narrator reads an exciting announcement at a fast pace.

How the adapter was trained

T2 preserves T1's corrected training protocol (the one that fixes the four silent bugs found in earlier work) and changes only the dataset sampling strategy and a handful of schedule knobs.

The v1 → T1 correctness fixes kept in T2

#	Problem in the naive recipe	Fix kept from T1
1	Training sequence built as a chat-formatted prompt followed by codec ids, with `torch.where` over a text/codec boundary — does not match `Qwen3TTSForConditionalGeneration.generate`'s VoiceDesign path.	Training-time `inputs_embeds` is built by the exact dual-track sum used by `generate` (instruct → text → codec; control tokens placed on the codec track).
2	`labels=` passed into the forward; PEFT + `CausalLMLoss` shifts labels by 1 internally on top of the collator's own shift → double-shifted target.	`labels=None`; loss computed manually: `F.cross_entropy(logits[:, :-1], codec_0_labels[:, 1:], ignore_index=-100)`.
3	LR too aggressive (1e-4) on a 1.7B base for a small data subset.	Peak LR 2.0e-5, cosine schedule, 200 warmup steps.
4	Sub-talker loss on a frozen Code Predictor produced corrupting gradient that broke every checkpoint.	`sub_talker_loss_weight=0.0`. Only CB-0 is supervised via the Talker head.

T2-specific changes

Gender-parity stratification. Inside each pitch bucket (high / medium / low) the sampler takes min(#male, #female) records from the dataset, so training sees exactly the same number of male and female clips at every pitch. T1's stratifier balanced on gender as one of four joint axes, which in practice let pitch distribution skew. The audit on the rebalanced splits confirms 50/50 gender ratio at high pitch (2641 male / 2646 female) and medium pitch (3062 / 3049); the low-pitch slice is small in the source corpus (20 / 19) and was kept 1:1.
Three epochs. With the rebalanced distribution, the adapter has slightly fewer unique clips and one extra epoch ensures comparable total gradient exposure to T1.
Higher LR floor (min_lr_ratio=0.2). T1's cb0 loss flattened around step 1000 — the cosine schedule had already decayed the LR to the 0.1 floor by then, and the last 400 steps contributed little. T2 raises the floor to 0.2 so late training keeps making progress instead of drifting.

Training hyperparameters

block	setting
Base model	Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign
Precision	bf16
Attention impl	`sdpa`
LoRA r / α / dropout	16 / 32 / 0.05
LoRA target modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
LoRA scope	Talker only (Code Predictor frozen)
Trainable parameters	~~19.2 M (~~1 % of 1.9 B)
Dataset	TextrolSpeech, 11,437 clips, gender-parity per pitch bucket
Train / val / eval split	10,870 / 360 / 200
Prompt paraphrases per clip	up to 3
Epochs	3
Batch size	4 × 4 = 16 effective
Max sequence length	1,536
Optimizer	AdamW (β₁=0.9, β₂=0.95, weight_decay=0.01)
Grad clip	1.0
Loss	manual CB-0 cross-entropy
Total optimizer steps	2,142
LR schedule	peak 2.0e-5, cosine, 200 warmup, `min_lr_ratio=0.2`
Hardware	Single RTX 4090 (24 GB)
Wall-clock	~55 min

LoRA adapter (pre-merge)

Before merging, the adapter was 77 MB (~19.2 M fp32 parameters) — identical footprint to T1 since r/α/targets/scope are unchanged. It has been permanently folded into the Talker weights in this repo; no PEFT dependency is needed at inference.

Evaluation details

Prompts: 16 natural-language voice-description prompts (same file as T1 for A/B continuity): 8 emotions (happy, angry, sad, disgusted, afraid, contempt, neutral, surprised) × 2 genders × 3 pitch/speed buckets.
Decoder: temperature=0.9, top_k=50, top_p=1.0, repetition_penalty=1.05, do_sample=True, max_new_tokens=600.
ASR: openai/whisper-base.
Emotion classifier: iic/emotion2vec_plus_large.
UTMOS: speechmos UTMOS (automatic MOS proxy).
Composite score: 0.5 · (1 / max(WER, 0.05)) + 0.3 · emotion_acc + 0.2 · utmos_normalized — higher is better.

Known limitations

Small delta vs baseline on automatic metrics. This is a data-balance iteration, not a recipe change. The headline WER already matched baseline at T1; there isn't much room to improve on that axis without changing the data source or the training objective.
Emotion fidelity is not the focus. T2 trades some automatic-emotion-classification accuracy for naturalness and gender stability. The training data (TextrolSpeech) uses templated emotion descriptions, which is a known ceiling — later iterations move to corpora with richer free-form captions.
English only. All training and evaluation used English prompts + English text. The base model supports 10 languages; they are untouched but were not validated against this adapter's shifted CB-0 distribution.
Research / non-commercial only — see license.

License

Base model weights (Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign): Apache 2.0.
Training data (TextrolSpeech): CC BY-NC-SA 4.0 (research / non-commercial).

Because a substantial portion of this model's post-training behavior is shaped by TextrolSpeech, the derived model inherits the CC BY-NC-SA 4.0 constraint in practice: free to use for research, academic, and non-commercial purposes, with attribution and share-alike. Commercial deployment is not recommended without re-training on a commercially-licensed corpus.

References

Base model: Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign
Inference library: qwen-tts on PyPI
Upstream LoRA training reference: QwenLM/Qwen3-TTS/finetuning/sft_12hz_lora.py
Training data: TextrolSpeech (Ji et al., 2023)
Companion checkpoint from the same study: macminix/qwen3_voice_design_t1

Downloads last month: 31

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for macminix/qwen3_voice_design_t2

Base model

Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign

Adapter

(6)

this model