Qwen3-TTS VoiceDesign β€” T2

A second-pass fine-tune of Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign. T2 keeps the corrected-protocol training recipe from T1 and focuses on closing the gender Γ— pitch confound observed in T1 β€” the failure mode where prompts like "a male speaker at a high pitch" would sometimes render with a female timbre. T2 addresses this by enforcing gender parity inside every pitch bucket during dataset sampling, and by training for one extra epoch with a higher LR floor so the adapter has more time on the newly balanced distribution.

  • Base: Qwen3-TTS-12Hz-1.7B-VoiceDesign (frozen during training, LoRA adapter merged into this repo)
  • Method: LoRA on the Talker's attention + MLP projections, merged back into the base weights
  • Training data: TextrolSpeech, 11,437 clips, 50/50 gender parity per pitch bucket
  • Output: 24 kHz mono wav via the Qwen3 12 Hz multi-codebook codec

This repo is self-contained β€” it ships the merged transformer weights, the audio codec (speech_tokenizer/), the tokenizer, and all configs. You do not need to pull any other HF repo at inference time.

What T2 changes vs T1

axis T1 T2 why
stratification [pitch, speed, situational, gender] round-robin [gender, pitch, situational] with gender-parity per pitch (take min(#male, #female) per pitch bucket) direct fix for T1's gender-pitch leakage
epochs 2 3 more passes over the rebalanced distribution
min_lr_ratio 0.1 (floor 2e-6) 0.2 (floor 4e-6) keeps late-training LR high enough to keep learning rather than plateauing
dataset size 12,000 11,437 (slight trim for exact 50/50 per bucket) parity is achieved by capping the majority class

Everything else (base model, LoRA r/Ξ±/dropout, target modules, Talker-only scope, bf16, sdpa, max_seq_length 1536, manual CB-0 cross-entropy loss, sub_talker_loss_weight=0, same eval protocol) is identical to T1.

Results on the 16-prompt in-distribution eval

Evaluation protocol matches T1: 16 natural-language voice-design prompts covering 8 emotions Γ— 2 genders Γ— 3 pitch/speed buckets; ASR: openai/whisper-base; decoder temperature=0.9, top_k=50, top_p=1.0, repetition_penalty=1.05, max_new_tokens=600. Unlike T1, T2's eval run had the emotion classifier and UTMOS scorers installed, so those columns are reported.

checkpoint Composite WER emotion_acc UTMOS notes
baseline 9.921 0.007 0.438 3.001 reference
T2 (this repo, from checkpoint-2000) 9.887 0.007 0.312 3.086 matches baseline intelligibility; UTMOS +2.8%; listener-preferred on naturalness
T2 checkpoint-1000 9.642 0.013 0.438 3.045 earlier snapshot

What actually moved

  • No intelligibility regression β€” WER is identical to baseline at 0.007 (roughly 1 token wrong across all 16 prompts). The balance change didn't cost anything on the transcription axis.
  • Naturalness ticks up β€” UTMOS (the automatic MOS proxy) moves from 3.001 β†’ 3.086, a small but measurable lift. In manual listening this shows as slightly smoother prosody and more consistent speaker timbre inside an utterance; not a dramatic change, just cleaner.
  • Emotion tagging shifts slightly β€” the 16-prompt emotion axis shows a small decrease in the automatic classifier's hit rate (0.438 β†’ 0.312), but that's a 2-prompt difference at n=16, within the noise floor of this eval size. Manual listening does not pick out obvious emotion regressions.
  • Gender Γ— pitch behavior improves on the prompts where T1 failed β€” the parity sampler gave the model matched training support across the male Γ— low_pitch and male Γ— high_pitch cells, and the listener rate of wrong-gender renders on those specific prompts is lower than T1 on informal spot-checks.

Positioning: T2 is a small, conservative step on top of T1 that prioritizes naturalness and gender behavior while keeping the headline WER gain locked in. It is not a dramatic jump β€” and intentionally so. The T1 β†’ T2 delta is the kind of change you'd expect from a pure data-balance intervention with the same recipe otherwise.

Raw metrics: eval_results.json.

Quick start

Install the Qwen3-TTS inference package (it registers the custom Qwen3TTSForConditionalGeneration model class with transformers):

pip install qwen-tts transformers torch soundfile

Generate a clip:

from qwen_tts import Qwen3TTSModel
import soundfile as sf

wrap = Qwen3TTSModel.from_pretrained("macminix/qwen3_voice_design_t2")

wavs, sr = wrap.generate_voice_design(
    text="Come and look at this, you are not going to believe it.",
    instruct="A male speaker delivers his happy speech at a moderate pace with standard energy.",
    language="english",
    temperature=0.9, top_k=50, top_p=1.0,
    repetition_penalty=1.05, max_new_tokens=600,
)
sf.write("out.wav", wavs[0], sr)

A ready-to-run version is provided at example_inference.py.

The instruct prompt format

The instruct field is free-form English describing the voice. The training distribution has four axes:

  • gender β€” "a male/female speaker"
  • pitch β€” "high/medium/low pitched", "deep", "high-pitched"
  • speed β€” "slow/moderate/fast", "at a brisk pace"
  • situational / emotion β€” "happy", "angry", "sad", "whispered", "contemptuous", etc.

Example prompts:

A male speaker delivers his happy speech at a moderate pace with standard energy.
A female voice speaks slowly with a sad tone, low energy, almost whispering.
A low-pitched male narrator reads an exciting announcement at a fast pace.

How the adapter was trained

T2 preserves T1's corrected training protocol (the one that fixes the four silent bugs found in earlier work) and changes only the dataset sampling strategy and a handful of schedule knobs.

The v1 β†’ T1 correctness fixes kept in T2

# Problem in the naive recipe Fix kept from T1
1 Training sequence built as a chat-formatted prompt followed by codec ids, with torch.where over a text/codec boundary β€” does not match Qwen3TTSForConditionalGeneration.generate's VoiceDesign path. Training-time inputs_embeds is built by the exact dual-track sum used by generate (instruct β†’ text β†’ codec; control tokens placed on the codec track).
2 labels= passed into the forward; PEFT + CausalLMLoss shifts labels by 1 internally on top of the collator's own shift β†’ double-shifted target. labels=None; loss computed manually: F.cross_entropy(logits[:, :-1], codec_0_labels[:, 1:], ignore_index=-100).
3 LR too aggressive (1e-4) on a 1.7B base for a small data subset. Peak LR 2.0e-5, cosine schedule, 200 warmup steps.
4 Sub-talker loss on a frozen Code Predictor produced corrupting gradient that broke every checkpoint. sub_talker_loss_weight=0.0. Only CB-0 is supervised via the Talker head.

T2-specific changes

  1. Gender-parity stratification. Inside each pitch bucket (high / medium / low) the sampler takes min(#male, #female) records from the dataset, so training sees exactly the same number of male and female clips at every pitch. T1's stratifier balanced on gender as one of four joint axes, which in practice let pitch distribution skew. The audit on the rebalanced splits confirms 50/50 gender ratio at high pitch (2641 male / 2646 female) and medium pitch (3062 / 3049); the low-pitch slice is small in the source corpus (20 / 19) and was kept 1:1.
  2. Three epochs. With the rebalanced distribution, the adapter has slightly fewer unique clips and one extra epoch ensures comparable total gradient exposure to T1.
  3. Higher LR floor (min_lr_ratio=0.2). T1's cb0 loss flattened around step 1000 β€” the cosine schedule had already decayed the LR to the 0.1 floor by then, and the last 400 steps contributed little. T2 raises the floor to 0.2 so late training keeps making progress instead of drifting.

Training hyperparameters

block setting
Base model Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign
Precision bf16
Attention impl sdpa
LoRA r / Ξ± / dropout 16 / 32 / 0.05
LoRA target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
LoRA scope Talker only (Code Predictor frozen)
Trainable parameters 19.2 M (1 % of 1.9 B)
Dataset TextrolSpeech, 11,437 clips, gender-parity per pitch bucket
Train / val / eval split 10,870 / 360 / 200
Prompt paraphrases per clip up to 3
Epochs 3
Batch size 4 Γ— 4 = 16 effective
Max sequence length 1,536
Optimizer AdamW (β₁=0.9, Ξ²β‚‚=0.95, weight_decay=0.01)
Grad clip 1.0
Loss manual CB-0 cross-entropy
Total optimizer steps 2,142
LR schedule peak 2.0e-5, cosine, 200 warmup, min_lr_ratio=0.2
Hardware Single RTX 4090 (24 GB)
Wall-clock ~55 min

LoRA adapter (pre-merge)

Before merging, the adapter was 77 MB (~19.2 M fp32 parameters) β€” identical footprint to T1 since r/Ξ±/targets/scope are unchanged. It has been permanently folded into the Talker weights in this repo; no PEFT dependency is needed at inference.

Evaluation details

  • Prompts: 16 natural-language voice-description prompts (same file as T1 for A/B continuity): 8 emotions (happy, angry, sad, disgusted, afraid, contempt, neutral, surprised) Γ— 2 genders Γ— 3 pitch/speed buckets.
  • Decoder: temperature=0.9, top_k=50, top_p=1.0, repetition_penalty=1.05, do_sample=True, max_new_tokens=600.
  • ASR: openai/whisper-base.
  • Emotion classifier: iic/emotion2vec_plus_large.
  • UTMOS: speechmos UTMOS (automatic MOS proxy).
  • Composite score: 0.5 Β· (1 / max(WER, 0.05)) + 0.3 Β· emotion_acc + 0.2 Β· utmos_normalized β€” higher is better.

Known limitations

  • Small delta vs baseline on automatic metrics. This is a data-balance iteration, not a recipe change. The headline WER already matched baseline at T1; there isn't much room to improve on that axis without changing the data source or the training objective.
  • Emotion fidelity is not the focus. T2 trades some automatic-emotion-classification accuracy for naturalness and gender stability. The training data (TextrolSpeech) uses templated emotion descriptions, which is a known ceiling β€” later iterations move to corpora with richer free-form captions.
  • English only. All training and evaluation used English prompts + English text. The base model supports 10 languages; they are untouched but were not validated against this adapter's shifted CB-0 distribution.
  • Research / non-commercial only β€” see license.

License

  • Base model weights (Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign): Apache 2.0.
  • Training data (TextrolSpeech): CC BY-NC-SA 4.0 (research / non-commercial).

Because a substantial portion of this model's post-training behavior is shaped by TextrolSpeech, the derived model inherits the CC BY-NC-SA 4.0 constraint in practice: free to use for research, academic, and non-commercial purposes, with attribution and share-alike. Commercial deployment is not recommended without re-training on a commercially-licensed corpus.

References

Downloads last month
31
Safetensors
Model size
2B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for macminix/qwen3_voice_design_t2

Adapter
(6)
this model