Instructions to use macminix/qwen3_voice_design_t2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use macminix/qwen3_voice_design_t2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="macminix/qwen3_voice_design_t2")# Load model directly from transformers import AutoModelForSeq2SeqLM model = AutoModelForSeq2SeqLM.from_pretrained("macminix/qwen3_voice_design_t2", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Qwen3-TTS VoiceDesign β T2
A second-pass fine-tune of Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign. T2 keeps the corrected-protocol training recipe from T1 and focuses on closing the gender Γ pitch confound observed in T1 β the failure mode where prompts like "a male speaker at a high pitch" would sometimes render with a female timbre. T2 addresses this by enforcing gender parity inside every pitch bucket during dataset sampling, and by training for one extra epoch with a higher LR floor so the adapter has more time on the newly balanced distribution.
- Base: Qwen3-TTS-12Hz-1.7B-VoiceDesign (frozen during training, LoRA adapter merged into this repo)
- Method: LoRA on the Talker's attention + MLP projections, merged back into the base weights
- Training data: TextrolSpeech, 11,437 clips, 50/50 gender parity per pitch bucket
- Output: 24 kHz mono wav via the Qwen3 12 Hz multi-codebook codec
This repo is self-contained β it ships the merged transformer weights, the audio codec (speech_tokenizer/), the tokenizer, and all configs. You do not need to pull any other HF repo at inference time.
What T2 changes vs T1
| axis | T1 | T2 | why |
|---|---|---|---|
| stratification | [pitch, speed, situational, gender] round-robin |
[gender, pitch, situational] with gender-parity per pitch (take min(#male, #female) per pitch bucket) |
direct fix for T1's gender-pitch leakage |
| epochs | 2 | 3 | more passes over the rebalanced distribution |
min_lr_ratio |
0.1 (floor 2e-6) | 0.2 (floor 4e-6) | keeps late-training LR high enough to keep learning rather than plateauing |
| dataset size | 12,000 | 11,437 (slight trim for exact 50/50 per bucket) | parity is achieved by capping the majority class |
Everything else (base model, LoRA r/Ξ±/dropout, target modules, Talker-only scope, bf16, sdpa, max_seq_length 1536, manual CB-0 cross-entropy loss, sub_talker_loss_weight=0, same eval protocol) is identical to T1.
Results on the 16-prompt in-distribution eval
Evaluation protocol matches T1: 16 natural-language voice-design prompts covering 8 emotions Γ 2 genders Γ 3 pitch/speed buckets; ASR: openai/whisper-base; decoder temperature=0.9, top_k=50, top_p=1.0, repetition_penalty=1.05, max_new_tokens=600. Unlike T1, T2's eval run had the emotion classifier and UTMOS scorers installed, so those columns are reported.
| checkpoint | Composite | WER | emotion_acc | UTMOS | notes |
|---|---|---|---|---|---|
| baseline | 9.921 | 0.007 | 0.438 | 3.001 | reference |
T2 (this repo, from checkpoint-2000) |
9.887 | 0.007 | 0.312 | 3.086 | matches baseline intelligibility; UTMOS +2.8%; listener-preferred on naturalness |
| T2 checkpoint-1000 | 9.642 | 0.013 | 0.438 | 3.045 | earlier snapshot |
What actually moved
- No intelligibility regression β WER is identical to baseline at 0.007 (roughly 1 token wrong across all 16 prompts). The balance change didn't cost anything on the transcription axis.
- Naturalness ticks up β UTMOS (the automatic MOS proxy) moves from 3.001 β 3.086, a small but measurable lift. In manual listening this shows as slightly smoother prosody and more consistent speaker timbre inside an utterance; not a dramatic change, just cleaner.
- Emotion tagging shifts slightly β the 16-prompt emotion axis shows a small decrease in the automatic classifier's hit rate (0.438 β 0.312), but that's a 2-prompt difference at n=16, within the noise floor of this eval size. Manual listening does not pick out obvious emotion regressions.
- Gender Γ pitch behavior improves on the prompts where T1 failed β the parity sampler gave the model matched training support across the
male Γ low_pitchandmale Γ high_pitchcells, and the listener rate of wrong-gender renders on those specific prompts is lower than T1 on informal spot-checks.
Positioning: T2 is a small, conservative step on top of T1 that prioritizes naturalness and gender behavior while keeping the headline WER gain locked in. It is not a dramatic jump β and intentionally so. The T1 β T2 delta is the kind of change you'd expect from a pure data-balance intervention with the same recipe otherwise.
Raw metrics: eval_results.json.
Quick start
Install the Qwen3-TTS inference package (it registers the custom Qwen3TTSForConditionalGeneration model class with transformers):
pip install qwen-tts transformers torch soundfile
Generate a clip:
from qwen_tts import Qwen3TTSModel
import soundfile as sf
wrap = Qwen3TTSModel.from_pretrained("macminix/qwen3_voice_design_t2")
wavs, sr = wrap.generate_voice_design(
text="Come and look at this, you are not going to believe it.",
instruct="A male speaker delivers his happy speech at a moderate pace with standard energy.",
language="english",
temperature=0.9, top_k=50, top_p=1.0,
repetition_penalty=1.05, max_new_tokens=600,
)
sf.write("out.wav", wavs[0], sr)
A ready-to-run version is provided at example_inference.py.
The instruct prompt format
The instruct field is free-form English describing the voice. The training distribution has four axes:
- gender β "a male/female speaker"
- pitch β "high/medium/low pitched", "deep", "high-pitched"
- speed β "slow/moderate/fast", "at a brisk pace"
- situational / emotion β "happy", "angry", "sad", "whispered", "contemptuous", etc.
Example prompts:
A male speaker delivers his happy speech at a moderate pace with standard energy.
A female voice speaks slowly with a sad tone, low energy, almost whispering.
A low-pitched male narrator reads an exciting announcement at a fast pace.
How the adapter was trained
T2 preserves T1's corrected training protocol (the one that fixes the four silent bugs found in earlier work) and changes only the dataset sampling strategy and a handful of schedule knobs.
The v1 β T1 correctness fixes kept in T2
| # | Problem in the naive recipe | Fix kept from T1 |
|---|---|---|
| 1 | Training sequence built as a chat-formatted prompt followed by codec ids, with torch.where over a text/codec boundary β does not match Qwen3TTSForConditionalGeneration.generate's VoiceDesign path. |
Training-time inputs_embeds is built by the exact dual-track sum used by generate (instruct β text β codec; control tokens placed on the codec track). |
| 2 | labels= passed into the forward; PEFT + CausalLMLoss shifts labels by 1 internally on top of the collator's own shift β double-shifted target. |
labels=None; loss computed manually: F.cross_entropy(logits[:, :-1], codec_0_labels[:, 1:], ignore_index=-100). |
| 3 | LR too aggressive (1e-4) on a 1.7B base for a small data subset. | Peak LR 2.0e-5, cosine schedule, 200 warmup steps. |
| 4 | Sub-talker loss on a frozen Code Predictor produced corrupting gradient that broke every checkpoint. | sub_talker_loss_weight=0.0. Only CB-0 is supervised via the Talker head. |
T2-specific changes
- Gender-parity stratification. Inside each pitch bucket (high / medium / low) the sampler takes
min(#male, #female)records from the dataset, so training sees exactly the same number of male and female clips at every pitch. T1's stratifier balanced on gender as one of four joint axes, which in practice let pitch distribution skew. The audit on the rebalanced splits confirms 50/50 gender ratio at high pitch (2641 male / 2646 female) and medium pitch (3062 / 3049); the low-pitch slice is small in the source corpus (20 / 19) and was kept 1:1. - Three epochs. With the rebalanced distribution, the adapter has slightly fewer unique clips and one extra epoch ensures comparable total gradient exposure to T1.
- Higher LR floor (
min_lr_ratio=0.2). T1'scb0loss flattened around step 1000 β the cosine schedule had already decayed the LR to the 0.1 floor by then, and the last 400 steps contributed little. T2 raises the floor to 0.2 so late training keeps making progress instead of drifting.
Training hyperparameters
| block | setting |
|---|---|
| Base model | Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign |
| Precision | bf16 |
| Attention impl | sdpa |
| LoRA r / Ξ± / dropout | 16 / 32 / 0.05 |
| LoRA target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| LoRA scope | Talker only (Code Predictor frozen) |
| Trainable parameters | |
| Dataset | TextrolSpeech, 11,437 clips, gender-parity per pitch bucket |
| Train / val / eval split | 10,870 / 360 / 200 |
| Prompt paraphrases per clip | up to 3 |
| Epochs | 3 |
| Batch size | 4 Γ 4 = 16 effective |
| Max sequence length | 1,536 |
| Optimizer | AdamW (Ξ²β=0.9, Ξ²β=0.95, weight_decay=0.01) |
| Grad clip | 1.0 |
| Loss | manual CB-0 cross-entropy |
| Total optimizer steps | 2,142 |
| LR schedule | peak 2.0e-5, cosine, 200 warmup, min_lr_ratio=0.2 |
| Hardware | Single RTX 4090 (24 GB) |
| Wall-clock | ~55 min |
LoRA adapter (pre-merge)
Before merging, the adapter was 77 MB (~19.2 M fp32 parameters) β identical footprint to T1 since r/Ξ±/targets/scope are unchanged. It has been permanently folded into the Talker weights in this repo; no PEFT dependency is needed at inference.
Evaluation details
- Prompts: 16 natural-language voice-description prompts (same file as T1 for A/B continuity): 8 emotions (happy, angry, sad, disgusted, afraid, contempt, neutral, surprised) Γ 2 genders Γ 3 pitch/speed buckets.
- Decoder:
temperature=0.9, top_k=50, top_p=1.0, repetition_penalty=1.05, do_sample=True, max_new_tokens=600. - ASR:
openai/whisper-base. - Emotion classifier:
iic/emotion2vec_plus_large. - UTMOS:
speechmosUTMOS (automatic MOS proxy). - Composite score:
0.5 Β· (1 / max(WER, 0.05)) + 0.3 Β· emotion_acc + 0.2 Β· utmos_normalizedβ higher is better.
Known limitations
- Small delta vs baseline on automatic metrics. This is a data-balance iteration, not a recipe change. The headline WER already matched baseline at T1; there isn't much room to improve on that axis without changing the data source or the training objective.
- Emotion fidelity is not the focus. T2 trades some automatic-emotion-classification accuracy for naturalness and gender stability. The training data (TextrolSpeech) uses templated emotion descriptions, which is a known ceiling β later iterations move to corpora with richer free-form captions.
- English only. All training and evaluation used English prompts + English text. The base model supports 10 languages; they are untouched but were not validated against this adapter's shifted CB-0 distribution.
- Research / non-commercial only β see license.
License
- Base model weights (
Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign): Apache 2.0. - Training data (TextrolSpeech): CC BY-NC-SA 4.0 (research / non-commercial).
Because a substantial portion of this model's post-training behavior is shaped by TextrolSpeech, the derived model inherits the CC BY-NC-SA 4.0 constraint in practice: free to use for research, academic, and non-commercial purposes, with attribution and share-alike. Commercial deployment is not recommended without re-training on a commercially-licensed corpus.
References
- Base model: Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign
- Inference library:
qwen-ttson PyPI - Upstream LoRA training reference:
QwenLM/Qwen3-TTS/finetuning/sft_12hz_lora.py - Training data: TextrolSpeech (Ji et al., 2023)
- Companion checkpoint from the same study:
macminix/qwen3_voice_design_t1
- Downloads last month
- 31
Model tree for macminix/qwen3_voice_design_t2
Base model
Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign