Qwen3-TTS VoiceDesign — T1
A fine-tune of Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign targeting better intelligibility on natural-language voice-description prompts. The base model is steered at inference time by a free-form English instruct string describing gender, pitch, pace, and affect; T1 sharpens how faithfully the output follows that prompt while preserving the base model's decoding stack.
- Base: Qwen3-TTS-12Hz-1.7B-VoiceDesign (frozen during training, LoRA adapter merged into this repo)
- Method: LoRA on the Talker's attention + MLP projections, merged back into the base weights
- Training data: TextrolSpeech (12,000 clips, stratified by pitch × speed × situational × gender)
- Output: 24 kHz mono wav via the Qwen3 12 Hz multi-codebook codec
This repo is self-contained — it ships the merged transformer weights, the audio codec (speech_tokenizer/), the tokenizer, and all configs. You do not need to pull any other HF repo at inference time.
Results on the 16-prompt in-distribution eval
Eval format: natural-language voice-design prompts covering 8 emotions × 2 genders × 3 pitch/speed combinations. ASR: openai/whisper-base.
| rank | checkpoint | Composite | WER | Notes |
|---|---|---|---|---|
| 1 | T1 (this repo) | 9.219 | 0.017 | merged from checkpoint-1000 |
| 2 | T1 checkpoint-1430 | 9.188 | 0.019 | within noise of ckpt-1000 |
| 3 | base (no adapter) | 7.995 | 0.051 | reference |
WER dropped from 5.11% → 1.74% (−66%), composite score +15.3%. No regression on any evaluated checkpoint.
Per-prompt wavs and raw metrics (summary.json, comparison.json) are reproducible with the same 16 prompts at temperature=0.9, top_k=50, top_p=1.0, repetition_penalty=1.05, max_new_tokens=600.
Quick start
Install the Qwen3-TTS inference package (it registers the custom Qwen3TTSForConditionalGeneration model class with transformers):
pip install qwen-tts transformers torch soundfile
Generate a clip:
from qwen_tts import Qwen3TTSModel
import soundfile as sf
wrap = Qwen3TTSModel.from_pretrained("macminix/qwen3_voice_design_t1")
wavs, sr = wrap.generate_voice_design(
text="Come and look at this, you are not going to believe it.",
instruct="A male speaker delivers his happy speech at a moderate pace with standard energy.",
language="english",
temperature=0.9, top_k=50, top_p=1.0,
repetition_penalty=1.05, max_new_tokens=600,
)
sf.write("out.wav", wavs[0], sr)
A ready-to-run version is provided at example_inference.py.
The instruct prompt format
The instruct field is free-form English describing the voice. The training distribution has four axes:
- gender — "a male/female speaker"
- pitch — "high/medium/low pitched", "deep", "high-pitched"
- speed — "slow/moderate/fast", "at a brisk pace"
- situational / emotion — "happy", "angry", "sad", "whispered", "contemptuous", etc.
Example prompts:
A male speaker delivers his happy speech at a moderate pace with standard energy.
A female voice speaks slowly with a sad tone, low energy, almost whispering.
A low-pitched male narrator reads an exciting announcement at a fast pace.
How the adapter was trained
T1 was an iteration in a larger fine-tuning study that traced a regression in v1 of the same project (monotonically decreasing validation loss but audibly worse outputs at every checkpoint after step 500) to four concrete mistakes in the v1 training sequence. T1 is the first clean training under the corrected protocol.
Correctness fixes vs the naive recipe
| # | Problem in v1 | Fix applied in T1 |
|---|---|---|
| 1 | Training sequence built as a chat-formatted prompt followed by codec ids, with torch.where over a text/codec boundary. This does not match Qwen3TTSForConditionalGeneration.generate's VoiceDesign path, which embeds instruct_ids through text_embedding → text_projection and sums them element-wise with a codec track of pad tokens. |
Training-time inputs_embeds is built by the exact dual-track sum used by generate (instruct → text → codec; think / think_bos / think_eos / codec_bos placed on the codec track). |
| 2 | labels= passed into the forward; PEFT + CausalLMLoss shifts labels by 1 internally on top of the collator's own shift → double-shifted target. |
labels=None; loss computed manually: F.cross_entropy(logits[:, :-1], codec_0_labels[:, 1:], ignore_index=-100). |
| 3 | LR 1e-4 on a 1.7B base. The Qwen team's own sft_12hz_lora.py reference uses 2e-6. |
Peak LR 2.0e-5, cosine schedule, 200 warmup steps, min_lr_ratio=0.1. |
| 4 | Sub-talker loss weight 0.3 with the Code Predictor frozen → corrupting gradient that produced 45 s of gibberish at every checkpoint. | sub_talker_loss_weight=0.0. Only CB-0 is supervised via the Talker head. |
Training hyperparameters
| block | setting |
|---|---|
| Base model | Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign |
| Precision | bf16 |
| Attention impl | sdpa |
| LoRA r / α / dropout | 16 / 32 / 0.05 |
| LoRA target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| LoRA scope | Talker only (Code Predictor frozen) |
| Trainable parameters | 19.2 M (0.99 % of 1.9 B) |
| Dataset | TextrolSpeech, 12,000 clips stratified by pitch × speed × situational × gender |
| Train / val / eval split | 11,440 / 360 / 200 |
| Prompt paraphrases per clip | up to 3 |
| Epochs | 2 |
| Batch size | 4 × 4 = 16 effective |
| Max sequence length | 1,536 |
| Optimizer | AdamW (β₁=0.9, β₂=0.95, weight_decay=0.01) |
| Grad clip | 1.0 |
| Loss | manual CB-0 cross-entropy (no HF double-shift) |
| Total optimizer steps | 1,430 |
| Hardware | Single RTX 4090 (24 GB) |
| Wall-clock | ~36 min |
LoRA adapter (as trained, before merge)
The adapter before merging was 77 MB (~19.2 M fp32 parameters). That adapter has been permanently folded into the model's Talker weights in this repo; the user-facing file is a single standalone model. If you need the un-merged LoRA adapter for composition or research, it is preserved at a separate location.
Evaluation details
- Prompts: 16 natural-language voice-description prompts covering 8 emotions (happy, angry, sad, disgusted, afraid, contempt, neutral, surprised) × 2 genders × 3 pitch/speed buckets.
- Decoder:
temperature=0.9, top_k=50, top_p=1.0, repetition_penalty=1.05, do_sample=True, max_new_tokens=600. - ASR:
openai/whisper-baseviatransformerspipeline. - Composite score:
0.5 · (1 / max(WER, 0.05)) + 0.3 · emotion_acc + 0.2 · utmos_normalized. The T1 eval did not install the emotion classifier or UTMOS; those columns are unreported.
WER is fraction-of-tokens-wrong; lower is better. Composite is higher-is-better, unbounded (dominated by 1/WER when WER is small).
Known limitations
- Gender × pitch confound. Prompts like "a male speaker at a high pitch" or occasionally "a male speaker at a low pitch" can produce a female-timbre voice. Root cause: the training corpus's gender distribution per pitch bucket is imbalanced — the model partially treats pitch as a proxy for gender. Not fixed in this checkpoint.
- Emotion intensity is mixed. Some emotion prompts produce clearly stronger delivery than the base; others are roughly comparable. Emotion metrics were not logged at training time, so emotional fidelity is a subjective assessment rather than a tracked metric.
- English only. All training and evaluation used English prompts + English text. The base model supports 10 languages; they are untouched but were not validated against this adapter's shifted CB-0 distribution.
- Research / non-commercial only — see license.
License
- Base model weights (
Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign): Apache 2.0. - Training data (TextrolSpeech): CC BY-NC-SA 4.0 (research / non-commercial).
Because a substantial portion of this model's post-training behavior is shaped by TextrolSpeech, the derived model inherits the CC BY-NC-SA 4.0 constraint in practice: free to use for research, academic, and non-commercial purposes, with attribution and share-alike. Commercial deployment is not recommended without re-training on a commercially-licensed corpus.
References
- Base model: Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign
- Inference library:
qwen-ttson PyPI - Upstream LoRA training reference:
QwenLM/Qwen3-TTS/finetuning/sft_12hz_lora.py - Training data: TextrolSpeech (Ji et al., 2023)
- Downloads last month
- 14
Model tree for ShinyUser/vocence-miner02
Base model
Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign