Qwen3-TTS VoiceDesign — T1

A fine-tune of Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign targeting better intelligibility on natural-language voice-description prompts. The base model is steered at inference time by a free-form English instruct string describing gender, pitch, pace, and affect; T1 sharpens how faithfully the output follows that prompt while preserving the base model's decoding stack.

Base: Qwen3-TTS-12Hz-1.7B-VoiceDesign (frozen during training, LoRA adapter merged into this repo)
Method: LoRA on the Talker's attention + MLP projections, merged back into the base weights
Training data: TextrolSpeech (12,000 clips, stratified by pitch × speed × situational × gender)
Output: 24 kHz mono wav via the Qwen3 12 Hz multi-codebook codec

This repo is self-contained — it ships the merged transformer weights, the audio codec (speech_tokenizer/), the tokenizer, and all configs. You do not need to pull any other HF repo at inference time.

Results on the 16-prompt in-distribution eval

Eval format: natural-language voice-design prompts covering 8 emotions × 2 genders × 3 pitch/speed combinations. ASR: openai/whisper-base.

rank	checkpoint	Composite	WER	Notes
1	T1 (this repo)	9.219	0.017	merged from `checkpoint-1000`
2	T1 checkpoint-1430	9.188	0.019	within noise of ckpt-1000
3	base (no adapter)	7.995	0.051	reference

WER dropped from 5.11% → 1.74% (−66%), composite score +15.3%. No regression on any evaluated checkpoint.

Per-prompt wavs and raw metrics (summary.json, comparison.json) are reproducible with the same 16 prompts at temperature=0.9, top_k=50, top_p=1.0, repetition_penalty=1.05, max_new_tokens=600.

Quick start

Install the Qwen3-TTS inference package (it registers the custom Qwen3TTSForConditionalGeneration model class with transformers):

pip install qwen-tts transformers torch soundfile

Generate a clip:

from qwen_tts import Qwen3TTSModel
import soundfile as sf

wrap = Qwen3TTSModel.from_pretrained("macminix/qwen3_voice_design_t1")

wavs, sr = wrap.generate_voice_design(
    text="Come and look at this, you are not going to believe it.",
    instruct="A male speaker delivers his happy speech at a moderate pace with standard energy.",
    language="english",
    temperature=0.9, top_k=50, top_p=1.0,
    repetition_penalty=1.05, max_new_tokens=600,
)
sf.write("out.wav", wavs[0], sr)

A ready-to-run version is provided at example_inference.py.

The `instruct` prompt format

The instruct field is free-form English describing the voice. The training distribution has four axes:

gender — "a male/female speaker"
pitch — "high/medium/low pitched", "deep", "high-pitched"
speed — "slow/moderate/fast", "at a brisk pace"
situational / emotion — "happy", "angry", "sad", "whispered", "contemptuous", etc.

Example prompts:

A male speaker delivers his happy speech at a moderate pace with standard energy.
A female voice speaks slowly with a sad tone, low energy, almost whispering.
A low-pitched male narrator reads an exciting announcement at a fast pace.

How the adapter was trained

T1 was an iteration in a larger fine-tuning study that traced a regression in v1 of the same project (monotonically decreasing validation loss but audibly worse outputs at every checkpoint after step 500) to four concrete mistakes in the v1 training sequence. T1 is the first clean training under the corrected protocol.

Correctness fixes vs the naive recipe

#	Problem in v1	Fix applied in T1
1	Training sequence built as a chat-formatted prompt followed by codec ids, with `torch.where` over a text/codec boundary. This does not match `Qwen3TTSForConditionalGeneration.generate`'s VoiceDesign path, which embeds `instruct_ids` through `text_embedding → text_projection` and sums them element-wise with a codec track of pad tokens.	Training-time `inputs_embeds` is built by the exact dual-track sum used by `generate` (instruct → text → codec; `think` / `think_bos` / `think_eos` / `codec_bos` placed on the codec track).
2	`labels=` passed into the forward; PEFT + `CausalLMLoss` shifts labels by 1 internally on top of the collator's own shift → double-shifted target.	`labels=None`; loss computed manually: `F.cross_entropy(logits[:, :-1], codec_0_labels[:, 1:], ignore_index=-100)`.
3	LR 1e-4 on a 1.7B base. The Qwen team's own `sft_12hz_lora.py` reference uses 2e-6.	Peak LR 2.0e-5, cosine schedule, 200 warmup steps, `min_lr_ratio=0.1`.
4	Sub-talker loss weight 0.3 with the Code Predictor frozen → corrupting gradient that produced 45 s of gibberish at every checkpoint.	`sub_talker_loss_weight=0.0`. Only CB-0 is supervised via the Talker head.

Training hyperparameters

block	setting
Base model	Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign
Precision	bf16
Attention impl	`sdpa`
LoRA r / α / dropout	16 / 32 / 0.05
LoRA target modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
LoRA scope	Talker only (Code Predictor frozen)
Trainable parameters	19.2 M (0.99 % of 1.9 B)
Dataset	TextrolSpeech, 12,000 clips stratified by pitch × speed × situational × gender
Train / val / eval split	11,440 / 360 / 200
Prompt paraphrases per clip	up to 3
Epochs	2
Batch size	4 × 4 = 16 effective
Max sequence length	1,536
Optimizer	AdamW (β₁=0.9, β₂=0.95, weight_decay=0.01)
Grad clip	1.0
Loss	manual CB-0 cross-entropy (no HF double-shift)
Total optimizer steps	1,430
Hardware	Single RTX 4090 (24 GB)
Wall-clock	~36 min

LoRA adapter (as trained, before merge)

The adapter before merging was 77 MB (~19.2 M fp32 parameters). That adapter has been permanently folded into the model's Talker weights in this repo; the user-facing file is a single standalone model. If you need the un-merged LoRA adapter for composition or research, it is preserved at a separate location.

Evaluation details

Prompts: 16 natural-language voice-description prompts covering 8 emotions (happy, angry, sad, disgusted, afraid, contempt, neutral, surprised) × 2 genders × 3 pitch/speed buckets.
Decoder: temperature=0.9, top_k=50, top_p=1.0, repetition_penalty=1.05, do_sample=True, max_new_tokens=600.
ASR: openai/whisper-base via transformers pipeline.
Composite score: 0.5 · (1 / max(WER, 0.05)) + 0.3 · emotion_acc + 0.2 · utmos_normalized. The T1 eval did not install the emotion classifier or UTMOS; those columns are unreported.

WER is fraction-of-tokens-wrong; lower is better. Composite is higher-is-better, unbounded (dominated by 1/WER when WER is small).

Known limitations

Gender × pitch confound. Prompts like "a male speaker at a high pitch" or occasionally "a male speaker at a low pitch" can produce a female-timbre voice. Root cause: the training corpus's gender distribution per pitch bucket is imbalanced — the model partially treats pitch as a proxy for gender. Not fixed in this checkpoint.
Emotion intensity is mixed. Some emotion prompts produce clearly stronger delivery than the base; others are roughly comparable. Emotion metrics were not logged at training time, so emotional fidelity is a subjective assessment rather than a tracked metric.
English only. All training and evaluation used English prompts + English text. The base model supports 10 languages; they are untouched but were not validated against this adapter's shifted CB-0 distribution.
Research / non-commercial only — see license.

License

Base model weights (Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign): Apache 2.0.
Training data (TextrolSpeech): CC BY-NC-SA 4.0 (research / non-commercial).

Because a substantial portion of this model's post-training behavior is shaped by TextrolSpeech, the derived model inherits the CC BY-NC-SA 4.0 constraint in practice: free to use for research, academic, and non-commercial purposes, with attribution and share-alike. Commercial deployment is not recommended without re-training on a commercially-licensed corpus.

References

Base model: Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign
Inference library: qwen-tts on PyPI
Upstream LoRA training reference: QwenLM/Qwen3-TTS/finetuning/sft_12hz_lora.py
Training data: TextrolSpeech (Ji et al., 2023)

Downloads last month: 14

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for ShinyUser/vocence-miner02

Base model

Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign

Adapter

(6)

this model