Qwen3-TTS VoiceDesign — T1

A fine-tune of Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign targeting better intelligibility on natural-language voice-description prompts. The base model is steered at inference time by a free-form English instruct string describing gender, pitch, pace, and affect; T1 sharpens how faithfully the output follows that prompt while preserving the base model's decoding stack.

  • Base: Qwen3-TTS-12Hz-1.7B-VoiceDesign (frozen during training, LoRA adapter merged into this repo)
  • Method: LoRA on the Talker's attention + MLP projections, merged back into the base weights
  • Training data: TextrolSpeech (12,000 clips, stratified by pitch × speed × situational × gender)
  • Output: 24 kHz mono wav via the Qwen3 12 Hz multi-codebook codec

This repo is self-contained — it ships the merged transformer weights, the audio codec (speech_tokenizer/), the tokenizer, and all configs. You do not need to pull any other HF repo at inference time.

Results on the 16-prompt in-distribution eval

Eval format: natural-language voice-design prompts covering 8 emotions × 2 genders × 3 pitch/speed combinations. ASR: openai/whisper-base.

rank checkpoint Composite WER Notes
1 T1 (this repo) 9.219 0.017 merged from checkpoint-1000
2 T1 checkpoint-1430 9.188 0.019 within noise of ckpt-1000
3 base (no adapter) 7.995 0.051 reference

WER dropped from 5.11% → 1.74% (−66%), composite score +15.3%. No regression on any evaluated checkpoint.

Per-prompt wavs and raw metrics (summary.json, comparison.json) are reproducible with the same 16 prompts at temperature=0.9, top_k=50, top_p=1.0, repetition_penalty=1.05, max_new_tokens=600.

Quick start

Install the Qwen3-TTS inference package (it registers the custom Qwen3TTSForConditionalGeneration model class with transformers):

pip install qwen-tts transformers torch soundfile

Generate a clip:

from qwen_tts import Qwen3TTSModel
import soundfile as sf

wrap = Qwen3TTSModel.from_pretrained("macminix/qwen3_voice_design_t1")

wavs, sr = wrap.generate_voice_design(
    text="Come and look at this, you are not going to believe it.",
    instruct="A male speaker delivers his happy speech at a moderate pace with standard energy.",
    language="english",
    temperature=0.9, top_k=50, top_p=1.0,
    repetition_penalty=1.05, max_new_tokens=600,
)
sf.write("out.wav", wavs[0], sr)

A ready-to-run version is provided at example_inference.py.

The instruct prompt format

The instruct field is free-form English describing the voice. The training distribution has four axes:

  • gender — "a male/female speaker"
  • pitch — "high/medium/low pitched", "deep", "high-pitched"
  • speed — "slow/moderate/fast", "at a brisk pace"
  • situational / emotion — "happy", "angry", "sad", "whispered", "contemptuous", etc.

Example prompts:

A male speaker delivers his happy speech at a moderate pace with standard energy.
A female voice speaks slowly with a sad tone, low energy, almost whispering.
A low-pitched male narrator reads an exciting announcement at a fast pace.

How the adapter was trained

T1 was an iteration in a larger fine-tuning study that traced a regression in v1 of the same project (monotonically decreasing validation loss but audibly worse outputs at every checkpoint after step 500) to four concrete mistakes in the v1 training sequence. T1 is the first clean training under the corrected protocol.

Correctness fixes vs the naive recipe

# Problem in v1 Fix applied in T1
1 Training sequence built as a chat-formatted prompt followed by codec ids, with torch.where over a text/codec boundary. This does not match Qwen3TTSForConditionalGeneration.generate's VoiceDesign path, which embeds instruct_ids through text_embedding → text_projection and sums them element-wise with a codec track of pad tokens. Training-time inputs_embeds is built by the exact dual-track sum used by generate (instruct → text → codec; think / think_bos / think_eos / codec_bos placed on the codec track).
2 labels= passed into the forward; PEFT + CausalLMLoss shifts labels by 1 internally on top of the collator's own shift → double-shifted target. labels=None; loss computed manually: F.cross_entropy(logits[:, :-1], codec_0_labels[:, 1:], ignore_index=-100).
3 LR 1e-4 on a 1.7B base. The Qwen team's own sft_12hz_lora.py reference uses 2e-6. Peak LR 2.0e-5, cosine schedule, 200 warmup steps, min_lr_ratio=0.1.
4 Sub-talker loss weight 0.3 with the Code Predictor frozen → corrupting gradient that produced 45 s of gibberish at every checkpoint. sub_talker_loss_weight=0.0. Only CB-0 is supervised via the Talker head.

Training hyperparameters

block setting
Base model Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign
Precision bf16
Attention impl sdpa
LoRA r / α / dropout 16 / 32 / 0.05
LoRA target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
LoRA scope Talker only (Code Predictor frozen)
Trainable parameters 19.2 M (0.99 % of 1.9 B)
Dataset TextrolSpeech, 12,000 clips stratified by pitch × speed × situational × gender
Train / val / eval split 11,440 / 360 / 200
Prompt paraphrases per clip up to 3
Epochs 2
Batch size 4 × 4 = 16 effective
Max sequence length 1,536
Optimizer AdamW (β₁=0.9, β₂=0.95, weight_decay=0.01)
Grad clip 1.0
Loss manual CB-0 cross-entropy (no HF double-shift)
Total optimizer steps 1,430
Hardware Single RTX 4090 (24 GB)
Wall-clock ~36 min

LoRA adapter (as trained, before merge)

The adapter before merging was 77 MB (~19.2 M fp32 parameters). That adapter has been permanently folded into the model's Talker weights in this repo; the user-facing file is a single standalone model. If you need the un-merged LoRA adapter for composition or research, it is preserved at a separate location.

Evaluation details

  • Prompts: 16 natural-language voice-description prompts covering 8 emotions (happy, angry, sad, disgusted, afraid, contempt, neutral, surprised) × 2 genders × 3 pitch/speed buckets.
  • Decoder: temperature=0.9, top_k=50, top_p=1.0, repetition_penalty=1.05, do_sample=True, max_new_tokens=600.
  • ASR: openai/whisper-base via transformers pipeline.
  • Composite score: 0.5 · (1 / max(WER, 0.05)) + 0.3 · emotion_acc + 0.2 · utmos_normalized. The T1 eval did not install the emotion classifier or UTMOS; those columns are unreported.

WER is fraction-of-tokens-wrong; lower is better. Composite is higher-is-better, unbounded (dominated by 1/WER when WER is small).

Known limitations

  • Gender × pitch confound. Prompts like "a male speaker at a high pitch" or occasionally "a male speaker at a low pitch" can produce a female-timbre voice. Root cause: the training corpus's gender distribution per pitch bucket is imbalanced — the model partially treats pitch as a proxy for gender. Not fixed in this checkpoint.
  • Emotion intensity is mixed. Some emotion prompts produce clearly stronger delivery than the base; others are roughly comparable. Emotion metrics were not logged at training time, so emotional fidelity is a subjective assessment rather than a tracked metric.
  • English only. All training and evaluation used English prompts + English text. The base model supports 10 languages; they are untouched but were not validated against this adapter's shifted CB-0 distribution.
  • Research / non-commercial only — see license.

License

  • Base model weights (Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign): Apache 2.0.
  • Training data (TextrolSpeech): CC BY-NC-SA 4.0 (research / non-commercial).

Because a substantial portion of this model's post-training behavior is shaped by TextrolSpeech, the derived model inherits the CC BY-NC-SA 4.0 constraint in practice: free to use for research, academic, and non-commercial purposes, with attribution and share-alike. Commercial deployment is not recommended without re-training on a commercially-licensed corpus.

References

Downloads last month
14
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ShinyUser/vocence-miner02

Adapter
(6)
this model