Qwen3-TTS VoiceDesign — T8 (naturalness pivot)

A fine-tune of Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign focused on naturalness rather than maximum prompt-following adherence, with a new axis added: British English accent.

The earlier iterations in this series (T3-T7) optimized for emotion-following and gender-following metrics. That direction tends to drift the adapter away from the base model's naturalness — strong-emotion training data trains caricatured delivery. T8 inverts the recipe to anchor the adapter close to base on neutral prompts, while opening one new axis the prior iterations couldn't reach.

Base: Qwen3-TTS-12Hz-1.7B-VoiceDesign (frozen during training; LoRA merged into this repo)
Method: LoRA on the Talker's attention + MLP projections, with KL regularization against the frozen base on neutral-prompt minibatches (β=0.02, 20% mix). LoRA merged back into the talker for inference.
Training data: VCTK 0.92 (109 speakers, ~40 British) + EARS regular-intensity reads + Expresso narration. ~46 k clips total. Studio-quality only — no LibriTTS, no SpeechCraft.
Output: 24 kHz mono wav via the Qwen3 12 Hz multi-codebook codec.

This repo is self-contained — it ships the merged transformer weights, the audio codec (speech_tokenizer/), the tokenizer, and all configs. No other HF repo needs to be downloaded at inference time.

What this checkpoint targets

T8's primary goal is not regressing naturalness while adding new capability. We hit that on the auto-metrics:

metric	baseline	T8 (this repo)	change
UTMOS (DNSMOS proxy)	3.189	3.230	+0.04
WER (Whisper-base)	0.0215	0.0187	-13% rel
emotion accuracy	0.559	0.712	+27% rel
gender accuracy	0.847	0.746	-0.10
naturalness A/B vs base	—	PASS (5 W / 2 T / 3 L)	gate cleared

Eval set: 59 prompts spanning 10 naturalness pairs (neutral content for A/B against base), 14 UK-accent prompts (RP / Scottish / Welsh / Northern Irish / Irish), an emotion × gender × pitch matrix at conversational intensity, and 9 scenario prompts.

The naturalness gate is the headline result: on neutral prompts, this checkpoint wins or ties against frozen-base UTMOS at 70 % of pairs (5 wins + 2 ties out of 10). T6 had no equivalent gate; T7 was scaffolded but never run.

Quick start

Install the Qwen3-TTS inference package (it registers the custom Qwen3TTSForConditionalGeneration model class with transformers):

pip install qwen-tts transformers torch soundfile

Generate a clip:

from qwen_tts import Qwen3TTSModel
import soundfile as sf

wrap = Qwen3TTSModel.from_pretrained("macminix/qwen3_voice_design_t8")

wavs, sr = wrap.generate_voice_design(
    text="The train to Edinburgh departs from platform four.",
    instruct="A man with a British English accent, calm and natural.",
    language="english",
    temperature=0.9, top_k=50, top_p=1.0,
    repetition_penalty=1.05, max_new_tokens=600,
)
sf.write("out.wav", wavs[0], sr)

A ready-to-run version with multiple example prompts is at example_inference.py.

The `instruct` prompt format

The instruct field is free-form English describing the voice. T8 was trained on subtle, conversational phrasings — not intensifier-heavy ones. Phrasings like "clearly angry", "intensely sad", "nearly shouting" are in the prompt distribution but training de-emphasized them. Phrasings that work well:

gender + accent — "A man with a British English accent", "A Scottish woman", "A British woman, conversational and unhurried"
subtle emotion — "speaking warmly and pleased", "softly sad", "with a quiet sadness", "with a touch of anger, controlled rather than shouting"
scenario — "a bedtime storyteller, soft and warm", "a news anchor, professional and neutral", "a meditation guide, soft and serene"

Example prompts:

A British man speaks calmly and naturally.
A woman with a Scottish accent, in an everyday speaking tone.
A man, softly sad, calm and unhurried.
A British news anchor, professional and neutral, at a brisk steady pace.
A clear, neutral voice reading the sentence.

How the adapter was trained

The training protocol corrects the silent issues common in naive VoiceDesign fine-tunes:

Dual-track input layout. Training-time inputs_embeds is the exact element-wise sum of text-track and codec-track embeddings used by Qwen3TTSForConditionalGeneration.generate's VoiceDesign path — including the 5-position English think-prefix on the codec track. Matches inference exactly, instead of approximating it with a chat-templated prompt + boundary switch.
Single-shift loss. Labels are computed manually as F.cross_entropy(logits[:, :-1], codec_0_labels[:, 1:], ignore_index=-100). The labels= argument is never passed into the wrapped forward, avoiding the double-shift that occurs when PEFT's wrapped CausalLMLoss adds its own internal shift on top of the collator's.
Conservative LR for LoRA on a 1.7 B base. T8 uses 1e-5 (half of T5/T6/T7's 2e-5) over 1 epoch with cosine schedule and min_lr_ratio=0.2. The naturalness target is "small nudge from base," not "task change."
No sub-talker loss with frozen Code Predictor. Disabled in T8 (T4 lesson — incompatible with talker-only LoRA scope).
KL-to-base anchor at higher weight. T8 uses β=0.02 with 20% neutral-prompt mix (T6 used 0.01 / 10% which proved too weak). The teacher is the same model with the LoRA disabled; on neutral-prompt minibatches the loss becomes β · KL(student || teacher) on CB-0 logits.
Source curation. VCTK (studio 48 kHz, 109 speakers, ~40 British) + EARS regular-intensity reads only (no whispered/shouted/extreme-emotion clips) + Expresso narration. Filtered EARS to subtle emotion categories (amusement, contentment, contemplation, sympathy, pride, gratitude, realization, interest) to avoid training caricature.
Caption phrasing rules. The caption library was rewritten to drop intensifiers ("very", "extremely", "intensely", "highly") and avoid imperative templates ("Generate speech where ..."). Captions bias toward "calmly", "softly", "gently", "naturally measured".

The adapter is LoRA r=16, α=32, dropout=0.05 on the Talker's q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj projections only. The Code Predictor and audio codec are frozen end-to-end. The final adapter (~19 M parameters, ~74 MB at fp32) was permanently merged into the Talker weights for this repo so inference does not require PEFT.

The selected checkpoint is step 2000 of a 2819-step single-epoch run (~71 % through training). It cleared all gates: WER no-regression, naturalness A/B passing, no NaN, gradient clipping never tripped.

Strengths

Naturalness preserved on neutral prompts. A plain "a clear, neutral voice reading the sentence" produces output that ties or beats frozen base on 70 % of pairs by UTMOS.
Improved expressive output at conversational intensity. Emotion adherence on the eval matrix climbed from 0.56 to 0.71, despite the recipe NOT chasing that metric — likely from the calmer EARS-regular reads being a better signal than EARS-full.
Lower WER than the base. Whisper-base WER dropped 13 % relative to base on the same eval suite — the adapter is producing slightly more transcribable audio, not less.
First UK accent coverage in the series. VCTK contributes ~28 k clips with British / Scottish / Welsh / Northern Irish / Irish accents. Phrasings like "A man with a British English accent" now hit a real cluster in training distribution.

Known limitations

UK accent classifier was not run during eval. The accent_acc column in the eval summary is null — speechbrain was not installed at eval time, so the lang-id classifier never executed. Listening tests on the 14 UK-accent eval prompts is the only honest verification right now. Eval will be reproduced with speechbrain installed in a future iteration.
Slight gender accuracy regression. F0-based gender accuracy dropped from 0.847 (base) to 0.746 (T8), about 0.10. This is below T5's 0.75 floor by a hair. Inspecting the failing cases is part of the planned T9 review — initial impression is borderline F0 misclassification on UK male voices that sit slightly higher in the male F0 range, not actual wrong-gender output.
Subtle emotion is the design point. This checkpoint is intentionally undertrained for caricatured emotion (loud anger, shouted joy, dramatic sadness). For exaggerated theatrical delivery, T5 / T6 are still better fits.
English only. All training and evaluation used English prompts and English text. The base model supports 10 languages; they are untouched but not validated against this adapter's modified CB-0 distribution.
Research / non-commercial use only — see license.

License

Base model weights (Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign): Apache 2.0.
Training data:
- VCTK 0.92: CC BY 4.0 (research + commercial OK).
- EARS: CC BY-NC-SA 4.0 (research / non-commercial).
- Expresso: CC BY-NC 4.0 (research / non-commercial).

Because EARS and Expresso carry non-commercial restrictions, the derived model effectively inherits a CC BY-NC-SA 4.0 constraint: free to use for research, academic, and non-commercial purposes, with attribution and share-alike. Commercial deployment is not recommended without re-training on a commercially-licensed corpus (a VCTK-only re-run could be commercially licensable).

References

Base model: Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign
Inference library: qwen-tts on PyPI
VCTK 0.92: Edinburgh DataShare 10283/3443
EARS dataset: Effortless and Realistic Speech Dataset
Expresso dataset: ylacombe/expresso
Prior iterations in this series: T1, T2, T5, T6

Downloads last month: -

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for macminix/qwen3_voice_design_t8

Base model

Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign

Adapter

(8)

this model