Qwen3-TTS-12Hz-1.7B Fine-tuned (French SIWIS)

Fine-tuned version of Qwen/Qwen3-TTS-12Hz-1.7B-Base on the French SIWIS dataset for improved French TTS.

Training Details

Base model: Qwen/Qwen3-TTS-12Hz-1.7B-Base
Training data: SIWIS French Speech Synthesis Database (~8,325 samples, 500 benchmark phrases excluded)
Training type: Full fine-tuning (speaker encoder frozen)
Best checkpoint: Epoch 6/10, val_loss=7.2436

Hyperparameters

Parameter	Value
Epochs	6 (best of 10)
Batch size	4
Gradient accumulation	8
Effective batch size	32
Learning rate	2e-6
Warmup steps	500
Weight decay	0.01
LR scheduler	Cosine
Precision	bf16

Training Loss Progression

Epoch	Train Loss	Val Loss
1	12.89	9.13
2	8.15	7.72
3	7.53	7.43
4	7.35	7.31
5	7.27	7.26
6	7.23	7.24

Hardware

GPU: NVIDIA RTX A5500 (24GB)

Usage

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

tts = Qwen3TTSModel.from_pretrained(
    "Rcarvalo/qwentts",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

wavs, sr = tts.generate_custom_voice(
    text="Bonjour, comment allez-vous aujourd'hui?",
    speaker="siwis_french",
)
sf.write("output.wav", wavs[0], sr)

Baseline Benchmark (before fine-tuning, 500 SIWIS phrases)

Metric	Value
WER (mean)	23.4%
WER (median)	14.3%
RTF (mean)	1.300

License

Apache 2.0 (same as base model)

Downloads last month: 4

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for Rcarvalo/qwenTTS

Base model

Qwen/Qwen3-TTS-12Hz-1.7B-Base

Finetuned

(28)

this model