Qwen3-TTS-12Hz-1.7B Fine-tuned (French SIWIS)

Fine-tuned version of Qwen/Qwen3-TTS-12Hz-1.7B-Base on the French SIWIS dataset for improved French TTS.

Training Details

  • Base model: Qwen/Qwen3-TTS-12Hz-1.7B-Base
  • Training data: SIWIS French Speech Synthesis Database (~8,325 samples, 500 benchmark phrases excluded)
  • Training type: Full fine-tuning (speaker encoder frozen)
  • Best checkpoint: Epoch 6/10, val_loss=7.2436

Hyperparameters

Parameter Value
Epochs 6 (best of 10)
Batch size 4
Gradient accumulation 8
Effective batch size 32
Learning rate 2e-6
Warmup steps 500
Weight decay 0.01
LR scheduler Cosine
Precision bf16

Training Loss Progression

Epoch Train Loss Val Loss
1 12.89 9.13
2 8.15 7.72
3 7.53 7.43
4 7.35 7.31
5 7.27 7.26
6 7.23 7.24

Hardware

  • GPU: NVIDIA RTX A5500 (24GB)

Usage

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

tts = Qwen3TTSModel.from_pretrained(
    "Rcarvalo/qwentts",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

wavs, sr = tts.generate_custom_voice(
    text="Bonjour, comment allez-vous aujourd'hui?",
    speaker="siwis_french",
)
sf.write("output.wav", wavs[0], sr)

Baseline Benchmark (before fine-tuning, 500 SIWIS phrases)

Metric Value
WER (mean) 23.4%
WER (median) 14.3%
RTF (mean) 1.300

License

Apache 2.0 (same as base model)

Downloads last month
25
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Rcarvalo/qwenTTS

Finetuned
(10)
this model