Qwen3-TTS-12Hz-1.7B Fine-tuned (French SIWIS)
Fine-tuned version of Qwen/Qwen3-TTS-12Hz-1.7B-Base on the French SIWIS dataset for improved French TTS.
Training Details
- Base model: Qwen/Qwen3-TTS-12Hz-1.7B-Base
- Training data: SIWIS French Speech Synthesis Database (~8,325 samples, 500 benchmark phrases excluded)
- Training type: Full fine-tuning (speaker encoder frozen)
- Best checkpoint: Epoch 6/10, val_loss=7.2436
Hyperparameters
| Parameter | Value |
|---|---|
| Epochs | 6 (best of 10) |
| Batch size | 4 |
| Gradient accumulation | 8 |
| Effective batch size | 32 |
| Learning rate | 2e-6 |
| Warmup steps | 500 |
| Weight decay | 0.01 |
| LR scheduler | Cosine |
| Precision | bf16 |
Training Loss Progression
| Epoch | Train Loss | Val Loss |
|---|---|---|
| 1 | 12.89 | 9.13 |
| 2 | 8.15 | 7.72 |
| 3 | 7.53 | 7.43 |
| 4 | 7.35 | 7.31 |
| 5 | 7.27 | 7.26 |
| 6 | 7.23 | 7.24 |
Hardware
- GPU: NVIDIA RTX A5500 (24GB)
Usage
import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel
tts = Qwen3TTSModel.from_pretrained(
"Rcarvalo/qwentts",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
wavs, sr = tts.generate_custom_voice(
text="Bonjour, comment allez-vous aujourd'hui?",
speaker="siwis_french",
)
sf.write("output.wav", wavs[0], sr)
Baseline Benchmark (before fine-tuning, 500 SIWIS phrases)
| Metric | Value |
|---|---|
| WER (mean) | 23.4% |
| WER (median) | 14.3% |
| RTF (mean) | 1.300 |
License
Apache 2.0 (same as base model)
- Downloads last month
- 25
Model tree for Rcarvalo/qwenTTS
Base model
Qwen/Qwen3-TTS-12Hz-1.7B-Base