vibevoice / README.md
Rcarvalo's picture
Upload VibeVoice French fine-tuned model (SIWIS, 10 epochs, full FT)
39f8fcd verified
metadata
license: mit
base_model: microsoft/VibeVoice-Realtime-0.5B
tags:
  - tts
  - text-to-speech
  - french
  - vibevoice
  - finetuned
language:
  - fr
datasets:
  - custom
pipeline_tag: text-to-speech

VibeVoice-Realtime-0.5B Fine-tuned (French SIWIS)

Fine-tuned version of microsoft/VibeVoice-Realtime-0.5B on the French SIWIS dataset for improved French TTS.

Training Details

  • Base model: microsoft/VibeVoice-Realtime-0.5B
  • Training data: SIWIS French Speech Synthesis Database (~9,200 samples, 500 benchmark phrases excluded)
  • Training type: Full fine-tuning of TTS language model (434M params)
  • Frozen components: Acoustic tokenizer (VAE), prediction head (diffusion), language encoder (Qwen2.5 4 layers)

Hyperparameters

Parameter Value
Epochs 10
Batch size 4
Gradient accumulation 4
Effective batch size 16
Learning rate 5e-5
Weight decay 0.01
Warmup steps 500
Precision bf16

Hardware

  • GPU: NVIDIA RTX 6000 Ada (49GB)

Benchmark Results (500 SIWIS French phrases)

Metric Value
WER (mean) 35.0%
WER (median) 22.9%
RTF (mean) 0.416

Usage

import torch
import soundfile as sf
from vibevoice.modular.modeling_vibevoice_streaming_inference import (
    VibeVoiceStreamingForConditionalGenerationInference,
)

model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
    "Rcarvalo/vibevoice",
    torch_dtype=torch.bfloat16,
).to("cuda")

# Generate French speech
audio = model.generate(text="Bonjour, comment allez-vous aujourd'hui?")
sf.write("output.wav", audio.cpu().numpy(), 24000)

License

MIT (same as base model)