| | --- |
| | license: mit |
| | base_model: microsoft/VibeVoice-Realtime-0.5B |
| | tags: |
| | - tts |
| | - text-to-speech |
| | - french |
| | - vibevoice |
| | - finetuned |
| | language: |
| | - fr |
| | datasets: |
| | - custom |
| | pipeline_tag: text-to-speech |
| | --- |
| | |
| | # VibeVoice-Realtime-0.5B Fine-tuned (French SIWIS) |
| |
|
| | Fine-tuned version of [microsoft/VibeVoice-Realtime-0.5B](https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B) on the French SIWIS dataset for improved French TTS. |
| |
|
| | ## Training Details |
| |
|
| | - **Base model**: microsoft/VibeVoice-Realtime-0.5B |
| | - **Training data**: SIWIS French Speech Synthesis Database (~9,200 samples, 500 benchmark phrases excluded) |
| | - **Training type**: Full fine-tuning of TTS language model (434M params) |
| | - **Frozen components**: Acoustic tokenizer (VAE), prediction head (diffusion), language encoder (Qwen2.5 4 layers) |
| |
|
| | ### Hyperparameters |
| |
|
| | | Parameter | Value | |
| | |-----------|-------| |
| | | Epochs | 10 | |
| | | Batch size | 4 | |
| | | Gradient accumulation | 4 | |
| | | Effective batch size | 16 | |
| | | Learning rate | 5e-5 | |
| | | Weight decay | 0.01 | |
| | | Warmup steps | 500 | |
| | | Precision | bf16 | |
| |
|
| | ### Hardware |
| |
|
| | - GPU: NVIDIA RTX 6000 Ada (49GB) |
| |
|
| | ## Benchmark Results (500 SIWIS French phrases) |
| |
|
| | | Metric | Value | |
| | |--------|-------| |
| | | WER (mean) | 35.0% | |
| | | WER (median) | 22.9% | |
| | | RTF (mean) | 0.416 | |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | import torch |
| | import soundfile as sf |
| | from vibevoice.modular.modeling_vibevoice_streaming_inference import ( |
| | VibeVoiceStreamingForConditionalGenerationInference, |
| | ) |
| | |
| | model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained( |
| | "Rcarvalo/vibevoice", |
| | torch_dtype=torch.bfloat16, |
| | ).to("cuda") |
| | |
| | # Generate French speech |
| | audio = model.generate(text="Bonjour, comment allez-vous aujourd'hui?") |
| | sf.write("output.wav", audio.cpu().numpy(), 24000) |
| | ``` |
| |
|
| | ## License |
| |
|
| | MIT (same as base model) |
| |
|