---
license: mit
base_model: microsoft/VibeVoice-Realtime-0.5B
tags:
  - tts
  - text-to-speech
  - french
  - vibevoice
  - finetuned
language:
  - fr
datasets:
  - custom
pipeline_tag: text-to-speech
---

# VibeVoice-Realtime-0.5B Fine-tuned (French SIWIS)

Fine-tuned version of [microsoft/VibeVoice-Realtime-0.5B](https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B) on the French SIWIS dataset for improved French TTS.

## Training Details

- **Base model**: microsoft/VibeVoice-Realtime-0.5B
- **Training data**: SIWIS French Speech Synthesis Database (~9,200 samples, 500 benchmark phrases excluded)
- **Training type**: Full fine-tuning of TTS language model (434M params)
- **Frozen components**: Acoustic tokenizer (VAE), prediction head (diffusion), language encoder (Qwen2.5 4 layers)

### Hyperparameters

| Parameter | Value |
|-----------|-------|
| Epochs | 10 |
| Batch size | 4 |
| Gradient accumulation | 4 |
| Effective batch size | 16 |
| Learning rate | 5e-5 |
| Weight decay | 0.01 |
| Warmup steps | 500 |
| Precision | bf16 |

### Hardware

- GPU: NVIDIA RTX 6000 Ada (49GB)

## Benchmark Results (500 SIWIS French phrases)

| Metric | Value |
|--------|-------|
| WER (mean) | 35.0% |
| WER (median) | 22.9% |
| RTF (mean) | 0.416 |

## Usage

```python
import torch
import soundfile as sf
from vibevoice.modular.modeling_vibevoice_streaming_inference import (
    VibeVoiceStreamingForConditionalGenerationInference,
)

model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
    "Rcarvalo/vibevoice",
    torch_dtype=torch.bfloat16,
).to("cuda")

# Generate French speech
audio = model.generate(text="Bonjour, comment allez-vous aujourd'hui?")
sf.write("output.wav", audio.cpu().numpy(), 24000)
```

## License

MIT (same as base model)