--- license: mit base_model: microsoft/VibeVoice-Realtime-0.5B tags: - tts - text-to-speech - french - vibevoice - finetuned language: - fr datasets: - custom pipeline_tag: text-to-speech --- # VibeVoice-Realtime-0.5B Fine-tuned (French SIWIS) Fine-tuned version of [microsoft/VibeVoice-Realtime-0.5B](https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B) on the French SIWIS dataset for improved French TTS. ## Training Details - **Base model**: microsoft/VibeVoice-Realtime-0.5B - **Training data**: SIWIS French Speech Synthesis Database (~9,200 samples, 500 benchmark phrases excluded) - **Training type**: Full fine-tuning of TTS language model (434M params) - **Frozen components**: Acoustic tokenizer (VAE), prediction head (diffusion), language encoder (Qwen2.5 4 layers) ### Hyperparameters | Parameter | Value | |-----------|-------| | Epochs | 10 | | Batch size | 4 | | Gradient accumulation | 4 | | Effective batch size | 16 | | Learning rate | 5e-5 | | Weight decay | 0.01 | | Warmup steps | 500 | | Precision | bf16 | ### Hardware - GPU: NVIDIA RTX 6000 Ada (49GB) ## Benchmark Results (500 SIWIS French phrases) | Metric | Value | |--------|-------| | WER (mean) | 35.0% | | WER (median) | 22.9% | | RTF (mean) | 0.416 | ## Usage ```python import torch import soundfile as sf from vibevoice.modular.modeling_vibevoice_streaming_inference import ( VibeVoiceStreamingForConditionalGenerationInference, ) model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained( "Rcarvalo/vibevoice", torch_dtype=torch.bfloat16, ).to("cuda") # Generate French speech audio = model.generate(text="Bonjour, comment allez-vous aujourd'hui?") sf.write("output.wav", audio.cpu().numpy(), 24000) ``` ## License MIT (same as base model)