Upload VibeVoice French fine-tuned model (SIWIS, 10 epochs, full FT)

39f8fcd verified 6 days ago

1.79 kB

	---
	license: mit
	base_model: microsoft/VibeVoice-Realtime-0.5B
	tags:
	- tts
	- text-to-speech
	- french
	- vibevoice
	- finetuned
	language:
	- fr
	datasets:
	- custom
	pipeline_tag: text-to-speech
	---

	# VibeVoice-Realtime-0.5B Fine-tuned (French SIWIS)

	Fine-tuned version of [microsoft/VibeVoice-Realtime-0.5B](https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B) on the French SIWIS dataset for improved French TTS.

	## Training Details

	- Base model: microsoft/VibeVoice-Realtime-0.5B
	- Training data: SIWIS French Speech Synthesis Database (~9,200 samples, 500 benchmark phrases excluded)
	- Training type: Full fine-tuning of TTS language model (434M params)
	- Frozen components: Acoustic tokenizer (VAE), prediction head (diffusion), language encoder (Qwen2.5 4 layers)

	### Hyperparameters

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Epochs \| 10 \|
	\| Batch size \| 4 \|
	\| Gradient accumulation \| 4 \|
	\| Effective batch size \| 16 \|
	\| Learning rate \| 5e-5 \|
	\| Weight decay \| 0.01 \|
	\| Warmup steps \| 500 \|
	\| Precision \| bf16 \|

	### Hardware

	- GPU: NVIDIA RTX 6000 Ada (49GB)

	## Benchmark Results (500 SIWIS French phrases)

	\| Metric \| Value \|
	\|--------\|-------\|
	\| WER (mean) \| 35.0% \|
	\| WER (median) \| 22.9% \|
	\| RTF (mean) \| 0.416 \|

	## Usage

	```python
	import torch
	import soundfile as sf
	from vibevoice.modular.modeling_vibevoice_streaming_inference import (
	VibeVoiceStreamingForConditionalGenerationInference,
	)

	model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
	"Rcarvalo/vibevoice",
	torch_dtype=torch.bfloat16,
	).to("cuda")

	# Generate French speech
	audio = model.generate(text="Bonjour, comment allez-vous aujourd'hui?")
	sf.write("output.wav", audio.cpu().numpy(), 24000)
	```

	## License

	MIT (same as base model)