ZipVoice-CA: Catalan Zero-Shot Text-to-Speech with ZipVoice
Catalan fine-tune of ZipVoice, a fast zero-shot text-to-speech model based on flow matching.
This repository contains the fine-tuned ZipVoice-CA checkpoint for Catalan speech synthesis. For the full training, preprocessing, inference, and evaluation recipe, see the GitHub repository.
The metrics below are intended as indicative benchmarks under this repository's evaluation setup, not as definitive state-of-the-art claims.
Performance Metrics
| Dataset | WER (%) โ | CER (%) โ | SIM-o โ | UTMOS โ |
|---|---|---|---|---|
| Common Voice 17 | 10.96 | 3.00 | 0.68 | 3.17 |
| FestCat | 7.31 | 2.56 | 0.65 | 3.46 |
| LaFrescat | 7.61 | 2.56 | 0.67 | 3.54 |
Evaluation uses generated samples from the ZipVoice-CA recipe with guidance_scale=1.0 and num_step=25.
Installation
1. Clone the recipe repository
git clone https://github.com/ErikUPV/ZipVoice-CA.git
cd ZipVoice-CA
2. Create the environment
conda create -n ZipVoice python=3.11
conda activate ZipVoice
pip install -r requirements_zipvoice.txt
3. Download the Catalan checkpoint
# pip install huggingface_hub
huggingface-cli download \
--local-dir models \
ebellob/ZipVoice-CA \
zipvoice_ca.pt
Inference
Batch inference from a test.tsv file
python3 -m zipvoice.bin.infer_zipvoice \
--model-name zipvoice \
--model-dir ./models \
--checkpoint-name zipvoice_ca.pt \
--tokenizer espeak \
--lang ca \
--test-list data_cat/raw/test.tsv \
--res-dir results/ \
--guidance-scale 1.0 \
--num-step 25
Single-sample inference
python3 -m zipvoice.bin.infer_zipvoice \
--model-name zipvoice \
--prompt-wav prompt.wav \
--prompt-text "I am the transcription of the prompt wav." \
--text "I am the text to be synthesized." \
--res-wav-path result.wav \
--model-dir ./models \
--checkpoint-name zipvoice_ca.pt \
--tokenizer espeak \
--lang ca \
--guidance-scale 1.0 \
--num-step 25
The prompt audio should contain the reference speaker voice, and --prompt-text should match the transcription of that prompt audio.
Evaluation Setup
The reported metrics are computed on generated samples from three Catalan evaluation sources:
- held-out Common Voice 17 Catalan samples,
- FestCat prompts,
- LaFrescat prompts.
Metrics:
- WER / CER: ASR-based intelligibility metrics.
- SIM-o: speaker similarity between prompt and generated speech.
- UTMOS: automatic MOS-style naturalness estimate.
Limitations
This model is intended for Catalan text-to-speech research and experimentation. Quality may vary depending on prompt quality, prompt duration, speaker characteristics, text normalization, and out-of-domain inputs.
As with any zero-shot TTS model, users should avoid generating speech that impersonates real people without consent.
Acknowledgments
This model is a fine-tuned version of ZipVoice, using the pretrained checkpoint released by the original authors.
License
This model is released under the Apache-2.0 License.
Model tree for ebellob/ZipVoice-CA
Base model
k2-fsa/ZipVoice