ZipVoice-CA: Catalan Zero-Shot Text-to-Speech with ZipVoice

Catalan fine-tune of ZipVoice, a fast zero-shot text-to-speech model based on flow matching.

This repository contains the fine-tuned ZipVoice-CA checkpoint for Catalan speech synthesis. For the full training, preprocessing, inference, and evaluation recipe, see the GitHub repository.

The metrics below are intended as indicative benchmarks under this repository's evaluation setup, not as definitive state-of-the-art claims.

Performance Metrics

Dataset	WER (%) ↓	CER (%) ↓	SIM-o ↑	UTMOS ↑
Common Voice 17	10.96	3.00	0.68	3.17
FestCat	7.31	2.56	0.65	3.46
LaFrescat	7.61	2.56	0.67	3.54

Evaluation uses generated samples from the ZipVoice-CA recipe with guidance_scale=1.0 and num_step=25.

Installation

1. Clone the recipe repository

git clone https://github.com/ErikUPV/ZipVoice-CA.git
cd ZipVoice-CA

2. Create the environment

conda create -n ZipVoice python=3.11
conda activate ZipVoice
pip install -r requirements_zipvoice.txt

3. Download the Catalan checkpoint

# pip install huggingface_hub
huggingface-cli download \
  --local-dir models \
  ebellob/ZipVoice-CA \
  zipvoice_ca.pt

Inference

Batch inference from a `test.tsv` file

python3 -m zipvoice.bin.infer_zipvoice \
  --model-name zipvoice \
  --model-dir ./models \
  --checkpoint-name zipvoice_ca.pt \
  --tokenizer espeak \
  --lang ca \
  --test-list data_cat/raw/test.tsv \
  --res-dir results/ \
  --guidance-scale 1.0 \
  --num-step 25

Single-sample inference

python3 -m zipvoice.bin.infer_zipvoice \
  --model-name zipvoice \
  --prompt-wav prompt.wav \
  --prompt-text "I am the transcription of the prompt wav." \
  --text "I am the text to be synthesized." \
  --res-wav-path result.wav \
  --model-dir ./models \
  --checkpoint-name zipvoice_ca.pt \
  --tokenizer espeak \
  --lang ca \
  --guidance-scale 1.0 \
  --num-step 25

The prompt audio should contain the reference speaker voice, and --prompt-text should match the transcription of that prompt audio.

Evaluation Setup

The reported metrics are computed on generated samples from three Catalan evaluation sources:

held-out Common Voice 17 Catalan samples,
FestCat prompts,
LaFrescat prompts.

Metrics:

WER / CER: ASR-based intelligibility metrics.
SIM-o: speaker similarity between prompt and generated speech.
UTMOS: automatic MOS-style naturalness estimate.

Limitations

This model is intended for Catalan text-to-speech research and experimentation. Quality may vary depending on prompt quality, prompt duration, speaker characteristics, text normalization, and out-of-domain inputs.

As with any zero-shot TTS model, users should avoid generating speech that impersonates real people without consent.

Acknowledgments

This model is a fine-tuned version of ZipVoice, using the pretrained checkpoint released by the original authors.

License

This model is released under the Apache-2.0 License.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for ebellob/ZipVoice-CA

Base model

k2-fsa/ZipVoice

Finetuned

(3)

this model

ebellob
/

ZipVoice-CA

ZipVoice-CA: Catalan Zero-Shot Text-to-Speech with ZipVoice

Performance Metrics

Installation

1. Clone the recipe repository

2. Create the environment

3. Download the Catalan checkpoint

Inference

Batch inference from a `test.tsv` file

Single-sample inference

Evaluation Setup

Limitations

Acknowledgments

License

Model tree for ebellob/ZipVoice-CA

Dataset used to train ebellob/ZipVoice-CA

ZipVoice-CA: Catalan Zero-Shot Text-to-Speech with ZipVoice

Performance Metrics

Installation

1. Clone the recipe repository

2. Create the environment

3. Download the Catalan checkpoint

Inference

Batch inference from a test.tsv file

Single-sample inference

Evaluation Setup

Limitations

Acknowledgments

License

Model tree for ebellob/ZipVoice-CA

Dataset used to train ebellob/ZipVoice-CA

Batch inference from a `test.tsv` file