Text-to-Speech
Catalan
catalan
tts
audio
flashsr
cleanunet
zipvoice

ZipVoice-CA: Catalan Zero-Shot Text-to-Speech with ZipVoice

Catalan fine-tune of ZipVoice, a fast zero-shot text-to-speech model based on flow matching.

Listen to samples GitHub repository Base ZipVoice repository

This repository contains the fine-tuned ZipVoice-CA checkpoint for Catalan speech synthesis. For the full training, preprocessing, inference, and evaluation recipe, see the GitHub repository.

The metrics below are intended as indicative benchmarks under this repository's evaluation setup, not as definitive state-of-the-art claims.


Performance Metrics

Dataset WER (%) โ†“ CER (%) โ†“ SIM-o โ†‘ UTMOS โ†‘
Common Voice 17 10.96 3.00 0.68 3.17
FestCat 7.31 2.56 0.65 3.46
LaFrescat 7.61 2.56 0.67 3.54

Evaluation uses generated samples from the ZipVoice-CA recipe with guidance_scale=1.0 and num_step=25.


Installation

1. Clone the recipe repository

git clone https://github.com/ErikUPV/ZipVoice-CA.git
cd ZipVoice-CA

2. Create the environment

conda create -n ZipVoice python=3.11
conda activate ZipVoice
pip install -r requirements_zipvoice.txt

3. Download the Catalan checkpoint

# pip install huggingface_hub
huggingface-cli download \
  --local-dir models \
  ebellob/ZipVoice-CA \
  zipvoice_ca.pt

Inference

Batch inference from a test.tsv file

python3 -m zipvoice.bin.infer_zipvoice \
  --model-name zipvoice \
  --model-dir ./models \
  --checkpoint-name zipvoice_ca.pt \
  --tokenizer espeak \
  --lang ca \
  --test-list data_cat/raw/test.tsv \
  --res-dir results/ \
  --guidance-scale 1.0 \
  --num-step 25

Single-sample inference

python3 -m zipvoice.bin.infer_zipvoice \
  --model-name zipvoice \
  --prompt-wav prompt.wav \
  --prompt-text "I am the transcription of the prompt wav." \
  --text "I am the text to be synthesized." \
  --res-wav-path result.wav \
  --model-dir ./models \
  --checkpoint-name zipvoice_ca.pt \
  --tokenizer espeak \
  --lang ca \
  --guidance-scale 1.0 \
  --num-step 25

The prompt audio should contain the reference speaker voice, and --prompt-text should match the transcription of that prompt audio.


Evaluation Setup

The reported metrics are computed on generated samples from three Catalan evaluation sources:

  • held-out Common Voice 17 Catalan samples,
  • FestCat prompts,
  • LaFrescat prompts.

Metrics:

  • WER / CER: ASR-based intelligibility metrics.
  • SIM-o: speaker similarity between prompt and generated speech.
  • UTMOS: automatic MOS-style naturalness estimate.

Limitations

This model is intended for Catalan text-to-speech research and experimentation. Quality may vary depending on prompt quality, prompt duration, speaker characteristics, text normalization, and out-of-domain inputs.

As with any zero-shot TTS model, users should avoid generating speech that impersonates real people without consent.


Acknowledgments

This model is a fine-tuned version of ZipVoice, using the pretrained checkpoint released by the original authors.


License

This model is released under the Apache-2.0 License.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ebellob/ZipVoice-CA

Base model

k2-fsa/ZipVoice
Finetuned
(1)
this model

Dataset used to train ebellob/ZipVoice-CA