---
license: apache-2.0
datasets:
- ebellob/annotated_catalan_common_voice_v17_cleaned_enhanced
language:
- ca
base_model:
- k2-fsa/ZipVoice
tags:
- catalan
- tts
- audio
- flashsr
- cleanunet
- zipvoice
pipeline_tag: text-to-speech
---

# ZipVoice-CA:  Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching (now in Catalan!)

This repository contains the checkpoint of the fine-tuned ZipVoice-CA model, able to synthesise speech in catalan with great quality. Its metrics follow below.
For more information regarding the training and evaluation of the model please refer to its repository on https://github.com/ErikUPV/ZipVoice-CA.
Please also visit the repository of the original [ZipVoice](https://github.com/k2-fsa/ZipVoice) model.

To get a feel of the model, click [here](https://erikupv.github.io/zipvoice-samples/) to listen to some samples.

## Performance Metrics

| Dataset | WER (%) ↓ | CER (%) ↓ | SIM-o ↑ | UTMOS ↑ |
| :--- | :---: | :---: | :---: | :---: |
| **Common Voice 17** | 10.96 | 3.00 | 0.68 | 3.17 |
| **Festcat** | 7.31 | 2.56 | 0.65 | 3.46 |
| **LaFrescat** | 7.61 | 2.56 | 0.67 | 3.54 |

---

## Installation

### 1. Clone the repository
```bash
git clone https://github.com/erikupv/zipvoice-ca
cd zipvoice-ca
```

### 2. Environment Setup
We recommend using **Conda** to manage your dependencies and ensure a clean environment:
```bash
conda create -n ZipVoice python=3.11
conda activate ZipVoice
pip install -r requirements_zipvoice.txt
```

### 3. Download the Catalan Model
Use the Hugging Face CLI to download the fine-tuned checkpoint directly into your local models directory:
```bash
# pip install huggingface_hub
huggingface-cli download \
  --local-dir models \
  ebellob/ZipVoice-CA \
  zipvoice_ca.pt
```

---

## Inference

To generate speech from a test.tsv file using the Catalan model, use the command below.

```bash
python3 -m zipvoice.bin.infer_zipvoice \
    --model-name zipvoice \
    --model-dir ./models \
    --checkpoint-name zipvoice_ca.pt \
    --tokenizer "espeak" \
    --lang "ca" \
    --test-list data_cat/raw/test.tsv \
    --res-dir results/ \
    --guidance-scale 1.0 \
    --num-step 16
```

For single file inference

```bash
python3 -m zipvoice.bin.infer_zipvoice \
    --model-name zipvoice \
    --prompt-wav prompt.wav \
    --prompt-text "I am the transcription of the prompt wav." \
    --text "I am the text to be synthesized." \
    --res-wav-path result.wav
    --model-dir ./models \
    --checkpoint-name zipvoice_ca.pt \
    --tokenizer "espeak" \
    --lang "ca" \
    --guidance-scale 1.0 \
    --num-step 16
```
---

## Acknowledgments
This work is a fine-tuned version of the [ZipVoice](https://github.com/k2-fsa/ZipVoice) project.

## License
This project is licensed under the **Apache-2.0 License**.