|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- ebellob/annotated_catalan_common_voice_v17_cleaned_enhanced |
|
|
language: |
|
|
- ca |
|
|
base_model: |
|
|
- k2-fsa/ZipVoice |
|
|
tags: |
|
|
- catalan |
|
|
- tts |
|
|
- audio |
|
|
- flashsr |
|
|
- cleanunet |
|
|
- zipvoice |
|
|
pipeline_tag: text-to-speech |
|
|
--- |
|
|
|
|
|
# ZipVoice-CA: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching (now in Catalan!) |
|
|
|
|
|
This repository contains the checkpoint of the fine-tuned ZipVoice-CA model, able to synthesise speech in catalan with great quality. Its metrics follow below. |
|
|
For more information regarding the training and evaluation of the model please refer to its repository on https://github.com/ErikUPV/ZipVoice-CA. |
|
|
Please also visit the repository of the original [ZipVoice](https://github.com/k2-fsa/ZipVoice) model. |
|
|
|
|
|
To get a feel of the model, click [here](https://erikupv.github.io/zipvoice-samples/) to listen to some samples. |
|
|
|
|
|
## Performance Metrics |
|
|
|
|
|
| Dataset | WER (%) ↓ | CER (%) ↓ | SIM-o ↑ | UTMOS ↑ | |
|
|
| :--- | :---: | :---: | :---: | :---: | |
|
|
| **Common Voice 17** | 10.96 | 3.00 | 0.68 | 3.17 | |
|
|
| **Festcat** | 7.31 | 2.56 | 0.65 | 3.46 | |
|
|
| **LaFrescat** | 7.61 | 2.56 | 0.67 | 3.54 | |
|
|
|
|
|
--- |
|
|
|
|
|
## Installation |
|
|
|
|
|
### 1. Clone the repository |
|
|
```bash |
|
|
git clone https://github.com/erikupv/zipvoice-ca |
|
|
cd zipvoice-ca |
|
|
``` |
|
|
|
|
|
### 2. Environment Setup |
|
|
We recommend using **Conda** to manage your dependencies and ensure a clean environment: |
|
|
```bash |
|
|
conda create -n ZipVoice python=3.11 |
|
|
conda activate ZipVoice |
|
|
pip install -r requirements_zipvoice.txt |
|
|
``` |
|
|
|
|
|
### 3. Download the Catalan Model |
|
|
Use the Hugging Face CLI to download the fine-tuned checkpoint directly into your local models directory: |
|
|
```bash |
|
|
# pip install huggingface_hub |
|
|
huggingface-cli download \ |
|
|
--local-dir models \ |
|
|
ebellob/ZipVoice-CA \ |
|
|
zipvoice_ca.pt |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Inference |
|
|
|
|
|
To generate speech from a test.tsv file using the Catalan model, use the command below. |
|
|
|
|
|
```bash |
|
|
python3 -m zipvoice.bin.infer_zipvoice \ |
|
|
--model-name zipvoice \ |
|
|
--model-dir ./models \ |
|
|
--checkpoint-name zipvoice_ca.pt \ |
|
|
--tokenizer "espeak" \ |
|
|
--lang "ca" \ |
|
|
--test-list data_cat/raw/test.tsv \ |
|
|
--res-dir results/ \ |
|
|
--guidance-scale 1.0 \ |
|
|
--num-step 16 |
|
|
``` |
|
|
|
|
|
For single file inference |
|
|
|
|
|
```bash |
|
|
python3 -m zipvoice.bin.infer_zipvoice \ |
|
|
--model-name zipvoice \ |
|
|
--prompt-wav prompt.wav \ |
|
|
--prompt-text "I am the transcription of the prompt wav." \ |
|
|
--text "I am the text to be synthesized." \ |
|
|
--res-wav-path result.wav |
|
|
--model-dir ./models \ |
|
|
--checkpoint-name zipvoice_ca.pt \ |
|
|
--tokenizer "espeak" \ |
|
|
--lang "ca" \ |
|
|
--guidance-scale 1.0 \ |
|
|
--num-step 16 |
|
|
``` |
|
|
--- |
|
|
|
|
|
## Acknowledgments |
|
|
This work is a fine-tuned version of the [ZipVoice](https://github.com/k2-fsa/ZipVoice) project. |
|
|
|
|
|
## License |
|
|
This project is licensed under the **Apache-2.0 License**. |