--- license: apache-2.0 datasets: - ebellob/annotated_catalan_common_voice_v17_cleaned_enhanced language: - ca base_model: - k2-fsa/ZipVoice tags: - catalan - tts - audio - flashsr - cleanunet - zipvoice pipeline_tag: text-to-speech --- # ZipVoice-CA: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching (now in Catalan!) This repository contains the checkpoint of the fine-tuned ZipVoice-CA model, able to synthesise speech in catalan with great quality. Its metrics follow below. For more information regarding the training and evaluation of the model please refer to its repository on https://github.com/ErikUPV/ZipVoice-CA. Please also visit the repository of the original [ZipVoice](https://github.com/k2-fsa/ZipVoice) model. To get a feel of the model, click [here](https://erikupv.github.io/zipvoice-samples/) to listen to some samples. ## Performance Metrics | Dataset | WER (%) ↓ | CER (%) ↓ | SIM-o ↑ | UTMOS ↑ | | :--- | :---: | :---: | :---: | :---: | | **Common Voice 17** | 10.96 | 3.00 | 0.68 | 3.17 | | **Festcat** | 7.31 | 2.56 | 0.65 | 3.46 | | **LaFrescat** | 7.61 | 2.56 | 0.67 | 3.54 | --- ## Installation ### 1. Clone the repository ```bash git clone https://github.com/erikupv/zipvoice-ca cd zipvoice-ca ``` ### 2. Environment Setup We recommend using **Conda** to manage your dependencies and ensure a clean environment: ```bash conda create -n ZipVoice python=3.11 conda activate ZipVoice pip install -r requirements_zipvoice.txt ``` ### 3. Download the Catalan Model Use the Hugging Face CLI to download the fine-tuned checkpoint directly into your local models directory: ```bash # pip install huggingface_hub huggingface-cli download \ --local-dir models \ ebellob/ZipVoice-CA \ zipvoice_ca.pt ``` --- ## Inference To generate speech from a test.tsv file using the Catalan model, use the command below. ```bash python3 -m zipvoice.bin.infer_zipvoice \ --model-name zipvoice \ --model-dir ./models \ --checkpoint-name zipvoice_ca.pt \ --tokenizer "espeak" \ --lang "ca" \ --test-list data_cat/raw/test.tsv \ --res-dir results/ \ --guidance-scale 1.0 \ --num-step 16 ``` For single file inference ```bash python3 -m zipvoice.bin.infer_zipvoice \ --model-name zipvoice \ --prompt-wav prompt.wav \ --prompt-text "I am the transcription of the prompt wav." \ --text "I am the text to be synthesized." \ --res-wav-path result.wav --model-dir ./models \ --checkpoint-name zipvoice_ca.pt \ --tokenizer "espeak" \ --lang "ca" \ --guidance-scale 1.0 \ --num-step 16 ``` --- ## Acknowledgments This work is a fine-tuned version of the [ZipVoice](https://github.com/k2-fsa/ZipVoice) project. ## License This project is licensed under the **Apache-2.0 License**.