ebellob
/

ZipVoice-CA

Model card Files Files and versions

ZipVoice-CA / README.md

ebellob's picture

Update README.md

c62de57 verified 15 days ago

|

history blame contribute delete

2.8 kB

	---
	license: apache-2.0
	datasets:
	- ebellob/annotated_catalan_common_voice_v17_cleaned_enhanced
	language:
	- ca
	base_model:
	- k2-fsa/ZipVoice
	tags:
	- catalan
	- tts
	- audio
	- flashsr
	- cleanunet
	- zipvoice
	pipeline_tag: text-to-speech
	---

	# ZipVoice-CA: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching (now in Catalan!)

	This repository contains the checkpoint of the fine-tuned ZipVoice-CA model, able to synthesise speech in catalan with great quality. Its metrics follow below.
	For more information regarding the training and evaluation of the model please refer to its repository on https://github.com/ErikUPV/ZipVoice-CA.
	Please also visit the repository of the original [ZipVoice](https://github.com/k2-fsa/ZipVoice) model.

	To get a feel of the model, click [here](https://erikupv.github.io/zipvoice-samples/) to listen to some samples.

	## Performance Metrics

	\| Dataset \| WER (%) ↓ \| CER (%) ↓ \| SIM-o ↑ \| UTMOS ↑ \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \|
	\| Common Voice 17 \| 10.96 \| 3.00 \| 0.68 \| 3.17 \|
	\| Festcat \| 7.31 \| 2.56 \| 0.65 \| 3.46 \|
	\| LaFrescat \| 7.61 \| 2.56 \| 0.67 \| 3.54 \|

	---

	## Installation

	### 1. Clone the repository
	```bash
	git clone https://github.com/erikupv/zipvoice-ca
	cd zipvoice-ca
	```

	### 2. Environment Setup
	We recommend using Conda to manage your dependencies and ensure a clean environment:
	```bash
	conda create -n ZipVoice python=3.11
	conda activate ZipVoice
	pip install -r requirements_zipvoice.txt
	```

	### 3. Download the Catalan Model
	Use the Hugging Face CLI to download the fine-tuned checkpoint directly into your local models directory:
	```bash
	# pip install huggingface_hub
	huggingface-cli download \
	--local-dir models \
	ebellob/ZipVoice-CA \
	zipvoice_ca.pt
	```

	---

	## Inference

	To generate speech from a test.tsv file using the Catalan model, use the command below.

	```bash
	python3 -m zipvoice.bin.infer_zipvoice \
	--model-name zipvoice \
	--model-dir ./models \
	--checkpoint-name zipvoice_ca.pt \
	--tokenizer "espeak" \
	--lang "ca" \
	--test-list data_cat/raw/test.tsv \
	--res-dir results/ \
	--guidance-scale 1.0 \
	--num-step 16
	```

	For single file inference

	```bash
	python3 -m zipvoice.bin.infer_zipvoice \
	--model-name zipvoice \
	--prompt-wav prompt.wav \
	--prompt-text "I am the transcription of the prompt wav." \
	--text "I am the text to be synthesized." \
	--res-wav-path result.wav
	--model-dir ./models \
	--checkpoint-name zipvoice_ca.pt \
	--tokenizer "espeak" \
	--lang "ca" \
	--guidance-scale 1.0 \
	--num-step 16
	```
	---

	## Acknowledgments
	This work is a fine-tuned version of the [ZipVoice](https://github.com/k2-fsa/ZipVoice) project.

	## License
	This project is licensed under the Apache-2.0 License.