---
license: cc-by-nc-4.0
pipeline_tag: audio-to-audio
library_name: f5-tts
extra_gated_prompt: "You agree to not use the model to generate, share, or promote content that is illegal, harmful, deceptive, or intended to impersonate real individuals without their informed consent."
extra_gated_fields:
  Affiliation: text
  Country: country
I agree to use this model for non-commercial use ONLY: checkbox
---

# EZ-VC: Easy Zero-shot Any-to-Any Voice Conversion

[![github](https://img.shields.io/badge/Github-code-brightgreen)](https://github.com/EZ-VC/EZ-VC)
[![arXiv](https://img.shields.io/badge/EMNLP-findings.2025.1077-b31b1b.svg?logo=acl)](https://aclanthology.org/2025.findings-emnlp.1077/)
[![demo](https://img.shields.io/badge/Demo-page-yellow.svg)](https://ez-vc.github.io/EZ-VC-Demo/)
[![lab](https://img.shields.io/badge/SPRING-Lab-purple)](https://asr.iitm.ac.in/)
<!-- <img src="https://github.com/user-attachments/assets/12d7749c-071a-427c-81bf-b87b91def670" alt="Watermark" style="width: 40px; height: auto"> -->


### Our paper has been published in the Findings of EMNLP 2025!

## Installation

### Create a separate environment if needed

```bash
# Create a python 3.10 conda env (you could also use virtualenv)
conda create -n ez-vc python=3.10
conda activate ez-vc
```

### Local installation

```bash
git clone https://github.com/EZ-VC/EZ-VC
cd EZ-VC
git submodule update --init --recursive
pip install -e .

# Install espnet for xeus (Exactly this version)
pip install 'espnet @ git+https://github.com/wanchichen/espnet.git@ssl'
```

## Inference

We have provided a Jupyter notebook for inference in "src/f5_tts/infer/infer.ipynb".

Open [Inference notebook](https://github.com/EZ-VC/EZ-VC/blob/main/src/f5_tts/infer/infer.ipynb).

Run all. 

The converted audio will be available at the last cell.


## Acknowledgements

- [F5-TTS](https://arxiv.org/abs/2410.06885) for opensourcing their code which has made EZ-VC possible.

## Citation
If our work and codebase is useful for you, please cite as:
```
@inproceedings{joglekar-etal-2025-ez,
    title = "{EZ}-{VC}: Easy Zero-shot Any-to-Any Voice Conversion",
    author = "Joglekar, Advait  and
      Singh, Divyanshu  and
      Bhatia, Rooshil Rohit  and
      Umesh, Srinivasan",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-emnlp.1077/",
    doi = "10.18653/v1/2025.findings-emnlp.1077",
    pages = "19768--19774",
    ISBN = "979-8-89176-335-7",
    abstract = "Voice Conversion research in recent times has increasingly focused on improving the zero-shot capabilities of existing methods. Despite remarkable advancements, current architectures still tend to struggle in zero-shot cross-lingual settings. They are also often unable to generalize for speakers of unseen languages and accents. In this paper, we adopt a simple yet effective approach that combines discrete speech representations from self-supervised models with a non-autoregressive Diffusion-Transformer based conditional flow matching speech decoder. We show that this architecture allows us to train a voice-conversion model in a purely textless, self-supervised fashion. Our technique works without requiring multiple encoders to disentangle speech features. Our model also manages to excel in zero-shot cross-lingual settings even for unseen languages. We provide our code, model checkpoint and demo samples here: https://github.com/ez-vc/ez-vc"
}
```
## License

Our code is released under MIT License. The pre-trained models are licensed under the CC-BY-NC license. Sorry for any inconvenience this may cause.