custom
X-VC / README.md
chenxie95's picture
Update README.md
9e54747 verified
---
license: mit
---
# X-VC
[![arXiv](https://img.shields.io/badge/arXiv-2604.12456-b31b1b.svg)](https://arxiv.org/abs/2604.12456)
[![GitHub](https://img.shields.io/badge/GitHub-Code-181717?logo=github)](https://github.com/Jerrister/X-VC)
[![Demo Page](https://img.shields.io/badge/Demo-Project%20Page-blue)](https://x-vc.github.io)
Official code release for **X-VC: Zero-shot Streaming Voice Conversion in Codec Space**.
## Environment Setup
### 1. Clone
```bash
git clone https://github.com/Jerrister/X-VC.git
cd X-VC
```
### 2. Create conda environment and install dependencies
```bash
conda create -n xvc python=3.10 -y
conda activate xvc
pip install -U pip
pip install -r requirements.txt
```
### 3. Prepare pretrained models
Prepare:
- [GLM-4-Voice-Tokenizer](https://huggingface.co/zai-org/glm-4-voice-tokenizer) (for semantic tokenization)
- [ERes2Net speaker encoder](https://modelscope.cn/models/iic/speech_eres2net_sv_en_voxceleb_16k) (for speaker feature extraction)
Then set paths in [`configs/xvc.yaml`](configs/xvc.yaml), especially:
- `model.generator.semantic_encoder.encoder.from_pretrained`
- `model.generator.semantic_encoder.cfg`
- `model.generator.speaker_encoder.pretrained_dir`
### 4. Prepare checkpoints
Put checkpoints under `ckpts/`, for example:
```text
ckpts/
xvc.pt
```
## Inference
### Single-pair Inference
Use [`scripts/infer_single.sh`](scripts/infer_single.sh).
```bash
bash scripts/infer_single.sh
```
Key arguments in this script:
- `current=0` for offline inference.
- `current>0` for streaming inference.
- `chunk/current/future/smooth` control streaming behavior.
Outputs are saved under `save_dir` (default: `outputs/xvc_single`).
### Batch Offline Inference (SeedTTS-eval as example)
Use [`scripts/batch_infer_seedtts_offline.sh`](scripts/batch_infer_seedtts_offline.sh).
```bash
bash scripts/batch_infer_seedtts_offline.sh
```
This script reports:
- `saved_dir`
- `total_rtf`
### Batch Streaming Inference (SeedTTS-eval as example)
Use [`scripts/batch_infer_seedtts_stream.sh`](scripts/batch_infer_seedtts_stream.sh).
```bash
bash scripts/batch_infer_seedtts_stream.sh
```
This script reports:
- `saved_dir`
- `avg_latency_ms`
## Training
### Step 1: Prepare pretrained dependencies
Before training, prepare the required pretrained dependencies:
- [SAC pretrained checkpoint(s)](https://huggingface.co/Soul-AILab/SAC-16k-62_5Hz) (for model initialization)
Then set corresponding paths in [`configs/xvc.yaml`](configs/xvc.yaml), especially:
- `model.generator.checkpoint`
- `model.discriminator.checkpoint`
### Step 2: Prepare training data
Organize your training/validation data in JSONL format and set:
- `datasets.train`
- `datasets.val`
in [`configs/xvc.yaml`](configs/xvc.yaml).
### Step 3: Modify training configs
You can adjust training behavior in:
- [`configs/xvc.yaml`](configs/xvc.yaml) (main training config)
- [`configs/ds_stage2.json`](configs/ds_stage2.json) (DeepSpeed config)
### Step 4: Start training
Use [`scripts/train.sh`](scripts/train.sh).
```bash
bash scripts/train.sh
```
Notes:
- Default training engine is DeepSpeed (`configs/ds_stage2.json`).
- Main experiment config is `configs/xvc.yaml`.
- Set your `WANDB_API_KEY` in `scripts/train.sh` before running if you use wandb logging.
## Data Format
Training config points to JSONL files in `configs/xvc.yaml`:
- `datasets.train`
- `datasets.val`
Each JSONL line should be a JSON object.
Required fields:
- `target_utt`
- `source_wav_path`
- `target_wav_path`
Optional field:
- `source_utt`
Minimal example:
```json
{"source_utt":"utt_0001","source_wav_path":"<path_to_source>","target_utt":"utt_0002","target_wav_path":"<path_to_target>"}
```
## Acknowledgements
This codebase builds upon open-source components from [SAC](https://github.com/Soul-AILab/SAC) and the broader audio generation ecosystem.
## Citation
If you find our work useful in your research, please consider citing:
```bibtex
@misc{zheng2026xvczeroshotstreamingvoice,
title={X-VC: Zero-shot Streaming Voice Conversion in Codec Space},
author={Qixi Zheng and Yuxiang Zhao and Tianrui Wang and Wenxi Chen and Kele Xu and Yikang Li and Qinyuan Chen and Xipeng Qiu and Kai Yu and Xie Chen},
year={2026},
eprint={2604.12456},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2604.12456},
}
```
## License
This project is licensed under the MIT License.