File size: 4,457 Bytes
26f2a99 89f695e 26f2a99 89f695e 26f2a99 89f695e 26f2a99 9e54747 26f2a99 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 | ---
license: mit
---
# X-VC
[](https://arxiv.org/abs/2604.12456)
[](https://github.com/Jerrister/X-VC)
[](https://x-vc.github.io)
Official code release for **X-VC: Zero-shot Streaming Voice Conversion in Codec Space**.
## Environment Setup
### 1. Clone
```bash
git clone https://github.com/Jerrister/X-VC.git
cd X-VC
```
### 2. Create conda environment and install dependencies
```bash
conda create -n xvc python=3.10 -y
conda activate xvc
pip install -U pip
pip install -r requirements.txt
```
### 3. Prepare pretrained models
Prepare:
- [GLM-4-Voice-Tokenizer](https://huggingface.co/zai-org/glm-4-voice-tokenizer) (for semantic tokenization)
- [ERes2Net speaker encoder](https://modelscope.cn/models/iic/speech_eres2net_sv_en_voxceleb_16k) (for speaker feature extraction)
Then set paths in [`configs/xvc.yaml`](configs/xvc.yaml), especially:
- `model.generator.semantic_encoder.encoder.from_pretrained`
- `model.generator.semantic_encoder.cfg`
- `model.generator.speaker_encoder.pretrained_dir`
### 4. Prepare checkpoints
Put checkpoints under `ckpts/`, for example:
```text
ckpts/
xvc.pt
```
## Inference
### Single-pair Inference
Use [`scripts/infer_single.sh`](scripts/infer_single.sh).
```bash
bash scripts/infer_single.sh
```
Key arguments in this script:
- `current=0` for offline inference.
- `current>0` for streaming inference.
- `chunk/current/future/smooth` control streaming behavior.
Outputs are saved under `save_dir` (default: `outputs/xvc_single`).
### Batch Offline Inference (SeedTTS-eval as example)
Use [`scripts/batch_infer_seedtts_offline.sh`](scripts/batch_infer_seedtts_offline.sh).
```bash
bash scripts/batch_infer_seedtts_offline.sh
```
This script reports:
- `saved_dir`
- `total_rtf`
### Batch Streaming Inference (SeedTTS-eval as example)
Use [`scripts/batch_infer_seedtts_stream.sh`](scripts/batch_infer_seedtts_stream.sh).
```bash
bash scripts/batch_infer_seedtts_stream.sh
```
This script reports:
- `saved_dir`
- `avg_latency_ms`
## Training
### Step 1: Prepare pretrained dependencies
Before training, prepare the required pretrained dependencies:
- [SAC pretrained checkpoint(s)](https://huggingface.co/Soul-AILab/SAC-16k-62_5Hz) (for model initialization)
Then set corresponding paths in [`configs/xvc.yaml`](configs/xvc.yaml), especially:
- `model.generator.checkpoint`
- `model.discriminator.checkpoint`
### Step 2: Prepare training data
Organize your training/validation data in JSONL format and set:
- `datasets.train`
- `datasets.val`
in [`configs/xvc.yaml`](configs/xvc.yaml).
### Step 3: Modify training configs
You can adjust training behavior in:
- [`configs/xvc.yaml`](configs/xvc.yaml) (main training config)
- [`configs/ds_stage2.json`](configs/ds_stage2.json) (DeepSpeed config)
### Step 4: Start training
Use [`scripts/train.sh`](scripts/train.sh).
```bash
bash scripts/train.sh
```
Notes:
- Default training engine is DeepSpeed (`configs/ds_stage2.json`).
- Main experiment config is `configs/xvc.yaml`.
- Set your `WANDB_API_KEY` in `scripts/train.sh` before running if you use wandb logging.
## Data Format
Training config points to JSONL files in `configs/xvc.yaml`:
- `datasets.train`
- `datasets.val`
Each JSONL line should be a JSON object.
Required fields:
- `target_utt`
- `source_wav_path`
- `target_wav_path`
Optional field:
- `source_utt`
Minimal example:
```json
{"source_utt":"utt_0001","source_wav_path":"<path_to_source>","target_utt":"utt_0002","target_wav_path":"<path_to_target>"}
```
## Acknowledgements
This codebase builds upon open-source components from [SAC](https://github.com/Soul-AILab/SAC) and the broader audio generation ecosystem.
## Citation
If you find our work useful in your research, please consider citing:
```bibtex
@misc{zheng2026xvczeroshotstreamingvoice,
title={X-VC: Zero-shot Streaming Voice Conversion in Codec Space},
author={Qixi Zheng and Yuxiang Zhao and Tianrui Wang and Wenxi Chen and Kele Xu and Yikang Li and Qinyuan Chen and Xipeng Qiu and Kai Yu and Xie Chen},
year={2026},
eprint={2604.12456},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2604.12456},
}
```
## License
This project is licensed under the MIT License.
|