Commit
·
b645b2d
0
Parent(s):
Duplicate from Soul-AILab/SoulX-Singer
Browse filesCo-authored-by: Xinsheng Wang <Xinsheng-Wang@users.noreply.huggingface.co>
- .gitattributes +38 -0
- README.md +106 -0
- assets/soul_wechat01.jpg +3 -0
- assets/soulx-logo.png +3 -0
- config.yaml +37 -0
- model.pt +3 -0
.gitattributes
ADDED
|
@@ -0,0 +1,38 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
| 5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
| 6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
| 9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
| 10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
| 11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
| 13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
| 14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
| 15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
| 16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
| 17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
| 18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
| 19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
| 20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
| 21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
| 24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
| 25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
| 27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
| 28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
| 29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
| 30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
| 31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
| 32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
| 33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
assets/logo_fixed.png filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
assets/soul_wechat01.jpg filter=lfs diff=lfs merge=lfs -text
|
| 38 |
+
assets/soulx-logo.png filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
|
@@ -0,0 +1,106 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
- zh
|
| 6 |
+
library_name: huggingface_hub
|
| 7 |
+
license: apache-2.0
|
| 8 |
+
pipeline_tag: text-to-speech
|
| 9 |
+
tags:
|
| 10 |
+
- text-to-audio
|
| 11 |
+
- music
|
| 12 |
+
- singing-voice-synthesis
|
| 13 |
+
- svs
|
| 14 |
+
- zero-shot
|
| 15 |
+
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
<div align="center">
|
| 19 |
+
<b><em> Towards High-Quality Zero-Shot Singing Voice Synthesis</em></b>
|
| 20 |
+
</p>
|
| 21 |
+
<p>
|
| 22 |
+
<img src="assets/soulx-logo.png" alt="SoulX-Singer_Logo" style="height: 80px;">
|
| 23 |
+
</p>
|
| 24 |
+
<p>
|
| 25 |
+
</p>
|
| 26 |
+
<a href="https://soul-ailab.github.io/soulx-singer/"><img src="https://img.shields.io/badge/Demo-Page-lightgrey" alt="version"></a>
|
| 27 |
+
<a href="https://github.com/Soul-AILab/SoulX-Singer"><img src='https://img.shields.io/badge/Github-Page-green' alt="Github"></a>
|
| 28 |
+
<a href="https://arxiv.org/abs/2602.07803"><img src="https://img.shields.io/badge/arXiv-2602.07803-b31b1b" alt="arXiv"></a>
|
| 29 |
+
<a href="https://github.com/Soul-AILab/SoulX-Singer/blob/main/assets/technical-report.pdf"><img src='https://img.shields.io/badge/Report-Github?label=Technical&color=red' alt="technical report"></a>
|
| 30 |
+
<a href="https://github.com/Soul-AILab/SoulX-Singer"><img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" alt="Apache-2.0"></a>
|
| 31 |
+
</div>
|
| 32 |
+
|
| 33 |
+
**SoulX-Singer** is a high-fidelity, zero-shot singing voice synthesis model that enables users to generate realistic singing voices for unseen singers. It supports melody-conditioned (F0 contour) and score-conditioned (MIDI notes) control for precise pitch, rhythm, and expression.
|
| 34 |
+
|
| 35 |
+
For more details, please refer to the paper: [SoulX-Singer: Towards High-Quality Zero-Shot Singing Voice Synthesis](https://arxiv.org/abs/2602.07803).
|
| 36 |
+
|
| 37 |
+
## Sample Usage
|
| 38 |
+
|
| 39 |
+
### 1. Set Up Environment
|
| 40 |
+
|
| 41 |
+
```bash
|
| 42 |
+
git clone https://github.com/Soul-AILab/SoulX-Singer.git
|
| 43 |
+
cd SoulX-Singer
|
| 44 |
+
conda create -n soulxsinger -y python=3.10
|
| 45 |
+
conda activate soulxsinger
|
| 46 |
+
pip install -r requirements.txt
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
### 2. Download Pretrained Models
|
| 50 |
+
|
| 51 |
+
```bash
|
| 52 |
+
pip install -U huggingface_hub
|
| 53 |
+
|
| 54 |
+
# Download the SoulX-Singer SVS model
|
| 55 |
+
hf download Soul-AILab/SoulX-Singer --local-dir pretrained_models/SoulX-Singer
|
| 56 |
+
|
| 57 |
+
# Download models required for preprocessing
|
| 58 |
+
hf download Soul-AILab/SoulX-Singer-Preprocess --local-dir pretrained_models/SoulX-Singer-Preprocess
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
### 3. Run Inference
|
| 62 |
+
|
| 63 |
+
```bash
|
| 64 |
+
bash example/infer.sh
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
## License
|
| 68 |
+
|
| 69 |
+
We use the Apache 2.0 license. Researchers and developers are free to use the codes and model weights of our SoulX-Singer. Check the license at [LICENSE](LICENSE) for more details.
|
| 70 |
+
|
| 71 |
+
|
| 72 |
+
## Usage Disclaimer
|
| 73 |
+
This project provides a singing voice synthesis model for vocal generation capable of zero-shot voice cloning, intended for academic research, educational purposes, and legitimate applications, such as personalized vocal synthesis and assistive technologies.
|
| 74 |
+
|
| 75 |
+
Please note:
|
| 76 |
+
We advocate for the responsible development and use of AI and encourage the community to uphold safety and ethical principles in AI research and applications. If you have any concerns regarding ethics or misuse, please contact us.
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
## Citation
|
| 80 |
+
|
| 81 |
+
```bibtex
|
| 82 |
+
@misc{soulxsinger,
|
| 83 |
+
title={SoulX-Singer: Towards High-Quality Zero-Shot Singing Voice Synthesis},
|
| 84 |
+
author={Jiale Qian and Hao Meng and Tian Zheng and Pengcheng Zhu and Haopeng Lin and Yuhang Dai and Hanke Xie and Wenxiao Cao and Ruixuan Shang and Jun Wu and Hongmei Liu and Hanlin Wen and Jian Zhao and Zhonglin Jiang and Yong Chen and Shunshun Yin and Ming Tao and Jianguo Wei and Lei Xie and Xinsheng Wang},
|
| 85 |
+
year={2026},
|
| 86 |
+
eprint={2602.07803},
|
| 87 |
+
archivePrefix={arXiv},
|
| 88 |
+
primaryClass={eess.AS},
|
| 89 |
+
url={https://arxiv.org/abs/2602.07803},
|
| 90 |
+
}
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
## Contact Us
|
| 94 |
+
If you are interested in leaving a message to our work, feel free to email qianjiale@soulapp.cn or menghao@soulapp.cn or wangxinsheng@soulapp.cn
|
| 95 |
+
|
| 96 |
+
You’re welcome to join our WeChat or Soul APP group for technical discussions, updates.
|
| 97 |
+
<p align="center">
|
| 98 |
+
|
| 99 |
+
|
| 100 |
+
<br>
|
| 101 |
+
<span style="display: inline-block; margin-right: 10px;">
|
| 102 |
+
<img src="assets/soul_wechat01.jpg" width="500" alt="WeChat Group QR Code"/>
|
| 103 |
+
</span>
|
| 104 |
+
</p>
|
| 105 |
+
|
| 106 |
+
|
assets/soul_wechat01.jpg
ADDED
|
Git LFS Details
|
assets/soulx-logo.png
ADDED
|
Git LFS Details
|
config.yaml
ADDED
|
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
infer:
|
| 2 |
+
n_steps: 32
|
| 3 |
+
cfg: 3
|
| 4 |
+
|
| 5 |
+
audio:
|
| 6 |
+
hop_size: 480
|
| 7 |
+
sample_rate: 24000
|
| 8 |
+
max_length: 36000
|
| 9 |
+
n_fft: 1920
|
| 10 |
+
num_mels: 128
|
| 11 |
+
win_size: 1920
|
| 12 |
+
fmin: 0
|
| 13 |
+
fmax: 12000
|
| 14 |
+
mel_var: 8.14
|
| 15 |
+
mel_mean: -4.92
|
| 16 |
+
|
| 17 |
+
model:
|
| 18 |
+
encoder:
|
| 19 |
+
vocab_size: 3000
|
| 20 |
+
text_dim: 512
|
| 21 |
+
pitch_dim: 512
|
| 22 |
+
type_dim: 512
|
| 23 |
+
f0_bin: 361
|
| 24 |
+
f0_dim: 512
|
| 25 |
+
num_layers: 4
|
| 26 |
+
|
| 27 |
+
flow_matching:
|
| 28 |
+
mel_dim: 128
|
| 29 |
+
hidden_size: 1024
|
| 30 |
+
num_layers: 22
|
| 31 |
+
num_heads: 16
|
| 32 |
+
cfg_drop_prob: 0.2
|
| 33 |
+
use_embedding: False
|
| 34 |
+
cond_codebook_size: 512
|
| 35 |
+
cond_scale_factor: 1
|
| 36 |
+
sigma: 1e-5
|
| 37 |
+
time_scheduler: cos
|
model.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:447eaf41f91a6b6659d55e9ec3c9b809221724fb8592aebaec35a23751a5b500
|
| 3 |
+
size 2818092278
|