File size: 6,711 Bytes
ec903fc
 
 
 
 
 
7dad981
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ec903fc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
---
license: apache-2.0
base_model:
- HKUSTAudio/Llasa-3B
pipeline_tag: text-to-speech
---
<div align="center">
    <h1>
    VoiceSculptor
    </h1>
    <b><em>VoiceSculptor: Your Voice, Designed By You</em></b>
    </p>
    <p>
    <img src="assets/logo.png" style="width: 400px; height: 400px;">
    </p>
    <a href="https://hujingbin1.github.io/VoiceSculptor-Demo"><img src="https://img.shields.io/badge/Demo-Page-lightgrey" alt="version"></a>
    <a href="https://huggingface.co/ASLP-lab/VoiceSculptor-VD"><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue' alt="HF-model"></a>
    <a href="https://github.com/ASLP-lab/VoiceSculptor"><img src='https://img.shields.io/badge/Report-Github?label=Technical&color=red' alt="technical report"></a>
    <a href="https://huggingface.co/ASLP-lab/VoiceSculptor-VD"><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Demo-blue' alt="HF-demo"></a>
    <a href="https://github.com/ASLP-lab/VoiceSculptor"><img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" alt="Apache-2.0"></a>
</div>

## πŸ“Š Instruct TTS Eval

#### Instruct TTS Eval (ZH)

| Model | APS (%) | DSD (%) | RP (%) | AVG (%) |
|------|---------|---------|--------|---------|
| Gemini 2.5-Flash* | 88.2 | 90.9 | 77.3 | 85.4 |
| Gemini 2.5-Pro* | 89.0 | 90.1 | 75.5 | 84.8 |
| GPT-4o-Mini-TTS* | 54.9 | 52.3 | 46.0 | 51.1 |
| ElevenLabs* | 42.8 | 50.9 | 59.1 | 50.9 |
| VoxInstruct | 47.5 | 52.3 | 42.6 | 47.5 |
| MiMo-Audio-7B-Instruct | 70.1 | 66.1 | 57.1 | 64.5 |
| **VoiceSculptor** | **75.7** | **64.7** | **61.5** | **67.6** |

> **Note**
>
> - Models marked with `*` are commercial models.  
> - **InstructTTSEval** β€” Huang, K., Tu, Q., Fan, L., Yang, C., Zhang, D., Li, S., Fei, Z., Cheng, Q., & Qiu, X. (2025).  
>   *InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems.*  
>   arXiv preprint arXiv:2506.16381.  
>   [arXiv](https://arxiv.org/abs/2506.16381)



## πŸ”₯ News


- **[2026-1-2]** We opened the repository and uploaded the voice design models! [VoiceSculptor](https://huggingface.co/ASLP-lab/VoiceSculptor-VD)

## πŸš€ Getting Started

### 1. Environment Setup

Follow the steps below to clone the repository and install the required environment.

```bash
# Clone the repository and enter the directory
git clone https://github.com/ASLP-lab/VoiceSculptor.git
cd VoiceSculptor

# Create and activate a Conda environment
conda create -n VoiceSculptor python=3.10 -y
conda activate VoiceSculptor

# Install dependencies
pip install -r requirements.txt
```

### 2. Download Pre-trained Models

```bash
git lfs install
git clone https://huggingface.co/ASLP-lab/VoiceSculptor-VD
```

### 3. Infer

For detailed instructions on how to design high-quality voice prompts,  
please refer to [Voice Design Guide](docs/voice_design.md) or [Voice Design Guide EN](docs/voice_design_en.md).

```bash
python infer.py
```

<!-- ### 4. WebUI

```bash
python gradio.py
```


### 5. RAG

```bash
python build_rag.py
``` -->


## πŸ“‹ TODO
- [x] 🌐 **Demo website**
- [x] πŸ”“ **Release inference code**
- [x] πŸ€— **Release HuggingFace model**
- [ ] πŸ€— **HuggingFace Space**
- [ ] πŸ“ **Release Technical Report**
- [ ] πŸ”“ **Release gradio code**
- [ ] πŸ”“ **Release RAG code**
- [ ] πŸ”“ **Support vLLM**
- [ ] πŸ”“ **Release training code**

## Citation

```bibtex
@misc{VoiceSculptor,
      title={VoiceSculptor: Your Voice, Designed By You},
      author={Jingbin Hu and Huakang Chen and Linhan Ma and Dake Guo and Qirui Zhan and Wenhao Li and Haoyu Zhang and Kangxiang Xia and Ziyu Zhang and Wenjie Tian and Chengyou Wang and Jinrui Liang and Shuhan Guo and Zihang Yang and Bengu Wu and Binbin Zhang and Pengcheng Zhu and Pengyuan Xie and Chuan Xie and Qiang Zhang and Jie Liu and Lei Xie},
      year={2026},
      url={https://github.com/ASLP-lab/VoiceSculptor},
}
@misc{ye2025llasascalingtraintimeinferencetime,
      title={Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis},
      author={Zhen Ye and Xinfa Zhu and Chi-Min Chan and Xinsheng Wang and Xu Tan and Jiahe Lei and Yi Peng and Haohe Liu and Yizhu Jin and Zheqi Dai and Hongzhan Lin and Jianyi Chen and Xingjian Du and Liumeng Xue and Yunlin Chen and Zhifei Li and Lei Xie and Qiuqiang Kong and Yike Guo and Wei Xue},
      year={2025},
      eprint={2502.04128},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2502.04128},
}
```

## License

We use the Apache 2.0 license. Researchers and developers are free to use the codes and model weights of our VoiceSculptor. Check the license at [LICENSE](LICENSE.txt) for more details.

## Acknowledgement
- This repo benefits from [LLaSA](https://github.com/zhenye234/LLaSA_training)
- This repo benefits from [CosyVoice](https://github.com/FunAudioLLM/CosyVoice)


##  Usage Disclaimer
Additional Notice on Generated Voices

This project provides a speech synthesis model for voice design, intended for academic research, educational purposes, and legitimate applications, such as personalized speech synthesis, assistive technologies, and linguistic research.

Please note:

Do not use this model for unauthorized voice cloning, impersonation, fraud, scams, deepfakes, or any illegal or malicious activities.

Ensure compliance with local laws and regulations when using this model and uphold ethical standards.

The developers assume no liability for any misuse of this model.

Important clarification regarding generated voices:

As a generative model, the voices produced by this system are synthetic outputs inferred by the model, not recordings of real human voices.

The generated voice characteristics do not represent or reproduce any specific real individual, and are not derived from or intended to imitate identifiable persons.

We advocate for the responsible development and use of AI and encourage the community to uphold safety and ethical principles in AI research and applications. 

## Contact Us
If you are interested in leaving a message to our work, feel free to email jingbin.hu@mail.nwpu.edu.cn or lxie@nwpu.edu.cn

You’re welcome to join our WeChat group for technical discussions, updates.
<p align="center">
  <!-- <em>Due to group limits, if you can't scan the QR code, please add my WeChat for group access  -->
      <!-- : <strong>Tiamo James</strong></em> -->
  <br>
  <span style="display: inline-block; margin-right: 10px;">
    <img src="assets/wechat.png" width="300" alt="WeChat Group QR Code"/>
  </span>
  <!-- <span style="display: inline-block;">
    <img src="assets/wechat_tiamo.jpg" width="300" alt="WeChat QR Code"/>
  </span> -->
</p>