update model card
Browse files
README.md
CHANGED
|
@@ -1,4 +1,113 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
| 4 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: text-to-speech
|
| 4 |
+
library_name: transformers
|
| 5 |
+
---
|
| 6 |
+
## Step-Audio-EditX
|
| 7 |
+
|
| 8 |
+
β¨ [Demo Page](https://stepaudiollm.github.io/step-audio-editx/)
|
| 9 |
+
| π [GitHub](https://github.com/stepfun-ai/Step-Audio-EditX)
|
| 10 |
+
| π [Paper](https://arxiv.org/abs/2511.03601)
|
| 11 |
+
|
| 12 |
+
Check our open-source repository https://github.com/stepfun-ai/Step-Audio-EditX for more details!
|
| 13 |
+
|
| 14 |
+
We are open-sourcing **Step-Audio-EditX**, a powerful LLM-based audio model specialized in expressive and **iterative audio editing**.
|
| 15 |
+
It excels at **editing emotion**, **speaking style**, and **paralinguistics**, and also features robust **zero-shot text-to-speech (TTS)** capabilities.
|
| 16 |
+
|
| 17 |
+
## Features
|
| 18 |
+
- **Zero-Shot TTS**
|
| 19 |
+
- Excellent zero-shot TTS cloning for Mandarin, English, Sichuanese, and Cantonese.
|
| 20 |
+
- To use a dialect, just add a **[Sichuanese]** or **[Cantonese]** tag before your text.
|
| 21 |
+
|
| 22 |
+
- **Emotion and Speaking Style Editing**
|
| 23 |
+
- Remarkably effective iterative control over emotions and styles, supporting **dozens** of options for editing.
|
| 24 |
+
- Emotion Editing : [ *Angry*, *Happy*, *Sad*, *Excited*, *Fearful*, *Surprised*, *Disgusted*, etc. ]
|
| 25 |
+
- Speaking Style Editing: [ *Act_coy*, *Older*, *Child*, *Whisper*, *Serious*, *Generous*, *Exaggerated*, etc.]
|
| 26 |
+
- Editing with more emotion and more speaking styles is on the way. **Get Ready!** π
|
| 27 |
+
|
| 28 |
+
- **Paralinguistic Editing**:
|
| 29 |
+
- Precise control over 10 types of paralinguistic features for more natural, human-like, and expressive synthetic audio.
|
| 30 |
+
- Supporting Tags:
|
| 31 |
+
- [ *Breathing*, *Laughter*, *Suprise-oh*, *Confirmation-en*, *Uhm*, *Suprise-ah*, *Suprise-wa*, *Sigh*, *Question-ei*, *Dissatisfaction-hnn* ]
|
| 32 |
+
|
| 33 |
+
For more examples, see [demo page](https://stepaudiollm.github.io/step-audio-editx/).
|
| 34 |
+
|
| 35 |
+
## Model Usage
|
| 36 |
+
### π Requirements
|
| 37 |
+
The following table shows the requirements for running Step-Audio model (batch size = 1):
|
| 38 |
+
|
| 39 |
+
| Model | Setting<br/>(sample frequency) | GPU Minimum Memory |
|
| 40 |
+
|------------|--------------------------------|----------------|
|
| 41 |
+
| Step-Audio-EditX | 41.6Hz | 8GB |
|
| 42 |
+
|
| 43 |
+
* An NVIDIA GPU with CUDA support is required.
|
| 44 |
+
* The model is tested on a four A800 80G GPU.
|
| 45 |
+
* **Recommended**: We recommend using 4xA800/H800 GPU with 80GB memory for better generation quality.
|
| 46 |
+
* Tested operating system: Linux
|
| 47 |
+
|
| 48 |
+
### π§ Dependencies and Installation
|
| 49 |
+
- Python >= 3.10.0 (Recommend to use [Anaconda](https://www.anaconda.com/download/#linux) or [Miniconda](https://docs.conda.io/en/latest/miniconda.html))
|
| 50 |
+
- [PyTorch >= 2.3-cu121](https://pytorch.org/)
|
| 51 |
+
- [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads)
|
| 52 |
+
|
| 53 |
+
```bash
|
| 54 |
+
git clone https://github.com/stepfun-ai/Step-Audio-EditX.git
|
| 55 |
+
conda create -n stepaudioedit python=3.10
|
| 56 |
+
conda activate stepaudioedit
|
| 57 |
+
|
| 58 |
+
cd Step-Audio
|
| 59 |
+
pip install -r requirements.txt
|
| 60 |
+
|
| 61 |
+
git lfs install
|
| 62 |
+
git clone https://huggingface.co/stepfun-ai/Step-Audio-Tokenizer
|
| 63 |
+
git clone https://huggingface.co/stepfun-ai/Step-Audio-EditX
|
| 64 |
+
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
After downloading the models, where_you_download_dir should have the following structure:
|
| 68 |
+
```
|
| 69 |
+
where_you_download_dir
|
| 70 |
+
βββ Step-Audio-Tokenizer
|
| 71 |
+
βββ Step-Audio-EditX
|
| 72 |
+
```
|
| 73 |
+
|
| 74 |
+
#### Run with Docker
|
| 75 |
+
|
| 76 |
+
You can set up the environment required for running Step-Audio using the provided Dockerfile.
|
| 77 |
+
|
| 78 |
+
```bash
|
| 79 |
+
# build docker
|
| 80 |
+
docker build . -t step-audio-editx
|
| 81 |
+
|
| 82 |
+
# run docker
|
| 83 |
+
docker run --rm --gpus all \
|
| 84 |
+
-v /your/code/path:/app \
|
| 85 |
+
-v /your/model/path:/model \
|
| 86 |
+
-p 7860:7860 \
|
| 87 |
+
step-audio-editx
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
|
| 91 |
+
#### Launch Web Demo
|
| 92 |
+
Start a local server for online inference.
|
| 93 |
+
Assume you have 4 GPUs available and have already downloaded all the models.
|
| 94 |
+
|
| 95 |
+
```bash
|
| 96 |
+
# Step-Audio-EditX demo
|
| 97 |
+
python app.py --model-path where_you_download_dir --model-source local
|
| 98 |
+
```
|
| 99 |
+
|
| 100 |
+
## Citation
|
| 101 |
+
|
| 102 |
+
```
|
| 103 |
+
@misc{yan2025stepaudioeditxtechnicalreport,
|
| 104 |
+
title={Step-Audio-EditX Technical Report},
|
| 105 |
+
author={Chao Yan and Boyong Wu and Peng Yang and Pengfei Tan and Guoqiang Hu and Yuxin Zhang and Xiangyu and Zhang and Fei Tian and Xuerui Yang and Xiangyu Zhang and Daxin Jiang and Gang Yu},
|
| 106 |
+
year={2025},
|
| 107 |
+
eprint={2511.03601},
|
| 108 |
+
archivePrefix={arXiv},
|
| 109 |
+
primaryClass={cs.CL},
|
| 110 |
+
url={https://arxiv.org/abs/2511.03601},
|
| 111 |
+
}
|
| 112 |
+
|
| 113 |
+
```
|