yanchaomars commited on
Commit
53ac8d2
Β·
verified Β·
1 Parent(s): 056c725

update model card

Browse files
Files changed (1) hide show
  1. README.md +113 -4
README.md CHANGED
@@ -1,4 +1,113 @@
1
- ---
2
- license: apache-2.0
3
- ---
4
- 1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: text-to-speech
4
+ library_name: transformers
5
+ ---
6
+ ## Step-Audio-EditX
7
+
8
+ ✨ [Demo Page](https://stepaudiollm.github.io/step-audio-editx/) 
9
+ | 🌟 [GitHub](https://github.com/stepfun-ai/Step-Audio-EditX) 
10
+ | πŸ“‘ [Paper](https://arxiv.org/abs/2511.03601) 
11
+
12
+ Check our open-source repository https://github.com/stepfun-ai/Step-Audio-EditX for more details!
13
+
14
+ We are open-sourcing **Step-Audio-EditX**, a powerful LLM-based audio model specialized in expressive and **iterative audio editing**.
15
+ It excels at **editing emotion**, **speaking style**, and **paralinguistics**, and also features robust **zero-shot text-to-speech (TTS)** capabilities.
16
+
17
+ ## Features
18
+ - **Zero-Shot TTS**
19
+ - Excellent zero-shot TTS cloning for Mandarin, English, Sichuanese, and Cantonese.
20
+ - To use a dialect, just add a **[Sichuanese]** or **[Cantonese]** tag before your text.
21
+
22
+ - **Emotion and Speaking Style Editing**
23
+ - Remarkably effective iterative control over emotions and styles, supporting **dozens** of options for editing.
24
+ - Emotion Editing : [ *Angry*, *Happy*, *Sad*, *Excited*, *Fearful*, *Surprised*, *Disgusted*, etc. ]
25
+ - Speaking Style Editing: [ *Act_coy*, *Older*, *Child*, *Whisper*, *Serious*, *Generous*, *Exaggerated*, etc.]
26
+ - Editing with more emotion and more speaking styles is on the way. **Get Ready!** πŸš€
27
+
28
+ - **Paralinguistic Editing**:
29
+ - Precise control over 10 types of paralinguistic features for more natural, human-like, and expressive synthetic audio.
30
+ - Supporting Tags:
31
+ - [ *Breathing*, *Laughter*, *Suprise-oh*, *Confirmation-en*, *Uhm*, *Suprise-ah*, *Suprise-wa*, *Sigh*, *Question-ei*, *Dissatisfaction-hnn* ]
32
+
33
+ For more examples, see [demo page](https://stepaudiollm.github.io/step-audio-editx/).
34
+
35
+ ## Model Usage
36
+ ### πŸ“œ Requirements
37
+ The following table shows the requirements for running Step-Audio model (batch size = 1):
38
+
39
+ | Model | Setting<br/>(sample frequency) | GPU Minimum Memory |
40
+ |------------|--------------------------------|----------------|
41
+ | Step-Audio-EditX | 41.6Hz | 8GB |
42
+
43
+ * An NVIDIA GPU with CUDA support is required.
44
+ * The model is tested on a four A800 80G GPU.
45
+ * **Recommended**: We recommend using 4xA800/H800 GPU with 80GB memory for better generation quality.
46
+ * Tested operating system: Linux
47
+
48
+ ### πŸ”§ Dependencies and Installation
49
+ - Python >= 3.10.0 (Recommend to use [Anaconda](https://www.anaconda.com/download/#linux) or [Miniconda](https://docs.conda.io/en/latest/miniconda.html))
50
+ - [PyTorch >= 2.3-cu121](https://pytorch.org/)
51
+ - [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads)
52
+
53
+ ```bash
54
+ git clone https://github.com/stepfun-ai/Step-Audio-EditX.git
55
+ conda create -n stepaudioedit python=3.10
56
+ conda activate stepaudioedit
57
+
58
+ cd Step-Audio
59
+ pip install -r requirements.txt
60
+
61
+ git lfs install
62
+ git clone https://huggingface.co/stepfun-ai/Step-Audio-Tokenizer
63
+ git clone https://huggingface.co/stepfun-ai/Step-Audio-EditX
64
+
65
+ ```
66
+
67
+ After downloading the models, where_you_download_dir should have the following structure:
68
+ ```
69
+ where_you_download_dir
70
+ β”œβ”€β”€ Step-Audio-Tokenizer
71
+ β”œβ”€β”€ Step-Audio-EditX
72
+ ```
73
+
74
+ #### Run with Docker
75
+
76
+ You can set up the environment required for running Step-Audio using the provided Dockerfile.
77
+
78
+ ```bash
79
+ # build docker
80
+ docker build . -t step-audio-editx
81
+
82
+ # run docker
83
+ docker run --rm --gpus all \
84
+ -v /your/code/path:/app \
85
+ -v /your/model/path:/model \
86
+ -p 7860:7860 \
87
+ step-audio-editx
88
+ ```
89
+
90
+
91
+ #### Launch Web Demo
92
+ Start a local server for online inference.
93
+ Assume you have 4 GPUs available and have already downloaded all the models.
94
+
95
+ ```bash
96
+ # Step-Audio-EditX demo
97
+ python app.py --model-path where_you_download_dir --model-source local
98
+ ```
99
+
100
+ ## Citation
101
+
102
+ ```
103
+ @misc{yan2025stepaudioeditxtechnicalreport,
104
+ title={Step-Audio-EditX Technical Report},
105
+ author={Chao Yan and Boyong Wu and Peng Yang and Pengfei Tan and Guoqiang Hu and Yuxin Zhang and Xiangyu and Zhang and Fei Tian and Xuerui Yang and Xiangyu Zhang and Daxin Jiang and Gang Yu},
106
+ year={2025},
107
+ eprint={2511.03601},
108
+ archivePrefix={arXiv},
109
+ primaryClass={cs.CL},
110
+ url={https://arxiv.org/abs/2511.03601},
111
+ }
112
+
113
+ ```