grider-transwithai commited on
Commit
e85a10c
·
verified ·
1 Parent(s): 2ec61db

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +216 -3
README.md CHANGED
@@ -1,3 +1,216 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: audio-text-to-text
4
+ library_name: transformers
5
+ tags:
6
+ - audio-reasoning
7
+ - chain-of-thought
8
+ - multi-modal
9
+ - step-audio-r1
10
+ ---
11
+ ## Step-Audio-R1-NVFP4A16 (Quantized)
12
+
13
+ This is a **quantized version** of Step-Audio-R1 using NVFP4A16 quantization via [LLM Compressor](https://github.com/vllm-project/llm-compressor).
14
+
15
+ ### Quantization Details
16
+
17
+ - **Scheme**: NVFP4A16 (FP4 weights with FP16 activations)
18
+ - **Target layers**: All Linear layers (except `encoder`, `adapter`, `lm_head`)
19
+ - **Group size**: 16
20
+ - **Method**: Post-Training Quantization (PTQ)
21
+
22
+ ### Quantization Code
23
+ ```python
24
+ from transformers import AutoModelForCausalLM, AutoTokenizer
25
+ from llmcompressor import oneshot
26
+ from llmcompressor.modifiers.quantization import QuantizationModifier
27
+ from llmcompressor.utils import dispatch_for_generation
28
+
29
+ MODEL_ID = "stepfun-ai/Step-Audio-R1"
30
+
31
+ # Load model
32
+ model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
33
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
34
+
35
+ # Configure the quantization algorithm and scheme
36
+ # Quantize weights to FP4 with per group 16 via PTQ
37
+ recipe = QuantizationModifier(targets="Linear", scheme="NVFP4A16", ignore=["lm_head", "re:encoder.*", "re:adapter.*"])
38
+
39
+ # Apply quantization
40
+ oneshot(model=model, recipe=recipe)
41
+
42
+ # Save to disk in compressed-tensors format
43
+ SAVE_DIR = "Step-Audio-R1-NVFP4A16"
44
+ model.save_pretrained(SAVE_DIR, save_compressed=True)
45
+ tokenizer.save_pretrained(SAVE_DIR)
46
+ ```
47
+
48
+
49
+ ## Step-Audio-R1
50
+
51
+ ✨ [Demo Page](https://stepaudiollm.github.io/step-audio-r1/) 
52
+ | 🎮 [Playground](https://huggingface.co/spaces/stepfun-ai/Step-Audio-R1) 
53
+ | 🌟 [GitHub](https://github.com/stepfun-ai/Step-Audio-R1) 
54
+ | 📑 [Paper](https://arxiv.org/abs/2511.15848) 
55
+
56
+ Step-Audio-R1 is the **first audio language model to successfully unlock Chain-of-Thought (CoT) reasoning**.
57
+ It decisively solves the "inverted scaling" problem that plagues existing models, where performance degrades
58
+ with longer reasoning. Step-Audio-R1 is the first model to demonstrate that for audio, like text and vision,
59
+ allocating more compute at test-time predictably improves performance.
60
+
61
+ We found the root cause of this anomaly: models were engaging in **textual surrogate reasoning**
62
+ (analyzing transcripts, not audio) due to a modality mismatch. To solve this, we introduce
63
+ **Modality-Grounded Reasoning Distillation (MGRD)**, an iterative training framework that shifts the model's
64
+ reasoning from textual abstractions to acoustic properties.
65
+
66
+ This new approach allows us to create **Step-Audio-R1**, which:
67
+ - Is the **first audio reasoning model** that successfully benefits from test-time compute scaling.
68
+ - Surpasses **Gemini 2.5 Pro** and is comparable to **Gemini 3** across major audio reasoning tasks.
69
+ - Transforms extended deliberation from a liability into a **powerful asset** for audio intelligence.
70
+
71
+ ## Features
72
+ - **Chain-of-Thought (CoT) Reasoning**
73
+ - First audio language model to successfully unlock Chain-of-Thought reasoning capabilities.
74
+ - Generates audio-relevant reasoning chains that genuinely ground themselves in acoustic features.
75
+
76
+ - **Modality-Grounded Reasoning Distillation (MGRD)**
77
+ - Innovative iterative training framework that shifts reasoning from textual abstractions to acoustic properties.
78
+ - Solves the modality mismatch problem that caused textual surrogate reasoning in previous models.
79
+
80
+ - **Superior Performance**
81
+ - Surpasses **Gemini 2.5 Pro** across comprehensive audio understanding and reasoning benchmarks.
82
+ - Comparable to **Gemini 3** across major audio reasoning tasks.
83
+ - Surpasses **Qwen3** in textual reasoning.
84
+ - Covers speech, environmental sounds, and music domains.
85
+
86
+
87
+ For more examples, see [demo page](https://stepaudiollm.github.io/step-audio-r1/).
88
+
89
+ ## Model Usage
90
+ ### 📜 Requirements
91
+ - **GPU**: NVIDIA GPUs with CUDA support (tested on 4×L40S/H100/H800/H20).
92
+ - **Operating System**: Linux.
93
+ - **Python**: >= 3.10.0.
94
+
95
+ ### ⬇️ Download Model
96
+ First, you need to download the Step-Audio-R1 model weights.
97
+
98
+ **Method A · Git LFS**
99
+ ```bash
100
+ git lfs install
101
+ git clone https://huggingface.co/stepfun-ai/Step-Audio-R1
102
+ ```
103
+
104
+ **Method B · Hugging Face CLI**
105
+ ```bash
106
+ hf download stepfun-ai/Step-Audio-R1 --local-dir ./Step-Audio-R1
107
+ ```
108
+
109
+ ### 🚀 Deployment and Execution
110
+ We provide two ways to serve the model: Docker (recommended) or compiling the customized vLLM backend.
111
+
112
+ #### 🐳 Method 1 · Run with Docker (Recommended)
113
+
114
+ A customized vLLM image is required.
115
+
116
+ 1. **Pull the image**:
117
+ ```bash
118
+ docker pull stepfun2025/vllm:step-audio-2-v20250909
119
+ ```
120
+ 2. **Start the service**:
121
+ Assuming the model is downloaded in the `Step-Audio-R1` folder in the current directory.
122
+
123
+ ```bash
124
+ docker run --rm -ti --gpus all \
125
+ -v $(pwd)/Step-Audio-R1:/Step-Audio-R1 \
126
+ -p 9999:9999 \
127
+ stepfun2025/vllm:step-audio-2-v20250909 \
128
+ -- vllm serve /Step-Audio-R1 \
129
+ --served-model-name Step-Audio-R1 \
130
+ --port 9999 \
131
+ --max-model-len 16384 \
132
+ --max-num-seqs 32 \
133
+ --tensor-parallel-size 4 \
134
+ --chat-template '{%- macro render_content(content) -%}{%- if content is string -%}{{- content.replace("<audio_patch>\n", "<audio_patch>") -}}{%- elif content is mapping -%}{{- content['"'"'value'"'"'] if '"'"'value'"'"' in content else content['"'"'text'"'"'] -}}{%- elif content is iterable -%}{%- for item in content -%}{%- if item.type == '"'"'text'"'"' -%}{{- item['"'"'value'"'"'] if '"'"'value'"'"' in item else item['"'"'text'"'"'] -}}{%- elif item.type == '"'"'audio'"'"' -%}<audio_patch>{%- endif -%}{%- endfor -%}{%- endif -%}{%- endmacro -%}{%- if tools -%}{{- '"'"'<|BOT|>system\n'"'"' -}}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- render_content(messages[0]['"'"'content'"'"']) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{{- '"'"'<|BOT|>tool_json_schemas\n'"'"' + tools|tojson + '"'"'<|EOT|>'"'"' -}}{%- else -%}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- '"'"'<|BOT|>system\n'"'"' + render_content(messages[0]['"'"'content'"'"']) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- endif -%}{%- for message in messages -%}{%- if message["role"] == "user" -%}{{- '"'"'<|BOT|>human\n'"'"' + render_content(message["content"]) + '"'"'<|EOT|>'"'"' -}}{%- elif message["role"] == "assistant" -%}{{- '"'"'<|BOT|>assistant\n'"'"' + (render_content(message["content"]) if message["content"] else '"'"''"'"') -}}{%- set is_last_assistant = true -%}{%- for m in messages[loop.index:] -%}{%- if m["role"] == "assistant" -%}{%- set is_last_assistant = false -%}{%- endif -%}{%- endfor -%}{%- if not is_last_assistant -%}{{- '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- elif message["role"] == "function_output" -%}{%- else -%}{%- if not (loop.first and message["role"] == "system") -%}{{- '"'"'<|BOT|>'"'"' + message["role"] + '"'"'\n'"'"' + render_content(message["content"]) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{- '"'"'<|BOT|>assistant\n<think>\n'"'"' -}}{%- endif -%}' \
135
+ --enable-log-requests \
136
+ --interleave-mm-strings \
137
+ --trust-remote-code
138
+ ```
139
+ After the service starts, it will listen on `localhost:9999`.
140
+
141
+ #### 🐳 Method 2 · Run from Source (Compile vLLM)
142
+ Step-Audio-R1 requires a customized vLLM backend.
143
+
144
+ 1. **Download Source Code**:
145
+ ```bash
146
+ git clone https://github.com/stepfun-ai/vllm.git
147
+ cd vllm
148
+ ```
149
+
150
+ 2. **Prepare Environment**:
151
+ ```bash
152
+ python3 -m venv .venv
153
+ source .venv/bin/activate
154
+ ```
155
+
156
+ 3. **Install and Compile**:
157
+ vLLM contains both C++ and Python code. We mainly modified the Python code, so the C++ part can use the pre-compiled version to speed up the process.
158
+
159
+ ```bash
160
+ # Use pre-compiled C++ extensions (Recommended)
161
+ VLLM_USE_PRECOMPILED=1 pip install -e .
162
+ ```
163
+
164
+ 4. **Switch Branch**:
165
+ After compilation, switch to the branch that supports Step-Audio.
166
+ ```bash
167
+ git checkout step-audio-2-mini
168
+ ```
169
+
170
+ 5. **Start the Service**:
171
+ ```bash
172
+ # Ensure you are in the vllm directory and the virtual environment is activated
173
+ source .venv/bin/activate
174
+
175
+ python3 -m vllm.entrypoints.openai.api_server \
176
+ --model ../Step-Audio-R1 \
177
+ --served-model-name Step-Audio-R1 \
178
+ --port 9999 \
179
+ --host 0.0.0.0 \
180
+ --max-model-len 65536 \
181
+ --max-num-seqs 128 \
182
+ --tensor-parallel-size 4 \
183
+ --gpu-memory-utilization 0.85 \
184
+ --trust-remote-code \
185
+ --enable-log-requests \
186
+ --interleave-mm-strings \
187
+ --chat-template '{%- macro render_content(content) -%}{%- if content is string -%}{{- content.replace("<audio_patch>\n", "<audio_patch>") -}}{%- elif content is mapping -%}{{- content['"'"'value'"'"'] if '"'"'value'"'"' in content else content['"'"'text'"'"'] -}}{%- elif content is iterable -%}{%- for item in content -%}{%- if item.type == '"'"'text'"'"' -%}{{- item['"'"'value'"'"'] if '"'"'value'"'"' in item else item['"'"'text'"'"'] -}}{%- elif item.type == '"'"'audio'"'"' -%}<audio_patch>{%- endif -%}{%- endfor -%}{%- endif -%}{%- endmacro -%}{%- if tools -%}{{- '"'"'<|BOT|>system\n'"'"' -}}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- render_content(messages[0]['"'"'content'"'"']) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{{- '"'"'<|BOT|>tool_json_schemas\n'"'"' + tools|tojson + '"'"'<|EOT|>'"'"' -}}{%- else -%}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- '"'"'<|BOT|>system\n'"'"' + render_content(messages[0]['"'"'content'"'"']) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- endif -%}{%- for message in messages -%}{%- if message["role"] == "user" -%}{{- '"'"'<|BOT|>human\n'"'"' + render_content(message["content"]) + '"'"'<|EOT|>'"'"' -}}{%- elif message["role"] == "assistant" -%}{{- '"'"'<|BOT|>assistant\n'"'"' + (render_content(message["content"]) if message["content"] else '"'"''"'"') -}}{%- set is_last_assistant = true -%}{%- for m in messages[loop.index:] -%}{%- if m["role"] == "assistant" -%}{%- set is_last_assistant = false -%}{%- endif -%}{%- endfor -%}{%- if not is_last_assistant -%}{{- '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- elif message["role"] == "function_output" -%}{%- else -%}{%- if not (loop.first and message["role"] == "system") -%}{{- '"'"'<|BOT|>'"'"' + message["role"] + '"'"'\n'"'"' + render_content(message["content"]) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{- '"'"'<|BOT|>assistant\n<think>\n'"'"' -}}{%- endif -%}'
188
+ ```
189
+
190
+ After the service starts, it will listen on `localhost:9999`.
191
+
192
+
193
+ ### 🧪 Client Examples
194
+
195
+ Get the example code and run it:
196
+ ```bash
197
+ # Clone the repository containing example scripts
198
+ git clone https://github.com/stepfun-ai/Step-Audio-R1.git r1-scripts
199
+
200
+ # Run the example
201
+ cd r1-scripts
202
+ python examples-vllm_r1.py
203
+ ```
204
+
205
+
206
+ ## Citation
207
+
208
+ ```
209
+ @article{tian2025step,
210
+ title={Step-Audio-R1 Technical Report},
211
+ author={Tian, Fei and Zhang, Xiangyu Tony and Zhang, Yuxin and Zhang, Haoyang and Li, Yuxin and Liu, Daijiao and Deng, Yayue and Wu, Donghang and Chen, Jun and Zhao, Liang and others},
212
+ journal={arXiv preprint arXiv:2511.15848},
213
+ year={2025}
214
+ }
215
+
216
+ ```