y-ren16 commited on
Commit
76784ea
·
verified ·
1 Parent(s): e0405e4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -847
README.md CHANGED
@@ -2,855 +2,15 @@
2
  license: apache-2.0
3
  ---
4
 
5
- <div align="center">
6
- <img src="assets/logo.png" height=100>
7
- </div>
8
 
9
- <div align="center" style="line-height: 1;">
10
- <a href="https://github.com/stepfun-ai/Step-Audio2" target="_blank"><img alt="GitHub" src="https://img.shields.io/badge/GitHub-StepFun-white?logo=github&logoColor=white"/></a> &ensp;
11
- <a href="https://stepfun.com/" target="_blank"><img alt="Homepage" src="https://img.shields.io/badge/Homepage-StepFun-white?logo=StepFun&logoColor=white"/></a> &ensp;
12
- <a href="https://x.com/StepFun_ai" target="_blank"><img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-StepFun-white?logo=x&logoColor=white"/></a> &ensp;
13
- <a href="https://discord.com/invite/XHheP5Fn" target="_blank"><img alt="Discord" src="https://img.shields.io/badge/Discord-StepFun-white?logo=discord&logoColor=white"/></a>
14
- </div>
15
- <div align="center">
16
- <a href="https://huggingface.co/stepfun-ai/Step-Audio-2-mini"><img src="https://img.shields.io/static/v1?label=Step-Audio-2-mini&message=HuggingFace&color=yellow"></a> &ensp;
17
- <a href="https://huggingface.co/stepfun-ai/Step-Audio-2-mini-Base"><img src="https://img.shields.io/static/v1?label=Step-Audio-2-mini-Base&message=HuggingFace&color=yellow"></a>
18
- </div>
19
- <div align="center">
20
- <a href="https://arxiv.org/abs/2507.16632"><img src="assets/arxiv.svg"></a> &ensp;
21
- <a href="https://github.com/stepfun-ai/Step-Audio2/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/badge/License-Apache%202.0-blue?&color=blue"/></a>
22
- </div>
23
 
24
- ## Introduction
25
 
26
 
27
- Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.
28
 
29
- - **Advanced Speech and Audio Understanding**: Promising performance in ASR and audio understanding by comprehending and reasoning semantic information, para-linguistic and non-vocal information.
30
-
31
- - **Intelligent Speech Conversation**: Achieving natural and intelligent interactions that are contextually appropriate for various conversational scenarios and paralinguistic information.
32
-
33
- - **Tool Calling and Multimodal RAG**: By leveraging tool calling and RAG to access real-world knowledge (both textual and acoustic), Step-Audio 2 can generate responses with fewer hallucinations for diverse scenarios, while also having the ability to switch timbres based on retrieved speech.
34
-
35
- - **State-of-the-Art Performance**: Achieving state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. (See [Evaluation](#evaluation) and [Technical Report](https://arxiv.org/pdf/2507.16632)).
36
-
37
- + **Open-source**: [Step-Audio 2 mini](https://huggingface.co/stepfun-ai/Step-Audio-2-mini) and [Step-Audio 2 mini Base](https://huggingface.co/stepfun-ai/Step-Audio-2-mini-Base) are released under [Apache 2.0](LICENSE) license.
38
-
39
- ## Model Download
40
- ### Huggingface
41
- | Models | 🤗 Hugging Face |
42
- |-------|-------|
43
- | Step-Audio 2 mini | [stepfun-ai/Step-Audio-2-mini](https://huggingface.co/stepfun-ai/Step-Audio-2-mini) |
44
- | Step-Audio 2 mini Base | [stepfun-ai/Step-Audio-2-mini-Base](https://huggingface.co/stepfun-ai/Step-Audio-2-mini-Base) |
45
-
46
- <!-- ### Modelscope
47
- | Models | Links |
48
- |-------|-------|
49
- | Step-Audio-2-mini | [modelscope](https://modelscope.cn/models/stepfun-ai/Step-Audio-2-mini) |
50
- | Step-Audio-2-mini-Base | [modelscope](https://modelscope.cn/models/stepfun-ai/Step-Audio-2-mini-Base) | -->
51
-
52
- ## Model Usage
53
- ### 🔧 Dependencies and Installation
54
- - Python >= 3.10
55
- - [PyTorch >= 2.3-cu121](https://pytorch.org/)
56
- - [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads)
57
-
58
- ```bash
59
- conda create -n stepaudio2 python=3.10
60
- conda activate stepaudio2
61
- pip install transformers==4.49.0 torchaudio librosa onnxruntime s3tokenizer diffusers hyperpyyaml
62
-
63
- git clone https://github.com/stepfun-ai/Step-Audio2.git
64
- cd Step-Audio2
65
- git lfs install
66
- git clone https://huggingface.co/stepfun-ai/Step-Audio-2-mini-Base
67
- ```
68
-
69
- ### 🚀 Inference Scripts
70
-
71
- ```bash
72
- python examples-base.py
73
- ```
74
-
75
- ## Online demonstration
76
-
77
- ### StepFun realtime console
78
-
79
- - Both Step-Audio 2 and Step-Audio 2 mini are available in our [StepFun realtime console](https://realtime-console.stepfun.com/) with web search tool enabled.
80
- - You will need an API key from the [StepFun Open Platform](https://platform.stepfun.com/).
81
-
82
- ### StepFun AI Assistant
83
-
84
- - Step-Audio 2 is also available in our StepFun AI Assistant mobile App with both web and audio search tools enabled.
85
- - Please scan the following QR code to download it from your app store then tap the phone icon in the top-right corner.
86
-
87
- <div align="center">
88
- <img src="./assets/qrcode.jpg" width="200" alt="QR code">
89
- </div>
90
-
91
- ## WeChat group
92
-
93
- You can scan the following QR code to join our WeChat group for communication and discussion.
94
- <div align="center">
95
- <img src="./assets/wechat_group.png" width="200" alt="QR code">
96
- </div>
97
-
98
- ## Evaluation
99
- <div align="center">
100
- <img src="assets/radar.png" alt="Architecture" width="600" />
101
- </div>
102
-
103
- ### Automatic speech recognition
104
- CER for Chinese, Cantonese and Japanese and WER for Arabian and English. N/A indicates that the language is not supported.
105
-
106
- <table border="1" cellpadding="5" cellspacing="0" align="center">
107
- <thead>
108
- <tr>
109
- <th style="text-align: center;">Category</th>
110
- <th style="text-align: center;">Test set</th>
111
- <th style="text-align: center;">Doubao LLM ASR</th>
112
- <th style="text-align: center;">GPT-4o Transcribe</th>
113
- <th style="text-align: center;">Kimi-Audio</th>
114
- <th style="text-align: center;">Qwen-Omni</th>
115
- <th style="text-align: center;">Step-Audio 2</th>
116
- <th style="text-align: center;">Step-Audio 2 mini</th>
117
- </tr>
118
- </thead>
119
- <tbody>
120
- <tr>
121
- <td rowspan="5" style="text-align: center; vertical-align: middle;"><strong>English</strong></td>
122
- <td align="left">Common Voice</td>
123
- <td align="center">9.20</td>
124
- <td align="center">9.30</td>
125
- <td align="center">7.83</td>
126
- <td align="center">8.33</td>
127
- <td align="center"><strong>5.95</strong></td>
128
- <td align="center">6.76</td>
129
- </tr>
130
- <tr>
131
- <td align="left">FLEURS English</td>
132
- <td align="center">7.22</td>
133
- <td align="center"><strong>2.71</strong></td>
134
- <td align="center">4.47</td>
135
- <td align="center">5.05</td>
136
- <td align="center">3.03</td>
137
- <td align="center">3.05</td>
138
- </tr>
139
- <tr>
140
- <td align="left">LibriSpeech clean</td>
141
- <td align="center">2.92</td>
142
- <td align="center">1.75</td>
143
- <td align="center">1.49</td>
144
- <td align="center">2.93</td>
145
- <td align="center"><strong>1.17</strong></td>
146
- <td align="center">1.33</td>
147
- </tr>
148
- <tr>
149
- <td align="left">LibriSpeech other</td>
150
- <td align="center">5.32</td>
151
- <td align="center">4.23</td>
152
- <td align="center">2.91</td>
153
- <td align="center">5.07</td>
154
- <td align="center"><strong>2.42</strong></td>
155
- <td align="center">2.86</td>
156
- </tr>
157
- <tr>
158
- <td align="left"><strong>Average</strong></td>
159
- <td align="center">6.17</td>
160
- <td align="center">4.50</td>
161
- <td align="center">4.18</td>
162
- <td align="center">5.35</td>
163
- <td align="center"><strong>3.14</strong></td>
164
- <td align="center">3.50</td>
165
- </tr>
166
- <tr>
167
- <td rowspan="7" style="text-align: center; vertical-align: middle;"><strong>Chinese</strong></td>
168
- <td align="left">AISHELL</td>
169
- <td align="center">0.98</td>
170
- <td align="center">3.52</td>
171
- <td align="center">0.64</td>
172
- <td align="center">1.17</td>
173
- <td align="center"><strong>0.63</strong></td>
174
- <td align="center">0.78</td>
175
- </tr>
176
- <tr>
177
- <td align="left">AISHELL-2</td>
178
- <td align="center">3.10</td>
179
- <td align="center">4.26</td>
180
- <td align="center">2.67</td>
181
- <td align="center">2.40</td>
182
- <td align="center"><strong>2.10</strong></td>
183
- <td align="center">2.16</td>
184
- </tr>
185
- <tr>
186
- <td align="left">FLEURS Chinese</td>
187
- <td align="center">2.92</td>
188
- <td align="center">2.62</td>
189
- <td align="center">2.91</td>
190
- <td align="center">7.01</td>
191
- <td align="center">2.68</td>
192
- <td align="center"><strong>2.53</strong></td>
193
- </tr>
194
- <tr>
195
- <td align="left">KeSpeech phase1</td>
196
- <td align="center">6.48</td>
197
- <td align="center">26.80</td>
198
- <td align="center">5.11</td>
199
- <td align="center">6.45</td>
200
- <td align="center"><strong>3.63</strong></td>
201
- <td align="center">3.97</td>
202
- </tr>
203
- <tr>
204
- <td align="left">WenetSpeech meeting</td>
205
- <td align="center">4.90</td>
206
- <td align="center">31.40</td>
207
- <td align="center">5.21</td>
208
- <td align="center">6.61</td>
209
- <td align="center"><strong>4.75</strong></td>
210
- <td align="center">4.87</td>
211
- </tr>
212
- <tr>
213
- <td align="left">WenetSpeech net</td>
214
- <td align="center"><strong>4.46</strong></td>
215
- <td align="center">15.71</td>
216
- <td align="center">5.93</td>
217
- <td align="center">5.24</td>
218
- <td align="center">4.67</td>
219
- <td align="center">4.82</td>
220
- </tr>
221
- <tr>
222
- <td align="left"><strong>Average</strong></td>
223
- <td align="center">3.81</td>
224
- <td align="center">14.05</td>
225
- <td align="center">3.75</td>
226
- <td align="center">4.81</td>
227
- <td align="center"><strong>3.08</strong></td>
228
- <td align="center">3.19</td>
229
- </tr>
230
- <tr>
231
- <td rowspan="3" style="text-align: center; vertical-align: middle;"><strong>Multilingual </strong></td>
232
- <td align="left">FLEURS Arabian</td>
233
- <td align="center">N/A</td>
234
- <td align="center"><strong>11.72</strong></td>
235
- <td align="center">N/A</td>
236
- <td align="center">25.13</td>
237
- <td align="center">14.22</td>
238
- <td align="center">16.46</td>
239
- </tr>
240
- <tr>
241
- <td align="left">Common Voice yue</td>
242
- <td align="center">9.20</td>
243
- <td align="center">11.10</td>
244
- <td align="center">38.90</td>
245
- <td align="center"><strong>7.89</strong></td>
246
- <td align="center">7.90</td>
247
- <td align="center">8.32</td>
248
- </tr>
249
- <tr>
250
- <td align="left">FLEURS Japanese</td>
251
- <td align="center">N/A</td>
252
- <td align="center"><strong>3.27</strong></td>
253
- <td align="center">N/A</td>
254
- <td align="center">10.49</td>
255
- <td align="center">3.18</td>
256
- <td align="center">4.67</td>
257
- </tr>
258
- <tr>
259
- <td rowspan="7" style="text-align: center; vertical-align: middle;"><strong>In-house</strong></td>
260
- <td align="left">Anhui accent</td>
261
- <td align="center"><strong>8.83</strong></td>
262
- <td align="center">50.55</td>
263
- <td align="center">22.17</td>
264
- <td align="center">18.73</td>
265
- <td align="center">10.61</td>
266
- <td align="center">11.65</td>
267
- </tr>
268
- <tr>
269
- <td align="left">Guangdong accent</td>
270
- <td align="center">4.99</td>
271
- <td align="center">7.83</td>
272
- <td align="center"><strong>3.76</strong></td>
273
- <td align="center">4.03</td>
274
- <td align="center">3.81</td>
275
- <td align="center">4.44</td>
276
- </tr>
277
- <tr>
278
- <td align="left">Guangxi accent</td>
279
- <td align="center">3.37</td>
280
- <td align="center">7.09</td>
281
- <td align="center">4.29</td>
282
- <td align="center"><strong>3.35</strong></td>
283
- <td align="center">4.11</td>
284
- <td align="center">3.51</td>
285
- </tr>
286
- <tr>
287
- <td align="left">Shanxi accent</td>
288
- <td align="center">20.26</td>
289
- <td align="center">55.03</td>
290
- <td align="center">34.71</td>
291
- <td align="center">25.95</td>
292
- <td align="center"><strong>12.44</strong></td>
293
- <td align="center">15.60</td>
294
- </tr>
295
- <tr>
296
- <td align="left">Sichuan dialect</td>
297
- <td align="center"><strong>3.01</strong></td>
298
- <td align="center">32.85</td>
299
- <td align="center">5.26</td>
300
- <td align="center">5.61</td>
301
- <td align="center">4.35</td>
302
- <td align="center">4.57</td>
303
- </tr>
304
- <tr>
305
- <td align="left">Shanghai dialect</td>
306
- <td align="center">47.49</td>
307
- <td align="center">89.58</td>
308
- <td align="center">82.90</td>
309
- <td align="center">58.74</td>
310
- <td align="center"><strong>17.77</strong></td>
311
- <td align="center">19.30</td>
312
- </tr>
313
- <tr>
314
- <td align="left"><strong>Average</strong></td>
315
- <td align="center">14.66</td>
316
- <td align="center">40.49</td>
317
- <td align="center">25.52</td>
318
- <td align="center">19.40</td>
319
- <td align="center"><strong>8.85</strong></td>
320
- <td align="center">9.85</td>
321
- </tr>
322
- </tbody>
323
- </table>
324
-
325
- ### Paralinguistic information understanding
326
- StepEval-Audio-Paralinguistic
327
- <table border="1" cellpadding="5" cellspacing="0" align="center">
328
- <thead>
329
- <tr>
330
- <th style="text-align: center;" rowspan="2">Model</th>
331
- <th style="text-align: center;" rowspan="2">Avg.</th>
332
- <th style="text-align: center;" rowspan="2">Gender</th>
333
- <th style="text-align: center;" rowspan="2">Age</th>
334
- <th style="text-align: center;" rowspan="2">Timbre</th>
335
- <th style="text-align: center;" rowspan="2">Scenario</th>
336
- <th style="text-align: center;" rowspan="2">Event</th>
337
- <th style="text-align: center;" rowspan="2">Emotion</th>
338
- <th style="text-align: center;" rowspan="2">Pitch</th>
339
- <th style="text-align: center;" rowspan="2">Rhythm</th>
340
- <th style="text-align: center;" rowspan="2">Speed</th>
341
- <th style="text-align: center;" rowspan="2">Style</th>
342
- <th style="text-align: center;" rowspan="2">Vocal</th>
343
- </tr>
344
- </thead>
345
- <tbody>
346
- <tr>
347
- <td align="left"><strong>GPT-4o Audio</strong></td>
348
- <td align="center">43.45</td>
349
- <td align="center">18</td>
350
- <td align="center">42</td>
351
- <td align="center">34</td>
352
- <td align="center">22</td>
353
- <td align="center">14</td>
354
- <td align="center">82</td>
355
- <td align="center">40</td>
356
- <td align="center">60</td>
357
- <td align="center">58</td>
358
- <td align="center">64</td>
359
- <td align="center">44</td>
360
- </tr>
361
- <tr>
362
- <td align="left"><strong>Kimi-Audio</strong></td>
363
- <td align="center">49.64</td>
364
- <td align="center">94</td>
365
- <td align="center">50</td>
366
- <td align="center">10</td>
367
- <td align="center">30</td>
368
- <td align="center">48</td>
369
- <td align="center">66</td>
370
- <td align="center">56</td>
371
- <td align="center">40</td>
372
- <td align="center">44</td>
373
- <td align="center">54</td>
374
- <td align="center">54</td>
375
- </tr>
376
- <tr>
377
- <td align="left"><strong>Qwen-Omni</strong></td>
378
- <td align="center">44.18</td>
379
- <td align="center">40</td>
380
- <td align="center">50</td>
381
- <td align="center">16</td>
382
- <td align="center">28</td>
383
- <td align="center">42</td>
384
- <td align="center">76</td>
385
- <td align="center">32</td>
386
- <td align="center">54</td>
387
- <td align="center">50</td>
388
- <td align="center">50</td>
389
- <td align="center">48</td>
390
- </tr>
391
- <tr>
392
- <td align="left"><strong>Step-Audio-AQAA</strong></td>
393
- <td align="center">36.91</td>
394
- <td align="center">70</td>
395
- <td align="center">66</td>
396
- <td align="center">18</td>
397
- <td align="center">14</td>
398
- <td align="center">14</td>
399
- <td align="center">40</td>
400
- <td align="center">38</td>
401
- <td align="center">48</td>
402
- <td align="center">54</td>
403
- <td align="center">44</td>
404
- <td align="center">0</td>
405
- </tr>
406
- <tr>
407
- <td align="left"><strong>Step-Audio 2</strong></td>
408
- <td align="center"><strong>83.09</strong></td>
409
- <td align="center"><strong>100</strong></td>
410
- <td align="center"><strong>96</strong></td>
411
- <td align="center"><strong>82</strong></td>
412
- <td align="center"><strong>78</strong></td>
413
- <td align="center"><strong>60</strong></td>
414
- <td align="center"><strong>86</strong></td>
415
- <td align="center"><strong>82</strong></td>
416
- <td align="center"><strong>86</strong></td>
417
- <td align="center"><strong>88</strong></td>
418
- <td align="center"><strong>88</strong></td>
419
- <td align="center">68</td>
420
- </tr>
421
- <tr>
422
- <td align="left"><strong>Step-Audio 2 mini</strong></td>
423
- <td align="center">80.00</td>
424
- <td align="center"><strong>100</strong></td>
425
- <td align="center">94</td>
426
- <td align="center">80</td>
427
- <td align="center"><strong>78</strong></td>
428
- <td align="center"><strong>60</strong></td>
429
- <td align="center">82</td>
430
- <td align="center"><strong>82</strong></td>
431
- <td align="center">68</td>
432
- <td align="center">74</td>
433
- <td align="center">86</td>
434
- <td align="center"><strong>76</strong></td>
435
- </tr>
436
- </tbody>
437
- </table>
438
-
439
- ### Audio understanding and reasoning
440
- MMAU
441
- <table border="1" cellpadding="5" cellspacing="0" align="center">
442
- <thead>
443
- <tr>
444
- <th style="text-align: center;">Model</th>
445
- <th style="text-align: center;">Avg.</th>
446
- <th style="text-align: center;">Sound</th>
447
- <th style="text-align: center;">Speech</th>
448
- <th style="text-align: center;">Music</th>
449
- </tr>
450
- </thead>
451
- <tbody>
452
- <tr>
453
- <td align="left"><strong>Audio Flamingo 3</strong></td>
454
- <td align="center">73.1</td>
455
- <td align="center">76.9</td>
456
- <td align="center">66.1</td>
457
- <td align="center"><strong>73.9</strong></td>
458
- </tr>
459
- <tr>
460
- <td align="left"><strong>Gemini 2.5 Pro</strong></td>
461
- <td align="center">71.6</td>
462
- <td align="center">75.1</td>
463
- <td align="center">71.5</td>
464
- <td align="center">68.3</td>
465
- </tr>
466
- <tr>
467
- <td align="left"><strong>GPT-4o Audio</strong></td>
468
- <td align="center">58.1</td>
469
- <td align="center">58.0</td>
470
- <td align="center">64.6</td>
471
- <td align="center">51.8</td>
472
- </tr>
473
- <tr>
474
- <td align="left"><strong>Kimi-Audio</strong></td>
475
- <td align="center">69.6</td>
476
- <td align="center">79.0</td>
477
- <td align="center">65.5</td>
478
- <td align="center">64.4</td>
479
- </tr>
480
- <tr>
481
- <td align="left"><strong>Omni-R1</strong></td>
482
- <td align="center">77.0</td>
483
- <td align="center">81.7</td>
484
- <td align="center">76.0</td>
485
- <td align="center">73.4</td>
486
- </tr>
487
- <tr>
488
- <td align="left"><strong>Qwen2.5-Omni</strong></td>
489
- <td align="center">71.5</td>
490
- <td align="center">78.1</td>
491
- <td align="center">70.6</td>
492
- <td align="center">65.9</td>
493
- </tr>
494
- <tr>
495
- <td align="left"><strong>Step-Audio-AQAA</strong></td>
496
- <td align="center">49.7</td>
497
- <td align="center">50.5</td>
498
- <td align="center">51.4</td>
499
- <td align="center">47.3</td>
500
- </tr>
501
- <tr>
502
- <td align="left"><strong>Step-Audio 2</strong></td>
503
- <td align="center"><strong>78.0</strong></td>
504
- <td align="center"><strong>83.5</strong></td>
505
- <td align="center"><strong>76.9</strong></td>
506
- <td align="center">73.7</td>
507
- </tr>
508
- <tr>
509
- <td align="left"><strong>Step-Audio 2 mini</strong></td>
510
- <td align="center">73.2</td>
511
- <td align="center">76.6</td>
512
- <td align="center">71.5</td>
513
- <td align="center">71.6</td>
514
- </tr>
515
- </tbody>
516
- </table>
517
-
518
- ### Speech translation
519
-
520
- <table border="1" cellpadding="5" cellspacing="0" align="center">
521
- <thead>
522
- <tr>
523
- <th style="text-align: center;" rowspan="2">Model</th>
524
- <th style="text-align: center;" colspan="3">CoVoST 2 (S2TT)</th>
525
- </tr>
526
- <tr>
527
- <th>Avg.</th>
528
- <th>English-to-Chinese</th>
529
- <th>Chinese-to-English</th>
530
- </tr>
531
- </thead>
532
- <tbody>
533
- <tr>
534
- <td align="left"><strong>GPT-4o Audio</strong></td>
535
- <td align="center">29.61</td>
536
- <td align="center">40.20</td>
537
- <td align="center">19.01</td>
538
- </tr>
539
- <tr>
540
- <td align="left"><strong>Qwen2.5-Omni</strong></td>
541
- <td align="center">35.40</td>
542
- <td align="center">41.40</td>
543
- <td align="center">29.40</td>
544
- </tr>
545
- <tr>
546
- <td align="left"><strong>Step-Audio-AQAA</strong></td>
547
- <td align="center">28.57</td>
548
- <td align="center">37.71</td>
549
- <td align="center">19.43</td>
550
- </tr>
551
- <tr>
552
- <td align="left"><strong>Step-Audio 2</strong></td>
553
- <td align="center">39.26</td>
554
- <td align="center">49.01</td>
555
- <td align="center"><strong>29.51</strong></td>
556
- </tr>
557
- <tr>
558
- <td align="left"><strong>Step-Audio 2 mini</strong></td>
559
- <td align="center"><strong>39.29</strong></td>
560
- <td align="center"><strong>49.12</strong></td>
561
- <td align="center">29.47</td>
562
- </tr>
563
- </tbody>
564
- </table>
565
-
566
- <table border="1" cellpadding="5" cellspacing="0" align="center">
567
- <thead>
568
- <tr>
569
- <th style="text-align: center;" rowspan="2">Model</th>
570
- <th style="text-align: center;" colspan="3">CVSS (S2ST)</th>
571
- </tr>
572
- <tr>
573
- <th>Avg.</th>
574
- <th>English-to-Chinese</th>
575
- <th>Chinese-to-English</th>
576
- </tr>
577
- </thead>
578
- <tbody>
579
- <tr>
580
- <td align="left"><strong>GPT-4o Audio</strong></td>
581
- <td align="center">23.68</td>
582
- <td align="center">20.07</td>
583
- <td align="center"><strong>27.29</strong></td>
584
- </tr>
585
- <tr>
586
- <td align="left"><strong>Qwen-Omni</strong></td>
587
- <td align="center">15.35</td>
588
- <td align="center">8.04</td>
589
- <td align="center">22.66</td>
590
- </tr>
591
- <tr>
592
- <td align="left"><strong>Step-Audio-AQAA</strong></td>
593
- <td align="center">27.36</td>
594
- <td align="center">30.74</td>
595
- <td align="center">23.98</td>
596
- </tr>
597
- <tr>
598
- <td align="left"><strong>Step-Audio 2</strong></td>
599
- <td align="center"><strong>30.87</strong></td>
600
- <td align="center"><strong>34.83</strong></td>
601
- <td align="center">26.92</td>
602
- </tr>
603
- <tr>
604
- <td align="left"><strong>Step-Audio 2 mini</strong></td>
605
- <td align="center">29.08</td>
606
- <td align="center">32.81</td>
607
- <td align="center">25.35</td>
608
- </tr>
609
- </tbody>
610
- </table>
611
-
612
- ### Tool calling
613
- StepEval-Audio-Toolcall. Date and time tools have no parameter.
614
- <table border="1" cellpadding="5" cellspacing="0" align="center">
615
- <thead>
616
- <tr>
617
- <th style="text-align: center;">Model</th>
618
- <th style="text-align: center;">Objective</th>
619
- <th style="text-align: center;">Metric</th>
620
- <th style="text-align: center;">Audio search</th>
621
- <th style="text-align: center;">Date & Time</th>
622
- <th style="text-align: center;">Weather</th>
623
- <th style="text-align: center;">Web search</th>
624
- </tr>
625
- </thead>
626
- <tbody>
627
- <tr>
628
- <td style="text-align: center; vertical-align: middle;" rowspan="3"><strong>Qwen3-32B</strong><sup>†</sup></td>
629
- <td align="center"><strong>Trigger</strong></td>
630
- <td align="center"><strong>Precision / Recall</strong></td>
631
- <td align="center">67.5 / 98.5</td>
632
- <td align="center">98.4 / 100.0</td>
633
- <td align="center">90.1 / 100.0</td>
634
- <td align="center">86.8 / 98.5</td>
635
- </tr>
636
- <tr>
637
- <td align="center"><strong>Type</strong></td>
638
- <td align="center"><strong>Accuracy</strong></td>
639
- <td align="center">100.0</td>
640
- <td align="center">100.0</td>
641
- <td align="center">98.5</td>
642
- <td align="center">98.5</td>
643
- </tr>
644
- <tr>
645
- <td align="center"><strong>Parameter</strong></td>
646
- <td align="center"><strong>Accuracy</strong></td>
647
- <td align="center">100.0</td>
648
- <td align="center">N/A</td>
649
- <td align="center">100.0</td>
650
- <td align="center">100.0</td>
651
- </tr>
652
- <tr>
653
- <td style="text-align: center; vertical-align: middle;" rowspan="3"><strong>Step-Audio 2</strong></td>
654
- <td align="center"><strong>Trigger</strong></td>
655
- <td align="center"><strong>Precision / Recall</strong></td>
656
- <td align="center">86.8 / 99.5</td>
657
- <td align="center">96.9 / 98.4</td>
658
- <td align="center">92.2 / 100.0</td>
659
- <td align="center">88.4 / 95.5</td>
660
- </tr>
661
- <tr>
662
- <td align="center"><strong>Type</strong></td>
663
- <td align="center"><strong>Accuracy</strong></td>
664
- <td align="center">100.0</td>
665
- <td align="center">100.0</td>
666
- <td align="center">90.5</td>
667
- <td align="center">98.4</td>
668
- </tr>
669
- <tr>
670
- <td align="center"><strong>Parameter</strong></td>
671
- <td align="center"><strong>Accuracy</strong></td>
672
- <td align="center">100.0</td>
673
- <td align="center">N/A</td>
674
- <td align="center">100.0</td>
675
- <td align="center">100.0</td>
676
- </tr>
677
- </tbody>
678
- </table>
679
-
680
- ### Speech-to-speech conversation
681
- URO-Bench. U. R. O. stands for understanding, reasoning, and oral conversation, respectively.
682
-
683
- <table border="1" cellpadding="5" cellspacing="0" align="center">
684
- <thead>
685
- <tr>
686
- <th style="text-align: center;" rowspan="2">Model</th>
687
- <th style="text-align: center;" rowspan="2">Language</th>
688
- <th style="text-align: center;" colspan="4">Basic</th>
689
- <th style="text-align: center;" colspan="4">Pro</th>
690
- </tr>
691
- <tr>
692
- <th style="text-align: center;">Avg.</th>
693
- <th style="text-align: center;">U.</th>
694
- <th style="text-align: center;">R.</th>
695
- <th style="text-align: center;">O.</th>
696
- <th style="text-align: center;">Avg.</th>
697
- <th style="text-align: center;">U.</th>
698
- <th style="text-align: center;">R.</th>
699
- <th style="text-align: center;">O.</th>
700
- </tr>
701
- </thead>
702
- <tbody>
703
- <tr>
704
- <td align="left"><strong>GPT-4o Audio</strong></td>
705
- <td rowspan="6" style="text-align: center; vertical-align: middle;"><strong>Chinese</strong></td>
706
- <td align="center">78.59</td>
707
- <td align="center">89.40</td>
708
- <td align="center">65.48</td>
709
- <td align="center">85.24</td>
710
- <td align="center">67.10</td>
711
- <td align="center">70.60</td>
712
- <td align="center">57.22</td>
713
- <td align="center">70.20</td>
714
- </tr>
715
- <tr>
716
- <td align="left"><strong>Kimi-Audio</strong></td>
717
- <td align="center">73.59</td>
718
- <td align="center">79.34</td>
719
- <td align="center">64.66</td>
720
- <td align="center">79.75</td>
721
- <td align="center">66.07</td>
722
- <td align="center">60.44</td>
723
- <td align="center">59.29</td>
724
- <td align="center"><strong>76.21</strong></td>
725
- </tr>
726
- <tr>
727
- <td align="left"><strong>Qwen-Omni</strong></td>
728
- <td align="center">68.98</td>
729
- <td align="center">59.66</td>
730
- <td align="center">69.74</td>
731
- <td align="center">77.27</td>
732
- <td align="center">59.11</td>
733
- <td align="center">59.01</td>
734
- <td align="center">59.82</td>
735
- <td align="center">58.74</td>
736
- </tr>
737
- <tr>
738
- <td align="left"><strong>Step-Audio-AQAA</strong></td>
739
- <td align="center">74.71</td>
740
- <td align="center">87.61</td>
741
- <td align="center">59.63</td>
742
- <td align="center">81.93</td>
743
- <td align="center">65.61</td>
744
- <td align="center">74.76</td>
745
- <td align="center">47.29</td>
746
- <td align="center">68.97</td>
747
- </tr>
748
- <tr>
749
- <td align="left"><strong>Step-Audio 2</strong></td>
750
- <td align="center"><strong>83.32</strong></td>
751
- <td align="center"><strong>91.05</strong></td>
752
- <td align="center"><strong>75.45</strong></td>
753
- <td align="center"><strong>86.08</strong></td>
754
- <td align="center">68.25</td>
755
- <td align="center">74.78</td>
756
- <td align="center"><strong>63.18</strong></td>
757
- <td align="center">65.10</td>
758
- </tr>
759
- <tr>
760
- <td align="left"><strong>Step-Audio 2 mini</strong></td>
761
- <td align="center">77.81</td>
762
- <td align="center">89.19</td>
763
- <td align="center">64.53</td>
764
- <td align="center">84.12</td>
765
- <td align="center"><strong>69.57</strong></td>
766
- <td align="center"><strong>76.84</strong></td>
767
- <td align="center">58.90</td>
768
- <td align="center">69.42</td>
769
- </tr>
770
- <tr>
771
- <td align="left"><strong>GPT-4o Audio</strong></td>
772
- <td rowspan="6" style="text-align: center; vertical-align: middle;"><strong>English</strong></td>
773
- <td align="center"><strong>84.54</strong></td>
774
- <td align="center">90.18</td>
775
- <td align="center">75.90</td>
776
- <td align="center"><strong>90.41</strong></td>
777
- <td align="center"><strong>67.51</strong></td>
778
- <td align="center">60.65</td>
779
- <td align="center">64.36</td>
780
- <td align="center"><strong>78.46</strong></td>
781
- </tr>
782
- <tr>
783
- <td align="left"><strong>Kimi-Audio</strong></td>
784
- <td align="center">60.04</td>
785
- <td align="center">83.36</td>
786
- <td align="center">42.31</td>
787
- <td align="center">60.36</td>
788
- <td align="center">49.79</td>
789
- <td align="center">50.32</td>
790
- <td align="center">40.59</td>
791
- <td align="center">56.04</td>
792
- </tr>
793
- <tr>
794
- <td align="left"><strong>Qwen-Omni</strong></td>
795
- <td align="center">70.58</td>
796
- <td align="center">66.29</td>
797
- <td align="center">69.62</td>
798
- <td align="center">76.16</td>
799
- <td align="center">50.99</td>
800
- <td align="center">44.51</td>
801
- <td align="center">63.88</td>
802
- <td align="center">49.41</td>
803
- </tr>
804
- <tr>
805
- <td align="left"><strong>Step-Audio-AQAA</strong></td>
806
- <td align="center">71.11</td>
807
- <td align="center">90.15</td>
808
- <td align="center">56.12</td>
809
- <td align="center">72.06</td>
810
- <td align="center">52.01</td>
811
- <td align="center">44.25</td>
812
- <td align="center">54.54</td>
813
- <td align="center">59.81</td>
814
- </tr>
815
- <tr>
816
- <td align="left"><strong>Step-Audio 2</strong></td>
817
- <td align="center">83.90</td>
818
- <td align="center"><strong>92.72</strong></td>
819
- <td align="center"><strong>76.51</strong></td>
820
- <td align="center">84.92</td>
821
- <td align="center">66.07</td>
822
- <td align="center"><strong>64.86</strong></td>
823
- <td align="center"><strong>67.75</strong></td>
824
- <td align="center">66.33</td>
825
- </tr>
826
- <tr>
827
- <td align="left"><strong>Step-Audio 2 mini</strong></td>
828
- <td align="center">74.36</td>
829
- <td align="center">90.07</td>
830
- <td align="center">60.12</td>
831
- <td align="center">77.65</td>
832
- <td align="center">61.25</td>
833
- <td align="center">58.79</td>
834
- <td align="center">61.94</td>
835
- <td align="center">63.80</td>
836
- </tr>
837
- </tbody>
838
- </table>
839
-
840
- ## License
841
-
842
- The model and code in the repository is licensed under [Apache 2.0](LICENSE) License.
843
-
844
- ## Citation
845
-
846
- ```
847
- @misc{wu2025stepaudio2technicalreport,
848
- title={Step-Audio 2 Technical Report},
849
- author={Boyong Wu and Chao Yan and Chen Hu and Cheng Yi and Chengli Feng and Fei Tian and Feiyu Shen and Gang Yu and Haoyang Zhang and Jingbei Li and Mingrui Chen and Peng Liu and Wang You and Xiangyu Tony Zhang and Xingyuan Li and Xuerui Yang and Yayue Deng and Yechang Huang and Yuxin Li and Yuxin Zhang and Zhao You and Brian Li and Changyi Wan and Hanpeng Hu and Jiangjie Zhen and Siyu Chen and Song Yuan and Xuelin Zhang and Yimin Jiang and Yu Zhou and Yuxiang Yang and Bingxin Li and Buyun Ma and Changhe Song and Dongqing Pang and Guoqiang Hu and Haiyang Sun and Kang An and Na Wang and Shuli Gao and Wei Ji and Wen Li and Wen Sun and Xuan Wen and Yong Ren and Yuankai Ma and Yufan Lu and Bin Wang and Bo Li and Changxin Miao and Che Liu and Chen Xu and Dapeng Shi and Dingyuan Hu and Donghang Wu and Enle Liu and Guanzhe Huang and Gulin Yan and Han Zhang and Hao Nie and Haonan Jia and Hongyu Zhou and Jianjian Sun and Jiaoren Wu and Jie Wu and Jie Yang and Jin Yang and Junzhe Lin and Kaixiang Li and Lei Yang and Liying Shi and Li Zhou and Longlong Gu and Ming Li and Mingliang Li and Mingxiao Li and Nan Wu and Qi Han and Qinyuan Tan and Shaoliang Pang and Shengjie Fan and Siqi Liu and Tiancheng Cao and Wanying Lu and Wenqing He and Wuxun Xie and Xu Zhao and Xueqi Li and Yanbo Yu and Yang Yang and Yi Liu and Yifan Lu and Yilei Wang and Yuanhao Ding and Yuanwei Liang and Yuanwei Lu and Yuchu Luo and Yuhe Yin and Yumeng Zhan and Yuxiang Zhang and Zidong Yang and Zixin Zhang and Binxing Jiao and Daxin Jiang and Heung-Yeung Shum and Jiansheng Chen and Jing Li and Xiangyu Zhang and Yibo Zhu},
850
- year={2025},
851
- eprint={2507.16632},
852
- archivePrefix={arXiv},
853
- primaryClass={cs.CL},
854
- url={https://arxiv.org/abs/2507.16632},
855
- }
856
- ```
 
2
  license: apache-2.0
3
  ---
4
 
5
+ # OV-InstructTTS: Towards Open-Vocabulary Instruct Text-to-Speech
 
 
6
 
7
+ <p align="center">
8
+ &nbsp&nbsp🖥️ <a href="https://y-ren16.github.io/OV-InstructTTS">Demo</a> | 🤗 <a href="https://huggingface.co/datasets/y-ren16/OVSpeech">Datasets</a>&nbsp&nbsp | ⚙️ <a href="https://github.com/y-ren16/OV-InstructTTS" class="link-button"> Code</a>&nbsp&nbsp | 🤗 <a href="https://huggingface.co/y-ren16/OV-InstructTTS">Checkpoints</a>&nbsp&nbsp
9
+ <!-- |&nbsp&nbsp📑 <a href="https://arxiv.org/pdf/2510.00000">Paper</a>&nbsp&nbsp -->
10
+ <br>
 
 
 
 
 
 
 
 
 
 
11
 
12
+ # Introduction
13
 
14
 
15
+ Instruct Text-to-Speech (InstructTTS) leverages natural language descriptions as style prompts to guide speech synthesis. However, existing InstructTTS methods mainly rely on a direct combination of audio-related labels or their diverse rephrasings, making it difficult to handle flexible, high-level instructions. Such rigid control is insufficient for users such as content creators who wish to steer generation with descriptive instructions. To address these constraints, we introduce OV-InstructTTS, a new paradigm for open-vocabulary InstructTTS. We propose a comprehensive solution comprising a newly curated dataset, OV-Speech, and a novel reasoning-driven framework. The OV-Speech dataset pairs speech with open-vocabulary instructions, each augmented with a reasoning process that connects high-level instructions to acoustic features. The reasoning-driven framework infers emotional, acoustic, and paralinguistic information from open-vocabulary instructions before synthesizing speech. Evaluations show that this reasoning-driven approach significantly improves instruction-following fidelity and speech expressiveness. We believe this work can inspire the next user-friendly InstructTTS systems with stronger generalization and real-world applicability.
16