ASR & Speech Understanding Model
ASR & Understanding Leaderboard
Bold and underlined values denote the best and second-best results.
ASR results (CER%) on various test sets
| Model | In-House | WS-Wu-Bench | |
|---|---|---|---|
| Dialogue | Reading | ASR | |
| ASR Models | |||
| Paraformer | 63.13 | 66.85 | 64.92 |
| SenseVoice-small | 29.20 | 31.00 | 46.85 |
| Whisper-medium | 79.31 | 83.94 | 78.24 |
| FireRedASR-AED-L | 51.34 | 59.92 | 56.69 |
| Step-Audio2-mini | 24.27 | 24.01 | 26.72 |
| Qwen3-ASR | 23.96 | 24.13 | 29.31 |
| Tencent-Cloud-ASR | 23.25 | 25.26 | 29.48 |
| Gemini-2.5-pro | 85.50 | 84.67 | 89.99 |
| Conformer-U2pp-Wu ⭐ | 15.20 | 12.24 | 15.14 |
| Whisper-medium-Wu ⭐ | 14.19 | 11.09 | 14.33 |
| Step-Audio2-Wu-ASR ⭐ | 8.68 | 7.86 | 12.85 |
| Annotation Models | |||
| Dolphin-small | 24.78 | 27.29 | 26.93 |
| TeleASR | 29.07 | 21.18 | 30.81 |
| Step-Audio2-FT | 8.02 | 6.14 | 15.64 |
| Tele-CTC-FT | 11.90 | 7.23 | 23.85 |
Speech understanding performance on WenetSpeech-Wu-Bench
| Model | ASR | AST | Gender | Age | Emotion |
|---|---|---|---|---|---|
| Qwen3-Omni | 44.27 | 33.31 | 0.977 | 0.541 | 0.667 |
| Step-Audio2-mini | 26.72 | 37.81 | 0.855 | 0.370 | 0.460 |
| Step-Audio2-Wu-Und⭐ | 13.23 | 53.13 | 0.956 | 0.729 | 0.712 |
ASR & Speech Understanding
This section describes the inference procedures for different speech models used in our experiments, including Conformer-U2pp-Wu, Whisper-Medium-Wu, Step-Audio2-Wu-ASR and Step-Audio2-Wu-Und.
Clone
- Clone the repo for Conformer-U2pp-Wu, Whisper-Medium-Wu
git clone https://github.com/wenet-e2e/wenet.git
cd examples/aishell/whisper
- Clone the repo for Step-Audio2-Wu-ASR,Step-Audio2-Wu-Und
git clone https://github.com/modelscope/ms-swift.git
pip install transformers==4.53.3
Data Format
Different models are trained and inferred under different frameworks, with corresponding data formats.
Conformer-U2pp-Wu & Whisper-Medium-Wu
The inference data is provided in JSONL format, where each line corresponds to one utterance:
{"key": "xxxx", "wav": "xxxxx", "txt": "xxxx"}
key: utterance IDwav: path to the audio filetxt: reference transcription (optional during inference)
Step-Audio2-Wu-ASR
The inference data follows a multi-modal dialogue format, where audio is provided explicitly:
{
"messages": [
{
"role": "user",
"content": "<audio>语音说了什么"
},
{
"role": "assistant",
"content": "xxxx"
}
],
"audios": [
"xxxx"
]
}
messages: dialogue-style input/outputaudios: path(s) to the audio file(s)
Step-Audio2-Wu-Und
The inference script is identical to that of Step-Audio2 described above; only the user prompt needs to be modified for different tasks.
{
"ASR": "<audio>请记录下你所听到的语音内容。",
"AST": "<audio>请仔细聆听这段语音,然后将其内容翻译成普通话。",
"age": "<audio>请根据语音的声学特征,判断说话人的年龄,从儿童、少年、青年、中年、老年中选一个标签。",
"gender": "<audio>请根据语音的声学特征,判断说话人的性别,从男性、女性中选一个标签。",
"emotion": "<audio>请根据语音的声学特征和语义,判断语音的情感,从中立、高兴、难过、惊讶、生气选一个标签。"
}
Conformer-U2pp-Wu
dir=exp
data_type=raw
decode_checkpoint=$dir/u2++.pt
decode_modes="attention attention_rescoring ctc_prefix_beam_search ctc_greedy_search"
decode_batch=4
test_result_dir=./results
ctc_weight=0.0
reverse_weight=0.0
decoding_chunk_size=-1
python wenet/bin/recognize.py --gpu 0 \
--modes ${decode_modes} \
--config $dir/train.yaml \
--data_type $data_type \
--test_data $test_dir/$test_set/data.jsonl \
--checkpoint $decode_checkpoint \
--beam_size 10 \
--batch_size ${decode_batch} \
--blank_penalty 0.0 \
--ctc_weight $ctc_weight \
--reverse_weight $reverse_weight \
--result_dir $test_result_dir \
${decoding_chunk_size:+--decoding_chunk_size $decoding_chunk_size}
This setup supports multiple decoding strategies, including attention-based and CTC-based decoding.
Whisper-Medium-Wu
dir=exp
data_type=raw
decode_checkpoint=$dir/whisper.pt
decode_modes="attention attention_rescoring ctc_prefix_beam_search ctc_greedy_search"
decode_batch=4
test_result_dir=./results
ctc_weight=0.0
reverse_weight=0.0
decoding_chunk_size=-1
python wenet/bin/recognize.py --gpu 0 \
--modes ${decode_modes} \
--config $dir/train.yaml \
--data_type $data_type \
--test_data $test_dir/$test_set/data.jsonl \
--checkpoint $decode_checkpoint \
--beam_size 10 \
--batch_size ${decode_batch} \
--blank_penalty 0.0 \
--ctc_weight $ctc_weight \
--reverse_weight $reverse_weight \
--result_dir $test_result_dir \
${decoding_chunk_size:+--decoding_chunk_size $decoding_chunk_size}
Step-Audio2-Wu-ASR & Step-Audio2-Wu-Und
model_dir=Step-Audio-2-mini
adapter_dir=./checkpoints
CUDA_VISIBLE_DEVICES=0 \
swift infer \
--model $model_dir \
--adapters $adapter_dir \
--val_dataset data.jsonl \
--max_new_tokens 512 \
--torch_dtype bfloat16 \
--result_path results.jsonl
Model tree for ASLP-lab/WenetSpeech-Wu-Speech-Understanding
Base model
openai/whisper-medium