ASR & Speech Understanding Model

ASR & Understanding Leaderboard

Bold and underlined values denote the best and second-best results.

ASR results (CER%) on various test sets


Model In-House WS-Wu-Bench
Dialogue Reading ASR
ASR Models
Paraformer 63.13 66.85 64.92
SenseVoice-small 29.20 31.00 46.85
Whisper-medium 79.31 83.94 78.24
FireRedASR-AED-L 51.34 59.92 56.69
Step-Audio2-mini 24.27 24.01 26.72
Qwen3-ASR 23.96 24.13 29.31
Tencent-Cloud-ASR 23.25 25.26 29.48
Gemini-2.5-pro 85.50 84.67 89.99
Conformer-U2pp-Wu ⭐ 15.20 12.24 15.14
Whisper-medium-Wu ⭐ 14.19 11.09 14.33
Step-Audio2-Wu-ASR ⭐ 8.68 7.86 12.85
Annotation Models
Dolphin-small 24.78 27.29 26.93
TeleASR 29.07 21.18 30.81
Step-Audio2-FT 8.02 6.14 15.64
Tele-CTC-FT 11.90 7.23 23.85

Speech understanding performance on WenetSpeech-Wu-Bench


Model ASR AST Gender Age Emotion
Qwen3-Omni 44.27 33.31 0.977 0.541 0.667
Step-Audio2-mini 26.72 37.81 0.855 0.370 0.460
Step-Audio2-Wu-Und⭐ 13.23 53.13 0.956 0.729 0.712

ASR & Speech Understanding

This section describes the inference procedures for different speech models used in our experiments, including Conformer-U2pp-Wu, Whisper-Medium-Wu, Step-Audio2-Wu-ASR and Step-Audio2-Wu-Und.


Clone

  • Clone the repo for Conformer-U2pp-Wu, Whisper-Medium-Wu
git clone https://github.com/wenet-e2e/wenet.git
cd examples/aishell/whisper
  • Clone the repo for Step-Audio2-Wu-ASR,Step-Audio2-Wu-Und
git clone https://github.com/modelscope/ms-swift.git
pip install transformers==4.53.3

Data Format

Different models are trained and inferred under different frameworks, with corresponding data formats.

Conformer-U2pp-Wu & Whisper-Medium-Wu

The inference data is provided in JSONL format, where each line corresponds to one utterance:

{"key": "xxxx", "wav": "xxxxx", "txt": "xxxx"}
  • key: utterance ID
  • wav: path to the audio file
  • txt: reference transcription (optional during inference)

Step-Audio2-Wu-ASR

The inference data follows a multi-modal dialogue format, where audio is provided explicitly:

{
  "messages": [
    {
      "role": "user",
      "content": "<audio>语音说了什么"
    },
    {
      "role": "assistant",
      "content": "xxxx"
    }
  ],
  "audios": [
    "xxxx"
  ]
}
  • messages: dialogue-style input/output
  • audios: path(s) to the audio file(s)

Step-Audio2-Wu-Und

The inference script is identical to that of Step-Audio2 described above; only the user prompt needs to be modified for different tasks.

{
  "ASR": "<audio>请记录下你所听到的语音内容。",
  "AST": "<audio>请仔细聆听这段语音,然后将其内容翻译成普通话。",
  "age": "<audio>请根据语音的声学特征,判断说话人的年龄,从儿童、少年、青年、中年、老年中选一个标签。",
  "gender": "<audio>请根据语音的声学特征,判断说话人的性别,从男性、女性中选一个标签。",
  "emotion": "<audio>请根据语音的声学特征和语义,判断语音的情感,从中立、高兴、难过、惊讶、生气选一个标签。"
}

Conformer-U2pp-Wu

dir=exp
data_type=raw
decode_checkpoint=$dir/u2++.pt
decode_modes="attention attention_rescoring ctc_prefix_beam_search ctc_greedy_search"
decode_batch=4
test_result_dir=./results
ctc_weight=0.0
reverse_weight=0.0
decoding_chunk_size=-1

python wenet/bin/recognize.py --gpu 0 \
  --modes ${decode_modes} \
  --config $dir/train.yaml \
  --data_type $data_type \
  --test_data $test_dir/$test_set/data.jsonl \
  --checkpoint $decode_checkpoint \
  --beam_size 10 \
  --batch_size ${decode_batch} \
  --blank_penalty 0.0 \
  --ctc_weight $ctc_weight \
  --reverse_weight $reverse_weight \
  --result_dir $test_result_dir \
  ${decoding_chunk_size:+--decoding_chunk_size $decoding_chunk_size}

This setup supports multiple decoding strategies, including attention-based and CTC-based decoding.

Whisper-Medium-Wu

dir=exp
data_type=raw
decode_checkpoint=$dir/whisper.pt
decode_modes="attention attention_rescoring ctc_prefix_beam_search ctc_greedy_search"
decode_batch=4
test_result_dir=./results
ctc_weight=0.0
reverse_weight=0.0
decoding_chunk_size=-1

python wenet/bin/recognize.py --gpu 0 \
  --modes ${decode_modes} \
  --config $dir/train.yaml \
  --data_type $data_type \
  --test_data $test_dir/$test_set/data.jsonl \
  --checkpoint $decode_checkpoint \
  --beam_size 10 \
  --batch_size ${decode_batch} \
  --blank_penalty 0.0 \
  --ctc_weight $ctc_weight \
  --reverse_weight $reverse_weight \
  --result_dir $test_result_dir \
  ${decoding_chunk_size:+--decoding_chunk_size $decoding_chunk_size}

Step-Audio2-Wu-ASR & Step-Audio2-Wu-Und

model_dir=Step-Audio-2-mini 
adapter_dir=./checkpoints

CUDA_VISIBLE_DEVICES=0 \
swift infer \
  --model $model_dir \
  --adapters $adapter_dir \
  --val_dataset data.jsonl \
  --max_new_tokens 512 \
  --torch_dtype bfloat16 \
  --result_path results.jsonl
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ASLP-lab/WenetSpeech-Wu-Speech-Understanding

Finetuned
(785)
this model

Collection including ASLP-lab/WenetSpeech-Wu-Speech-Understanding