ASR & Speech Understanding Model

ASR & Understanding Leaderboard

Bold and underlined values denote the best and second-best results.

ASR results (CER%) on various test sets

Model	In-House		WS-Wu-Bench
Model	Dialogue	Reading	ASR
ASR Models
Paraformer	63.13	66.85	64.92
SenseVoice-small	29.20	31.00	46.85
Whisper-medium	79.31	83.94	78.24
FireRedASR-AED-L	51.34	59.92	56.69
Step-Audio2-mini	24.27	24.01	26.72
Qwen3-ASR	23.96	24.13	29.31
Tencent-Cloud-ASR	23.25	25.26	29.48
Gemini-2.5-pro	85.50	84.67	89.99
Conformer-U2pp-Wu ⭐	15.20	12.24	15.14
Whisper-medium-Wu ⭐	14.19	11.09	14.33
Step-Audio2-Wu-ASR ⭐	8.68	7.86	12.85
Annotation Models
Dolphin-small	24.78	27.29	26.93
TeleASR	29.07	21.18	30.81
Step-Audio2-FT	8.02	6.14	15.64
Tele-CTC-FT	11.90	7.23	23.85

Speech understanding performance on WenetSpeech-Wu-Bench

Model	ASR	AST	Gender	Age	Emotion
Qwen3-Omni	44.27	33.31	0.977	0.541	0.667
Step-Audio2-mini	26.72	37.81	0.855	0.370	0.460
Step-Audio2-Wu-Und⭐	13.23	53.13	0.956	0.729	0.712

ASR & Speech Understanding

This section describes the inference procedures for different speech models used in our experiments, including Conformer-U2pp-Wu, Whisper-Medium-Wu, Step-Audio2-Wu-ASR and Step-Audio2-Wu-Und.

Clone

Clone the repo for Conformer-U2pp-Wu, Whisper-Medium-Wu

git clone https://github.com/wenet-e2e/wenet.git
cd examples/aishell/whisper

Clone the repo for Step-Audio2-Wu-ASR，Step-Audio2-Wu-Und

git clone https://github.com/modelscope/ms-swift.git
pip install transformers==4.53.3

Data Format

Different models are trained and inferred under different frameworks, with corresponding data formats.

Conformer-U2pp-Wu & Whisper-Medium-Wu

The inference data is provided in JSONL format, where each line corresponds to one utterance:

{"key": "xxxx", "wav": "xxxxx", "txt": "xxxx"}

key: utterance ID
wav: path to the audio file
txt: reference transcription (optional during inference)

Step-Audio2-Wu-ASR

The inference data follows a multi-modal dialogue format, where audio is provided explicitly:

{
  "messages": [
    {
      "role": "user",
      "content": "<audio>语音说了什么"
    },
    {
      "role": "assistant",
      "content": "xxxx"
    }
  ],
  "audios": [
    "xxxx"
  ]
}

messages: dialogue-style input/output
audios: path(s) to the audio file(s)

Step-Audio2-Wu-Und

The inference script is identical to that of Step-Audio2 described above; only the user prompt needs to be modified for different tasks.

{
  "ASR": "<audio>请记录下你所听到的语音内容。",
  "AST": "<audio>请仔细聆听这段语音，然后将其内容翻译成普通话。",
  "age": "<audio>请根据语音的声学特征，判断说话人的年龄，从儿童、少年、青年、中年、老年中选一个标签。",
  "gender": "<audio>请根据语音的声学特征，判断说话人的性别，从男性、女性中选一个标签。",
  "emotion": "<audio>请根据语音的声学特征和语义，判断语音的情感，从中立、高兴、难过、惊讶、生气选一个标签。"
}

Conformer-U2pp-Wu

dir=exp
data_type=raw
decode_checkpoint=$dir/u2++.pt
decode_modes="attention attention_rescoring ctc_prefix_beam_search ctc_greedy_search"
decode_batch=4
test_result_dir=./results
ctc_weight=0.0
reverse_weight=0.0
decoding_chunk_size=-1

python wenet/bin/recognize.py --gpu 0 \
  --modes ${decode_modes} \
  --config $dir/train.yaml \
  --data_type $data_type \
  --test_data $test_dir/$test_set/data.jsonl \
  --checkpoint $decode_checkpoint \
  --beam_size 10 \
  --batch_size ${decode_batch} \
  --blank_penalty 0.0 \
  --ctc_weight $ctc_weight \
  --reverse_weight $reverse_weight \
  --result_dir $test_result_dir \
  ${decoding_chunk_size:+--decoding_chunk_size $decoding_chunk_size}

This setup supports multiple decoding strategies, including attention-based and CTC-based decoding.

Whisper-Medium-Wu

dir=exp
data_type=raw
decode_checkpoint=$dir/whisper.pt
decode_modes="attention attention_rescoring ctc_prefix_beam_search ctc_greedy_search"
decode_batch=4
test_result_dir=./results
ctc_weight=0.0
reverse_weight=0.0
decoding_chunk_size=-1

python wenet/bin/recognize.py --gpu 0 \
  --modes ${decode_modes} \
  --config $dir/train.yaml \
  --data_type $data_type \
  --test_data $test_dir/$test_set/data.jsonl \
  --checkpoint $decode_checkpoint \
  --beam_size 10 \
  --batch_size ${decode_batch} \
  --blank_penalty 0.0 \
  --ctc_weight $ctc_weight \
  --reverse_weight $reverse_weight \
  --result_dir $test_result_dir \
  ${decoding_chunk_size:+--decoding_chunk_size $decoding_chunk_size}

Step-Audio2-Wu-ASR & Step-Audio2-Wu-Und

model_dir=Step-Audio-2-mini 
adapter_dir=./checkpoints

CUDA_VISIBLE_DEVICES=0 \
swift infer \
  --model $model_dir \
  --adapters $adapter_dir \
  --val_dataset data.jsonl \
  --max_new_tokens 512 \
  --torch_dtype bfloat16 \
  --result_path results.jsonl

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ASLP-lab/WenetSpeech-Wu-Speech-Understanding

Base model

openai/whisper-medium

Finetuned

(874)

this model

Collection including ASLP-lab/WenetSpeech-Wu-Speech-Understanding

WenetSpeech-Wu

Collection

4 items • Updated Jan 31 • 4