FireRed Team commited on
Commit
e448fd9
·
verified ·
1 Parent(s): 007264a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +152 -116
README.md CHANGED
@@ -15,6 +15,7 @@ tags:
15
  - asr
16
  ---
17
 
 
18
  <div align="center">
19
  <h1>
20
  FireRedASR2S
@@ -24,23 +25,28 @@ A SOTA Industrial-Grade All-in-One ASR System
24
 
25
  </div>
26
 
27
- [[Paper]](https://arxiv.org/pdf/2501.14350)
28
- [[Model]](https://huggingface.co/FireRedTeam)
29
- [[Blog]](https://fireredteam.github.io/demos/firered_asr/)
 
30
  [[Demo]](https://huggingface.co/spaces/FireRedTeam/FireRedASR)
31
 
32
 
33
  FireRedASR2S is a state-of-the-art (SOTA), industrial-grade, all-in-one ASR system with ASR, VAD, LID, and Punc modules. All modules achieve SOTA performance:
34
- - **FireRedASR2**: Automatic Speech Recognition (ASR) supporting Chinese (Mandarin, 20+ dialects/accents), English, code-switching, and singing lyrics recognition. 2.89% average CER on Mandarin (4 test sets), 11.55% on Chinese dialects (19 test sets), outperforming Doubao-ASR, Qwen3-ASR-1.7B, Fun-ASR, and Fun-ASR-Nano-2512. FireRedASR2-AED also supports word-level timestamps and confidence scores.
35
- - **FireRedVAD**: Voice Activity Detection (VAD) supporting speech/singing/music in 100+ languages. 97.57% F1, outperforming Silero-VAD, TEN-VAD, and FunASR-VAD. Supports non-streaming/streaming VAD and Audio Event Detection.
36
- - **FireRedLID**: Spoken Language Identification (LID) supporting 100+ languages and 20+ Chinese dialects/accents. 97.18% accuracy, outperforming Whisper and SpeechBrain-LID.
37
  - **FireRedPunc**: Punctuation Prediction (Punc) for Chinese and English. 78.90% average F1, outperforming FunASR-Punc (62.77%).
38
 
39
  *`2S`: `2`nd-generation FireRedASR, now expanded to an all-in-one ASR `S`ystem*
40
 
41
 
42
  ## 🔥 News
43
- - [2026.02.12] We release FireRedASR2S (FireRedASR2-AED, FireRedVAD, FireRedLID, and FireRedPunc) with model weights and inference code. Download links below. Technical report and finetuning code coming soon.
 
 
 
 
44
 
45
 
46
 
@@ -48,123 +54,33 @@ FireRedASR2S is a state-of-the-art (SOTA), industrial-grade, all-in-one ASR syst
48
 
49
  |Model|Supported Languages & Dialects|Download|
50
  |:-------------:|:---------------------------------:|:----------:|
51
- |FireRedASR2| Chinese (Mandarin and 20+ dialects/accents<sup>*</sup>), English, Code-Switching | [🤗](https://huggingface.co/FireRedTeam/FireRedASR2-AED) \| [🤖](https://modelscope.cn/collections/FireRedTeam/FireRedASR2S)|
52
- |FireRedVAD | 100+ languages, 20+ Chinese dialects/accents<sup>*</sup> | [🤗](https://huggingface.co/FireRedTeam/FireRedVAD) \| [🤖](https://modelscope.cn/collections/FireRedTeam/FireRedVAD)|
53
- |FireRedLID | 100+ languages, 20+ Chinese dialects/accents<sup>*</sup> | [🤗](https://huggingface.co/FireRedTeam/FireRedLID) \| [🤖](https://modelscope.cn/collections/FireRedTeam/FireRedLID)|
54
- |FireRedPunc| Chinese, English | [🤗](https://huggingface.co/FireRedTeam/FireRedPunc) \| [🤖](https://modelscope.cn/collections/FireRedTeam/FireRedPunc)|
 
55
 
56
  <sup>*</sup>Supported Chinese dialects/accents: Cantonese (Hong Kong & Guangdong), Sichuan, Shanghai, Wu, Minnan, Anhui, Fujian, Gansu, Guizhou, Hebei, Henan, Hubei, Hunan, Jiangxi, Liaoning, Ningxia, Shaanxi, Shanxi, Shandong, Tianjin, Yunnan, etc.
57
 
58
 
59
 
60
  ## Method
 
 
 
61
  ### FireRedASR2
62
  FireRedASR2 builds upon [FireRedASR](https://github.com/FireRedTeam/FireRedASR) with improved accuracy, designed to meet diverse requirements in superior performance and optimal efficiency across various applications. It comprises two variants:
63
  - **FireRedASR2-LLM**: Designed to achieve state-of-the-art performance and to enable seamless end-to-end speech interaction. It adopts an Encoder-Adapter-LLM framework leveraging large language model (LLM) capabilities.
64
  - **FireRedASR2-AED**: Designed to balance high performance and computational efficiency and to serve as an effective speech representation module in LLM-based speech models. It utilizes an Attention-based Encoder-Decoder (AED) architecture.
65
 
 
66
 
67
  ### Other Modules
68
- - **FireRedVAD**: DFSMN-based non-streaming/streaming Voice Activity Detection and Audio Event Detection.
69
- - **FireRedLID**: FireRedASR2-based Spoken Language Identification. See [FireRedLID README](./fireredasr2s/fireredlid/README.md) for language details.
70
  - **FireRedPunc**: BERT-based Punctuation Prediction.
71
 
72
 
73
- ## Evaluation
74
- ### FireRedASR2
75
- Metrics: Character Error Rate (CER%) for Chinese and Word Error Rate (WER%) for English. Lower is better.
76
-
77
- We evaluate FireRedASR2 on 24 public test sets covering Mandarin, 20+ Chinese dialects/accents, and singing.
78
-
79
- - **Mandarin (4 test sets)**: 2.89% (LLM) / 3.05% (AED) average CER, outperforming Doubao-ASR (3.69%), Qwen3-ASR-1.7B (3.76%), Fun-ASR (4.16%) and Fun-ASR-Nano-2512 (4.55%).
80
- - **Dialects (19 test sets)**: 11.55% (LLM) / 11.67% (AED) average CER, outperforming Doubao-ASR (15.39%), Qwen3-ASR-1.7B (11.85%), Fun-ASR (12.76%) and Fun-ASR-Nano-2512 (15.07%).
81
-
82
- *Note: ws=WenetSpeech, md=MagicData, conv=Conversational, daily=Daily-use.*
83
-
84
- |ID|Testset\Model|FireRedASR2-LLM|FireRedASR2-AED|Doubao-ASR|Qwen3-ASR|Fun-ASR|Fun-ASR-Nano|
85
- |:--:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|
86
- | |**Average CER<br>(All, 1-24)** |**9.67** |**9.80** |12.98 |10.12 |10.92 |12.81 |
87
- | |**Average CER<br>(Mandarin, 1-4)** |**2.89** |**3.05** |3.69 |3.76 |4.16 |4.55 |
88
- | |**Average CER<br>(Dialects, 5-23)** |**11.55**|**11.67**|15.39|11.85|12.76|15.07|
89
- |1 |aishell1 |0.64 |0.57 |1.52 |1.48 |1.64 |1.96 |
90
- |2 |aishell2 |2.15 |2.51 |2.77 |2.71 |2.38 |3.02 |
91
- |3 |ws-net |4.44 |4.57 |5.73 |4.97 |6.85 |6.93 |
92
- |4 |ws-meeting |4.32 |4.53 |4.74 |5.88 |5.78 |6.29 |
93
- |5 |kespeech |3.08 |3.60 |5.38 |5.10 |5.36 |7.66 |
94
- |6 |ws-yue-short |5.14 |5.15 |10.51|5.82 |7.34 |8.82 |
95
- |7 |ws-yue-long |8.71 |8.54 |11.39|8.85 |10.14|11.36|
96
- |8 |ws-chuan-easy |10.90|10.60|11.33|11.99|12.46|14.05|
97
- |9 |ws-chuan-hard |20.71|21.35|20.77|21.63|22.49|25.32|
98
- |10|md-heavy |7.42 |7.43 |7.69 |8.02 |9.13 |9.97 |
99
- |11|md-yue-conv |12.23|11.66|26.25|9.76 |33.71|15.68|
100
- |12|md-yue-daily |3.61 |3.35 |12.82|3.66 |2.69 |5.67 |
101
- |13|md-yue-vehicle |4.50 |4.83 |8.66 |4.28 |6.00 |7.04 |
102
- |14|md-chuan-conv |13.18|13.07|11.77|14.35|14.01|17.11|
103
- |15|md-chuan-daily |4.90 |5.17 |3.90 |4.93 |3.98 |5.95 |
104
- |16|md-shanghai-conv |28.70|27.02|45.15|29.77|25.49|37.08|
105
- |17|md-shanghai-daily |24.94|24.18|44.06|23.93|12.55|28.77|
106
- |18|md-wu |7.15 |7.14 |7.70 |7.57 |10.63|10.56|
107
- |19|md-zhengzhou-conv |10.20|10.65|9.83 |9.55 |10.85|13.09|
108
- |20|md-zhengzhou-daily|5.80 |6.26 |5.77 |5.88 |6.29 |8.18 |
109
- |21|md-wuhan |9.60 |10.81|9.94 |10.22|4.34 |8.70 |
110
- |22|md-tianjin |15.45|15.30|15.79|16.16|19.27|22.03|
111
- |23|md-changsha |23.18|25.64|23.76|23.70|25.66|29.23|
112
- |24|opencpop |1.12 |1.17 |4.36 |2.57 |3.05 |2.95 |
113
-
114
- Doubao-ASR (volc.seedasr.auc) tested in early February 2026, and Fun-ASR tested in late November 2025. Our ASR training data does not include any Chinese dialect or accented speech data from MagicData.
115
- - Doubao-ASR (API): https://www.volcengine.com/docs/6561/1354868
116
- - Qwen3-ASR (1.7B): https://github.com/QwenLM/Qwen3-ASR
117
- - Fun-ASR (API): https://help.aliyun.com/zh/model-studio/recording-file-recognition
118
- - Fun-ASR-Nano-2512: https://huggingface.co/FunAudioLLM/Fun-ASR-Nano-2512
119
-
120
-
121
- ### FireRedVAD
122
- We evaluate FireRedVAD on FLEURS-VAD-102, a multilingual VAD benchmark covering 102 languages.
123
-
124
- FireRedVAD achieves SOTA performance, outperforming Silero-VAD, TEN-VAD, FunASR-VAD, and WebRTC-VAD.
125
-
126
- |Metric\Model|FireRedVAD|[Silero-VAD](https://github.com/snakers4/silero-vad)|[TEN-VAD](https://github.com/TEN-framework/ten-vad)|[FunASR-VAD](https://modelscope.cn/models/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch)|[WebRTC-VAD](https://github.com/wiseman/py-webrtcvad)|
127
- |:-------:|:-----:|:------:|:------:|:------:|:------:|
128
- |AUC-ROC↑ |**99.60**|97.99|97.81|- |- |
129
- |F1 score↑ |**97.57**|95.95|95.19|90.91|52.30|
130
- |False Alarm Rate↓ |**2.69** |9.41 |15.47|44.03|2.83 |
131
- |Miss Rate↓|3.62 |3.95 |2.95 |0.42 |64.15|
132
-
133
- <sup>*</sup>FLEURS-VAD-102: We randomly selected ~100 audio files per language from [FLEURS test set](https://huggingface.co/datasets/google/fleurs), resulting in 9,443 audio files with manually annotated binary VAD labels (speech=1, silence=0). This VAD testset will be open sourced (coming soon).
134
-
135
- Note: FunASR-VAD achieves low Miss Rate but at the cost of high False Alarm Rate (44.03%), indicating over-prediction of speech segments.
136
-
137
-
138
- ### FireRedLID
139
- Metric: Utterance-level LID Accuracy (%). Higher is better.
140
-
141
- We evaluate FireRedLID on multilingual and Chinese dialect benchmarks.
142
-
143
- FireRedLID achieves SOTA performance, outperforming Whisper, SpeechBrain-LID, and Dolphin.
144
-
145
- |Testset\Model|Languages|FireRedLID|[Whisper](https://github.com/openai/whisper)|[SpeechBrain](https://huggingface.co/speechbrain/lang-id-voxlingua107-ecapa)|[Dolphin](https://github.com/DataoceanAI/Dolphin)|
146
- |:-----------------:|:---------:|:---------:|:-----:|:---------:|:-----:|
147
- |FLEURS test |82 languages |**97.18** |79.41 |92.91 |-|
148
- |CommonVoice test |74 languages |**92.07** |80.81 |78.75 |-|
149
- |KeSpeech + MagicData|20+ Chinese dialects/accents |**88.47** |-|-|69.01|
150
-
151
-
152
- ### FireRedPunc
153
- Metric: Precision/Recall/F1 Score (%). Higher is better.
154
-
155
- We evaluate FireRedPunc on multi-domain Chinese and English benchmarks.
156
-
157
- FireRedPunc achieves SOTA performance, outperforming FunASR-Punc (CT-Transformer).
158
-
159
- |Testset\Model|#Sentences|FireRedPunc|[FunASR-Punc](https://www.modelscope.cn/models/iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch)|
160
- |:------------------:|:---------:|:--------------:|:-----------------:|
161
- |Multi-domain Chinese| 88,644 |**82.84 / 83.08 / 82.96** | 77.27 / 74.03 / 75.62 |
162
- |Multi-domain English| 28,641 |**78.40 / 71.57 / 74.83** | 55.79 / 45.15 / 49.91 |
163
- |Average F1 Score | - |**78.90** | 62.77 |
164
-
165
-
166
-
167
-
168
  ## Quick Start
169
  ### Setup
170
  1. Create a clean Python environment:
@@ -186,10 +102,11 @@ $ export PYTHONPATH=$PWD/:$PYTHONPATH
186
  ```bash
187
  # Download via ModelScope (recommended for users in China)
188
  pip install -U modelscope
189
- modelscope download --model FireRedTeam/FireRedASR2-AED --local_dir ./pretrained_models/FireRedASR2-AED
190
- modelscope download --model FireRedTeam/FireRedVAD --local_dir ./pretrained_models/FireRedVAD
191
- modelscope download --model FireRedTeam/FireRedLID --local_dir ./pretrained_models/FireRedLID
192
- modelscope download --model FireRedTeam/FireRedPunc --local_dir ./pretrained_models/FireRedPunc
 
193
 
194
  # Download via Hugging Face
195
  pip install -U "huggingface_hub[cli]"
@@ -197,6 +114,7 @@ huggingface-cli download FireRedTeam/FireRedASR2-AED --local-dir ./pretrained_mo
197
  huggingface-cli download FireRedTeam/FireRedVAD --local-dir ./pretrained_models/FireRedVAD
198
  huggingface-cli download FireRedTeam/FireRedLID --local-dir ./pretrained_models/FireRedLID
199
  huggingface-cli download FireRedTeam/FireRedPunc --local-dir ./pretrained_models/FireRedPunc
 
200
  ```
201
 
202
  4. Convert your audio to **16kHz 16-bit mono PCM** format if needed:
@@ -247,7 +165,7 @@ $ cd examples_infer/asr
247
  $ bash inference_asr_aed.sh
248
  $ bash inference_asr_llm.sh
249
 
250
- # VAD & AED (Audio Event Detection)
251
  $ cd examples_infer/vad
252
  $ bash inference_vad.sh
253
  $ bash inference_streamvad.sh
@@ -262,6 +180,13 @@ $ cd examples_infer/punc
262
  $ bash inference_punc.sh
263
  ```
264
 
 
 
 
 
 
 
 
265
 
266
  ### Python API Usage
267
  Set up `PYTHONPATH` first: `export PYTHONPATH=$PWD/:$PYTHONPATH`
@@ -354,7 +279,7 @@ print(result)
354
  </details>
355
 
356
 
357
- #### Audio Event Detection (AED)
358
  <details>
359
  <summary>Click to expand</summary>
360
 
@@ -474,6 +399,7 @@ for wav_path, uttid in zip(batch_wav_path, batch_uttid):
474
  # {'uttid': 'hello_en', 'text': 'Hello speech.', 'sentences': [{'start_ms': 260, 'end_ms': 1820, 'text': 'Hello speech.', 'asr_confidence': 0.933, 'lang': 'en', 'lang_confidence': 0.993}], 'vad_segments_ms': [(260, 1820)], 'dur_s': 2.24, 'words': [{'start_ms': 400, 'end_ms': 960, 'text': 'hello'}, {'start_ms': 960, 'end_ms': 1666, 'text': 'speech'}], 'wav_path': 'assets/hello_en.wav'}
475
  ```
476
 
 
477
 
478
 
479
  ## FAQ
@@ -483,8 +409,107 @@ for wav_path, uttid in zip(batch_wav_path, batch_uttid):
483
 
484
  **Q: What are the input length limitations of ASR models?**
485
 
486
- - FireRedASR2-AED supports audio input up to 60s. Input longer than 60s may cause hallucination issues, and input exceeding 200s will trigger positional encoding errors.
487
- - FireRedASR2-LLM supports audio input up to 30s. The behavior for longer input is untested.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
488
 
489
 
490
  ## Acknowledgements
@@ -492,3 +517,14 @@ Thanks to the following open-source works:
492
  - [Qwen](https://huggingface.co/Qwen)
493
  - [WenetSpeech-Yue](https://github.com/ASLP-lab/WenetSpeech-Yue)
494
  - [WenetSpeech-Chuan](https://github.com/ASLP-lab/WenetSpeech-Chuan)
 
 
 
 
 
 
 
 
 
 
 
 
15
  - asr
16
  ---
17
 
18
+
19
  <div align="center">
20
  <h1>
21
  FireRedASR2S
 
25
 
26
  </div>
27
 
28
+ [[Code]](https://github.com/FireRedTeam/FireRedASR2S)
29
+ [[Paper]](https://arxiv.org/pdf/2603.10420)
30
+ [[Model🤗]](https://huggingface.co/collections/FireRedTeam/fireredasr2s)
31
+ [[Model🤖]](https://www.modelscope.cn/collections/xukaituo/FireRedASR2S)
32
  [[Demo]](https://huggingface.co/spaces/FireRedTeam/FireRedASR)
33
 
34
 
35
  FireRedASR2S is a state-of-the-art (SOTA), industrial-grade, all-in-one ASR system with ASR, VAD, LID, and Punc modules. All modules achieve SOTA performance:
36
+ - **FireRedASR2**: Automatic Speech Recognition (ASR) supporting peech and singing transcription for Chinese (Mandarin, 20+ dialects/accents), English, code-switching. 2.89% average CER on 4 public Mandarin benchmarks, 11.55% on 19 Chinese dialects and accents benchmarks, **outperforming Doubao-ASR, Qwen3-ASR-1.7B, Fun-ASR, and Fun-ASR-Nano-2512**. FireRedASR2-AED also supports word-level timestamps and confidence scores.
37
+ - **FireRedVAD**: Voice Activity Detection (VAD) supporting speech/singing/music in 100+ languages. 97.57% F1, **outperforming Silero-VAD, TEN-VAD, FunASR-VAD and WebRTC-VAD**. Supports non-streaming/streaming VAD and Multi-label VAD (mVAD).
38
+ - **FireRedLID**: Spoken Language Identification (LID) supporting 100+ languages and 20+ Chinese dialects/accents. 97.18% accuracy, **outperforming Whisper and SpeechBrain**.
39
  - **FireRedPunc**: Punctuation Prediction (Punc) for Chinese and English. 78.90% average F1, outperforming FunASR-Punc (62.77%).
40
 
41
  *`2S`: `2`nd-generation FireRedASR, now expanded to an all-in-one ASR `S`ystem*
42
 
43
 
44
  ## 🔥 News
45
+ - [2026.03.12] 🔥 We release FireRedASR2S technical report. See [arXiv](https://arxiv.org/abs/2603.10420).
46
+ - [2026.03.05] 🚀 [vLLM](https://github.com/vllm-project/vllm/pull/35727) supports FireRedASR2-LLM. See [vLLM Usage](https://github.com/FireRedTeam/FireRedASR2S?tab=readme-ov-file#vllm-usage) part.
47
+ - [2026.02.25] 🔥 We release **FireRedASR2-LLM model weights**. [🤗](https://huggingface.co/FireRedTeam/FireRedASR2-LLM) [🤖](https://www.modelscope.cn/models/xukaituo/FireRedASR2-LLM/)
48
+ - [2026.02.13] 🚀 Support TensorRT-LLM inference acceleration for FireRedASR2-AED (contributed by NVIDIA). Benchmark on AISHELL-1 test set shows **12.7x speedup** over PyTorch baseline (single H20).
49
+ - [2026.02.12] 🔥 We release FireRedASR2S (FireRedASR2-AED, FireRedVAD, FireRedLID, and FireRedPunc) with **model weights and inference code**. Download links below. Technical report and finetuning code coming soon.
50
 
51
 
52
 
 
54
 
55
  |Model|Supported Languages & Dialects|Download|
56
  |:-------------:|:---------------------------------:|:----------:|
57
+ |FireRedASR2-LLM| Chinese (Mandarin and 20+ dialects/accents<sup>*</sup>), English, Code-Switching | [🤗](https://huggingface.co/FireRedTeam/FireRedASR2-LLM) \| [🤖](https://www.modelscope.cn/models/xukaituo/FireRedASR2-LLM/)|
58
+ |FireRedASR2-AED| Chinese (Mandarin and 20+ dialects/accents<sup>*</sup>), English, Code-Switching | [🤗](https://huggingface.co/FireRedTeam/FireRedASR2-AED) \| [🤖](https://www.modelscope.cn/models/xukaituo/FireRedASR2-AED/)|
59
+ |FireRedVAD | 100+ languages, 20+ Chinese dialects/accents<sup>*</sup> | [🤗](https://huggingface.co/FireRedTeam/FireRedVAD) \| [🤖](https://www.modelscope.cn/models/xukaituo/FireRedVAD/)|
60
+ |FireRedLID | 100+ languages, 20+ Chinese dialects/accents<sup>*</sup> | [🤗](https://huggingface.co/FireRedTeam/FireRedLID) \| [🤖](https://www.modelscope.cn/models/xukaituo/FireRedLID/)|
61
+ |FireRedPunc| Chinese, English | [🤗](https://huggingface.co/FireRedTeam/FireRedPunc) \| [🤖](https://www.modelscope.cn/models/xukaituo/FireRedPunc/)|
62
 
63
  <sup>*</sup>Supported Chinese dialects/accents: Cantonese (Hong Kong & Guangdong), Sichuan, Shanghai, Wu, Minnan, Anhui, Fujian, Gansu, Guizhou, Hebei, Henan, Hubei, Hunan, Jiangxi, Liaoning, Ningxia, Shaanxi, Shanxi, Shandong, Tianjin, Yunnan, etc.
64
 
65
 
66
 
67
  ## Method
68
+ ### FireRedASR2S: System Overview
69
+ ![Model](./assets/FireRedASR2S.png)
70
+
71
  ### FireRedASR2
72
  FireRedASR2 builds upon [FireRedASR](https://github.com/FireRedTeam/FireRedASR) with improved accuracy, designed to meet diverse requirements in superior performance and optimal efficiency across various applications. It comprises two variants:
73
  - **FireRedASR2-LLM**: Designed to achieve state-of-the-art performance and to enable seamless end-to-end speech interaction. It adopts an Encoder-Adapter-LLM framework leveraging large language model (LLM) capabilities.
74
  - **FireRedASR2-AED**: Designed to balance high performance and computational efficiency and to serve as an effective speech representation module in LLM-based speech models. It utilizes an Attention-based Encoder-Decoder (AED) architecture.
75
 
76
+ ![Model](./assets/FireRedASR2.png)
77
 
78
  ### Other Modules
79
+ - **FireRedVAD**: DFSMN-based non-streaming/streaming Voice Activity Detection and Multi-label VAD (mVAD). mVAD can be viewed as a lightweight Audio Event Detection (AED) system specialized for a small set of sound categories (speech/singing/music).
80
+ - **FireRedLID**: Encoder-Decoder-based Spoken Language Identification. See [FireRedLID README](./fireredasr2s/fireredlid/README.md) for language details.
81
  - **FireRedPunc**: BERT-based Punctuation Prediction.
82
 
83
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84
  ## Quick Start
85
  ### Setup
86
  1. Create a clean Python environment:
 
102
  ```bash
103
  # Download via ModelScope (recommended for users in China)
104
  pip install -U modelscope
105
+ modelscope download --model xukaituo/FireRedASR2-AED --local_dir ./pretrained_models/FireRedASR2-AED
106
+ modelscope download --model xukaituo/FireRedVAD --local_dir ./pretrained_models/FireRedVAD
107
+ modelscope download --model xukaituo/FireRedLID --local_dir ./pretrained_models/FireRedLID
108
+ modelscope download --model xukaituo/FireRedPunc --local_dir ./pretrained_models/FireRedPunc
109
+ modelscope download --model xukaituo/FireRedASR2-LLM --local_dir ./pretrained_models/FireRedASR2-LLM
110
 
111
  # Download via Hugging Face
112
  pip install -U "huggingface_hub[cli]"
 
114
  huggingface-cli download FireRedTeam/FireRedVAD --local-dir ./pretrained_models/FireRedVAD
115
  huggingface-cli download FireRedTeam/FireRedLID --local-dir ./pretrained_models/FireRedLID
116
  huggingface-cli download FireRedTeam/FireRedPunc --local-dir ./pretrained_models/FireRedPunc
117
+ huggingface-cli download FireRedTeam/FireRedASR2-LLM --local-dir ./pretrained_models/FireRedASR2-LLM
118
  ```
119
 
120
  4. Convert your audio to **16kHz 16-bit mono PCM** format if needed:
 
165
  $ bash inference_asr_aed.sh
166
  $ bash inference_asr_llm.sh
167
 
168
+ # VAD & mVAD (mVAD=Audio Event Detection, AED)
169
  $ cd examples_infer/vad
170
  $ bash inference_vad.sh
171
  $ bash inference_streamvad.sh
 
180
  $ bash inference_punc.sh
181
  ```
182
 
183
+ ### vLLM Usage
184
+ ```shell
185
+ # Serving FireRedASR2-LLM with latest vLLM for the highest performance.
186
+ # For more details, see https://github.com/vllm-project/vllm/pull/35727.
187
+ $ vllm serve allendou/FireRedASR2-LLM-vllm -tp=2 --dtype=float32
188
+ $ python3 examples/online_serving/openai_transcription_client.py --repetition_penalty=1.0 --audio_path=/root/hello_zh.wav
189
+ ```
190
 
191
  ### Python API Usage
192
  Set up `PYTHONPATH` first: `export PYTHONPATH=$PWD/:$PYTHONPATH`
 
279
  </details>
280
 
281
 
282
+ #### mVAD (Audio Event Detection, AED)
283
  <details>
284
  <summary>Click to expand</summary>
285
 
 
399
  # {'uttid': 'hello_en', 'text': 'Hello speech.', 'sentences': [{'start_ms': 260, 'end_ms': 1820, 'text': 'Hello speech.', 'asr_confidence': 0.933, 'lang': 'en', 'lang_confidence': 0.993}], 'vad_segments_ms': [(260, 1820)], 'dur_s': 2.24, 'words': [{'start_ms': 400, 'end_ms': 960, 'text': 'hello'}, {'start_ms': 960, 'end_ms': 1666, 'text': 'speech'}], 'wav_path': 'assets/hello_en.wav'}
400
  ```
401
 
402
+ **Note:** `FireRedASR2S` code has only been tested on Linux Ubuntu 22.04. Behavior on other Linux distributions or Windows has not been tested.
403
 
404
 
405
  ## FAQ
 
409
 
410
  **Q: What are the input length limitations of ASR models?**
411
 
412
+ - **FireRedASR2-AED** supports audio input **up to 60s**. Input longer than 60s may cause hallucination issues, and input exceeding 200s will trigger positional encoding errors.
413
+ - **FireRedASR2-LLM** supports audio input **up to 40s**. The behavior for longer input is untested.
414
+ - **FireRedASR2-LLM Batch Beam Search**: When performing batch beam search with FireRedASR2-LLM, even though attention masks are applied, it is recommended to ensure that the input lengths of the utterances are similar. If there are significant differences in utterance lengths, shorter utterances may experience **repetition issues**. You can either sort your dataset by length or set `batch_size` to 1 to avoid the repetition issue.
415
+
416
+
417
+
418
+ ## Evaluation
419
+ ### FireRedASR2
420
+ Metrics: Character Error Rate (CER%) for Chinese and Word Error Rate (WER%) for English. Lower is better.
421
+
422
+ We evaluate FireRedASR2 on 24 public test sets covering Mandarin, 20+ Chinese dialects/accents, and singing.
423
+
424
+ - **Mandarin (4 test sets)**: 2.89% (LLM) / 3.05% (AED) average CER, outperforming Doubao-ASR (3.69%), Qwen3-ASR-1.7B (3.76%), Fun-ASR (4.16%) and Fun-ASR-Nano-2512 (4.55%).
425
+ - **Dialects (19 test sets)**: 11.55% (LLM) / 11.67% (AED) average CER, outperforming Doubao-ASR (15.39%), Qwen3-ASR-1.7B (11.85%), Fun-ASR (12.76%) and Fun-ASR-Nano-2512 (15.07%).
426
+
427
+ *Note: FRASR2=FireRedASR2, ws=WenetSpeech, md=MagicData, conv=Conversational, daily=Daily-use.*
428
+
429
+ |ID|Testset\CER\Model|FRASR2-LLM|FRASR2-AED|Doubao-ASR|Qwen3-ASR|Fun-ASR|
430
+ |:--:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|
431
+ |Avg|**All(1-24)** |**9.67** |**9.80** |12.98 |10.12 |10.92 |
432
+ |Avg|**Mandarin(1-4)** |**2.89** |**3.05** |3.69 |3.76 |4.16 |
433
+ |Avg|**Dialect(5-23)** |**11.55**|**11.67**|15.39|11.85|12.76|
434
+ |1 |aishell1 |0.64 |0.57 |1.52 |1.48 |1.64 |
435
+ |2 |aishell2 |2.15 |2.51 |2.77 |2.71 |2.38 |
436
+ |3 |ws-net |4.44 |4.57 |5.73 |4.97 |6.85 |
437
+ |4 |ws-meeting |4.32 |4.53 |4.74 |5.88 |5.78 |
438
+ |5 |kespeech |3.08 |3.60 |5.38 |5.10 |5.36 |
439
+ |6 |ws-yue-short |5.14 |5.15 |10.51|5.82 |7.34 |
440
+ |7 |ws-yue-long |8.71 |8.54 |11.39|8.85 |10.14|
441
+ |8 |ws-chuan-easy |10.90|10.60|11.33|11.99|12.46|
442
+ |9 |ws-chuan-hard |20.71|21.35|20.77|21.63|22.49|
443
+ |10|md-heavy |7.42 |7.43 |7.69 |8.02 |9.13 |
444
+ |11|md-yue-conv |12.23|11.66|26.25|9.76 |33.71|
445
+ |12|md-yue-daily |3.61 |3.35 |12.82|3.66 |2.69 |
446
+ |13|md-yue-vehicle |4.50 |4.83 |8.66 |4.28 |6.00 |
447
+ |14|md-chuan-conv |13.18|13.07|11.77|14.35|14.01|
448
+ |15|md-chuan-daily |4.90 |5.17 |3.90 |4.93 |3.98 |
449
+ |16|md-shanghai-conv |28.70|27.02|45.15|29.77|25.49|
450
+ |17|md-shanghai-daily |24.94|24.18|44.06|23.93|12.55|
451
+ |18|md-wu |7.15 |7.14 |7.70 |7.57 |10.63|
452
+ |19|md-zhengzhou-conv |10.20|10.65|9.83 |9.55 |10.85|
453
+ |20|md-zhengzhou-daily|5.80 |6.26 |5.77 |5.88 |6.29 |
454
+ |21|md-wuhan |9.60 |10.81|9.94 |10.22|4.34 |
455
+ |22|md-tianjin |15.45|15.30|15.79|16.16|19.27|
456
+ |23|md-changsha |23.18|25.64|23.76|23.70|25.66|
457
+ |24|opencpop |1.12 |1.17 |4.36 |2.57 |3.05 |
458
+
459
+
460
+ ### FireRedVAD
461
+ <details>
462
+ <summary>Click to expand</summary>
463
+ We evaluate FireRedVAD on FLEURS-VAD-102, a multilingual VAD benchmark covering 102 languages.
464
+
465
+ FireRedVAD achieves SOTA performance, outperforming Silero-VAD, TEN-VAD, FunASR-VAD, and WebRTC-VAD.
466
+
467
+ |Metric\Model|FireRedVAD|Silero-VAD|TEN-VAD|FunASR-VAD|WebRTC-VAD|
468
+ |:-------:|:-----:|:------:|:------:|:------:|:------:|
469
+ |AUC-ROC↑ |**99.60**|97.99|97.81|- |- |
470
+ |F1 score↑ |**97.57**|95.95|95.19|90.91|52.30|
471
+ |False Alarm Rate↓ |**2.69** |9.41 |15.47|44.03|2.83 |
472
+ |Miss Rate↓|3.62 |3.95 |2.95 |0.42 |64.15|
473
+
474
+ FLEURS-VAD-102: We randomly selected ~100 audio files per language from [FLEURS test set](https://huggingface.co/datasets/google/fleurs), resulting in 9,443 audio files with manually annotated binary VAD labels (speech=1, silence=0). This VAD testset will be open sourced (coming soon).
475
+
476
+ Note: FunASR-VAD achieves low Miss Rate but at the cost of high False Alarm Rate (44.03%), indicating over-prediction of speech segments.
477
+ </details>
478
+
479
+
480
+ ### FireRedLID
481
+ <details>
482
+ <summary>Click to expand</summary>
483
+ Metric: Utterance-level LID Accuracy (%). Higher is better.
484
+
485
+ We evaluate FireRedLID on multilingual and Chinese dialect benchmarks.
486
+
487
+ FireRedLID achieves SOTA performance, outperforming Whisper, SpeechBrain-LID, and Dolphin.
488
+
489
+ |Testset\Model|Languages|FireRedLID|Whisper|SpeechBrain|Dolphin|
490
+ |:-----------------:|:---------:|:---------:|:-----:|:---------:|:-----:|
491
+ |FLEURS test |82 languages |**97.18** |79.41 |92.91 |-|
492
+ |CommonVoice test |74 languages |**92.07** |80.81 |78.75 |-|
493
+ |KeSpeech + MagicData|20+ Chinese dialects/accents |**88.47** |-|-|69.01|
494
+ </details>
495
+
496
+
497
+ ### FireRedPunc
498
+ <details>
499
+ <summary>Click to expand</summary>
500
+ Metric: Precision/Recall/F1 Score (%). Higher is better.
501
+
502
+ We evaluate FireRedPunc on multi-domain Chinese and English benchmarks.
503
+
504
+ FireRedPunc achieves SOTA performance, outperforming FunASR-Punc (CT-Transformer).
505
+
506
+ |Testset\Model|#Sentences|FireRedPunc|FunASR-Punc|
507
+ |:------------------:|:---------:|:--------------:|:-----------------:|
508
+ |Multi-domain Chinese| 88,644 |**82.84 / 83.08 / 82.96** | 77.27 / 74.03 / 75.62 |
509
+ |Multi-domain English| 28,641 |**78.40 / 71.57 / 74.83** | 55.79 / 45.15 / 49.91 |
510
+ |Average F1 Score | - |**78.90** | 62.77 |
511
+
512
+ </details>
513
 
514
 
515
  ## Acknowledgements
 
517
  - [Qwen](https://huggingface.co/Qwen)
518
  - [WenetSpeech-Yue](https://github.com/ASLP-lab/WenetSpeech-Yue)
519
  - [WenetSpeech-Chuan](https://github.com/ASLP-lab/WenetSpeech-Chuan)
520
+
521
+
522
+ ## Citation
523
+ ```bibtex
524
+ @article{xu2026fireredasr2s,
525
+ title={FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System},
526
+ author={Xu, Kaituo and Jia, Yan and Huang, Kai and Chen, Junjie and Li, Wenpeng and Liu, Kun and Xie, Feng-Long and Tang, Xu and Hu, Yao},
527
+ journal={arXiv preprint arXiv:2603.10420},
528
+ year={2026}
529
+ }
530
+ ```