FireRedTeam commited on
Commit
522e5f6
Β·
verified Β·
1 Parent(s): 32150b6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -10
README.md CHANGED
@@ -18,6 +18,7 @@ A SOTA Industrial-Grade All-in-One ASR System
18
 
19
  </div>
20
 
 
21
  [[Paper]](https://arxiv.org/pdf/2501.14350)
22
  [[Model]](https://huggingface.co/FireRedTeam)
23
  [[Blog]](https://fireredteam.github.io/demos/firered_asr/)
@@ -34,7 +35,9 @@ FireRedASR2S is a state-of-the-art (SOTA), industrial-grade, all-in-one ASR syst
34
 
35
 
36
  ## πŸ”₯ News
37
- - [2026.02.12] We release FireRedASR2S (FireRedASR2-AED, FireRedVAD, FireRedLID, and FireRedPunc) with model weights and inference code. Download links below. Technical report and finetuning code coming soon.
 
 
38
 
39
 
40
 
@@ -42,10 +45,11 @@ FireRedASR2S is a state-of-the-art (SOTA), industrial-grade, all-in-one ASR syst
42
 
43
  |Model|Supported Languages & Dialects|Download|
44
  |:-------------:|:---------------------------------:|:----------:|
45
- |FireRedASR2| Chinese (Mandarin and 20+ dialects/accents<sup>*</sup>), English, Code-Switching | [πŸ€—](https://huggingface.co/FireRedTeam/FireRedASR2-AED) \| [πŸ€–](https://modelscope.cn/collections/FireRedTeam/FireRedASR2S)|
46
- |FireRedVAD | 100+ languages, 20+ Chinese dialects/accents<sup>*</sup> | [πŸ€—](https://huggingface.co/FireRedTeam/FireRedVAD) \| [πŸ€–](https://modelscope.cn/collections/FireRedTeam/FireRedVAD)|
47
- |FireRedLID | 100+ languages, 20+ Chinese dialects/accents<sup>*</sup> | [πŸ€—](https://huggingface.co/FireRedTeam/FireRedLID) \| [πŸ€–](https://modelscope.cn/collections/FireRedTeam/FireRedLID)|
48
- |FireRedPunc| Chinese, English | [πŸ€—](https://huggingface.co/FireRedTeam/FireRedPunc) \| [πŸ€–](https://modelscope.cn/collections/FireRedTeam/FireRedPunc)|
 
49
 
50
  <sup>*</sup>Supported Chinese dialects/accents: Cantonese (Hong Kong & Guangdong), Sichuan, Shanghai, Wu, Minnan, Anhui, Fujian, Gansu, Guizhou, Hebei, Henan, Hubei, Hunan, Jiangxi, Liaoning, Ningxia, Shaanxi, Shanxi, Shandong, Tianjin, Yunnan, etc.
51
 
@@ -57,6 +61,7 @@ FireRedASR2 builds upon [FireRedASR](https://github.com/FireRedTeam/FireRedASR)
57
  - **FireRedASR2-LLM**: Designed to achieve state-of-the-art performance and to enable seamless end-to-end speech interaction. It adopts an Encoder-Adapter-LLM framework leveraging large language model (LLM) capabilities.
58
  - **FireRedASR2-AED**: Designed to balance high performance and computational efficiency and to serve as an effective speech representation module in LLM-based speech models. It utilizes an Attention-based Encoder-Decoder (AED) architecture.
59
 
 
60
 
61
  ### Other Modules
62
  - **FireRedVAD**: DFSMN-based non-streaming/streaming Voice Activity Detection and Audio Event Detection.
@@ -180,10 +185,11 @@ $ export PYTHONPATH=$PWD/:$PYTHONPATH
180
  ```bash
181
  # Download via ModelScope (recommended for users in China)
182
  pip install -U modelscope
183
- modelscope download --model FireRedTeam/FireRedASR2-AED --local_dir ./pretrained_models/FireRedASR2-AED
184
- modelscope download --model FireRedTeam/FireRedVAD --local_dir ./pretrained_models/FireRedVAD
185
- modelscope download --model FireRedTeam/FireRedLID --local_dir ./pretrained_models/FireRedLID
186
- modelscope download --model FireRedTeam/FireRedPunc --local_dir ./pretrained_models/FireRedPunc
 
187
 
188
  # Download via Hugging Face
189
  pip install -U "huggingface_hub[cli]"
@@ -191,6 +197,7 @@ huggingface-cli download FireRedTeam/FireRedASR2-AED --local-dir ./pretrained_mo
191
  huggingface-cli download FireRedTeam/FireRedVAD --local-dir ./pretrained_models/FireRedVAD
192
  huggingface-cli download FireRedTeam/FireRedLID --local-dir ./pretrained_models/FireRedLID
193
  huggingface-cli download FireRedTeam/FireRedPunc --local-dir ./pretrained_models/FireRedPunc
 
194
  ```
195
 
196
  4. Convert your audio to **16kHz 16-bit mono PCM** format if needed:
@@ -468,6 +475,7 @@ for wav_path, uttid in zip(batch_wav_path, batch_uttid):
468
  # {'uttid': 'hello_en', 'text': 'Hello speech.', 'sentences': [{'start_ms': 260, 'end_ms': 1820, 'text': 'Hello speech.', 'asr_confidence': 0.933, 'lang': 'en', 'lang_confidence': 0.993}], 'vad_segments_ms': [(260, 1820)], 'dur_s': 2.24, 'words': [{'start_ms': 400, 'end_ms': 960, 'text': 'hello'}, {'start_ms': 960, 'end_ms': 1666, 'text': 'speech'}], 'wav_path': 'assets/hello_en.wav'}
469
  ```
470
 
 
471
 
472
 
473
  ## FAQ
@@ -478,7 +486,7 @@ for wav_path, uttid in zip(batch_wav_path, batch_uttid):
478
  **Q: What are the input length limitations of ASR models?**
479
 
480
  - FireRedASR2-AED supports audio input up to 60s. Input longer than 60s may cause hallucination issues, and input exceeding 200s will trigger positional encoding errors.
481
- - FireRedASR2-LLM supports audio input up to 30s. The behavior for longer input is untested.
482
 
483
 
484
  ## Acknowledgements
 
18
 
19
  </div>
20
 
21
+ [[Code]](https://github.com/FireRedTeam/FireRedASR2S)
22
  [[Paper]](https://arxiv.org/pdf/2501.14350)
23
  [[Model]](https://huggingface.co/FireRedTeam)
24
  [[Blog]](https://fireredteam.github.io/demos/firered_asr/)
 
35
 
36
 
37
  ## πŸ”₯ News
38
+ - [2026.02.25] πŸ”₯ We release **FireRedASR2-LLM model weights**. [πŸ€—](https://huggingface.co/FireRedTeam/FireRedASR2-LLM) [πŸ€–](https://www.modelscope.cn/models/xukaituo/FireRedASR2-LLM/)
39
+ - [2026.02.13] πŸš€ Support TensorRT-LLM inference acceleration for FireRedASR2-AED (contributed by NVIDIA). Benchmark on AISHELL-1 test set shows **12.7x speedup** over PyTorch baseline (single H20).
40
+ - [2026.02.12] πŸ”₯ We release FireRedASR2S (FireRedASR2-AED, FireRedVAD, FireRedLID, and FireRedPunc) with **model weights and inference code**. Download links below. Technical report and finetuning code coming soon.
41
 
42
 
43
 
 
45
 
46
  |Model|Supported Languages & Dialects|Download|
47
  |:-------------:|:---------------------------------:|:----------:|
48
+ |FireRedASR2-LLM| Chinese (Mandarin and 20+ dialects/accents<sup>*</sup>), English, Code-Switching | [πŸ€—](https://huggingface.co/FireRedTeam/FireRedASR2-LLM) \| [πŸ€–](https://www.modelscope.cn/models/xukaituo/FireRedASR2-LLM/)|
49
+ |FireRedASR2-AED| Chinese (Mandarin and 20+ dialects/accents<sup>*</sup>), English, Code-Switching | [πŸ€—](https://huggingface.co/FireRedTeam/FireRedASR2-AED) \| [πŸ€–](https://www.modelscope.cn/models/xukaituo/FireRedASR2-AED/)|
50
+ |FireRedVAD | 100+ languages, 20+ Chinese dialects/accents<sup>*</sup> | [πŸ€—](https://huggingface.co/FireRedTeam/FireRedVAD) \| [πŸ€–](https://www.modelscope.cn/models/xukaituo/FireRedVAD/)|
51
+ |FireRedLID | 100+ languages, 20+ Chinese dialects/accents<sup>*</sup> | [πŸ€—](https://huggingface.co/FireRedTeam/FireRedLID) \| [πŸ€–](https://www.modelscope.cn/models/xukaituo/FireRedLID/)|
52
+ |FireRedPunc| Chinese, English | [πŸ€—](https://huggingface.co/FireRedTeam/FireRedPunc) \| [πŸ€–](https://www.modelscope.cn/models/xukaituo/FireRedPunc/)|
53
 
54
  <sup>*</sup>Supported Chinese dialects/accents: Cantonese (Hong Kong & Guangdong), Sichuan, Shanghai, Wu, Minnan, Anhui, Fujian, Gansu, Guizhou, Hebei, Henan, Hubei, Hunan, Jiangxi, Liaoning, Ningxia, Shaanxi, Shanxi, Shandong, Tianjin, Yunnan, etc.
55
 
 
61
  - **FireRedASR2-LLM**: Designed to achieve state-of-the-art performance and to enable seamless end-to-end speech interaction. It adopts an Encoder-Adapter-LLM framework leveraging large language model (LLM) capabilities.
62
  - **FireRedASR2-AED**: Designed to balance high performance and computational efficiency and to serve as an effective speech representation module in LLM-based speech models. It utilizes an Attention-based Encoder-Decoder (AED) architecture.
63
 
64
+ ![Model](./assets/FireRedASR2_model.png)
65
 
66
  ### Other Modules
67
  - **FireRedVAD**: DFSMN-based non-streaming/streaming Voice Activity Detection and Audio Event Detection.
 
185
  ```bash
186
  # Download via ModelScope (recommended for users in China)
187
  pip install -U modelscope
188
+ modelscope download --model xukaituo/FireRedASR2-AED --local_dir ./pretrained_models/FireRedASR2-AED
189
+ modelscope download --model xukaituo/FireRedVAD --local_dir ./pretrained_models/FireRedVAD
190
+ modelscope download --model xukaituo/FireRedLID --local_dir ./pretrained_models/FireRedLID
191
+ modelscope download --model xukaituo/FireRedPunc --local_dir ./pretrained_models/FireRedPunc
192
+ modelscope download --model xukaituo/FireRedASR2-LLM --local_dir ./pretrained_models/FireRedASR2-LLM
193
 
194
  # Download via Hugging Face
195
  pip install -U "huggingface_hub[cli]"
 
197
  huggingface-cli download FireRedTeam/FireRedVAD --local-dir ./pretrained_models/FireRedVAD
198
  huggingface-cli download FireRedTeam/FireRedLID --local-dir ./pretrained_models/FireRedLID
199
  huggingface-cli download FireRedTeam/FireRedPunc --local-dir ./pretrained_models/FireRedPunc
200
+ huggingface-cli download FireRedTeam/FireRedASR2-LLM --local-dir ./pretrained_models/FireRedASR2-LLM
201
  ```
202
 
203
  4. Convert your audio to **16kHz 16-bit mono PCM** format if needed:
 
475
  # {'uttid': 'hello_en', 'text': 'Hello speech.', 'sentences': [{'start_ms': 260, 'end_ms': 1820, 'text': 'Hello speech.', 'asr_confidence': 0.933, 'lang': 'en', 'lang_confidence': 0.993}], 'vad_segments_ms': [(260, 1820)], 'dur_s': 2.24, 'words': [{'start_ms': 400, 'end_ms': 960, 'text': 'hello'}, {'start_ms': 960, 'end_ms': 1666, 'text': 'speech'}], 'wav_path': 'assets/hello_en.wav'}
476
  ```
477
 
478
+ **Note:** `FireRedASR2S` code has only been tested on Linux Ubuntu 22.04. Behavior on other Linux distributions or Windows has not been tested.
479
 
480
 
481
  ## FAQ
 
486
  **Q: What are the input length limitations of ASR models?**
487
 
488
  - FireRedASR2-AED supports audio input up to 60s. Input longer than 60s may cause hallucination issues, and input exceeding 200s will trigger positional encoding errors.
489
+ - FireRedASR2-LLM supports audio input up to 40s. The behavior for longer input is untested. Batch Beam Search: When performing batch beam search with FireRedASR2-LLM, even though attention masks are applied, it is recommended to ensure that the input lengths of the utterances are similar. If there are significant differences in utterance lengths, shorter utterances may experience repetition issues. You can either sort your dataset by length or set `batch_size` to 1 to avoid the repetition issue.
490
 
491
 
492
  ## Acknowledgements