FireRedTeam commited on
Commit
14fc49d
Β·
verified Β·
1 Parent(s): 6651a87

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -3
README.md CHANGED
@@ -35,8 +35,9 @@ FireRedASR2S is a state-of-the-art (SOTA), industrial-grade, all-in-one ASR syst
35
 
36
 
37
  ## πŸ”₯ News
 
38
  - [2026.02.13] πŸš€ Support TensorRT-LLM inference acceleration for FireRedASR2-AED (contributed by NVIDIA). Benchmark on AISHELL-1 test set shows **12.7x speedup** over PyTorch baseline (single H20).
39
- - [2026.02.12] We release FireRedASR2S (FireRedASR2-AED, FireRedVAD, FireRedLID, and FireRedPunc) with **model weights and inference code**. Download links below. Technical report and finetuning code coming soon.
40
 
41
 
42
 
@@ -44,7 +45,8 @@ FireRedASR2S is a state-of-the-art (SOTA), industrial-grade, all-in-one ASR syst
44
 
45
  |Model|Supported Languages & Dialects|Download|
46
  |:-------------:|:---------------------------------:|:----------:|
47
- |FireRedASR2| Chinese (Mandarin and 20+ dialects/accents<sup>*</sup>), English, Code-Switching | [πŸ€—](https://huggingface.co/FireRedTeam/FireRedASR2-AED) \| [πŸ€–](https://www.modelscope.cn/models/xukaituo/FireRedASR2-AED/)|
 
48
  |FireRedVAD | 100+ languages, 20+ Chinese dialects/accents<sup>*</sup> | [πŸ€—](https://huggingface.co/FireRedTeam/FireRedVAD) \| [πŸ€–](https://www.modelscope.cn/models/xukaituo/FireRedVAD/)|
49
  |FireRedLID | 100+ languages, 20+ Chinese dialects/accents<sup>*</sup> | [πŸ€—](https://huggingface.co/FireRedTeam/FireRedLID) \| [πŸ€–](https://www.modelscope.cn/models/xukaituo/FireRedLID/)|
50
  |FireRedPunc| Chinese, English | [πŸ€—](https://huggingface.co/FireRedTeam/FireRedPunc) \| [πŸ€–](https://www.modelscope.cn/models/xukaituo/FireRedPunc/)|
@@ -59,6 +61,8 @@ FireRedASR2 builds upon [FireRedASR](https://github.com/FireRedTeam/FireRedASR)
59
  - **FireRedASR2-LLM**: Designed to achieve state-of-the-art performance and to enable seamless end-to-end speech interaction. It adopts an Encoder-Adapter-LLM framework leveraging large language model (LLM) capabilities.
60
  - **FireRedASR2-AED**: Designed to balance high performance and computational efficiency and to serve as an effective speech representation module in LLM-based speech models. It utilizes an Attention-based Encoder-Decoder (AED) architecture.
61
 
 
 
62
  ### Other Modules
63
  - **FireRedVAD**: DFSMN-based non-streaming/streaming Voice Activity Detection and Audio Event Detection.
64
  - **FireRedLID**: FireRedASR2-based Spoken Language Identification. See [FireRedLID README](./fireredasr2s/fireredlid/README.md) for language details.
@@ -185,6 +189,7 @@ modelscope download --model xukaituo/FireRedASR2-AED --local_dir ./pretrained_mo
185
  modelscope download --model xukaituo/FireRedVAD --local_dir ./pretrained_models/FireRedVAD
186
  modelscope download --model xukaituo/FireRedLID --local_dir ./pretrained_models/FireRedLID
187
  modelscope download --model xukaituo/FireRedPunc --local_dir ./pretrained_models/FireRedPunc
 
188
 
189
  # Download via Hugging Face
190
  pip install -U "huggingface_hub[cli]"
@@ -192,6 +197,7 @@ huggingface-cli download FireRedTeam/FireRedASR2-AED --local-dir ./pretrained_mo
192
  huggingface-cli download FireRedTeam/FireRedVAD --local-dir ./pretrained_models/FireRedVAD
193
  huggingface-cli download FireRedTeam/FireRedLID --local-dir ./pretrained_models/FireRedLID
194
  huggingface-cli download FireRedTeam/FireRedPunc --local-dir ./pretrained_models/FireRedPunc
 
195
  ```
196
 
197
  4. Convert your audio to **16kHz 16-bit mono PCM** format if needed:
@@ -469,6 +475,7 @@ for wav_path, uttid in zip(batch_wav_path, batch_uttid):
469
  # {'uttid': 'hello_en', 'text': 'Hello speech.', 'sentences': [{'start_ms': 260, 'end_ms': 1820, 'text': 'Hello speech.', 'asr_confidence': 0.933, 'lang': 'en', 'lang_confidence': 0.993}], 'vad_segments_ms': [(260, 1820)], 'dur_s': 2.24, 'words': [{'start_ms': 400, 'end_ms': 960, 'text': 'hello'}, {'start_ms': 960, 'end_ms': 1666, 'text': 'speech'}], 'wav_path': 'assets/hello_en.wav'}
470
  ```
471
 
 
472
 
473
 
474
  ## FAQ
@@ -479,7 +486,7 @@ for wav_path, uttid in zip(batch_wav_path, batch_uttid):
479
  **Q: What are the input length limitations of ASR models?**
480
 
481
  - FireRedASR2-AED supports audio input up to 60s. Input longer than 60s may cause hallucination issues, and input exceeding 200s will trigger positional encoding errors.
482
- - FireRedASR2-LLM supports audio input up to 30s. The behavior for longer input is untested.
483
 
484
 
485
  ## Acknowledgements
 
35
 
36
 
37
  ## πŸ”₯ News
38
+ - [2026.02.25] πŸ”₯ We release **FireRedASR2-LLM model weights**. [πŸ€—](https://huggingface.co/FireRedTeam/FireRedASR2-LLM) [πŸ€–](https://www.modelscope.cn/models/xukaituo/FireRedASR2-LLM/)
39
  - [2026.02.13] πŸš€ Support TensorRT-LLM inference acceleration for FireRedASR2-AED (contributed by NVIDIA). Benchmark on AISHELL-1 test set shows **12.7x speedup** over PyTorch baseline (single H20).
40
+ - [2026.02.12] πŸ”₯ We release FireRedASR2S (FireRedASR2-AED, FireRedVAD, FireRedLID, and FireRedPunc) with **model weights and inference code**. Download links below. Technical report and finetuning code coming soon.
41
 
42
 
43
 
 
45
 
46
  |Model|Supported Languages & Dialects|Download|
47
  |:-------------:|:---------------------------------:|:----------:|
48
+ |FireRedASR2-LLM| Chinese (Mandarin and 20+ dialects/accents<sup>*</sup>), English, Code-Switching | [πŸ€—](https://huggingface.co/FireRedTeam/FireRedASR2-LLM) \| [πŸ€–](https://www.modelscope.cn/models/xukaituo/FireRedASR2-LLM/)|
49
+ |FireRedASR2-AED| Chinese (Mandarin and 20+ dialects/accents<sup>*</sup>), English, Code-Switching | [πŸ€—](https://huggingface.co/FireRedTeam/FireRedASR2-AED) \| [πŸ€–](https://www.modelscope.cn/models/xukaituo/FireRedASR2-AED/)|
50
  |FireRedVAD | 100+ languages, 20+ Chinese dialects/accents<sup>*</sup> | [πŸ€—](https://huggingface.co/FireRedTeam/FireRedVAD) \| [πŸ€–](https://www.modelscope.cn/models/xukaituo/FireRedVAD/)|
51
  |FireRedLID | 100+ languages, 20+ Chinese dialects/accents<sup>*</sup> | [πŸ€—](https://huggingface.co/FireRedTeam/FireRedLID) \| [πŸ€–](https://www.modelscope.cn/models/xukaituo/FireRedLID/)|
52
  |FireRedPunc| Chinese, English | [πŸ€—](https://huggingface.co/FireRedTeam/FireRedPunc) \| [πŸ€–](https://www.modelscope.cn/models/xukaituo/FireRedPunc/)|
 
61
  - **FireRedASR2-LLM**: Designed to achieve state-of-the-art performance and to enable seamless end-to-end speech interaction. It adopts an Encoder-Adapter-LLM framework leveraging large language model (LLM) capabilities.
62
  - **FireRedASR2-AED**: Designed to balance high performance and computational efficiency and to serve as an effective speech representation module in LLM-based speech models. It utilizes an Attention-based Encoder-Decoder (AED) architecture.
63
 
64
+ ![Model](./assets/FireRedASR2_model.png)
65
+
66
  ### Other Modules
67
  - **FireRedVAD**: DFSMN-based non-streaming/streaming Voice Activity Detection and Audio Event Detection.
68
  - **FireRedLID**: FireRedASR2-based Spoken Language Identification. See [FireRedLID README](./fireredasr2s/fireredlid/README.md) for language details.
 
189
  modelscope download --model xukaituo/FireRedVAD --local_dir ./pretrained_models/FireRedVAD
190
  modelscope download --model xukaituo/FireRedLID --local_dir ./pretrained_models/FireRedLID
191
  modelscope download --model xukaituo/FireRedPunc --local_dir ./pretrained_models/FireRedPunc
192
+ modelscope download --model xukaituo/FireRedASR2-LLM --local_dir ./pretrained_models/FireRedASR2-LLM
193
 
194
  # Download via Hugging Face
195
  pip install -U "huggingface_hub[cli]"
 
197
  huggingface-cli download FireRedTeam/FireRedVAD --local-dir ./pretrained_models/FireRedVAD
198
  huggingface-cli download FireRedTeam/FireRedLID --local-dir ./pretrained_models/FireRedLID
199
  huggingface-cli download FireRedTeam/FireRedPunc --local-dir ./pretrained_models/FireRedPunc
200
+ huggingface-cli download FireRedTeam/FireRedASR2-LLM --local-dir ./pretrained_models/FireRedASR2-LLM
201
  ```
202
 
203
  4. Convert your audio to **16kHz 16-bit mono PCM** format if needed:
 
475
  # {'uttid': 'hello_en', 'text': 'Hello speech.', 'sentences': [{'start_ms': 260, 'end_ms': 1820, 'text': 'Hello speech.', 'asr_confidence': 0.933, 'lang': 'en', 'lang_confidence': 0.993}], 'vad_segments_ms': [(260, 1820)], 'dur_s': 2.24, 'words': [{'start_ms': 400, 'end_ms': 960, 'text': 'hello'}, {'start_ms': 960, 'end_ms': 1666, 'text': 'speech'}], 'wav_path': 'assets/hello_en.wav'}
476
  ```
477
 
478
+ **Note:** `FireRedASR2S` code has only been tested on Linux Ubuntu 22.04. Behavior on other Linux distributions or Windows has not been tested.
479
 
480
 
481
  ## FAQ
 
486
  **Q: What are the input length limitations of ASR models?**
487
 
488
  - FireRedASR2-AED supports audio input up to 60s. Input longer than 60s may cause hallucination issues, and input exceeding 200s will trigger positional encoding errors.
489
+ - FireRedASR2-LLM supports audio input up to 40s. The behavior for longer input is untested. Batch Beam Search: When performing batch beam search with FireRedASR2-LLM, even though attention masks are applied, it is recommended to ensure that the input lengths of the utterances are similar. If there are significant differences in utterance lengths, shorter utterances may experience repetition issues. You can either sort your dataset by length or set `batch_size` to 1 to avoid the repetition issue.
490
 
491
 
492
  ## Acknowledgements