zhifeixie commited on
Commit
aedaed2
·
verified ·
1 Parent(s): 91a30f1

Update model card figures

Browse files
.gitattributes CHANGED
@@ -35,3 +35,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  lora/lora-stage1/tokenizer.json filter=lfs diff=lfs merge=lfs -text
37
  lora/lora-stage2/tokenizer.json filter=lfs diff=lfs merge=lfs -text
 
 
 
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  lora/lora-stage1/tokenizer.json filter=lfs diff=lfs merge=lfs -text
37
  lora/lora-stage2/tokenizer.json filter=lfs diff=lfs merge=lfs -text
38
+ assets/figures/mega_asr_logo.png filter=lfs diff=lfs merge=lfs -text
39
+ assets/figures/radar_results.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,22 +1,22 @@
1
  ---
 
2
  language:
3
  - en
4
  - zh
5
- license: apache-2.0
6
- pipeline_tag: automatic-speech-recognition
7
- datasets:
8
- - zhifeixie/Voices-in-the-Wild-2M
9
  tags:
10
  - automatic-speech-recognition
11
  - speech-recognition
12
  - audio
13
  - robust-asr
14
  - qwen3-asr
 
15
  ---
16
 
17
- # Mega-ASR: Towards In-the-wild^2 Speech Recognition
18
 
19
- [**Paper**](https://huggingface.co/papers/2605.19833) | [**Project Page**](https://xzf-thu.github.io/Mega-ASR/) | [**Code**](https://github.com/xzf-thu/Mega-ASR)
 
 
20
 
21
  Mega-ASR is a robust automatic speech recognition system designed for real-world audio with severe acoustic degradation. It targets noisy, reverberant, clipped, band-limited, overlapping, and otherwise difficult recording conditions where standard ASR systems often produce empty outputs, omissions, repetitions, or hallucinated text.
22
 
@@ -49,8 +49,6 @@ Mega-ASR is intended for speech-to-text transcription of real-world audio, espec
49
 
50
  ## Quick Start
51
 
52
- ### Installation
53
-
54
  Install the Mega-ASR codebase and dependencies:
55
 
56
  ```bash
@@ -62,7 +60,25 @@ conda activate mega-asr
62
  pip install -r requirements.txt
63
  ```
64
 
65
- ### Python Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
  ```python
68
  from MegaASR.model.megaASR import MegaASR
@@ -77,9 +93,31 @@ result = model.infer("/path/to/audio.wav", return_route=True)
77
  print(result)
78
  ```
79
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
  ## Training Summary
81
 
82
- Mega-ASR is trained for robust speech recognition in realistic acoustic environments. The training pipeline uses acoustic-to-semantic supervised fine-tuning (A2S-SFT) on the **Voices-in-the-Wild-2M** dataset, where the model is exposed to progressively harder speech examples and learns to recover both local acoustic details and sentence-level semantics under degradation.
 
 
 
 
 
 
83
 
84
  ## Evaluation
85
 
@@ -88,6 +126,10 @@ Mega-ASR is evaluated on standard ASR benchmarks, noisy robustness benchmarks, a
88
  - **WER** for English and whitespace-tokenized languages
89
  - **CER** for Chinese and character-based evaluation
90
 
 
 
 
 
91
  The Mega-ASR repository includes an evaluation script:
92
 
93
  ```bash
@@ -97,6 +139,12 @@ python src/MegaASR/eval/evaluate_wer.py \
97
  --output_jsonl outputs/pred_with_wer.jsonl
98
  ```
99
 
 
 
 
 
 
 
100
  ## Citation
101
 
102
  If you use Mega-ASR, please cite the project:
@@ -115,4 +163,4 @@ If you use Mega-ASR, please cite the project:
115
 
116
  ## Acknowledgements
117
 
118
- Mega-ASR builds on Qwen3-ASR. We thank the Qwen3-ASR team and the creators of public speech and audio datasets used in the project.
 
1
  ---
2
+ license: apache-2.0
3
  language:
4
  - en
5
  - zh
 
 
 
 
6
  tags:
7
  - automatic-speech-recognition
8
  - speech-recognition
9
  - audio
10
  - robust-asr
11
  - qwen3-asr
12
+ pipeline_tag: automatic-speech-recognition
13
  ---
14
 
15
+ # Mega-ASR
16
 
17
+ <p align="center">
18
+ <img src="assets/figures/mega_asr_logo.png" alt="Mega-ASR overview" width="70%">
19
+ </p>
20
 
21
  Mega-ASR is a robust automatic speech recognition system designed for real-world audio with severe acoustic degradation. It targets noisy, reverberant, clipped, band-limited, overlapping, and otherwise difficult recording conditions where standard ASR systems often produce empty outputs, omissions, repetitions, or hallucinated text.
22
 
 
49
 
50
  ## Quick Start
51
 
 
 
52
  Install the Mega-ASR codebase and dependencies:
53
 
54
  ```bash
 
60
  pip install -r requirements.txt
61
  ```
62
 
63
+ Place this checkpoint directory at:
64
+
65
+ ```text
66
+ ckpt/Mega-ASR
67
+ ```
68
+
69
+ Run inference:
70
+
71
+ ```bash
72
+ python infer.py --audio /path/to/audio.wav --ckpt_dir ckpt/Mega-ASR
73
+ ```
74
+
75
+ Disable routing if you want to always use the robust recognition path:
76
+
77
+ ```bash
78
+ python infer.py --audio /path/to/audio.wav --ckpt_dir ckpt/Mega-ASR --routing false
79
+ ```
80
+
81
+ Python usage:
82
 
83
  ```python
84
  from MegaASR.model.megaASR import MegaASR
 
93
  print(result)
94
  ```
95
 
96
+ ## Decoding Defaults
97
+
98
+ The Mega-ASR wrapper uses Qwen3-ASR generation defaults unless explicitly overridden. In the provided wrapper, `max_new_tokens` is set to 256.
99
+
100
+ The default generation configuration is deterministic:
101
+
102
+ ```text
103
+ do_sample: false
104
+ num_beams: 1
105
+ repetition_penalty: 1.0
106
+ top_p: 1.0
107
+ top_k: 50
108
+ ```
109
+
110
+ Because `do_sample` is false, decoding is greedy by default and sampling controls such as temperature, top-p, and top-k do not affect normal inference.
111
+
112
  ## Training Summary
113
 
114
+ Mega-ASR is trained for robust speech recognition in realistic acoustic environments. The training pipeline uses acoustic-to-semantic supervised fine-tuning, where the model is exposed to progressively harder speech examples and learns to recover both local acoustic details and sentence-level semantics under degradation.
115
+
116
+ The system is designed to improve recognition robustness on difficult audio while using a routing mechanism to reduce unnecessary changes on clean audio.
117
+
118
+ <p align="center">
119
+ <img src="assets/figures/method_overview.png" alt="Mega-ASR training and inference overview" width="100%">
120
+ </p>
121
 
122
  ## Evaluation
123
 
 
126
  - **WER** for English and whitespace-tokenized languages
127
  - **CER** for Chinese and character-based evaluation
128
 
129
+ <p align="center">
130
+ <img src="assets/figures/radar_results.png" alt="Mega-ASR evaluation results" width="100%">
131
+ </p>
132
+
133
  The Mega-ASR repository includes an evaluation script:
134
 
135
  ```bash
 
139
  --output_jsonl outputs/pred_with_wer.jsonl
140
  ```
141
 
142
+ Input JSONL format:
143
+
144
+ ```json
145
+ {"audio": "examples/audio/noise.wav", "answer": "I usually take the quieter road home because the main street gets crowded after work."}
146
+ ```
147
+
148
  ## Citation
149
 
150
  If you use Mega-ASR, please cite the project:
 
163
 
164
  ## Acknowledgements
165
 
166
+ Mega-ASR builds on Qwen3-ASR. We thank the Qwen3-ASR team and the creators of public speech and audio datasets used in the project.
assets/figures/mega_asr_logo.png ADDED

Git LFS Details

  • SHA256: 9dcc78da3cc1ceed9952d1a932c2d4212c937220381b8c7ba13966b1238101aa
  • Pointer size: 132 Bytes
  • Size of remote file: 1.98 MB
assets/figures/method_overview.png ADDED
assets/figures/radar_results.png ADDED

Git LFS Details

  • SHA256: 0142016c3d0abad5d1a536fd77a3666c07b95922d2a2c21670d45a3cdd21b7a8
  • Pointer size: 132 Bytes
  • Size of remote file: 1.09 MB