Improve model card: add links, datasets and sample usage

#2
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +12 -48
README.md CHANGED
@@ -1,18 +1,22 @@
1
  ---
2
- license: apache-2.0
3
  language:
4
  - en
5
  - zh
 
 
 
 
6
  tags:
7
  - automatic-speech-recognition
8
  - speech-recognition
9
  - audio
10
  - robust-asr
11
  - qwen3-asr
12
- pipeline_tag: automatic-speech-recognition
13
  ---
14
 
15
- # Mega-ASR
 
 
16
 
17
  Mega-ASR is a robust automatic speech recognition system designed for real-world audio with severe acoustic degradation. It targets noisy, reverberant, clipped, band-limited, overlapping, and otherwise difficult recording conditions where standard ASR systems often produce empty outputs, omissions, repetitions, or hallucinated text.
18
 
@@ -45,6 +49,8 @@ Mega-ASR is intended for speech-to-text transcription of real-world audio, espec
45
 
46
  ## Quick Start
47
 
 
 
48
  Install the Mega-ASR codebase and dependencies:
49
 
50
  ```bash
@@ -56,25 +62,7 @@ conda activate mega-asr
56
  pip install -r requirements.txt
57
  ```
58
 
59
- Place this checkpoint directory at:
60
-
61
- ```text
62
- ckpt/Mega-ASR
63
- ```
64
-
65
- Run inference:
66
-
67
- ```bash
68
- python infer.py --audio /path/to/audio.wav --ckpt_dir ckpt/Mega-ASR
69
- ```
70
-
71
- Disable routing if you want to always use the robust recognition path:
72
-
73
- ```bash
74
- python infer.py --audio /path/to/audio.wav --ckpt_dir ckpt/Mega-ASR --routing false
75
- ```
76
-
77
- Python usage:
78
 
79
  ```python
80
  from MegaASR.model.megaASR import MegaASR
@@ -89,27 +77,9 @@ result = model.infer("/path/to/audio.wav", return_route=True)
89
  print(result)
90
  ```
91
 
92
- ## Decoding Defaults
93
-
94
- The Mega-ASR wrapper uses Qwen3-ASR generation defaults unless explicitly overridden. In the provided wrapper, `max_new_tokens` is set to 256.
95
-
96
- The default generation configuration is deterministic:
97
-
98
- ```text
99
- do_sample: false
100
- num_beams: 1
101
- repetition_penalty: 1.0
102
- top_p: 1.0
103
- top_k: 50
104
- ```
105
-
106
- Because `do_sample` is false, decoding is greedy by default and sampling controls such as temperature, top-p, and top-k do not affect normal inference.
107
-
108
  ## Training Summary
109
 
110
- Mega-ASR is trained for robust speech recognition in realistic acoustic environments. The training pipeline uses acoustic-to-semantic supervised fine-tuning, where the model is exposed to progressively harder speech examples and learns to recover both local acoustic details and sentence-level semantics under degradation.
111
-
112
- The system is designed to improve recognition robustness on difficult audio while using a routing mechanism to reduce unnecessary changes on clean audio.
113
 
114
  ## Evaluation
115
 
@@ -127,12 +97,6 @@ python src/MegaASR/eval/evaluate_wer.py \
127
  --output_jsonl outputs/pred_with_wer.jsonl
128
  ```
129
 
130
- Input JSONL format:
131
-
132
- ```json
133
- {"audio": "examples/audio/noise.wav", "answer": "I usually take the quieter road home because the main street gets crowded after work."}
134
- ```
135
-
136
  ## Citation
137
 
138
  If you use Mega-ASR, please cite the project:
@@ -151,4 +115,4 @@ If you use Mega-ASR, please cite the project:
151
 
152
  ## Acknowledgements
153
 
154
- Mega-ASR builds on Qwen3-ASR. We thank the Qwen3-ASR team and the creators of public speech and audio datasets used in the project.
 
1
  ---
 
2
  language:
3
  - en
4
  - zh
5
+ license: apache-2.0
6
+ pipeline_tag: automatic-speech-recognition
7
+ datasets:
8
+ - zhifeixie/Voices-in-the-Wild-2M
9
  tags:
10
  - automatic-speech-recognition
11
  - speech-recognition
12
  - audio
13
  - robust-asr
14
  - qwen3-asr
 
15
  ---
16
 
17
+ # Mega-ASR: Towards In-the-wild^2 Speech Recognition
18
+
19
+ [**Paper**](https://huggingface.co/papers/2605.19833) | [**Project Page**](https://xzf-thu.github.io/Mega-ASR/) | [**Code**](https://github.com/xzf-thu/Mega-ASR)
20
 
21
  Mega-ASR is a robust automatic speech recognition system designed for real-world audio with severe acoustic degradation. It targets noisy, reverberant, clipped, band-limited, overlapping, and otherwise difficult recording conditions where standard ASR systems often produce empty outputs, omissions, repetitions, or hallucinated text.
22
 
 
49
 
50
  ## Quick Start
51
 
52
+ ### Installation
53
+
54
  Install the Mega-ASR codebase and dependencies:
55
 
56
  ```bash
 
62
  pip install -r requirements.txt
63
  ```
64
 
65
+ ### Python Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
  ```python
68
  from MegaASR.model.megaASR import MegaASR
 
77
  print(result)
78
  ```
79
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
  ## Training Summary
81
 
82
+ Mega-ASR is trained for robust speech recognition in realistic acoustic environments. The training pipeline uses acoustic-to-semantic supervised fine-tuning (A2S-SFT) on the **Voices-in-the-Wild-2M** dataset, where the model is exposed to progressively harder speech examples and learns to recover both local acoustic details and sentence-level semantics under degradation.
 
 
83
 
84
  ## Evaluation
85
 
 
97
  --output_jsonl outputs/pred_with_wer.jsonl
98
  ```
99
 
 
 
 
 
 
 
100
  ## Citation
101
 
102
  If you use Mega-ASR, please cite the project:
 
115
 
116
  ## Acknowledgements
117
 
118
+ Mega-ASR builds on Qwen3-ASR. We thank the Qwen3-ASR team and the creators of public speech and audio datasets used in the project.