zhifeixie
/

Mega-ASR

@@ -1,18 +1,22 @@
 ---
-license: apache-2.0
 language:
 - en
 - zh
 tags:
 - automatic-speech-recognition
 - speech-recognition
 - audio
 - robust-asr
 - qwen3-asr
-pipeline_tag: automatic-speech-recognition
 ---
-# Mega-ASR
 Mega-ASR is a robust automatic speech recognition system designed for real-world audio with severe acoustic degradation. It targets noisy, reverberant, clipped, band-limited, overlapping, and otherwise difficult recording conditions where standard ASR systems often produce empty outputs, omissions, repetitions, or hallucinated text.
@@ -45,6 +49,8 @@ Mega-ASR is intended for speech-to-text transcription of real-world audio, espec
 ## Quick Start
 Install the Mega-ASR codebase and dependencies:
 ```bash
@@ -56,25 +62,7 @@ conda activate mega-asr
 pip install -r requirements.txt
 ```
-Place this checkpoint directory at:
-```text
-ckpt/Mega-ASR
-```
-Run inference:
-```bash
-python infer.py --audio /path/to/audio.wav --ckpt_dir ckpt/Mega-ASR
-```
-Disable routing if you want to always use the robust recognition path:
-```bash
-python infer.py --audio /path/to/audio.wav --ckpt_dir ckpt/Mega-ASR --routing false
-```
-Python usage:
 ```python
 from MegaASR.model.megaASR import MegaASR
@@ -89,27 +77,9 @@ result = model.infer("/path/to/audio.wav", return_route=True)
 print(result)
 ```
-## Decoding Defaults
-The Mega-ASR wrapper uses Qwen3-ASR generation defaults unless explicitly overridden. In the provided wrapper, `max_new_tokens` is set to 256.
-The default generation configuration is deterministic:
-```text
-do_sample: false
-num_beams: 1
-repetition_penalty: 1.0
-top_p: 1.0
-top_k: 50
-```
-Because `do_sample` is false, decoding is greedy by default and sampling controls such as temperature, top-p, and top-k do not affect normal inference.
 ## Training Summary
-Mega-ASR is trained for robust speech recognition in realistic acoustic environments. The training pipeline uses acoustic-to-semantic supervised fine-tuning, where the model is exposed to progressively harder speech examples and learns to recover both local acoustic details and sentence-level semantics under degradation.
-The system is designed to improve recognition robustness on difficult audio while using a routing mechanism to reduce unnecessary changes on clean audio.
 ## Evaluation
@@ -127,12 +97,6 @@ python src/MegaASR/eval/evaluate_wer.py \
   --output_jsonl outputs/pred_with_wer.jsonl
 ```
-Input JSONL format:
-```json
-{"audio": "examples/audio/noise.wav", "answer": "I usually take the quieter road home because the main street gets crowded after work."}
-```
 ## Citation
 If you use Mega-ASR, please cite the project:
@@ -151,4 +115,4 @@ If you use Mega-ASR, please cite the project:
 ## Acknowledgements
-Mega-ASR builds on Qwen3-ASR. We thank the Qwen3-ASR team and the creators of public speech and audio datasets used in the project.

 ---
 language:
 - en
 - zh
+license: apache-2.0
+pipeline_tag: automatic-speech-recognition
+datasets:
+- zhifeixie/Voices-in-the-Wild-2M
 tags:
 - automatic-speech-recognition
 - speech-recognition
 - audio
 - robust-asr
 - qwen3-asr
 ---
+# Mega-ASR: Towards In-the-wild^2 Speech Recognition
+[**Paper**](https://huggingface.co/papers/2605.19833) | [**Project Page**](https://xzf-thu.github.io/Mega-ASR/) | [**Code**](https://github.com/xzf-thu/Mega-ASR)
 Mega-ASR is a robust automatic speech recognition system designed for real-world audio with severe acoustic degradation. It targets noisy, reverberant, clipped, band-limited, overlapping, and otherwise difficult recording conditions where standard ASR systems often produce empty outputs, omissions, repetitions, or hallucinated text.
 ## Quick Start
+### Installation
 Install the Mega-ASR codebase and dependencies:
 ```bash
 pip install -r requirements.txt
 ```
+### Python Usage
 ```python
 from MegaASR.model.megaASR import MegaASR
 print(result)
 ```
 ## Training Summary
+Mega-ASR is trained for robust speech recognition in realistic acoustic environments. The training pipeline uses acoustic-to-semantic supervised fine-tuning (A2S-SFT) on the **Voices-in-the-Wild-2M** dataset, where the model is exposed to progressively harder speech examples and learns to recover both local acoustic details and sentence-level semantics under degradation.
 ## Evaluation
   --output_jsonl outputs/pred_with_wer.jsonl
 ```
 ## Citation
 If you use Mega-ASR, please cite the project:
 ## Acknowledgements
+Mega-ASR builds on Qwen3-ASR. We thank the Qwen3-ASR team and the creators of public speech and audio datasets used in the project.