Update model card figures

Browse files

Files changed (5) hide show

.gitattributes +2 -0
README.md +59 -11
assets/figures/mega_asr_logo.png +3 -0
assets/figures/method_overview.png +0 -0
assets/figures/radar_results.png +3 -0

.gitattributes CHANGED Viewed

@@ -35,3 +35,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 lora/lora-stage1/tokenizer.json filter=lfs diff=lfs merge=lfs -text
 lora/lora-stage2/tokenizer.json filter=lfs diff=lfs merge=lfs -text

 *tfevents* filter=lfs diff=lfs merge=lfs -text
 lora/lora-stage1/tokenizer.json filter=lfs diff=lfs merge=lfs -text
 lora/lora-stage2/tokenizer.json filter=lfs diff=lfs merge=lfs -text
+assets/figures/mega_asr_logo.png filter=lfs diff=lfs merge=lfs -text
+assets/figures/radar_results.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,22 +1,22 @@
 ---
 language:
 - en
 - zh
-license: apache-2.0
-pipeline_tag: automatic-speech-recognition
-datasets:
-- zhifeixie/Voices-in-the-Wild-2M
 tags:
 - automatic-speech-recognition
 - speech-recognition
 - audio
 - robust-asr
 - qwen3-asr
 ---
-# Mega-ASR: Towards In-the-wild^2 Speech Recognition
-[**Paper**](https://huggingface.co/papers/2605.19833) | [**Project Page**](https://xzf-thu.github.io/Mega-ASR/) | [**Code**](https://github.com/xzf-thu/Mega-ASR)
 Mega-ASR is a robust automatic speech recognition system designed for real-world audio with severe acoustic degradation. It targets noisy, reverberant, clipped, band-limited, overlapping, and otherwise difficult recording conditions where standard ASR systems often produce empty outputs, omissions, repetitions, or hallucinated text.
@@ -49,8 +49,6 @@ Mega-ASR is intended for speech-to-text transcription of real-world audio, espec
 ## Quick Start
-### Installation
 Install the Mega-ASR codebase and dependencies:
 ```bash
@@ -62,7 +60,25 @@ conda activate mega-asr
 pip install -r requirements.txt
 ```
-### Python Usage
 ```python
 from MegaASR.model.megaASR import MegaASR
@@ -77,9 +93,31 @@ result = model.infer("/path/to/audio.wav", return_route=True)
 print(result)
 ```
 ## Training Summary
-Mega-ASR is trained for robust speech recognition in realistic acoustic environments. The training pipeline uses acoustic-to-semantic supervised fine-tuning (A2S-SFT) on the **Voices-in-the-Wild-2M** dataset, where the model is exposed to progressively harder speech examples and learns to recover both local acoustic details and sentence-level semantics under degradation.
 ## Evaluation
@@ -88,6 +126,10 @@ Mega-ASR is evaluated on standard ASR benchmarks, noisy robustness benchmarks, a
 - **WER** for English and whitespace-tokenized languages
 - **CER** for Chinese and character-based evaluation
 The Mega-ASR repository includes an evaluation script:
 ```bash
@@ -97,6 +139,12 @@ python src/MegaASR/eval/evaluate_wer.py \
   --output_jsonl outputs/pred_with_wer.jsonl
 ```
 ## Citation
 If you use Mega-ASR, please cite the project:
@@ -115,4 +163,4 @@ If you use Mega-ASR, please cite the project:
 ## Acknowledgements
-Mega-ASR builds on Qwen3-ASR. We thank the Qwen3-ASR team and the creators of public speech and audio datasets used in the project.

 ---
+license: apache-2.0
 language:
 - en
 - zh
 tags:
 - automatic-speech-recognition
 - speech-recognition
 - audio
 - robust-asr
 - qwen3-asr
+pipeline_tag: automatic-speech-recognition
 ---
+# Mega-ASR
+<p align="center">
+  <img src="assets/figures/mega_asr_logo.png" alt="Mega-ASR overview" width="70%">
+</p>
 Mega-ASR is a robust automatic speech recognition system designed for real-world audio with severe acoustic degradation. It targets noisy, reverberant, clipped, band-limited, overlapping, and otherwise difficult recording conditions where standard ASR systems often produce empty outputs, omissions, repetitions, or hallucinated text.
 ## Quick Start
 Install the Mega-ASR codebase and dependencies:
 ```bash
 pip install -r requirements.txt
 ```
+Place this checkpoint directory at:
+```text
+ckpt/Mega-ASR
+```
+Run inference:
+```bash
+python infer.py --audio /path/to/audio.wav --ckpt_dir ckpt/Mega-ASR
+```
+Disable routing if you want to always use the robust recognition path:
+```bash
+python infer.py --audio /path/to/audio.wav --ckpt_dir ckpt/Mega-ASR --routing false
+```
+Python usage:
 ```python
 from MegaASR.model.megaASR import MegaASR
 print(result)
 ```
+## Decoding Defaults
+The Mega-ASR wrapper uses Qwen3-ASR generation defaults unless explicitly overridden. In the provided wrapper, `max_new_tokens` is set to 256.
+The default generation configuration is deterministic:
+```text
+do_sample: false
+num_beams: 1
+repetition_penalty: 1.0
+top_p: 1.0
+top_k: 50
+```
+Because `do_sample` is false, decoding is greedy by default and sampling controls such as temperature, top-p, and top-k do not affect normal inference.
 ## Training Summary
+Mega-ASR is trained for robust speech recognition in realistic acoustic environments. The training pipeline uses acoustic-to-semantic supervised fine-tuning, where the model is exposed to progressively harder speech examples and learns to recover both local acoustic details and sentence-level semantics under degradation.
+The system is designed to improve recognition robustness on difficult audio while using a routing mechanism to reduce unnecessary changes on clean audio.
+<p align="center">
+  <img src="assets/figures/method_overview.png" alt="Mega-ASR training and inference overview" width="100%">
+</p>
 ## Evaluation
 - **WER** for English and whitespace-tokenized languages
 - **CER** for Chinese and character-based evaluation
+<p align="center">
+  <img src="assets/figures/radar_results.png" alt="Mega-ASR evaluation results" width="100%">
+</p>
 The Mega-ASR repository includes an evaluation script:
 ```bash
   --output_jsonl outputs/pred_with_wer.jsonl
 ```
+Input JSONL format:
+```json
+{"audio": "examples/audio/noise.wav", "answer": "I usually take the quieter road home because the main street gets crowded after work."}
+```
 ## Citation
 If you use Mega-ASR, please cite the project:
 ## Acknowledgements
+Mega-ASR builds on Qwen3-ASR. We thank the Qwen3-ASR team and the creators of public speech and audio datasets used in the project.

assets/figures/mega_asr_logo.png ADDED Viewed

Git LFS Details

SHA256: 9dcc78da3cc1ceed9952d1a932c2d4212c937220381b8c7ba13966b1238101aa
Pointer size: 132 Bytes
Size of remote file: 1.98 MB

assets/figures/method_overview.png ADDED Viewed

assets/figures/radar_results.png ADDED Viewed

Git LFS Details

SHA256: 0142016c3d0abad5d1a536fd77a3666c07b95922d2a2c21670d45a3cdd21b7a8
Pointer size: 132 Bytes
Size of remote file: 1.09 MB