Mega-ASR / README.md

nielsr HF Staff

Improve model card: add links, datasets and sample usage

00a6e75 verified 2 days ago

4.37 kB

	---
	language:
	- en
	- zh
	license: apache-2.0
	pipeline_tag: automatic-speech-recognition
	datasets:
	- zhifeixie/Voices-in-the-Wild-2M
	tags:
	- automatic-speech-recognition
	- speech-recognition
	- audio
	- robust-asr
	- qwen3-asr
	---

	# Mega-ASR: Towards In-the-wild^2 Speech Recognition

	[Paper](https://huggingface.co/papers/2605.19833) \| [Project Page](https://xzf-thu.github.io/Mega-ASR/) \| [Code](https://github.com/xzf-thu/Mega-ASR)

	Mega-ASR is a robust automatic speech recognition system designed for real-world audio with severe acoustic degradation. It targets noisy, reverberant, clipped, band-limited, overlapping, and otherwise difficult recording conditions where standard ASR systems often produce empty outputs, omissions, repetitions, or hallucinated text.

	The release contains the Qwen3-ASR-1.7B foundation model files, Mega-ASR adaptation weights, and an audio quality router. The router decides whether to use the robust Mega-ASR path or the base recognition path for each input, which helps preserve clean-speech recognition quality while improving robustness on degraded speech.

	## Model Details

	- Model name: Mega-ASR
	- Task: Automatic speech recognition
	- Backbone: Qwen3-ASR-1.7B
	- Primary use case: In-the-wild ASR under challenging acoustic conditions
	- Default decoding: Greedy decoding
	- Default max new tokens: 256 in the Mega-ASR inference wrapper
	- Router: Audio quality classifier with a default threshold of 0.5
	- License: Apache-2.0

	## Repository Contents

	```text
	Mega-ASR/
	├── Qwen3-ASR-1.7B/ # Backbone model, tokenizer, processor, and generation config
	├── mega-asr-merged/ # Mega-ASR adaptation weights used by the inference wrapper
	├── audio_quality_router/ # Audio quality router checkpoint
	└── README.md # Model card
	```

	## Intended Use

	Mega-ASR is intended for speech-to-text transcription of real-world audio, especially audio affected by compound acoustic distortions. Example scenarios include far-field recording, environmental noise, reverberation, low-quality microphones, compression artifacts, partial signal corruption, and mixed acoustic conditions.

	## Quick Start

	### Installation

	Install the Mega-ASR codebase and dependencies:

	```bash
	git clone https://github.com/xzf-thu/Mega-ASR.git
	cd Mega-ASR

	conda create -n mega-asr python=3.10 -y
	conda activate mega-asr
	pip install -r requirements.txt
	```

	### Python Usage

	```python
	from MegaASR.model.megaASR import MegaASR

	model = MegaASR(
	model_path="ckpt/Mega-ASR/Qwen3-ASR-1.7B",
	router_checkpoint="ckpt/Mega-ASR/audio_quality_router/best_acc_model.pt",
	routing_enabled=True,
	)

	result = model.infer("/path/to/audio.wav", return_route=True)
	print(result)
	```

	## Training Summary

	Mega-ASR is trained for robust speech recognition in realistic acoustic environments. The training pipeline uses acoustic-to-semantic supervised fine-tuning (A2S-SFT) on the Voices-in-the-Wild-2M dataset, where the model is exposed to progressively harder speech examples and learns to recover both local acoustic details and sentence-level semantics under degradation.

	## Evaluation

	Mega-ASR is evaluated on standard ASR benchmarks, noisy robustness benchmarks, and in-the-wild compound acoustic scenarios. The recommended evaluation metrics are:

	- WER for English and whitespace-tokenized languages
	- CER for Chinese and character-based evaluation

	The Mega-ASR repository includes an evaluation script:

	```bash
	python src/MegaASR/eval/evaluate_wer.py \
	--ckpt_dir ckpt/Mega-ASR \
	--input_jsonl examples/test.jsonl \
	--output_jsonl outputs/pred_with_wer.jsonl
	```

	## Citation

	If you use Mega-ASR, please cite the project:

	```bibtex
	@misc{xie2026megaasrinthewild2speechrecognition,
	title={Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation},
	author={Zhifei Xie and Kaiyu Pang and Haobin Zhang and Deheng Ye and Xiaobin Hu and Shuicheng Yan and Chunyan Miao},
	year={2026},
	eprint={2605.19833},
	archivePrefix={arXiv},
	primaryClass={cs.SD},
	url={https://arxiv.org/abs/2605.19833},
	}
	```

	## Acknowledgements

	Mega-ASR builds on Qwen3-ASR. We thank the Qwen3-ASR team and the creators of public speech and audio datasets used in the project.