modify readme

02ffadb 3 months ago

3.95 kB

	---
	tags:
	- audio
	license: apache-2.0
	---

	# AudioMCQ-Weak-to-Strong

	<div align="center">

	[![arXiv](https://img.shields.io/badge/arXiv-2509.21060-b31b1b.svg)](https://arxiv.org/abs/2509.21060)
	[![Dataset](https://img.shields.io/badge/🤗%20Dataset-AudioMCQ-blue)](https://huggingface.co/datasets/inclusionAI/AudioMCQ)
	[![DCASE 2025](https://img.shields.io/badge/DCASE%202025-1st%20Place-gold.svg)](https://dcase.community/challenge2025/task-audio-question-answering-results)

	</div>

	## Overview

	This repository contains the Weak-to-Strong model checkpoint from our paper "Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models". This model demonstrates state-of-the-art performance on audio question-answering benchmarks through our novel audio-contribution-aware post-training approach.

	## Training Paradigm

	The Weak-to-Strong training paradigm follows a two-stage approach:

	```
	Stage 1: SFT on weak audio-contribution data
	Stage 2: GRPO (RL) on strong audio-contribution data
	```

	This paradigm begins with supervised fine-tuning on samples with weak audio contribution (where visual or textual cues provide substantial information), then applies reinforcement learning on challenging strong audio-contribution samples to enhance audio-specific understanding capabilities.

	## Model Details

	- Base Model: Qwen2.5-Omni
	- Training Data: [AudioMCQ Dataset](https://huggingface.co/datasets/inclusionAI/AudioMCQ) (571k samples)
	- Training Stages:
	- Stage 1 (SFT): Weak audio-contribution subset
	- Stage 2 (GRPO): Strong audio-contribution subset
	- System Prompt: "You are an audio understanding model that answers multiple choice questions based on audio content."

	## Usage

	Our model loading and usage methods are identical to those of Qwen2.5-Omni. Please refer to the [official documentation](https://github.com/QwenLM/Qwen2.5-Omni).

	### Input Format

	The evaluation input prompt structure is:

	```
	[Question] Please choose the answer from the following options: [''Option1'', ''Option2'', ''Option3'', ''Option4'']. Output the final answer in <answer> </answer>.
	```

	### Example Usage

	```python
	# Load model following Qwen2.5-Omni documentation
	# Apply system prompt: "You are an audio understanding model that answers multiple choice questions based on audio content."
	# Format your question with the input structure above
	```

	## Performance

	The Weak-to-Strong model achieves competitive performance across multiple benchmarks:

	- MMAU-test-mini: Strong accuracy on general audio understanding
	- MMAR: Robust performance on music understanding tasks
	- MMSU: Solid results on speech understanding
	- Strong Audio-Contribution Splits: Enhanced performance on challenging samples requiring deep audio understanding

	For detailed performance metrics and comparisons, please refer to our paper.

	## Related Resources

	- AudioMCQ Dataset: [https://huggingface.co/datasets/inclusionAI/AudioMCQ](https://huggingface.co/datasets/inclusionAI/AudioMCQ)
	- Mixed-to-Strong Checkpoint: [https://huggingface.co/inclusionAI/AudioMCQ-Mixed-To-Strong](https://huggingface.co/inclusionAI/AudioMCQ-Mixed-To-Strong)
	- Paper: [arXiv:2509.21060](https://arxiv.org/abs/2509.21060)
	- DCASE 2025 Challenge: [http://dcase.community/challenge2025/](http://dcase.community/challenge2025/)

	## Citation

	If you find this model useful in your research, please cite:

	```bibtex
	@article{he2025audiomcq,
	title={Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models},
	author={He, Haolin and others},
	journal={arXiv preprint arXiv:2509.21060},
	year={2025}
	}
	```

	## Contact

	- Haolin He: [harlandzzc@link.cuhk.edu.hk](mailto:harlandzzc@link.cuhk.edu.hk)

	## Acknowledgements

	We thank the organizers of DCASE 2025 and the research community for their valuable feedback and support.