--- tags: - audio license: apache-2.0 --- # AudioMCQ-Mixed-to-Strong
[![arXiv](https://img.shields.io/badge/arXiv-2509.21060-b31b1b.svg)](https://arxiv.org/abs/2509.21060) [![Dataset](https://img.shields.io/badge/🤗%20Dataset-AudioMCQ-blue)](https://huggingface.co/datasets/inclusionAI/AudioMCQ) [![DCASE 2025](https://img.shields.io/badge/DCASE%202025-1st%20Place-gold.svg)](https://dcase.community/challenge2025/task-audio-question-answering-results)
## Overview This repository contains the **Mixed-to-Strong** model checkpoint from our paper "Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models". This model demonstrates state-of-the-art performance on audio question-answering benchmarks through our novel audio-contribution-aware post-training approach. ## Training Paradigm The **Mixed-to-Strong** training paradigm follows a two-stage approach: ``` Stage 1: SFT on mixed audio-contribution data (weak + strong) Stage 2: GRPO (RL) on strong audio-contribution data ``` This paradigm leverages both weak and strong audio-contribution samples during supervised fine-tuning, followed by reinforcement learning on challenging strong audio-contribution samples to achieve optimal performance. ## Model Details - **Base Model**: Qwen2.5-Omni - **Training Data**: [AudioMCQ Dataset](https://huggingface.co/datasets/inclusionAI/AudioMCQ) (571k samples) - **Training Stages**: - Stage 1 (SFT): Mixed audio-contribution subset - Stage 2 (GRPO): Strong audio-contribution subset - **System Prompt**: "You are an audio understanding model that answers multiple choice questions based on audio content." ## Usage Our model loading and usage methods are identical to those of Qwen2.5-Omni. Please refer to the [official documentation](https://github.com/QwenLM/Qwen2.5-Omni). ### Input Format The evaluation input prompt structure is: ``` [Question] Please choose the answer from the following options: [''Option1'', ''Option2'', ''Option3'', ''Option4'']. Output the final answer in . ``` ### Example Usage ```python # Load model following Qwen2.5-Omni documentation # Apply system prompt: "You are an audio understanding model that answers multiple choice questions based on audio content." # Format your question with the input structure above ``` ## Performance The Mixed-to-Strong model achieves superior performance across multiple benchmarks: - **MMAU-test-mini**: State-of-the-art accuracy on general audio understanding - **MMAR**: Strong performance on music understanding tasks - **MMSU**: Excellent results on speech understanding - **Strong Audio-Contribution Splits**: Significantly improved performance on challenging samples requiring deep audio understanding For detailed performance metrics and comparisons, please refer to our paper. ## Related Resources - **AudioMCQ Dataset**: [https://huggingface.co/datasets/inclusionAI/AudioMCQ](https://huggingface.co/datasets/inclusionAI/AudioMCQ) - **Weak-to-Strong Checkpoint**: [https://huggingface.co/inclusionAI/AudioMCQ-Weak-To-Strong](https://huggingface.co/inclusionAI/AudioMCQ-Weak-To-Strong) - **Paper**: [arXiv:2509.21060](https://arxiv.org/abs/2509.21060) - **DCASE 2025 Challenge**: [http://dcase.community/challenge2025/](http://dcase.community/challenge2025/) ## Citation If you find this model useful in your research, please cite: ```bibtex @article{he2025audiomcq, title={Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models}, author={He, Haolin and others}, journal={arXiv preprint arXiv:2509.21060}, year={2025} } ``` ## Contact - **Haolin He**: [harlandzzc@link.cuhk.edu.hk](mailto:harlandzzc@link.cuhk.edu.hk) ## Acknowledgements We thank the organizers of DCASE 2025 and the research community for their valuable feedback and support.