|
|
--- |
|
|
tags: |
|
|
- audio |
|
|
license: apache-2.0 |
|
|
--- |
|
|
|
|
|
# AudioMCQ-Weak-to-Strong |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
[](https://arxiv.org/abs/2509.21060) |
|
|
[](https://huggingface.co/datasets/inclusionAI/AudioMCQ) |
|
|
[](https://dcase.community/challenge2025/task-audio-question-answering-results) |
|
|
|
|
|
</div> |
|
|
|
|
|
## Overview |
|
|
|
|
|
This repository contains the **Weak-to-Strong** model checkpoint from our paper "Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models". This model demonstrates state-of-the-art performance on audio question-answering benchmarks through our novel audio-contribution-aware post-training approach. |
|
|
|
|
|
## Training Paradigm |
|
|
|
|
|
The **Weak-to-Strong** training paradigm follows a two-stage approach: |
|
|
|
|
|
``` |
|
|
Stage 1: SFT on weak audio-contribution data |
|
|
Stage 2: GRPO (RL) on strong audio-contribution data |
|
|
``` |
|
|
|
|
|
This paradigm begins with supervised fine-tuning on samples with weak audio contribution (where visual or textual cues provide substantial information), then applies reinforcement learning on challenging strong audio-contribution samples to enhance audio-specific understanding capabilities. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model**: Qwen2.5-Omni |
|
|
- **Training Data**: [AudioMCQ Dataset](https://huggingface.co/datasets/inclusionAI/AudioMCQ) (571k samples) |
|
|
- **Training Stages**: |
|
|
- Stage 1 (SFT): Weak audio-contribution subset |
|
|
- Stage 2 (GRPO): Strong audio-contribution subset |
|
|
- **System Prompt**: "You are an audio understanding model that answers multiple choice questions based on audio content." |
|
|
|
|
|
## Usage |
|
|
|
|
|
Our model loading and usage methods are identical to those of Qwen2.5-Omni. Please refer to the [official documentation](https://github.com/QwenLM/Qwen2.5-Omni). |
|
|
|
|
|
### Input Format |
|
|
|
|
|
The evaluation input prompt structure is: |
|
|
|
|
|
``` |
|
|
[Question] Please choose the answer from the following options: [''Option1'', ''Option2'', ''Option3'', ''Option4'']. Output the final answer in <answer> </answer>. |
|
|
``` |
|
|
|
|
|
### Example Usage |
|
|
|
|
|
```python |
|
|
# Load model following Qwen2.5-Omni documentation |
|
|
# Apply system prompt: "You are an audio understanding model that answers multiple choice questions based on audio content." |
|
|
# Format your question with the input structure above |
|
|
``` |
|
|
|
|
|
## Performance |
|
|
|
|
|
The Weak-to-Strong model achieves competitive performance across multiple benchmarks: |
|
|
|
|
|
- **MMAU-test-mini**: Strong accuracy on general audio understanding |
|
|
- **MMAR**: Robust performance on music understanding tasks |
|
|
- **MMSU**: Solid results on speech understanding |
|
|
- **Strong Audio-Contribution Splits**: Enhanced performance on challenging samples requiring deep audio understanding |
|
|
|
|
|
For detailed performance metrics and comparisons, please refer to our paper. |
|
|
|
|
|
## Related Resources |
|
|
|
|
|
- **AudioMCQ Dataset**: [https://huggingface.co/datasets/inclusionAI/AudioMCQ](https://huggingface.co/datasets/inclusionAI/AudioMCQ) |
|
|
- **Mixed-to-Strong Checkpoint**: [https://huggingface.co/inclusionAI/AudioMCQ-Mixed-To-Strong](https://huggingface.co/inclusionAI/AudioMCQ-Mixed-To-Strong) |
|
|
- **Paper**: [arXiv:2509.21060](https://arxiv.org/abs/2509.21060) |
|
|
- **DCASE 2025 Challenge**: [http://dcase.community/challenge2025/](http://dcase.community/challenge2025/) |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you find this model useful in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@article{he2025audiomcq, |
|
|
title={Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models}, |
|
|
author={He, Haolin and others}, |
|
|
journal={arXiv preprint arXiv:2509.21060}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Contact |
|
|
|
|
|
- **Haolin He**: [harlandzzc@link.cuhk.edu.hk](mailto:harlandzzc@link.cuhk.edu.hk) |
|
|
|
|
|
## Acknowledgements |
|
|
|
|
|
We thank the organizers of DCASE 2025 and the research community for their valuable feedback and support. |