File size: 3,948 Bytes
9c4d69c
 
 
02ffadb
9c4d69c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---
tags:
- audio
license: apache-2.0
---

# AudioMCQ-Weak-to-Strong

<div align="center">

[![arXiv](https://img.shields.io/badge/arXiv-2509.21060-b31b1b.svg)](https://arxiv.org/abs/2509.21060)
[![Dataset](https://img.shields.io/badge/🤗%20Dataset-AudioMCQ-blue)](https://huggingface.co/datasets/inclusionAI/AudioMCQ)
[![DCASE 2025](https://img.shields.io/badge/DCASE%202025-1st%20Place-gold.svg)](https://dcase.community/challenge2025/task-audio-question-answering-results)

</div>

## Overview

This repository contains the **Weak-to-Strong** model checkpoint from our paper "Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models". This model demonstrates state-of-the-art performance on audio question-answering benchmarks through our novel audio-contribution-aware post-training approach.

## Training Paradigm

The **Weak-to-Strong** training paradigm follows a two-stage approach:

```
Stage 1: SFT on weak audio-contribution data
Stage 2: GRPO (RL) on strong audio-contribution data
```

This paradigm begins with supervised fine-tuning on samples with weak audio contribution (where visual or textual cues provide substantial information), then applies reinforcement learning on challenging strong audio-contribution samples to enhance audio-specific understanding capabilities.

## Model Details

- **Base Model**: Qwen2.5-Omni
- **Training Data**: [AudioMCQ Dataset](https://huggingface.co/datasets/inclusionAI/AudioMCQ) (571k samples)
- **Training Stages**: 
  - Stage 1 (SFT): Weak audio-contribution subset
  - Stage 2 (GRPO): Strong audio-contribution subset
- **System Prompt**: "You are an audio understanding model that answers multiple choice questions based on audio content."

## Usage

Our model loading and usage methods are identical to those of Qwen2.5-Omni. Please refer to the [official documentation](https://github.com/QwenLM/Qwen2.5-Omni).

### Input Format

The evaluation input prompt structure is:

```
[Question] Please choose the answer from the following options: [''Option1'', ''Option2'', ''Option3'', ''Option4'']. Output the final answer in <answer> </answer>.
```

### Example Usage

```python
# Load model following Qwen2.5-Omni documentation
# Apply system prompt: "You are an audio understanding model that answers multiple choice questions based on audio content."
# Format your question with the input structure above
```

## Performance

The Weak-to-Strong model achieves competitive performance across multiple benchmarks:

- **MMAU-test-mini**: Strong accuracy on general audio understanding
- **MMAR**: Robust performance on music understanding tasks
- **MMSU**: Solid results on speech understanding
- **Strong Audio-Contribution Splits**: Enhanced performance on challenging samples requiring deep audio understanding

For detailed performance metrics and comparisons, please refer to our paper.

## Related Resources

- **AudioMCQ Dataset**: [https://huggingface.co/datasets/inclusionAI/AudioMCQ](https://huggingface.co/datasets/inclusionAI/AudioMCQ)
- **Mixed-to-Strong Checkpoint**: [https://huggingface.co/inclusionAI/AudioMCQ-Mixed-To-Strong](https://huggingface.co/inclusionAI/AudioMCQ-Mixed-To-Strong)
- **Paper**: [arXiv:2509.21060](https://arxiv.org/abs/2509.21060)
- **DCASE 2025 Challenge**: [http://dcase.community/challenge2025/](http://dcase.community/challenge2025/)

## Citation

If you find this model useful in your research, please cite:

```bibtex
@article{he2025audiomcq,
  title={Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models},
  author={He, Haolin and others},
  journal={arXiv preprint arXiv:2509.21060},
  year={2025}
}
```

## Contact

- **Haolin He**: [harlandzzc@link.cuhk.edu.hk](mailto:harlandzzc@link.cuhk.edu.hk)

## Acknowledgements

We thank the organizers of DCASE 2025 and the research community for their valuable feedback and support.