Update README.md
#7
by
GrantL10
- opened
README.md
CHANGED
|
@@ -11,7 +11,7 @@ pipeline_tag: audio-text-to-text
|
|
| 11 |
|
| 12 |
## Introduction
|
| 13 |
|
| 14 |
-
R1-AQA is a audio question answering (AQA) model based on `Qwen2-Audio-7B-Instruct`, optimized through reinforcement learning using the group relative policy optimization (GRPO) algorithm.
|
| 15 |
This implementation has achieved state-of-the-art performance on MMAU *Test-mini* benchmark with only 38k post-training samples.
|
| 16 |
For more details, please refer to our [Github](https://github.com/xiaomi-research/r1-aqa) and [Technical Report](https://arxiv.org/abs/2503.11197).
|
| 17 |
|
|
@@ -23,10 +23,12 @@ Our main findings are as follows:
|
|
| 23 |
- Large audio language models (LALMs) still lag far behind humans auditory-language reasoning, suggesting that the RL-based approaches warrant further explorations.
|
| 24 |
|
| 25 |
Additional Notes:
|
| 26 |
-
The AVQA training set originally consists of approximately 40k samples. However, we use only about 38k samples because some data sources have become invalid. Other datasets using YouTube sources face a similar issue, such as AudioSet. We believe that the missing 2k samples do not have a significant impact on the training results.
|
| 27 |
|
|
|
|
|
|
|
| 28 |
|
| 29 |
### Table: Accuracies (%) on MMAU Test-mini benchmark
|
|
|
|
| 30 |
| Model | Method | Sound | Music | Speech | Average |
|
| 31 |
|--------------------------------------------|-------------------------|--------|--------|--------|---------|
|
| 32 |
| \ | Human\* | 86.31 | 78.22 | 82.17 | 82.23 |
|
|
@@ -39,18 +41,18 @@ The AVQA training set originally consists of approximately 40k samples. However,
|
|
| 39 |
| GPT4o + Weak Cap. | Direct Inference\* | 39.33 | 41.90 | 58.25 | 45.70 |
|
| 40 |
| Llama-3-8B-Instruct + Weak Cap. | Direct Inference\* | 34.23 | 38.02 | 54.05 | 42.10 |
|
| 41 |
| SALMONN | Direct Inference\* | 41.00 | 34.80 | 25.50 | 33.70 |
|
| 42 |
-
| Qwen2-Audio-7B-Instruct | CoTA \[1\]
|
| 43 |
-
| Qwen2-Audio-7B-Instruct | Zero-Shot-CoT \[2\]
|
| 44 |
| **Qwen2-Audio-7B-Instruct** | **GRPO (Ours)** | **69.37** | 66.77 | 57.36 | **64.50** |
|
| 45 |
|
| 46 |
-
#### Notes
|
|
|
|
| 47 |
\* The data are sourced from the MMAU official website: [https://sakshi113.github.io/mmau_homepage/](https://sakshi113.github.io/mmau_homepage/)
|
| 48 |
\[1\] Xie, Zhifei, et al. "Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models." arXiv preprint arXiv:2503.02318 (2025).
|
| 49 |
\[2\] Ma, Ziyang, et al. "Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model." arXiv preprint arXiv:2501.07246 (2025).
|
| 50 |
|
| 51 |
-
|
| 52 |
-
|
| 53 |
## Inference
|
|
|
|
| 54 |
```python
|
| 55 |
import torch
|
| 56 |
import torchaudio
|
|
@@ -87,4 +89,16 @@ generated_ids = generated_ids[:, inputs.input_ids.size(1):]
|
|
| 87 |
response = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
|
| 88 |
|
| 89 |
print(response)
|
| 90 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
|
| 12 |
## Introduction
|
| 13 |
|
| 14 |
+
R1-AQA is a audio question answering (AQA) model based on `Qwen2-Audio-7B-Instruct`, optimized through reinforcement learning using the group relative policy optimization (GRPO) algorithm.
|
| 15 |
This implementation has achieved state-of-the-art performance on MMAU *Test-mini* benchmark with only 38k post-training samples.
|
| 16 |
For more details, please refer to our [Github](https://github.com/xiaomi-research/r1-aqa) and [Technical Report](https://arxiv.org/abs/2503.11197).
|
| 17 |
|
|
|
|
| 23 |
- Large audio language models (LALMs) still lag far behind humans auditory-language reasoning, suggesting that the RL-based approaches warrant further explorations.
|
| 24 |
|
| 25 |
Additional Notes:
|
|
|
|
| 26 |
|
| 27 |
+
- The AVQA training set originally consists of approximately 40k samples. However, we use only about 38k samples because some data sources have become invalid. Other datasets using YouTube sources face a similar issue, such as AudioSet. We believe that the missing 2k samples do not have a significant impact on the training results.
|
| 28 |
+
- The statement about the 8.2B parameters is based on the *Qwen2-Audio Technical Report*.
|
| 29 |
|
| 30 |
### Table: Accuracies (%) on MMAU Test-mini benchmark
|
| 31 |
+
|
| 32 |
| Model | Method | Sound | Music | Speech | Average |
|
| 33 |
|--------------------------------------------|-------------------------|--------|--------|--------|---------|
|
| 34 |
| \ | Human\* | 86.31 | 78.22 | 82.17 | 82.23 |
|
|
|
|
| 41 |
| GPT4o + Weak Cap. | Direct Inference\* | 39.33 | 41.90 | 58.25 | 45.70 |
|
| 42 |
| Llama-3-8B-Instruct + Weak Cap. | Direct Inference\* | 34.23 | 38.02 | 54.05 | 42.10 |
|
| 43 |
| SALMONN | Direct Inference\* | 41.00 | 34.80 | 25.50 | 33.70 |
|
| 44 |
+
| Qwen2-Audio-7B-Instruct | CoTA \[1\] | 60.06 | 64.30 | 60.70 | 61.71 |
|
| 45 |
+
| Qwen2-Audio-7B-Instruct | Zero-Shot-CoT \[2\] | 61.86 | 56.29 | 55.26 | 57.80 |
|
| 46 |
| **Qwen2-Audio-7B-Instruct** | **GRPO (Ours)** | **69.37** | 66.77 | 57.36 | **64.50** |
|
| 47 |
|
| 48 |
+
#### Notes
|
| 49 |
+
|
| 50 |
\* The data are sourced from the MMAU official website: [https://sakshi113.github.io/mmau_homepage/](https://sakshi113.github.io/mmau_homepage/)
|
| 51 |
\[1\] Xie, Zhifei, et al. "Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models." arXiv preprint arXiv:2503.02318 (2025).
|
| 52 |
\[2\] Ma, Ziyang, et al. "Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model." arXiv preprint arXiv:2501.07246 (2025).
|
| 53 |
|
|
|
|
|
|
|
| 54 |
## Inference
|
| 55 |
+
|
| 56 |
```python
|
| 57 |
import torch
|
| 58 |
import torchaudio
|
|
|
|
| 89 |
response = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
|
| 90 |
|
| 91 |
print(response)
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
## Citation
|
| 95 |
+
|
| 96 |
+
```bib
|
| 97 |
+
@article{li2025reinforcement,
|
| 98 |
+
title={Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering},
|
| 99 |
+
author={Li, Gang and Liu, Jizhong and Dinkel, Heinrich and Niu, Yadong and Zhang, Junbo and Luan, Jian},
|
| 100 |
+
journal={arXiv preprint arXiv:2503.11197},
|
| 101 |
+
year={2025},
|
| 102 |
+
url={https://github.com/xiaomi-research/r1-aqa; https://huggingface.co/mispeech/r1-aqa}
|
| 103 |
+
}
|
| 104 |
+
```
|