mispeech
/

r1-aqa

@@ -1,8 +1,11 @@
 ---
 library_name: transformers
 license: apache-2.0
-tags: []
 pipeline_tag: audio-text-to-text
 ---
 # R1-AQA --- Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering
@@ -38,16 +41,16 @@ Additional Notes:
 | Llama-3-8B-Instruct + Strong Cap.     | Direct Inference\*    | 50.75     | 49.10 | 48.93     | 48.93 | 55.25     | 62.70 | 52.10     | 53.57 |
 | Qwen2-Audio-7B-Instruct               | Direct Inference\*    | 54.95     | 45.90 | 50.98     | 53.26 | 42.04     | 45.90 | 49.20     | 52.50 |
 | SALAMONN                              | Direct Inference\*    | 41.00     | 40.30 | 34.80     | 33.76 | 25.50     | 24.24 | 33.70     | 32.77 |
-| Qwen2-Audio-7B-Instruct               | CoTA \[1\]            | 60.06     | -     | 64.30     | -     | 60.70     | -     | 61.71     | -     |
-| Qwen2-Audio-7B-Instruct               | Zero-Shot-CoT \[2\]   | 61.86     | -     | 56.29     | -     | 55.26     | -     | 57.80     | -     |
 | **Qwen2-Audio-7B-Instruct**           | **GRPO (Ours) 1️⃣**    | 69.37     | -     | 66.77     | -     | 57.36     | -     | 64.50     | -     |
 | **Qwen2-Audio-7B-Instruct**           | **GRPO (Ours) 2️⃣**    | 68.77     | 69.76 | 64.37     | 61.40 | 63.66     | 62.70 | 65.60     | 64.36 |
 #### Notes
 \* The data are sourced from the [MMAU leaderboard](https://sakshi113.github.io/mmau_homepage/#leaderboard).
-\[1\] Xie, Zhifei, et al. "Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models." arXiv preprint arXiv:2503.02318 (2025).
-\[2\] Ma, Ziyang, et al. "Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model." arXiv preprint arXiv:2501.07246 (2025).
 1️⃣ It is the original model, identical to the one on Hugging Face and described in our technical report.
 2️⃣ It is the model submitted to the [MMAU leaderboard](https://sakshi113.github.io/mmau_homepage/#leaderboard), trained multiple times to achieve balanced results.
@@ -101,4 +104,4 @@ print(response)
   year={2025},
   url={https://github.com/xiaomi-research/r1-aqa; https://huggingface.co/mispeech/r1-aqa}
 }
-```

 ---
 library_name: transformers
 license: apache-2.0
 pipeline_tag: audio-text-to-text
+tags:
+- audio-question-answering
+- reinforcement-learning
+- multimodal-llm
 ---
 # R1-AQA --- Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering
 | Llama-3-8B-Instruct + Strong Cap.     | Direct Inference\*    | 50.75     | 49.10 | 48.93     | 48.93 | 55.25     | 62.70 | 52.10     | 53.57 |
 | Qwen2-Audio-7B-Instruct               | Direct Inference\*    | 54.95     | 45.90 | 50.98     | 53.26 | 42.04     | 45.90 | 49.20     | 52.50 |
 | SALAMONN                              | Direct Inference\*    | 41.00     | 40.30 | 34.80     | 33.76 | 25.50     | 24.24 | 33.70     | 32.77 |
+| Qwen2-Audio-7B-Instruct               | CoTA [1]            | 60.06     | -     | 64.30     | -     | 60.70     | -     | 61.71     | -     |
+| Qwen2-Audio-7B-Instruct               | Zero-Shot-CoT [2]   | 61.86     | -     | 56.29     | -     | 55.26     | -     | 57.80     | -     |
 | **Qwen2-Audio-7B-Instruct**           | **GRPO (Ours) 1️⃣**    | 69.37     | -     | 66.77     | -     | 57.36     | -     | 64.50     | -     |
 | **Qwen2-Audio-7B-Instruct**           | **GRPO (Ours) 2️⃣**    | 68.77     | 69.76 | 64.37     | 61.40 | 63.66     | 62.70 | 65.60     | 64.36 |
 #### Notes
 \* The data are sourced from the [MMAU leaderboard](https://sakshi113.github.io/mmau_homepage/#leaderboard).
+[1] Xie, Zhifei, et al. "Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models." arXiv preprint arXiv:2503.02318 (2025).
+[2] Ma, Ziyang, et al. "Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model." arXiv preprint arXiv:2501.07246 (2025).
 1️⃣ It is the original model, identical to the one on Hugging Face and described in our technical report.
 2️⃣ It is the model submitted to the [MMAU leaderboard](https://sakshi113.github.io/mmau_homepage/#leaderboard), trained multiple times to achieve balanced results.
   year={2025},
   url={https://github.com/xiaomi-research/r1-aqa; https://huggingface.co/mispeech/r1-aqa}
 }
+```