Add descriptive tags to model card
#13
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,8 +1,11 @@
|
|
| 1 |
---
|
| 2 |
library_name: transformers
|
| 3 |
license: apache-2.0
|
| 4 |
-
tags: []
|
| 5 |
pipeline_tag: audio-text-to-text
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
---
|
| 7 |
|
| 8 |
# R1-AQA --- Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering
|
|
@@ -38,16 +41,16 @@ Additional Notes:
|
|
| 38 |
| Llama-3-8B-Instruct + Strong Cap. | Direct Inference\* | 50.75 | 49.10 | 48.93 | 48.93 | 55.25 | 62.70 | 52.10 | 53.57 |
|
| 39 |
| Qwen2-Audio-7B-Instruct | Direct Inference\* | 54.95 | 45.90 | 50.98 | 53.26 | 42.04 | 45.90 | 49.20 | 52.50 |
|
| 40 |
| SALAMONN | Direct Inference\* | 41.00 | 40.30 | 34.80 | 33.76 | 25.50 | 24.24 | 33.70 | 32.77 |
|
| 41 |
-
| Qwen2-Audio-7B-Instruct | CoTA
|
| 42 |
-
| Qwen2-Audio-7B-Instruct | Zero-Shot-CoT
|
| 43 |
| **Qwen2-Audio-7B-Instruct** | **GRPO (Ours) 1️⃣** | 69.37 | - | 66.77 | - | 57.36 | - | 64.50 | - |
|
| 44 |
| **Qwen2-Audio-7B-Instruct** | **GRPO (Ours) 2️⃣** | 68.77 | 69.76 | 64.37 | 61.40 | 63.66 | 62.70 | 65.60 | 64.36 |
|
| 45 |
|
| 46 |
#### Notes
|
| 47 |
|
| 48 |
\* The data are sourced from the [MMAU leaderboard](https://sakshi113.github.io/mmau_homepage/#leaderboard).
|
| 49 |
-
|
| 50 |
-
|
| 51 |
1️⃣ It is the original model, identical to the one on Hugging Face and described in our technical report.
|
| 52 |
2️⃣ It is the model submitted to the [MMAU leaderboard](https://sakshi113.github.io/mmau_homepage/#leaderboard), trained multiple times to achieve balanced results.
|
| 53 |
|
|
@@ -101,4 +104,4 @@ print(response)
|
|
| 101 |
year={2025},
|
| 102 |
url={https://github.com/xiaomi-research/r1-aqa; https://huggingface.co/mispeech/r1-aqa}
|
| 103 |
}
|
| 104 |
-
```
|
|
|
|
| 1 |
---
|
| 2 |
library_name: transformers
|
| 3 |
license: apache-2.0
|
|
|
|
| 4 |
pipeline_tag: audio-text-to-text
|
| 5 |
+
tags:
|
| 6 |
+
- audio-question-answering
|
| 7 |
+
- reinforcement-learning
|
| 8 |
+
- multimodal-llm
|
| 9 |
---
|
| 10 |
|
| 11 |
# R1-AQA --- Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering
|
|
|
|
| 41 |
| Llama-3-8B-Instruct + Strong Cap. | Direct Inference\* | 50.75 | 49.10 | 48.93 | 48.93 | 55.25 | 62.70 | 52.10 | 53.57 |
|
| 42 |
| Qwen2-Audio-7B-Instruct | Direct Inference\* | 54.95 | 45.90 | 50.98 | 53.26 | 42.04 | 45.90 | 49.20 | 52.50 |
|
| 43 |
| SALAMONN | Direct Inference\* | 41.00 | 40.30 | 34.80 | 33.76 | 25.50 | 24.24 | 33.70 | 32.77 |
|
| 44 |
+
| Qwen2-Audio-7B-Instruct | CoTA [1] | 60.06 | - | 64.30 | - | 60.70 | - | 61.71 | - |
|
| 45 |
+
| Qwen2-Audio-7B-Instruct | Zero-Shot-CoT [2] | 61.86 | - | 56.29 | - | 55.26 | - | 57.80 | - |
|
| 46 |
| **Qwen2-Audio-7B-Instruct** | **GRPO (Ours) 1️⃣** | 69.37 | - | 66.77 | - | 57.36 | - | 64.50 | - |
|
| 47 |
| **Qwen2-Audio-7B-Instruct** | **GRPO (Ours) 2️⃣** | 68.77 | 69.76 | 64.37 | 61.40 | 63.66 | 62.70 | 65.60 | 64.36 |
|
| 48 |
|
| 49 |
#### Notes
|
| 50 |
|
| 51 |
\* The data are sourced from the [MMAU leaderboard](https://sakshi113.github.io/mmau_homepage/#leaderboard).
|
| 52 |
+
[1] Xie, Zhifei, et al. "Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models." arXiv preprint arXiv:2503.02318 (2025).
|
| 53 |
+
[2] Ma, Ziyang, et al. "Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model." arXiv preprint arXiv:2501.07246 (2025).
|
| 54 |
1️⃣ It is the original model, identical to the one on Hugging Face and described in our technical report.
|
| 55 |
2️⃣ It is the model submitted to the [MMAU leaderboard](https://sakshi113.github.io/mmau_homepage/#leaderboard), trained multiple times to achieve balanced results.
|
| 56 |
|
|
|
|
| 104 |
year={2025},
|
| 105 |
url={https://github.com/xiaomi-research/r1-aqa; https://huggingface.co/mispeech/r1-aqa}
|
| 106 |
}
|
| 107 |
+
```
|