Audio-Text-to-Text
Transformers
Safetensors
qwen2_audio
text2text-generation

Add descriptive tags to model card

#13
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +9 -6
README.md CHANGED
@@ -1,8 +1,11 @@
1
  ---
2
  library_name: transformers
3
  license: apache-2.0
4
- tags: []
5
  pipeline_tag: audio-text-to-text
 
 
 
 
6
  ---
7
 
8
  # R1-AQA --- Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering
@@ -38,16 +41,16 @@ Additional Notes:
38
  | Llama-3-8B-Instruct + Strong Cap. | Direct Inference\* | 50.75 | 49.10 | 48.93 | 48.93 | 55.25 | 62.70 | 52.10 | 53.57 |
39
  | Qwen2-Audio-7B-Instruct | Direct Inference\* | 54.95 | 45.90 | 50.98 | 53.26 | 42.04 | 45.90 | 49.20 | 52.50 |
40
  | SALAMONN | Direct Inference\* | 41.00 | 40.30 | 34.80 | 33.76 | 25.50 | 24.24 | 33.70 | 32.77 |
41
- | Qwen2-Audio-7B-Instruct | CoTA \[1\] | 60.06 | - | 64.30 | - | 60.70 | - | 61.71 | - |
42
- | Qwen2-Audio-7B-Instruct | Zero-Shot-CoT \[2\] | 61.86 | - | 56.29 | - | 55.26 | - | 57.80 | - |
43
  | **Qwen2-Audio-7B-Instruct** | **GRPO (Ours) 1️⃣** | 69.37 | - | 66.77 | - | 57.36 | - | 64.50 | - |
44
  | **Qwen2-Audio-7B-Instruct** | **GRPO (Ours) 2️⃣** | 68.77 | 69.76 | 64.37 | 61.40 | 63.66 | 62.70 | 65.60 | 64.36 |
45
 
46
  #### Notes
47
 
48
  \* The data are sourced from the [MMAU leaderboard](https://sakshi113.github.io/mmau_homepage/#leaderboard).
49
- \[1\] Xie, Zhifei, et al. "Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models." arXiv preprint arXiv:2503.02318 (2025).
50
- \[2\] Ma, Ziyang, et al. "Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model." arXiv preprint arXiv:2501.07246 (2025).
51
  1️⃣ It is the original model, identical to the one on Hugging Face and described in our technical report.
52
  2️⃣ It is the model submitted to the [MMAU leaderboard](https://sakshi113.github.io/mmau_homepage/#leaderboard), trained multiple times to achieve balanced results.
53
 
@@ -101,4 +104,4 @@ print(response)
101
  year={2025},
102
  url={https://github.com/xiaomi-research/r1-aqa; https://huggingface.co/mispeech/r1-aqa}
103
  }
104
- ```
 
1
  ---
2
  library_name: transformers
3
  license: apache-2.0
 
4
  pipeline_tag: audio-text-to-text
5
+ tags:
6
+ - audio-question-answering
7
+ - reinforcement-learning
8
+ - multimodal-llm
9
  ---
10
 
11
  # R1-AQA --- Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering
 
41
  | Llama-3-8B-Instruct + Strong Cap. | Direct Inference\* | 50.75 | 49.10 | 48.93 | 48.93 | 55.25 | 62.70 | 52.10 | 53.57 |
42
  | Qwen2-Audio-7B-Instruct | Direct Inference\* | 54.95 | 45.90 | 50.98 | 53.26 | 42.04 | 45.90 | 49.20 | 52.50 |
43
  | SALAMONN | Direct Inference\* | 41.00 | 40.30 | 34.80 | 33.76 | 25.50 | 24.24 | 33.70 | 32.77 |
44
+ | Qwen2-Audio-7B-Instruct | CoTA [1] | 60.06 | - | 64.30 | - | 60.70 | - | 61.71 | - |
45
+ | Qwen2-Audio-7B-Instruct | Zero-Shot-CoT [2] | 61.86 | - | 56.29 | - | 55.26 | - | 57.80 | - |
46
  | **Qwen2-Audio-7B-Instruct** | **GRPO (Ours) 1️⃣** | 69.37 | - | 66.77 | - | 57.36 | - | 64.50 | - |
47
  | **Qwen2-Audio-7B-Instruct** | **GRPO (Ours) 2️⃣** | 68.77 | 69.76 | 64.37 | 61.40 | 63.66 | 62.70 | 65.60 | 64.36 |
48
 
49
  #### Notes
50
 
51
  \* The data are sourced from the [MMAU leaderboard](https://sakshi113.github.io/mmau_homepage/#leaderboard).
52
+ [1] Xie, Zhifei, et al. "Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models." arXiv preprint arXiv:2503.02318 (2025).
53
+ [2] Ma, Ziyang, et al. "Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model." arXiv preprint arXiv:2501.07246 (2025).
54
  1️⃣ It is the original model, identical to the one on Hugging Face and described in our technical report.
55
  2️⃣ It is the model submitted to the [MMAU leaderboard](https://sakshi113.github.io/mmau_homepage/#leaderboard), trained multiple times to achieve balanced results.
56
 
 
104
  year={2025},
105
  url={https://github.com/xiaomi-research/r1-aqa; https://huggingface.co/mispeech/r1-aqa}
106
  }
107
+ ```