Audio-Text-to-Text
Transformers
Safetensors
qwen2_audio
text2text-generation
Files changed (1) hide show
  1. README.md +21 -18
README.md CHANGED
@@ -27,27 +27,30 @@ Additional Notes:
27
  - The AVQA training set originally consists of approximately 40k samples. However, we use only about 38k samples because some data sources have become invalid. Other datasets using YouTube sources face a similar issue, such as AudioSet. We believe that the missing 2k samples do not have a significant impact on the training results.
28
  - The statement about the 8.2B parameters is based on the *Qwen2-Audio Technical Report*.
29
 
30
- ### Table: Accuracies (%) on MMAU Test-mini benchmark
31
-
32
- | Model | Method | Sound | Music | Speech | Average |
33
- |--------------------------------------------|-------------------------|--------|--------|--------|---------|
34
- | \ | Human\* | 86.31 | 78.22 | 82.17 | 82.23 |
35
- | Gemini Pro 2.0 Flash | Direct Inference\* | 56.46 | 58.68 | 51.65 | 55.60 |
36
- | Audio Flamingo 2 | Direct Inference\* | 61.56 | **73.95** | 30.93 | 55.48 |
37
- | GPT4o + Strong Cap. | Direct Inference\* | 57.35 | 49.70 | **64.86** | 57.30 |
38
- | Llama-3-8B-Instruct + Strong Cap. | Direct Inference\* | 50.75 | 48.93 | 55.25 | 52.10 |
39
- | Gemini Pro v1.5 | Direct Inference\* | 56.75 | 49.40 | 58.55 | 54.90 |
40
- | Qwen2-Audio-7B-Instruct | Direct Inference\* | 54.95 | 50.98 | 42.04 | 49.20 |
41
- | GPT4o + Weak Cap. | Direct Inference\* | 39.33 | 41.90 | 58.25 | 45.70 |
42
- | Llama-3-8B-Instruct + Weak Cap. | Direct Inference\* | 34.23 | 38.02 | 54.05 | 42.10 |
43
- | SALMONN | Direct Inference\* | 41.00 | 34.80 | 25.50 | 33.70 |
44
- | Qwen2-Audio-7B-Instruct | CoTA \[1\] | 60.06 | 64.30 | 60.70 | 61.71 |
45
- | Qwen2-Audio-7B-Instruct | Zero-Shot-CoT \[2\] | 61.86 | 56.29 | 55.26 | 57.80 |
46
- | **Qwen2-Audio-7B-Instruct** | **GRPO (Ours)** | **69.37** | 66.77 | 57.36 | **64.50** |
 
47
 
48
  #### Notes
49
 
50
- \* The data are sourced from the MMAU official website: [https://sakshi113.github.io/mmau_homepage/](https://sakshi113.github.io/mmau_homepage/)
 
 
51
  \[1\] Xie, Zhifei, et al. "Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models." arXiv preprint arXiv:2503.02318 (2025).
52
  \[2\] Ma, Ziyang, et al. "Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model." arXiv preprint arXiv:2501.07246 (2025).
53
 
 
27
  - The AVQA training set originally consists of approximately 40k samples. However, we use only about 38k samples because some data sources have become invalid. Other datasets using YouTube sources face a similar issue, such as AudioSet. We believe that the missing 2k samples do not have a significant impact on the training results.
28
  - The statement about the 8.2B parameters is based on the *Qwen2-Audio Technical Report*.
29
 
30
+ ### Table: Accuracies (%) on MMAU benchmark
31
+
32
+ | Model | Method | Test-mini | Test | Test-mini | Test | Test-mini | Test | Test-mini | Test |
33
+ |---------------------------------------|-----------------------|-----------|-------|-----------|-------|-----------|------|------------|-------|
34
+ | - | Human\* | 86.31 | - | 78.22 | - | 82.17 | - | 82.23 | - |
35
+ | Gemini Pro 2.0 Flash | Direct Inference\* | 56.46 | 61.73 | 58.68 | 56.53 | 51.65 | 61.53 | 55.60 | 59.93 |
36
+ | Audio Flamingo 2 | Direct Inference\* | 61.56 | 65.10 | 73.95 | 72.90 | 30.93 | 40.26 | 55.48 | 59.42 |
37
+ | GPT4o + Strong Cap. | Direct Inference\* | 57.35 | 55.83 | 49.70 | 51.73 | 64.86 | 68.66 | 57.30 | 58.74 |
38
+ | Llama-3-8B-Instruct + Strong Cap. | Direct Inference\* | 50.75 | 49.10 | 48.93 | 48.93 | 55.25 | 62.70 | 52.10 | 53.57 |
39
+ | Gemini Pro v1.5 | Direct Inference\* | 56.75 | 54.46 | 49.40 | 48.56 | 58.55 | 55.90 | 54.90 | 52.97 |
40
+ | Qwen2-Audio-7B-Instruct | Direct Inference\* | 54.95 | 45.90 | 50.98 | 53.26 | 42.04 | 45.90 | 49.20 | 52.50 |
41
+ | GPT4o + weak cap. | Direct Inference\* | 39.33 | 35.80 | 41.90 | 39.52 | 58.25 | 68.27 | 45.70 | 48.65 |
42
+ | Llama-3-8B-Instruct + Weak Cap. | Direct Inference\* | 34.23 | 33.73 | 38.02 | 42.36 | 54.05 | 61.54 | 42.10 | 45.87 |
43
+ | SALAMONN | Direct Inference\* | 41.00 | 40.30 | 34.80 | 33.76 | 25.50 | 24.24 | 33.70 | 32.77 |
44
+ | Qwen2-Audio-7B-Instruct | CoTA \[1\] | 60.06 | - | 64.30 | - | 60.70 | - | 61.71 | - |
45
+ | Qwen2-Audio-7B-Instruct | Zero-Shot-CoT \[2\] | 61.86 | - | 56.29 | - | 55.26 | - | 57.80 | - |
46
+ | **Qwen2-Audio-7B-Instruct** | **Ours 1️⃣** | 69.37 | - | 66.77 | - | 57.36 | - | 64.50 | - |
47
+ | **Qwen2-Audio-7B-Instruct** | **Ours 2️⃣** | 68.77 | 69.76 | 64.37 | 61.40 | 63.66 | 62.70 | 65.60 | 64.36 |
48
 
49
  #### Notes
50
 
51
+ 1️ It is the original model, identical to the one on Hugging Face and described in our technical report.
52
+ 2️⃣ It is the model submitted to [EvalAI](https://eval.ai/web/challenges/challenge-page/2391/overview) for evaluation, trained multiple times to achieve balanced results. (**The results on the [leaderboard](https://sakshi113.github.io/mmau_homepage/#leaderboard) contain some typographical errors, and we are currently in communication with the organizers to correct them.**)
53
+ \* The data are sourced from the [MMAU official website](https://sakshi113.github.io/mmau_homepage/)
54
  \[1\] Xie, Zhifei, et al. "Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models." arXiv preprint arXiv:2503.02318 (2025).
55
  \[2\] Ma, Ziyang, et al. "Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model." arXiv preprint arXiv:2501.07246 (2025).
56