Update README.md
#8
by
GrantL10
- opened
README.md
CHANGED
|
@@ -27,27 +27,30 @@ Additional Notes:
|
|
| 27 |
- The AVQA training set originally consists of approximately 40k samples. However, we use only about 38k samples because some data sources have become invalid. Other datasets using YouTube sources face a similar issue, such as AudioSet. We believe that the missing 2k samples do not have a significant impact on the training results.
|
| 28 |
- The statement about the 8.2B parameters is based on the *Qwen2-Audio Technical Report*.
|
| 29 |
|
| 30 |
-
### Table: Accuracies (%) on MMAU
|
| 31 |
-
|
| 32 |
-
| Model
|
| 33 |
-
|
| 34 |
-
|
|
| 35 |
-
| Gemini Pro 2.0 Flash
|
| 36 |
-
| Audio Flamingo 2
|
| 37 |
-
| GPT4o + Strong Cap.
|
| 38 |
-
| Llama-3-8B-Instruct + Strong Cap.
|
| 39 |
-
| Gemini Pro v1.5
|
| 40 |
-
| Qwen2-Audio-7B-Instruct
|
| 41 |
-
| GPT4o +
|
| 42 |
-
| Llama-3-8B-Instruct + Weak Cap.
|
| 43 |
-
|
|
| 44 |
-
| Qwen2-Audio-7B-Instruct
|
| 45 |
-
| Qwen2-Audio-7B-Instruct
|
| 46 |
-
| **Qwen2-Audio-7B-Instruct**
|
|
|
|
| 47 |
|
| 48 |
#### Notes
|
| 49 |
|
| 50 |
-
|
|
|
|
|
|
|
| 51 |
\[1\] Xie, Zhifei, et al. "Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models." arXiv preprint arXiv:2503.02318 (2025).
|
| 52 |
\[2\] Ma, Ziyang, et al. "Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model." arXiv preprint arXiv:2501.07246 (2025).
|
| 53 |
|
|
|
|
| 27 |
- The AVQA training set originally consists of approximately 40k samples. However, we use only about 38k samples because some data sources have become invalid. Other datasets using YouTube sources face a similar issue, such as AudioSet. We believe that the missing 2k samples do not have a significant impact on the training results.
|
| 28 |
- The statement about the 8.2B parameters is based on the *Qwen2-Audio Technical Report*.
|
| 29 |
|
| 30 |
+
### Table: Accuracies (%) on MMAU benchmark
|
| 31 |
+
|
| 32 |
+
| Model | Method | Test-mini | Test | Test-mini | Test | Test-mini | Test | Test-mini | Test |
|
| 33 |
+
|---------------------------------------|-----------------------|-----------|-------|-----------|-------|-----------|------|------------|-------|
|
| 34 |
+
| - | Human\* | 86.31 | - | 78.22 | - | 82.17 | - | 82.23 | - |
|
| 35 |
+
| Gemini Pro 2.0 Flash | Direct Inference\* | 56.46 | 61.73 | 58.68 | 56.53 | 51.65 | 61.53 | 55.60 | 59.93 |
|
| 36 |
+
| Audio Flamingo 2 | Direct Inference\* | 61.56 | 65.10 | 73.95 | 72.90 | 30.93 | 40.26 | 55.48 | 59.42 |
|
| 37 |
+
| GPT4o + Strong Cap. | Direct Inference\* | 57.35 | 55.83 | 49.70 | 51.73 | 64.86 | 68.66 | 57.30 | 58.74 |
|
| 38 |
+
| Llama-3-8B-Instruct + Strong Cap. | Direct Inference\* | 50.75 | 49.10 | 48.93 | 48.93 | 55.25 | 62.70 | 52.10 | 53.57 |
|
| 39 |
+
| Gemini Pro v1.5 | Direct Inference\* | 56.75 | 54.46 | 49.40 | 48.56 | 58.55 | 55.90 | 54.90 | 52.97 |
|
| 40 |
+
| Qwen2-Audio-7B-Instruct | Direct Inference\* | 54.95 | 45.90 | 50.98 | 53.26 | 42.04 | 45.90 | 49.20 | 52.50 |
|
| 41 |
+
| GPT4o + weak cap. | Direct Inference\* | 39.33 | 35.80 | 41.90 | 39.52 | 58.25 | 68.27 | 45.70 | 48.65 |
|
| 42 |
+
| Llama-3-8B-Instruct + Weak Cap. | Direct Inference\* | 34.23 | 33.73 | 38.02 | 42.36 | 54.05 | 61.54 | 42.10 | 45.87 |
|
| 43 |
+
| SALAMONN | Direct Inference\* | 41.00 | 40.30 | 34.80 | 33.76 | 25.50 | 24.24 | 33.70 | 32.77 |
|
| 44 |
+
| Qwen2-Audio-7B-Instruct | CoTA \[1\] | 60.06 | - | 64.30 | - | 60.70 | - | 61.71 | - |
|
| 45 |
+
| Qwen2-Audio-7B-Instruct | Zero-Shot-CoT \[2\] | 61.86 | - | 56.29 | - | 55.26 | - | 57.80 | - |
|
| 46 |
+
| **Qwen2-Audio-7B-Instruct** | **Ours 1️⃣** | 69.37 | - | 66.77 | - | 57.36 | - | 64.50 | - |
|
| 47 |
+
| **Qwen2-Audio-7B-Instruct** | **Ours 2️⃣** | 68.77 | 69.76 | 64.37 | 61.40 | 63.66 | 62.70 | 65.60 | 64.36 |
|
| 48 |
|
| 49 |
#### Notes
|
| 50 |
|
| 51 |
+
1️ It is the original model, identical to the one on Hugging Face and described in our technical report.
|
| 52 |
+
2️⃣ It is the model submitted to [EvalAI](https://eval.ai/web/challenges/challenge-page/2391/overview) for evaluation, trained multiple times to achieve balanced results. (**The results on the [leaderboard](https://sakshi113.github.io/mmau_homepage/#leaderboard) contain some typographical errors, and we are currently in communication with the organizers to correct them.**)
|
| 53 |
+
\* The data are sourced from the [MMAU official website](https://sakshi113.github.io/mmau_homepage/)
|
| 54 |
\[1\] Xie, Zhifei, et al. "Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models." arXiv preprint arXiv:2503.02318 (2025).
|
| 55 |
\[2\] Ma, Ziyang, et al. "Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model." arXiv preprint arXiv:2501.07246 (2025).
|
| 56 |
|