mispeech
/

r1-aqa

@@ -27,27 +27,30 @@ Additional Notes:
 - The AVQA training set originally consists of approximately 40k samples. However, we use only about 38k samples because some data sources have become invalid. Other datasets using YouTube sources face a similar issue, such as AudioSet. We believe that the missing 2k samples do not have a significant impact on the training results.
 - The statement about the 8.2B parameters is based on the *Qwen2-Audio Technical Report*.
-### Table: Accuracies (%) on MMAU Test-mini benchmark
-| Model                                      | Method                  | Sound  | Music  | Speech | Average |
-|--------------------------------------------|-------------------------|--------|--------|--------|---------|
-| \                                          | Human\*                 | 86.31  | 78.22  | 82.17  | 82.23   |
-| Gemini Pro 2.0 Flash                       | Direct Inference\*      | 56.46  | 58.68  | 51.65  | 55.60   |
-| Audio Flamingo 2                           | Direct Inference\*      | 61.56  | **73.95** | 30.93  | 55.48   |
-| GPT4o + Strong Cap.                        | Direct Inference\*      | 57.35  | 49.70  | **64.86** | 57.30   |
-| Llama-3-8B-Instruct + Strong Cap.          | Direct Inference\*      | 50.75  | 48.93  | 55.25  | 52.10   |
-| Gemini Pro v1.5                            | Direct Inference\*      | 56.75  | 49.40  | 58.55  | 54.90   |
-| Qwen2-Audio-7B-Instruct                    | Direct Inference\*      | 54.95  | 50.98  | 42.04  | 49.20   |
-| GPT4o + Weak Cap.                          | Direct Inference\*      | 39.33  | 41.90  | 58.25  | 45.70   |
-| Llama-3-8B-Instruct + Weak Cap.            | Direct Inference\*      | 34.23  | 38.02  | 54.05  | 42.10   |
-| SALMONN                                    | Direct Inference\*      | 41.00  | 34.80  | 25.50  | 33.70   |
-| Qwen2-Audio-7B-Instruct                    | CoTA \[1\]              | 60.06  | 64.30  | 60.70  | 61.71   |
-| Qwen2-Audio-7B-Instruct                    | Zero-Shot-CoT \[2\]     | 61.86  | 56.29  | 55.26  | 57.80   |
-| **Qwen2-Audio-7B-Instruct**                | **GRPO (Ours)**         | **69.37** | 66.77  | 57.36  | **64.50** |
 #### Notes
-\* The data are sourced from the MMAU official website: [https://sakshi113.github.io/mmau_homepage/](https://sakshi113.github.io/mmau_homepage/)
 \[1\] Xie, Zhifei, et al. "Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models." arXiv preprint arXiv:2503.02318 (2025).
 \[2\] Ma, Ziyang, et al. "Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model." arXiv preprint arXiv:2501.07246 (2025).

 - The AVQA training set originally consists of approximately 40k samples. However, we use only about 38k samples because some data sources have become invalid. Other datasets using YouTube sources face a similar issue, such as AudioSet. We believe that the missing 2k samples do not have a significant impact on the training results.
 - The statement about the 8.2B parameters is based on the *Qwen2-Audio Technical Report*.
+### Table: Accuracies (%) on MMAU benchmark
+| Model                                 | Method                | Test-mini | Test  | Test-mini | Test  | Test-mini | Test  | Test-mini | Test  |
+|---------------------------------------|-----------------------|-----------|-------|-----------|-------|-----------|------|------------|-------|
+| -                                     | Human\*               | 86.31     | -     | 78.22     | -     | 82.17     | -     | 82.23     | -     |
+| Gemini Pro 2.0 Flash                  | Direct Inference\*    | 56.46     | 61.73 | 58.68     | 56.53 | 51.65     | 61.53 | 55.60     | 59.93 |
+| Audio Flamingo 2                      | Direct Inference\*    | 61.56     | 65.10 | 73.95     | 72.90 | 30.93     | 40.26 | 55.48     | 59.42 |
+| GPT4o + Strong Cap.                   | Direct Inference\*    | 57.35     | 55.83 | 49.70     | 51.73 | 64.86     | 68.66 | 57.30     | 58.74 |
+| Llama-3-8B-Instruct + Strong Cap.     | Direct Inference\*    | 50.75     | 49.10 | 48.93     | 48.93 | 55.25     | 62.70 | 52.10     | 53.57 |
+| Gemini Pro v1.5                       | Direct Inference\*    | 56.75     | 54.46 | 49.40     | 48.56 | 58.55     | 55.90 | 54.90     | 52.97 |
+| Qwen2-Audio-7B-Instruct               | Direct Inference\*    | 54.95     | 45.90 | 50.98     | 53.26 | 42.04     | 45.90 | 49.20     | 52.50 |
+| GPT4o + weak cap.                     | Direct Inference\*    | 39.33     | 35.80 | 41.90     | 39.52 | 58.25     | 68.27 | 45.70     | 48.65 |
+| Llama-3-8B-Instruct + Weak Cap.       | Direct Inference\*    | 34.23     | 33.73 | 38.02     | 42.36 | 54.05     | 61.54 | 42.10     | 45.87 |
+| SALAMONN                              | Direct Inference\*    | 41.00     | 40.30 | 34.80     | 33.76 | 25.50     | 24.24 | 33.70     | 32.77 |
+| Qwen2-Audio-7B-Instruct               | CoTA \[1\]            | 60.06     | -     | 64.30     | -     | 60.70     | -     | 61.71     | -     |
+| Qwen2-Audio-7B-Instruct               | Zero-Shot-CoT \[2\]   | 61.86     | -     | 56.29     | -     | 55.26     | -     | 57.80     | -     |
+| **Qwen2-Audio-7B-Instruct**           | **Ours 1️⃣**           | 69.37     | -     | 66.77     | -     | 57.36     | -     | 64.50     | -     |
+| **Qwen2-Audio-7B-Instruct**           | **Ours 2️⃣**           | 68.77     | 69.76 | 64.37     | 61.40 | 63.66     | 62.70 | 65.60     | 64.36 |
 #### Notes
+1️ It is the original model, identical to the one on Hugging Face and described in our technical report.
+2️⃣ It is the model submitted to [EvalAI](https://eval.ai/web/challenges/challenge-page/2391/overview) for evaluation, trained multiple times to achieve balanced results. (**The results on the [leaderboard](https://sakshi113.github.io/mmau_homepage/#leaderboard) contain some typographical errors, and we are currently in communication with the organizers to correct them.**)
+\* The data are sourced from the [MMAU official website](https://sakshi113.github.io/mmau_homepage/)
 \[1\] Xie, Zhifei, et al. "Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models." arXiv preprint arXiv:2503.02318 (2025).
 \[2\] Ma, Ziyang, et al. "Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model." arXiv preprint arXiv:2501.07246 (2025).