--- license: apache-2.0 language: - en - zh tags: - audio - speech - music - understanding - multimodal - reasoning - chain-of-thought pipeline_tag: audio-text-to-text --- # MOSS-Audio
| Model | Model Size | MMAU | MMAU-Pro | MMAR | MMSU | Avg |
|---|---|---|---|---|---|---|
| Open Source (small) | ||||||
| Kimi-Audio | 7B | 72.41 | 56.58 | 60.82 | 54.74 | 61.14 |
| Qwen2.5-Omni | 7B | 65.60 | 52.20 | 56.70 | 61.32 | 58.96 |
| Audio Flamingo 3 | 7B | 61.23 | 51.70 | 57.96 | 60.04 | 57.73 |
| MiMo-Audio-7B | 7B | 74.90 | 53.35 | 61.70 | 61.94 | 62.97 |
| MiniCPM-o-4.5 | 9B | 70.97 | 39.65 | 55.75 | 60.96 | 56.83 |
| MOSS-Audio-4B-Instruct | 4B | 75.79 | 58.16 | 59.68 | 59.68 | 64.04 |
| MOSS-Audio-4B-Thinking | 4B | 77.64 | 60.75 | 63.91 | 71.20 | 68.37 |
| MOSS-Audio-8B-Instruct | 8B | 77.03 | 57.48 | 64.42 | 66.36 | 66.32 |
| MOSS-Audio-8B-Thinking | 8B | 77.13 | 64.29 | 65.73 | 76.06 | 70.80 |
| Open Source (large) | ||||||
| Qwen3-Omni-30B-A3B-Instruct | 30B | 75.00 | 61.22 | 66.40 | 69.00 | 67.91 |
| Step-Audio-R1.1 | 33B | 72.18 | 60.80 | 68.75 | 64.18 | 66.48 |
| Step-Audio-R1 | 33B | 78.67 | 59.68 | 69.15 | 75.18 | 70.67 |
| Closed Source | ||||||
| GPT4o-Audio | - | 65.66 | 52.30 | 59.78 | 58.76 | 59.13 |
| Gemini-3-Pro | - | 80.15 | 68.28 | 81.73 | 81.28 | 77.86 |
| Gemini-3.1-Pro | - | 81.10 | 73.47 | 83.70 | 81.30 | 79.89 |
| Model | Acoustic Environment (Clean) | Acoustic Environment (Noisy) | Acoustic Characteristics: Whisper | Acoustic Characteristics: Far-Field / Near-Field | Multi-Speaker | Age | Health Condition | Semantic Content | Code-Switching | Dialect | Singing | Non-Speech Vocalizations | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AISHELL-1 test |
AISHELL-2 Android | IOS | Mic |
THCHS-30 test |
MAGICDATA-READ test |
AISHELL6-Whisper normal | whisper |
AliMeeting Test_Ali_far | Test_Ali_near |
AISHELL-4 test |
SeniorTalk sentence |
ChildMandarin test |
AISHELL-6A mild | moderate | severe | StutteringSpeech |
AISHELL_6B LRDWWS | Uncontrol |
WenetSpeech test-meeting |
Fleurs cmn_hans_cn |
CS-Dialogue test |
TALCS test |
ASCEND test |
KeSpeech test |
WSYue-ASR-eval short |
MIR-1K test |
openc-pop test |
MNV_17 | |
| Paraformer-Large | 1.98 | 3.28 | 3.21 | 3.00 | 4.07 | 4.67 | 1.11 | 8.92 | 25.64 | 9.27 | 20.33 | 17.31 | 12.60 | 6.98 | 9.30 | 13.34 | 10.74 | 47.59 | 45.08 | 7.88 | 6.40 | 10.64 | 10.77 | 16.55 | 11.48 | 75.42 | 57.70 | 6.98 | 4.95 |
| GLM-ASR-Nano | 2.89 | 3.75 | 3.73 | 3.78 | 4.23 | 5.02 | 0.83 | 9.06 | 40.27 | 14.76 | 28.02 | 20.33 | 14.06 | 8.74 | 12.11 | 14.38 | 12.29 | 50.34 | 49.09 | 9.70 | 4.94 | 11.06 | 11.07 | 13.50 | 9.72 | 35.07 | 95.87 | 8.03 | 4.65 |
| Fun-ASR-Nano | 2.16 | 3.04 | 2.99 | 3.07 | 3.65 | 3.46 | 0.81 | 6.76 | 27.21 | 9.55 | 19.82 | 16.96 | 12.94 | 6.60 | 8.81 | 12.98 | 10.30 | 47.42 | 45.84 | 7.39 | 4.76 | 10.47 | 8.09 | 15.13 | 7.43 | 8.17 | 35.85 | 2.84 | 4.76 |
| SenseVoice-Small | 3.23 | 4.16 | 4.02 | 3.96 | 5.26 | 4.93 | 1.25 | 9.88 | 37.01 | 16.31 | 24.06 | 21.07 | 14.18 | 7.62 | 9.85 | 14.39 | 11.47 | 52.92 | 47.97 | 8.35 | 6.75 | 12.81 | 10.52 | 18.38 | 10.45 | 7.34 | 39.51 | 8.07 | 4.92 |
| Kimi-Audio-7B-Instruct | 0.79 | 2.91 | 3.03 | 2.88 | 1.39 | 2.15 | 0.69 | 4.63 | 28.22 | 13.82 | 20.61 | 19.70 | 13.79 | 7.00 | 9.34 | 12.56 | 10.75 | 44.44 | 42.57 | 7.15 | 5.10 | 14.56 | 12.74 | 21.83 | 5.51 | 53.17 | 38.35 | 5.17 | 4.68 |
| Qwen2.5-Omni-3B | 1.51 | 3.10 | 2.94 | 2.93 | 3.32 | 3.56 | 0.82 | 7.82 | 32.14 | 12.16 | 22.91 | 17.38 | 12.96 | 6.87 | 10.55 | 14.57 | 11.33 | 54.54 | 50.03 | 9.04 | 5.45 | 10.78 | 10.94 | 13.25 | 7.67 | 60.06 | 45.00 | 3.47 | 5.54 |
| Qwen2.5-Omni-7B | 1.16 | 2.88 | 2.77 | 2.73 | 3.06 | 3.16 | 0.71 | 6.57 | 32.03 | 18.73 | 21.01 | 19.96 | 12.29 | 7.27 | 10.94 | 12.92 | 10.53 | 51.99 | 49.45 | 8.43 | 5.13 | 14.02 | 10.46 | 14.42 | 6.40 | 57.43 | 42.62 | 2.75 | 4.56 |
| Qwen3-Omni-30B-A3B-Instruct | 0.95 | 2.70 | 2.72 | 2.57 | 2.21 | 2.47 | 0.59 | 3.22 | 25.72 | 8.44 | 18.15 | 14.13 | 8.79 | 6.20 | 8.88 | 11.59 | 10.25 | 45.80 | 41.65 | 6.64 | 4.84 | 12.94 | 8.33 | 12.64 | 5.87 | 25.39 | 30.81 | 1.21 | 4.73 |
| MOSS-Audio-4B-Instruct | 2.26 | 3.22 | 3.20 | 3.33 | 3.53 | 3.72 | 0.73 | 5.86 | 27.27 | 9.68 | 20.33 | 16.93 | 13.25 | 6.36 | 9.77 | 12.68 | 10.28 | 43.35 | 44.25 | 8.17 | 8.13 | 9.14 | 8.37 | 12.83 | 14.65 | 9.04 | 18.47 | 3.10 | 4.01 |
| MOSS-Audio-8B-Instruct | 1.82 | 2.97 | 2.95 | 2.91 | 2.82 | 3.20 | 0.69 | 4.80 | 36.82 | 11.25 | 24.36 | 17.42 | 13.10 | 5.84 | 8.94 | 11.52 | 9.72 | 39.76 | 39.27 | 7.86 | 7.52 | 9.07 | 8.22 | 13.26 | 9.18 | 8.33 | 17.24 | 2.39 | 4.31 |