Automatic Speech Recognition
Transformers
Safetensors
meralion3
meralion
meralion-3
custom_code
lewiswoncy commited on
Commit
348b468
·
verified ·
1 Parent(s): 2addc83

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +431 -193
README.md CHANGED
@@ -1,199 +1,437 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  library_name: transformers
3
- tags: []
 
 
4
  ---
5
 
6
- # Model Card for Model ID
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
 
 
9
 
10
-
11
-
12
- ## Model Details
13
-
14
- ### Model Description
15
-
16
- <!-- Provide a longer summary of what this model is. -->
17
-
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
-
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
-
28
- ### Model Sources [optional]
29
-
30
- <!-- Provide the basic links for the model. -->
31
-
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
-
36
- ## Uses
37
-
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
- ### Direct Use
41
-
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
-
44
- [More Information Needed]
45
-
46
- ### Downstream Use [optional]
47
-
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
-
50
- [More Information Needed]
51
-
52
- ### Out-of-Scope Use
53
-
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
-
56
- [More Information Needed]
57
-
58
- ## Bias, Risks, and Limitations
59
-
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
-
62
- [More Information Needed]
63
-
64
- ### Recommendations
65
-
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
-
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
-
70
- ## How to Get Started with the Model
71
-
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
75
-
76
- ## Training Details
77
-
78
- ### Training Data
79
-
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
-
82
- [More Information Needed]
83
-
84
- ### Training Procedure
85
-
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
-
88
- #### Preprocessing [optional]
89
-
90
- [More Information Needed]
91
-
92
-
93
- #### Training Hyperparameters
94
-
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
-
97
- #### Speeds, Sizes, Times [optional]
98
-
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
-
101
- [More Information Needed]
102
-
103
- ## Evaluation
104
-
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
-
107
- ### Testing Data, Factors & Metrics
108
-
109
- #### Testing Data
110
-
111
- <!-- This should link to a Dataset Card if possible. -->
112
-
113
- [More Information Needed]
114
-
115
- #### Factors
116
-
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
-
119
- [More Information Needed]
120
-
121
- #### Metrics
122
-
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
-
125
- [More Information Needed]
126
-
127
- ### Results
128
-
129
- [More Information Needed]
130
-
131
- #### Summary
132
-
133
-
134
-
135
- ## Model Examination [optional]
136
-
137
- <!-- Relevant interpretability work for the model goes here -->
138
-
139
- [More Information Needed]
140
-
141
- ## Environmental Impact
142
-
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
-
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
-
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
-
153
- ## Technical Specifications [optional]
154
-
155
- ### Model Architecture and Objective
156
-
157
- [More Information Needed]
158
-
159
- ### Compute Infrastructure
160
-
161
- [More Information Needed]
162
-
163
- #### Hardware
164
-
165
- [More Information Needed]
166
-
167
- #### Software
168
-
169
- [More Information Needed]
170
-
171
- ## Citation [optional]
172
-
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
-
175
- **BibTeX:**
176
-
177
- [More Information Needed]
178
-
179
- **APA:**
180
-
181
- [More Information Needed]
182
-
183
- ## Glossary [optional]
184
-
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
-
187
- [More Information Needed]
188
-
189
- ## More Information [optional]
190
-
191
- [More Information Needed]
192
-
193
- ## Model Card Authors [optional]
194
-
195
- [More Information Needed]
196
-
197
- ## Model Card Contact
198
-
199
- [More Information Needed]
 
1
  ---
2
+ license: other
3
+ license_name: meralion-public-license-v3
4
+ license_link: https://huggingface.co/datasets/MERaLiON/MERaLiON_Public_Licence/blob/main/MERaLiON-Public-Licence-v3.pdf
5
+ extra_gated_fields:
6
+ First Name: text
7
+ Last Name: text
8
+ Country: country
9
+ Affiliation: text
10
+ Job title:
11
+ type: select
12
+ options:
13
+ - Student
14
+ - Research Graduate
15
+ - AI developer/Researcher
16
+ - Other
17
+ I consent to being contacted by the MERaLiON team for feedback or follow-up regarding my experience using the model: checkbox
18
+ extra_gated_description: >-
19
+ By downloading this model, you acknowledge that you have read and agree to be bound by the Terms and Conditions set out in this document [MERaLiON Public License v3](https://huggingface.co/datasets/MERaLiON/MERaLiON_Public_Licence/blob/main/MERaLiON-Public-Licence-v3.pdf).
20
+ The information you provide will be collected, stored, processed, and shared in accordance with the [A*STAR Privacy Policy](https://www.a-star.edu.sg/privacy-statement).
21
+ extra_gated_button_content: Submit
22
+ datasets:
23
+ - MERaLiON/Multitask-National-Speech-Corpus-v1
24
+ language:
25
+ - en
26
+ - zh
27
+ - ms
28
+ - ta
29
+ - id
30
+ - th
31
+ - vi
32
+ metrics:
33
+ - wer
34
+ - bleu
35
+ base_model:
36
+ - openai/whisper-large-v3
37
+ - google/gemma-2-9b-it
38
  library_name: transformers
39
+ tags:
40
+ - meralion
41
+ - meralion-3
42
  ---
43
 
44
+ <h1 align="center">🔥 MERaLiON-3 🔥</h1>
45
+
46
+ <p align="center">
47
+ <a href="https://huggingface.co/MERaLiON/MERaLiON-3-10B">🚀 MERaLiON-3-10B</a>
48
+ </p>
49
+
50
+ <p align="center">
51
+ <a href="https://meralion.org/demo/">💻 Web Demo</a> |
52
+ ⚙️ vLLM coming soon
53
+ </p>
54
+
55
+ ## Introduction
56
+ We are pleased to announce the release of our flagship speech-text large language model, [**MERaLiON-3-10B**](https://huggingface.co/MERaLiON/MERaLiON-3-10B). MERaLiON-3-10B demonstrates competitive performance across benchmark evaluations in Age Recognition, Gender Recognition, Spoken Question Answering (SQA), and Contextual Paralinguistic Question Answering (CPQA) in the Southeast Asian context. These results are comparable to those achieved by other state-of-the-art AudioLLMs, including Gemini 3 Flash and Qwen3 Omni Instruct. MERaLiON-3-10B maintains its competitive performance in other tasks such as Multilingual Automatic Speech Recognition (ASR), Speech Translation (ST), Audio Scene Understanding and general speech comprehension vis-à-vis MERaLiON-2-10B.
57
+
58
+ We constructed a benchmark containing speech and prompts in Malay, Indonesian, English, Chinese, Tamil, Thai and Vietnamese to better represent the Southeast Asian context. The following table presents task-specific evaluation scores, assessed using the LLM-as-a-Judge framework across multiple datasets. Higher scores indicate better performance. We will open-source these benchmarks separately as part of a paper. See the [Evaluation](#performance) section for detailed benchmarks.
59
+
60
+ | Benchmark | MERaLiON-3-10B | MERaLiON-2-10B | Qwen3 Omni | Gemini 3 Flash | GPT 4o Audio |
61
+ | :--- | :---: | :---: | :---: | :---: | :---: |
62
+ | Age (commonvoice-en, ta, th, vi, zh) | 75.41 | 61.77 | 70.38 | <u>**77.00**</u> | 68.90 |
63
+ | Gender (Multi-dataset) | <u>**96.67**</u> | 54.19 | 95.34 | 81.72 | 40.25 |
64
+ | Spoken Q&A (SQA) | <u>**61.50**</u> | 56.76 | 58.74 | 59.75 | 57.48 |
65
+ | Contextual paralinguistic Q&A (CPQA) | <u>**57.33**</u> | 48.31 | 54.21 | 54.07 | 54.54 |
66
+
67
+ ## Model Description:
68
+
69
+ MERaLiON stands for **M**ultimodal **E**mpathetic **R**easoning **a**nd **L**earning **i**n **O**ne **N**etwork, with models tailored for **Singapore’s multilingual and multicultural landscape**, as well as the wider **Southeast Asian region**.
70
+
71
+ MERaLiON-3-10B is finetuned on **150,000 hours of speech and audio data** across **6 diverse tasks**: Automatic Speech Recognition (ASR), SQA, Spoken Dialogue Summarization (SDS), Audio Captioning (AC), Audio-Scene Question Answering (ASQA) and CPQA.
72
+
73
+ - **Developed by:** I<sup>2</sup>R, A\*STAR, Singapore
74
+ - **Model type:** Multimodal LLM
75
+ - **Language(s):** Primarily English (Global and Singapore), Chinese, with support for audio of regional languages including Malay, Tamil, Indonesian, Thai, and Vietnamese.
76
+ - **Audio:** **Mono** channel audio, **16000** hz, up to **300** seconds.
77
+ - **License:** [MERaLiON Public License](https://huggingface.co/datasets/MERaLiON/MERaLiON_Public_Licence/blob/main/MERaLiON-Public-Licence-v3.pdf)
78
+ - **Demo:** [MERaLiON-AudioLLM Web Demo](https://meralion.org/demo/)
79
+
80
+ ## Performance:
81
+
82
+ We benchmarked MERaLiON-3-10B against Qwen3 Omni, Gemini 3 Flash, GPT 4o Audio, and MERaLiON-2-10B, and it performed the best on 44 out of 59 benchmarks for tasks related to age recognition, gender recognition, SQA, and CPQA. MERaLiON-3-10B maintains competitive performance vis-à-vis MERaLiON-2-10B on the Audiobench benchmarks.
83
+
84
+ **Age recognition**
85
+
86
+ Age recognition tasks categorise speakers as teens (10-19), adults (20-59), or seniors (60-100). The prompts are either in English, or in the same language as the audio. LLM-as-a-judge is used to evaluate the correctness of each response.
87
+
88
+ | Dataset | Lang | Var | MERaLiON-3-10B | MERaLiON-2-10B | Qwen3 Omni | Gemini 3 Flash | GPT 4o Audio |
89
+ | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
90
+ | Commonvoice | en | eng | 64.30 | 63.10 | 64.20 | <u>**68.00**</u> | 65.00 |
91
+ | | | sea | 64.30 | 63.10 | 64.20 | <u>**68.00**</u> | 65.00 |
92
+ | | ta | eng | 78.00 | 64.65 | 73.50 | <u>**79.00**</u> | 71.00 |
93
+ | | | sea | 58.00 | 47.90 | 48.40 | <u>**78.00**</u> | 62.00 |
94
+ | | th | eng | <u>**81.68**</u> | 57.81 | 78.06 | 77.00 | 78.00 |
95
+ | | | sea | 76.39 | 42.19 | 64.13 | <u>**84.00**</u> | 53.00 |
96
+ | | vi | eng | <u>**91.96**</u> | 73.23 | 84.39 | 81.00 | 86.00 |
97
+ | | | sea | <u>**91.48**</u> | 64.35 | 77.67 | 87.00 | 81.00 |
98
+ | | zh | eng | 74.30 | 72.40 | <u>**75.60**</u> | 75.00 | 83.00 |
99
+ | | | sea | <u>**73.70**</u> | 69.00 | 73.60 | 73.00 | 45.00 |
100
+ | **Average** | | | 75.41 | 61.77 | 70.38 | <u>**77.00**</u> | 68.90 |
101
+
102
+ **Gender recognition**
103
+
104
+ The gender recognition benchmark consists of speech samples in Indonesian, Tamil, Thai, Vietnamese, Chinese, Malay, English, and Khmer. The text prompts are either in English, or in the same language as the audio. LLM-as-a-judge is used to evaluate the correctness of each response.
105
+
106
+ | Dataset | Lang | Var | MERaLiON-3-10B | MERaLiON-2-10B | Qwen3 Omni | Gemini 3 Flash | GPT 4o Audio |
107
+ | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
108
+ | commonvoice | id | eng | <u>**97.10**</u> | 45.20 | 96.80 | 86.00 | 46.00 |
109
+ | | | sea | <u>**97.20**</u> | 57.30 | 96.10 | 90.00 | 53.93 |
110
+ | | ta | eng | <u>**97.40**</u> | 53.00 | 96.80 | 65.00 | 33.00 |
111
+ | | | sea | <u>**97.10**</u> | 40.40 | 81.90 | 71.00 | 35.00 |
112
+ | | th | eng | <u>**97.86**</u> | 50.07 | 96.92 | 87.00 | 50.00 |
113
+ | | | sea | <u>**97.99**</u> | 23.96 | 95.18 | 82.00 | 40.00 |
114
+ | | vi | eng | <u>**99.22**</u> | 24.05 | 98.82 | 87.00 | 26.00 |
115
+ | | | sea | <u>**99.22**</u> | 14.64 | 96.86 | 88.00 | 35.00 |
116
+ | | zh | eng | <u>**98.20**</u> | 53.70 | <u>**98.20**</u> | 89.00 | 49.00 |
117
+ | | | sea | <u>**98.30**</u> | 35.50 | 98.10 | 82.00 | 21.00 |
118
+ | emota | ta | eng | <u>**100.00**</u> | 67.31 | 99.89 | 83.00 | 25.00 |
119
+ | | | sea | <u>**100.00**</u> | 48.93 | 97.65 | 86.00 | 33.00 |
120
+ | fleurs | en | eng | <u>**100.00**</u> | 58.27 | <u>**100.00**</u> | 73.00 | 78.00 |
121
+ | | | sea | <u>**100.00**</u> | 58.27 | <u>**100.00**</u> | 73.00 | 78.00 |
122
+ | | km | eng | <u>**100.00**</u> | 56.60 | <u>**100.00**</u> | 94.00 | 62.00 |
123
+ | | | sea | 99.48 | 43.40 | <u>**100.00**</u> | 99.00 | 15.00 |
124
+ | indowavesentiment | id | eng | <u>**100.00**</u> | 71.67 | <u>**100.00**</u> | 84.00 | 60.00 |
125
+ | | | sea | <u>**100.00**</u> | 60.67 | <u>**100.00**</u> | 88.00 | 14.00 |
126
+ | m3ed | zh | eng | 93.30 | 84.30 | <u>**94.30**</u> | 73.00 | 23.00 |
127
+ | | | sea | 93.70 | 70.70 | <u>**94.40**</u> | 72.00 | 12.00 |
128
+ | openslr | ta | eng | <u>**100.00**</u> | 55.30 | 99.00 | 75.00 | 47.00 |
129
+ | | | sea | <u>**100.00**</u> | 37.80 | 87.90 | 81.00 | 36.00 |
130
+ | sg streets | en | eng | <u>**100.00**</u> | 89.63 | <u>**100.00**</u> | 87.00 | 32.00 |
131
+ | | | sea | <u>**100.00**</u> | 89.63 | <u>**100.00**</u> | 87.00 | 32.00 |
132
+ | asr-smaldusc | ms | eng | <u>**99.40**</u> | 52.40 | 98.60 | 97.00 | 76.00 |
133
+ | | | sea | <u>**99.40**</u> | 44.00 | 98.80 | 99.00 | 24.00 |
134
+ | thai elderly speech | th | eng | 99.09 | 68.15 | <u>**99.29**</u> | 77.00 | 46.00 |
135
+ | | | sea | <u>**98.99**</u> | 26.92 | 97.39 | 76.00 | 51.00 |
136
+ | thai ser | th | eng | <u>**91.41**</u> | 63.46 | 90.47 | 85.00 | 44.00 |
137
+ | | | sea | <u>**91.41**</u> | 61.78 | 89.74 | 76.00 | 34.00 |
138
+ | vietnam-celeb | vi | eng | <u>**73.90**</u> | 65.80 | 73.80 | 62.00 | 41.00 |
139
+ | | | sea | 73.80 | 61.40 | <u>**74.00**</u> | 61.00 | 36.00 |
140
+ | **Average** | | | <u>**96.67**</u> | 54.19 | 95.34 | 81.72 | 40.25 |
141
+
142
+ **Spoken question and answer (SQA)**
143
+
144
+ The benchmark consists of speech in English, Malay, Tamil, and Chinese, with text prompts in English containing questions related to the speech. As studies have found that LLM judges tend to favor longer, verbose answers even if they are not as clear, high-quality, or accurate as shorter alternatives, we have adjusted the judge's prompt to address verbosity bias.
145
+
146
+ | Dataset | MERaLiON-3-10B | MERaLiON-2-10B | Qwen3 Omni | Gemini 3 Flash | GPT 4o Audio |
147
+ | :--- | :---: | :---: | :---: | :---: | :---: |
148
+ | ytb_sqa_batch1 | <u>**67.65**</u> | 65.89 | 66.66 | 63.25 | 60.43 |
149
+ | ytb_sqa_batch3_ms | <u>**58.00**</u> | 50.40 | 56.25 | 57.75 | 55.80 |
150
+ | ytb_sqa_batch3_ta | 58.55 | 53.60 | 52.25 | <u>**59.45**</u> | 56.25 |
151
+ | ytb_sqa_batch3_zh_en | <u>**61.80**</u> | 57.15 | 59.80 | 58.55 | 57.45 |
152
+ | **Average** | <u>**61.50**</u> | 56.76 | 58.74 | 59.75 | 57.48 |
153
+
154
+ **Contextual paralinguistic question and answer (CPQA)**
155
+
156
+ The audio includes both speech and non-speech elements, and when no speech is present, LLMs are expected to reason solely based on acoustic or musical elements. The speech samples were in languages of Chinese, Malay, Tamil, English, a mix of any of the languages (codeswitch), or could include dialects such as Hokkien. To test for robustness in instruction following, the text prompts were designed to be diverse, and were written in any of the following languages: English, Malay, Tamil, Indonesian, Vietnamese, Chinese, or Thai. LLMs are expected to reply in the same language as the text prompt. Similar to SQA, we have adjusted the judge's prompt to address verbosity bias.
157
+
158
+ | Dataset | MERaLiON-3-10B | MERaLiON-2-10B | Qwen3 Omni | Gemini 3 Flash | GPT 4o Audio |
159
+ | :--- | :---: | :---: | :---: | :---: | :---: |
160
+ | yx_youtube_zh | <u>**59.40**</u> | 50.18 | 57.27 | 54.67 | 54.79 |
161
+ | yx_youtube_codeswitch | <u>**61.80**</u> | 47.36 | 55.56 | 59.40 | 60.32 |
162
+ | yx_youtube_dialect | <u>**59.20**</u> | 47.72 | 56.36 | 55.36 | 54.92 |
163
+ | yx_youtube_ms | <u>**60.40**</u> | 46.16 | 53.88 | 57.00 | 56.36 |
164
+ | yx_youtube_ta | <u>**58.40**</u> | 38.88 | 49.60 | 56.60 | 54.64 |
165
+ | yx_youtube_en | <u>**58.64**</u> | 51.60 | 56.76 | 53.52 | 52.88 |
166
+ | ytb_short_eval_cpqa_human1 | <u>**54.64**</u> | 47.57 | 53.95 | 47.42 | 49.97 |
167
+ | ytb_short_eval_cpqa_llm1 | <u>**59.42**</u> | 56.25 | 56.07 | 54.94 | 52.44 |
168
+ | ytb_long_eval_cpqa_llm1 | <u>**60.46**</u> | 57.48 | 57.44 | 54.94 | 56.32 |
169
+ | ytb_long_eval_cpqa_human1 | <u>**60.94**</u> | 51.33 | 59.21 | 56.34 | 55.00 |
170
+ | Emotional-YTB-MY_zh_30_test_CPQA_v1 | 51.81 | 46.81 | 51.22 | 51.07 | <u>**53.41**</u> |
171
+ | Emotional-YTB-MY_ms_30_test_CPQA_v1 | 50.40 | 44.82 | 48.79 | 49.12 | <u>**53.01**</u> |
172
+ | Emotional-YTB-MY_ta_test_CPQA_v1 | 49.77 | 41.88 | 48.62 | 52.56 | <u>**54.96**</u> |
173
+ | **Average** | <u>**57.33**</u> | 48.31 | 54.21 | 54.07 | 54.54 |
174
+
175
+ **Automatic Speech Recognition (ASR), instruction following and audio understanding**
176
+
177
+ MERaLiON-3-10B continues to demonstrate competitive performance in ASR, instruction following and audio understanding as compared to MERaLiON-2-10B, with improvements on most metrics on Audiobench. Please visit [AudioBench benchmark](https://huggingface.co/spaces/MERaLiON/AudioBench-Leaderboard) for dataset-level evaluation results.
178
+
179
+ | Benchmark | MERaLiON-3-10B | MERaLiON-2-10B | MERaLiON-2-10B-ASR | MERaLiON-2-3B |
180
+ | :--- | :---: | :---: | :---: | :---: |
181
+ | ASR (lower better) | <u>**0.125**</u> | 0.1485 | 0.1332 | 0.1697 |
182
+ | Speech Instruction | <u>**76.90**</u> | 70.20 | 13.40 | 19.10 |
183
+ | Audio Scene Question Answering | <u>**56.98**</u> | 51.14 | 49.51 | 46.14 |
184
+ | Spoken QA (Singlish) | <u>**67.25**</u> | 66.55 | 61.85 | 59.70 |
185
+ | Audio Captioning | <u>**38.31**</u> | 35.60 | 34.47 | 33.24 |
186
+ | Spoken Dialogue Summarisation | <u>**56.45**</u> | 53.10 | 55.80 | 48.55 |
187
+ | Spoken QA (English) | <u>**83.42**</u> | 79.74 | 73.98 | 68.72 |
188
+ | Music Understanding | <u>**76.07**</u> | 63.94 | 60.66 | 55.60 |
189
+ | Accent Recognition | 57.47 | 41.82 | 47.79 | <u>**60.05**</u> |
190
+ | Speech Translation | <u>**28.83**</u> | 27.39 | 28.54 | 22.13 |
191
+
192
+
193
+ ## How to Use
194
+ > [!WARNING]
195
+ > **Out of Scope use**: This model is not intended for use in tool calling, math, and coding tasks.
196
+
197
+ MERaLiON-3 requires `transformers` version `4.56.2`
198
+
199
+ ```
200
+ pip install transformers==4.50.1
201
+ pip install librosa
202
+ ```
203
+
204
+ To run in GPU, MERaLiON-3 requires `flash-attn`.
205
+
206
+ ```
207
+ pip install flash-attn --no-build-isolation
208
+ ```
209
+
210
+ > [!TIP]
211
+ > Should you face any difficulties installing the above packages, you can try installing within this Docker container instead:
212
+ > `pytorch/pytorch:2.5.1-cuda12.1-cudnn9-devel`, whose cuda and torch environments have been tested working.
213
+
214
+ ### Audio Input
215
+
216
+ - For ASR tasks, the maximum audio length is suggested to be 30 seconds at 16,000 Hz.
217
+ - For general speech & audio understanding tasks, the maximum audio length which we tested for was up to 300 seconds at 16,000 Hz sampling rate.
218
+
219
+ ### Text Prompt
220
+
221
+ MERaLiON-3 is trained with this prompt template:
222
+
223
+ ```
224
+ Instruction: <TextHere> \nFollow the text instruction based on the following audio: <SpeechHere>
225
+ ```
226
+
227
+ It is generally recommended to follow this template, i.e., replace `<TextHere>` with your text instruction while leaving the `<SpeechHere>` untouched. We list a few useful example prompts here:
228
+
229
+ **Standard prompts for better accuracy**
230
+
231
+ ```python
232
+ prompt_template = "Instruction: {query} \nFollow the text instruction based on the following audio: <SpeechHere>"
233
+
234
+ transcription_prompt = prompt_template.format(query="Please transcribe this speech.")
235
+ translation_prompt = prompt_template.format(query="Please translate the speech into Malay")
236
+ summarization_prompt = prompt_template.format(query="Please summarize this speech")
237
+ audio_captioning_prompt_1 = prompt_template.format(query="Please describe the audio")
238
+ audio_captioning_prompt_2 = prompt_template.format(query="Please create a caption for the audio")
239
+ audio_scene_understanding_prompt = prompt_template.format(query="Are there people crying in the audio?")
240
+ speech_as_instruction_prompt = prompt_template.format(query="Please respond to the audio") # given a speech instruction is provided in the audio clip.
241
+ emotion_recognition_prompt_1 = prompt_template.format(query="What is the emotion of the speaker")
242
+ emotion_recognition_prompt_2 = prompt_template.format(query="Describe the paralinguistic features of the audio")
243
+ gender_recognition_prompt = prompt_template.format(query="What is the gender of the speaker")
244
+ ```
245
+
246
+ **More flexible prompts for enriched responses**
247
+
248
+ ```python
249
+ prompt_template = "Instruction: {query} \nFollow the text instruction based on the following audio: <SpeechHere>"
250
+
251
+ prompt_1 = prompt_template.format(query="describe the paralinguistics feature and return in json format.")
252
+ prompt_2 = prompt_template.format(query="Please summarize the content of the speech and analyse the paralinguistics features of this audio. Return in json format.")
253
+ prompt_3 = prompt_template.format(query="Please translate this speech to Singapore's 4 official languages.")
254
+ ```
255
+
256
+
257
+ **AI agent prompts (beyond the default prompt template)**
258
+
259
+ ```python
260
+ prompt_1 = \
261
+ """
262
+ You are MERaLiON-AudioLLM, an empathic AI assistant developed by A*STAR. MERaLiON stands for Multimodal Empathetic Reasoning and Learning in One Network.
263
+ You are a friendly and empathetic conversational partner, and is proficient in understanding human emotions, accents, and genders from paralinguistic features.
264
+ Maintain a tone that is warm, non-judgmental, and supportive while replying to user.
265
+
266
+ User's voice: <SpeechHere>
267
+ """
268
+ ```
269
+
270
+ ### Huggingface Inference with CPU
271
+
272
+ ```python
273
+ import librosa
274
+ from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
275
+
276
+ repo_id = "MERaLiON/MERaLiON-3-10B"
277
+
278
+ processor = AutoProcessor.from_pretrained(
279
+ repo_id,
280
+ trust_remote_code=True,
281
+ )
282
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(
283
+ repo_id,
284
+ use_safetensors=True,
285
+ trust_remote_code=True,
286
+ )
287
 
288
+ prompt_template = "Instruction: {query} \nFollow the text instruction based on the following audio: <SpeechHere>"
289
+ transcribe_prompt = "Please transcribe this speech."
290
+ translate_prompt = "Can you please translate this speech into written Chinese?"
291
 
292
+ # batch inference of 2 samples
293
+ conversation = [
294
+ [{"role": "user", "content": prompt_template.format(query=transcribe_prompt)}],
295
+ [{"role": "user", "content": prompt_template.format(query=translate_prompt)}],
296
+ ]
297
+
298
+ chat_prompt = processor.tokenizer.apply_chat_template(
299
+ conversation=conversation,
300
+ tokenize=False,
301
+ add_generation_prompt=True
302
+ )
303
+
304
+ # Use audio at 16000hz.
305
+ audio_array, sample_rate = librosa.load("/path/to/your/audio/file", sr=16000)
306
+ audio_array = [audio_array]*2
307
+ inputs = processor(text=chat_prompt, audios=audio_array)
308
+
309
+ # adjust the `max_new_tokens` based on your use case.
310
+ # Please note the inclusion of `no_repeat_ngram_size=6`.
311
+ outputs = model.generate(**inputs, max_new_tokens=256, no_repeat_ngram_size=6)
312
+ generated_ids = outputs[:, inputs['input_ids'].size(1):]
313
+ response = processor.batch_decode(generated_ids, skip_special_tokens=True)
314
+ ```
315
+
316
+ ### Huggingface GPU Inference
317
+
318
+ ```python
319
+ import torch
320
+ import librosa
321
+ from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
322
+
323
+ repo_id = "MERaLiON/MERaLiON-3-10B"
324
+ device = "cuda"
325
+
326
+ processor = AutoProcessor.from_pretrained(
327
+ repo_id,
328
+ trust_remote_code=True,
329
+ )
330
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(
331
+ repo_id,
332
+ use_safetensors=True,
333
+ trust_remote_code=True,
334
+ attn_implementation="flash_attention_2",
335
+ torch_dtype=torch.bfloat16
336
+ ).to(device)
337
+
338
+ prompt_template = "Instruction: {query} \nFollow the text instruction based on the following audio: <SpeechHere>"
339
+ transcribe_prompt = "Please transcribe this speech."
340
+ translate_prompt = "Can you please translate this speech into written Chinese?"
341
+
342
+ # batch inference of 2 samples
343
+ conversation = [
344
+ [{"role": "user", "content": prompt_template.format(query=transcribe_prompt)}],
345
+ [{"role": "user", "content": prompt_template.format(query=translate_prompt)}],
346
+ ]
347
+
348
+ chat_prompt = processor.tokenizer.apply_chat_template(
349
+ conversation=conversation,
350
+ tokenize=False,
351
+ add_generation_prompt=True
352
+ )
353
+
354
+ # Use audio at 16000hz.
355
+ audio_array, sample_rate = librosa.load("/path/to/your/audio/file", sr=16000)
356
+ audio_array = [audio_array]*2
357
+ inputs = processor(text=chat_prompt, audios=audio_array)
358
+
359
+ inputs = inputs.to(device, dtype=torch.bfloat16)
360
+
361
+ # adjust the `max_new_tokens` based on your use case.
362
+ # Please note the inclusion of `no_repeat_ngram_size=6`.
363
+ outputs = model.generate(**inputs, max_new_tokens=256, no_repeat_ngram_size=6)
364
+ generated_ids = outputs[:, inputs['input_ids'].size(1):]
365
+ response = processor.batch_decode(generated_ids, skip_special_tokens=True)
366
+ ```
367
+
368
+ ## ⚠️ Disclaimer
369
+
370
+ The current MERaLiON-3 has not been specifically aligned for safety and may generate content that is inappropriate, offensive, or harmful. Developers and users are responsible for performing their own safety fine-tuning and implementing necessary security measures. The authors shall not be held liable for any claims, damages, or other liabilities arising from the use of the released models, weights, or code.
371
+
372
+ ### Compute and Infrastructure
373
+
374
+ MERaLiON-3 was trained on the [**ASPIRE 2A+**](https://help.nscc.sg/aspire2aplus/about/) Supercomputer Cluster, provided by [**National Supercomputing Centre (NSCC)**](https://www.nscc.sg/), Singapore. ASPIRE 2A+ cluster provides multiple H100 nodes, with each compute node equipped with 8 Nvidia H100 GPUs, 2 TB of RAM, and 30 TB of locally attached NVMe storage. These nodes are interconnected via a rail-optimised, full fat-tree topology, utilising 400 Gb/s NDR InfiniBand cables. Additionally, the cluster incorporates a 2.5 PB SSD-based Lustre file system, linked to the H100 nodes through high-speed InfiniBand connections.
375
+
376
+ With a global batch size of 768, we trained the current release of MERaLiON-3 for around 200k steps, which took around 2 days to complete using 16 nodes, 128 H100 GPUs.
377
+
378
+ ## 📚 Citation
379
+
380
+ If you find our work useful, please cite our papers:
381
+
382
+ [MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models](https://arxiv.org/abs/2412.09818) <br>
383
+ [AudioBench: A Universal Benchmark for Audio Large Language Models](https://aclanthology.org/2025.naacl-long.218/) <br>
384
+ [Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models](https://arxiv.org/abs/2501.01034) <br>
385
+ [MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders](https://arxiv.org/abs/2409.06635) <br>
386
+ [MERaLiON-TextLLM: Cross-Lingual Understanding of Large Language Models in Chinese, Indonesian, Malay, and Singlish](https://arxiv.org/abs/2501.08335) <br>
387
+
388
+ ```
389
+ @misc{he2024meralionaudiollmtechnicalreport,
390
+ title={MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models},
391
+ author={{MERaLiON Team}},
392
+ year={2024},
393
+ eprint={2412.09818},
394
+ archivePrefix={arXiv},
395
+ primaryClass={cs.CL},
396
+ url={https://arxiv.org/abs/2412.09818},
397
+ }
398
+ ```
399
+
400
+ ```
401
+ @article{wang2024audiobench,
402
+ title={AudioBench: A Universal Benchmark for Audio Large Language Models},
403
+ author={Wang, Bin and Zou, Xunlong and Lin, Geyu and Sun, Shuo and Liu, Zhuohan and Zhang, Wenyu and Liu, Zhengyuan and Aw, AiTi and Chen, Nancy F},
404
+ journal={NAACL},
405
+ year={2025}
406
+ }
407
+ ```
408
+
409
+ ```
410
+ @article{wang2025advancing,
411
+ title={Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models},
412
+ author={Wang, Bin and Zou, Xunlong and Sun, Shuo and Zhang, Wenyu and He, Yingxu and Liu, Zhuohan and Wei, Chengwei and Chen, Nancy F and Aw, AiTi},
413
+ journal={arXiv preprint arXiv:2501.01034},
414
+ year={2025}
415
+ }
416
+ ```
417
+
418
+ ```
419
+ @article{zhang2024mowe,
420
+ title={MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders},
421
+ author={Zhang, Wenyu and Sun, Shuo and Wang, Bin and Zou, Xunlong and Liu, Zhuohan and He, Yingxu and Lin, Geyu and Chen, Nancy F and Aw, Ai Ti},
422
+ journal={ICASSP},
423
+ year={2025}
424
+ }
425
+ ```
426
+
427
+ ```
428
+ @misc{huang2025meraliontextllmcrosslingualunderstandinglarge,
429
+ title={MERaLiON-TextLLM: Cross-Lingual Understanding of Large Language Models in Chinese, Indonesian, Malay, and Singlish},
430
+ author={Xin Huang and Tarun Kumar Vangani and Minh Duc Pham and Xunlong Zou and Bin Wang and Zhengyuan Liu and Ai Ti Aw},
431
+ year={2025},
432
+ eprint={2501.08335},
433
+ archivePrefix={arXiv},
434
+ primaryClass={cs.CL},
435
+ url={https://arxiv.org/abs/2501.08335},
436
+ }
437
+ ```