huzy0 commited on
Commit
f422437
·
verified ·
1 Parent(s): 42668b9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -9
README.md CHANGED
@@ -31,7 +31,7 @@ We introduce **MERaLiON-SpeechEncoder-2**, our next-generation multilingual spee
31
  Unlike many existing models optimized for high-resource, Western languages, MERaLiON-SpeechEncoder-2 is designed from the ground up to reflect the linguistic diversity and complexity of Southeast Asia. Our training data was curated to contain a substantial amount originating from Singapore and SEA, including 60,000 hours of Singapore-accented speech, with a further 160,000 hours covering Singapore’s official languages Chinese, Malay and Tamil, along with a smaller portion of dialects like Hokkien and Cantonese. SEA data amounts to 200,000 hours, including significant proportions of Malay, Thai, Indonesian, Vietnamese, with smaller amounts of Tagalog, Burmese, Javanese, Sundanese, Khmer and Lao. See below for a regional breakdown of the language coverage of our pre-training data.
32
 
33
  <p align="center">
34
- <img src="data2.svg" width="620"/>
35
  </p>
36
 
37
  ## Model Highlights
@@ -46,7 +46,7 @@ Building on [MERaLiON-SpeechEncoder-v1](https://huggingface.co/MERaLiON/MERaLiON
46
  The model retains near state-of-the-art results on the SUPERB benchmark for English, and showcases strong multilingual capabilities demonstrated through its integration into a [high-performance ASR system](#automatic-speech-recognition-asr).
47
 
48
  #### Innovative pre-training techniques
49
- MERaLiON-SpeechEncoder-2 was trained from scratch with a **novel extension of the BEST-RQ** self-supervised objective, by using more informative latent targets. We also adopted the **Muon optimizer**, which has previously only been shown to outperform the widely-used AdamW optimizer for LLM training. We find its advantages also carry over to speech-based models.
50
 
51
  ## Model Summary
52
 
@@ -84,7 +84,7 @@ Leveraging on the multilingual capabilities of MERaLiON-SpeechEncoder-2, we furt
84
 
85
  ## Direct Use
86
 
87
- The following code snippet can be used to directly obtain latent features i.e. encoded speech by forwarding through the model. Inputs into the model are expected to be 80-dimensional Mel-spectrogram features transformed from 16kHz sampled audio. The AutoFeatureExtractor method can carry out the conversion.
88
 
89
  ```python
90
  import torch
@@ -129,15 +129,19 @@ with torch.no_grad():
129
  attention_mask=inputs['attention_mask'],
130
  output_hidden_states=True)
131
 
132
- # output is a Wav2Vec2BaseModelOutput or tuple containing:
133
- # last_hidden_state: torch.FloatTensor containing hidden states of the last layer of the model
134
- # extract_features: torch.FloatTensor containing extracted features from the convolution downsampling layers
135
- # hidden_states: tuple(torch.FloatTensor) containing hidden states of each layer of the model
136
- # attentions: tuple(torch.FloatTensor) containing attention states of each layer of the model
137
  ```
138
 
139
  ## Downstream Use
140
 
 
 
 
 
141
  Speech encoders are normally used in finetuning setups to provide the frontend to downstream speech applications. We provide an example below of an ASR finetuning setup with Huggingface. Please refer to this [blog](https://huggingface.co/blog/fine-tune-w2v2-bert) for the full ASR finetuning recipe using Huggingface Trainer. Alternatively, the Huggingface model can be loaded to any other frameworks such as Pytorch or ESPnet for custom finetuning loops.
142
 
143
  ```python
@@ -209,13 +213,15 @@ model = AutoModelForCTC.from_pretrained(
209
  attention_dropout=0.1,
210
  )
211
  model = model.to(device)
 
 
212
  ```
213
 
214
  ### Compute and Infrastructure
215
 
216
  MERaLiON-SpeechEncoder-2 was trained on the [**ASPIRE 2A+**](https://help.nscc.sg/aspire2aplus/about/) Supercomputer Cluster, provided by [**National Supercomputing Centre (NSCC)**](https://www.nscc.sg/), Singapore.
217
 
218
- MERaLiON-SpeechEncoder-2 was trained with 64 H100 GPUs across 8 nodes for collectively around 3.5 million steps. Training time took approximately 15 days.
219
 
220
 
221
  ## Citation
 
31
  Unlike many existing models optimized for high-resource, Western languages, MERaLiON-SpeechEncoder-2 is designed from the ground up to reflect the linguistic diversity and complexity of Southeast Asia. Our training data was curated to contain a substantial amount originating from Singapore and SEA, including 60,000 hours of Singapore-accented speech, with a further 160,000 hours covering Singapore’s official languages Chinese, Malay and Tamil, along with a smaller portion of dialects like Hokkien and Cantonese. SEA data amounts to 200,000 hours, including significant proportions of Malay, Thai, Indonesian, Vietnamese, with smaller amounts of Tagalog, Burmese, Javanese, Sundanese, Khmer and Lao. See below for a regional breakdown of the language coverage of our pre-training data.
32
 
33
  <p align="center">
34
+ <img src="data2.svg" width="600"/>
35
  </p>
36
 
37
  ## Model Highlights
 
46
  The model retains near state-of-the-art results on the SUPERB benchmark for English, and showcases strong multilingual capabilities demonstrated through its integration into a [high-performance ASR system](#automatic-speech-recognition-asr).
47
 
48
  #### Innovative pre-training techniques
49
+ MERaLiON-SpeechEncoder-2 was trained from scratch with a **novel extension of the BEST-RQ self-supervised objective**, by using more informative latent targets. We also adopted the **Muon optimizer**, which has previously only been shown to outperform the widely-used AdamW optimizer for LLM training. We find its advantages also carry over to speech-based models.
50
 
51
  ## Model Summary
52
 
 
84
 
85
  ## Direct Use
86
 
87
+ The following code snippet can be used to directly obtain latent features i.e. encoded speech by forwarding through the model. Inputs into the model are expected to be 80-dimensional Mel-spectrogram features transformed from audio sample at 16kHz. The AutoFeatureExtractor method can carry out this conversion.
88
 
89
  ```python
90
  import torch
 
129
  attention_mask=inputs['attention_mask'],
130
  output_hidden_states=True)
131
 
132
+ # output is a Wav2Vec2BaseModelOutput or tuple containing:
133
+ # last_hidden_state: torch.FloatTensor containing hidden states of the last layer of the model
134
+ # extract_features: torch.FloatTensor containing extracted features from the convolution downsampling layers
135
+ # hidden_states: tuple(torch.FloatTensor) containing hidden states of each layer of the model
136
+ # attentions: tuple(torch.FloatTensor) containing attention states of each layer of the model
137
  ```
138
 
139
  ## Downstream Use
140
 
141
+ <p align="center">
142
+ <img src="downstream.svg" width="600"/>
143
+ </p>
144
+
145
  Speech encoders are normally used in finetuning setups to provide the frontend to downstream speech applications. We provide an example below of an ASR finetuning setup with Huggingface. Please refer to this [blog](https://huggingface.co/blog/fine-tune-w2v2-bert) for the full ASR finetuning recipe using Huggingface Trainer. Alternatively, the Huggingface model can be loaded to any other frameworks such as Pytorch or ESPnet for custom finetuning loops.
146
 
147
  ```python
 
213
  attention_dropout=0.1,
214
  )
215
  model = model.to(device)
216
+
217
+ # Continue with training loop...
218
  ```
219
 
220
  ### Compute and Infrastructure
221
 
222
  MERaLiON-SpeechEncoder-2 was trained on the [**ASPIRE 2A+**](https://help.nscc.sg/aspire2aplus/about/) Supercomputer Cluster, provided by [**National Supercomputing Centre (NSCC)**](https://www.nscc.sg/), Singapore.
223
 
224
+ The model was trained with 64 H100 GPUs across 8 nodes for collectively around 3.5 million steps. The SEA data split was upsampled in the last phase of training. Training time took approximately 15 days.
225
 
226
 
227
  ## Citation