Automatic Speech Recognition
Transformers
Safetensors
Khmer
English
troryongasr
custom_code
Kimang18 commited on
Commit
81a41c6
·
verified ·
1 Parent(s): 7b0eb8e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -16
README.md CHANGED
@@ -1,6 +1,7 @@
1
  ---
2
  library_name: transformers
3
- license: mit
 
4
  datasets:
5
  - DDD-Cambodia/khm-asr-cultural
6
  - openslr/librispeech_asr
@@ -119,7 +120,7 @@ The evaluation assesses two capabilities — language detection and transcriptio
119
 
120
  <div align="center">
121
 
122
- | Model | Metrics | Khmer (`fleurs`) | English (librispeech.clean) |
123
  |-------|---------|------------------|-----------------------------|
124
  | Tiny | Precision | 100% | 100% |
125
  | | Recall | 100% | 100% |
@@ -177,7 +178,7 @@ The evaluation assesses two capabilities — language detection and transcriptio
177
  **Note:** To compute `CER` and `WER`, whitespaces are added between words in Khmer text (Khmer text does not have word boundaries like English text). To do so, `khmercut` PyPI package is used to tokenize Khmer text into words, and then the words are joined back together with whitespaces.
178
 
179
 
180
- #### WER Comparison with Whisper
181
 
182
  | Tiny | Parameters | Khmer (`fleurs`) | English (`librispeech.clean`) |
183
  |-------|--------|---------------------------| --- |
@@ -189,18 +190,20 @@ The evaluation assesses two capabilities — language detection and transcriptio
189
  | TrorYongASR | 136M | 50.70% | 12.95% |
190
  | Whisper | 244M | 104.4% | 3.4% |
191
 
192
- **Comparison Notes:**
193
  - Whisper models have more parameters for comparable sizes (39M vs 29M for Tiny, 244M vs 136M for Small)
194
  - Whisper shows significantly lower word error rates on English (7.6% vs 31.13% for Tiny, 3.4% vs 12.95% for Small)
195
- - Whisper performs worse on Khmer (100.6% vs 86.16% for Tiny, 104.4% vs 50.70% for Small), suggesting overfitting or challenging evaluation conditions
196
  - Error rates > 100% for Whisper on Khmer indicate the model is overfitting to the training data
197
 
 
 
198
 
199
  ### Result Summary
200
 
201
  **Language Detection:** Both model sizes achieved perfect 100% performance across all metrics (Precision, Recall, Accuracy, F1-score) on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. This perfect score is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
202
 
203
- **Transcription:** The Small model shows strong performance on English (10% token error rate, 7.08% CER, 12.95% WER) and moderate performance for Khmer (46% token error rate, 35.31% CER, 50.70% WER). The Tiny model shows strong performance on English (19% token error rate, 20.98% CER, 31.13% WER) but significantly lower performance for Khmer (56% token error rate, 60.71% CER, 86.16% WER). The larger model benefits from increased embedding dimension (768 vs 384) and more layers (12 vs 4).
204
 
205
 
206
  ## How to Get Started with the Model
@@ -264,7 +267,7 @@ The model can be integrated into:
264
  - **No speech detection**: The model was not trained for this task. User needs to fine-tune the model for this task (TrorYongASRTokenizer has `<|nospeech|>` token.)
265
  - **Translate task**: The training data for translation task is scarce. User needs to fine-tune the model for better translation performance
266
  - **Noise robustness**: Performance may degrade in noisy environments
267
- - **No timestamp output**: The model does not timestamps output
268
 
269
  **Sociotechnical Limitations:**
270
  - **Accent variability**: May not perform well on diverse Khmer accents
@@ -281,6 +284,8 @@ Users (both direct and downstream) should be made aware of the risks, biases and
281
 
282
  ## Training Details
283
 
 
 
284
  ### Training Data
285
 
286
  <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
@@ -289,16 +294,16 @@ Users (both direct and downstream) should be made aware of the risks, biases and
289
 
290
  For transcription task, the model was trained on around 140 hours of Khmer audio and around 100 hours of English audio.
291
  Khmer datasets include [`DDD-Cambodia/khm-asr-cultural`](https://huggingface.co/datasets/DDD-Cambodia/khm-asr-cultural) (134.6 hours), [`openslr/openslr`](https://huggingface.co/datasets/Kimang18/openslr-SLR42/blob/main/README.md), and [`google/fleurs`](https://huggingface.co/datasets/Kimang18/google-fleurs-km-kh).
292
- Split `clean.100` of [`openslr/librispeech_asr`](https://huggingface.co/datasets/openslr/librispeech_asr) was used as English dataset.
293
 
294
  <div align="center">
295
 
296
  | Dataset | Language | Training examples | Validation examples | Description |
297
- | --------- | ---------- | ----------------- | ------------------- |- |
 
298
  | **openslr/openslr** | Khmer | 2906 | 0 | Multi-speaker TTS data for Khmer language (split `SLR42`) |
299
  | **google/fleurs** | Khmer | 1675 | 324 | TTS data for Khmer language (split `km_kh`) |
300
- | **DDD-Cambodia/khm-asr-cultural** | Khmer | 56716 | 0 | Khmer ASR Cultural Dataset (split `train`) |
301
- | **librispeech.clean** | English | 28539 | 2703 | Clean speech dataset for English transcription |
302
  </div>
303
 
304
  #### Translation Task
@@ -312,7 +317,7 @@ For translation task, the data was scarce: only 2000 examples for Khmer audio to
312
  #### Preprocessing
313
 
314
  Following `Whisper` model of openai, audios with duration longer than 30 seconds are filtered out.
315
- All audios have `16000` sample rate.
316
  For English dataset, all texts are in lowercase.
317
 
318
  #### Training Hyperparameters
@@ -329,8 +334,9 @@ For English dataset, all texts are in lowercase.
329
 
330
  #### Speeds, Sizes, Times
331
 
332
- The training was conducted over 4000 optimizer steps on 1 GPU Tesla T4.
333
- The training took around 10 hours.
 
334
 
335
 
336
  ## Citation [optional]
@@ -344,8 +350,14 @@ The training took around 10 hours.
344
 
345
  ## Model Card Author
346
 
347
- ឈ្មោះ: បណ្ឌិត ឃុន គីមអាង
348
- Name: KHUN Kimang (Ph.D.)
 
 
 
 
 
 
349
 
350
  ## Model Card Contact
351
 
 
1
  ---
2
  library_name: transformers
3
+ license: other
4
+ license_name: modified-mit
5
  datasets:
6
  - DDD-Cambodia/khm-asr-cultural
7
  - openslr/librispeech_asr
 
120
 
121
  <div align="center">
122
 
123
+ | Model | Metrics | Khmer (`fleurs`) | English (`librispeech.clean`) |
124
  |-------|---------|------------------|-----------------------------|
125
  | Tiny | Precision | 100% | 100% |
126
  | | Recall | 100% | 100% |
 
178
  **Note:** To compute `CER` and `WER`, whitespaces are added between words in Khmer text (Khmer text does not have word boundaries like English text). To do so, `khmercut` PyPI package is used to tokenize Khmer text into words, and then the words are joined back together with whitespaces.
179
 
180
 
181
+ **WER Comparison with Whisper:**
182
 
183
  | Tiny | Parameters | Khmer (`fleurs`) | English (`librispeech.clean`) |
184
  |-------|--------|---------------------------| --- |
 
190
  | TrorYongASR | 136M | 50.70% | 12.95% |
191
  | Whisper | 244M | 104.4% | 3.4% |
192
 
193
+ **Key Observations:**
194
  - Whisper models have more parameters for comparable sizes (39M vs 29M for Tiny, 244M vs 136M for Small)
195
  - Whisper shows significantly lower word error rates on English (7.6% vs 31.13% for Tiny, 3.4% vs 12.95% for Small)
196
+ - Whisper performs worse on Khmer (100.6% vs 86.16% for Tiny, 104.4% vs 50.70% for Small)
197
  - Error rates > 100% for Whisper on Khmer indicate the model is overfitting to the training data
198
 
199
+ **Note:** `WER` data of Whisper is taken from their [paper](https://arxiv.org/abs/2212.04356).
200
+
201
 
202
  ### Result Summary
203
 
204
  **Language Detection:** Both model sizes achieved perfect 100% performance across all metrics (Precision, Recall, Accuracy, F1-score) on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. This perfect score is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
205
 
206
+ **Transcription:** The Small model shows strong performance on English (12.95% WER, 7.08% CER, 10% token error rate) and moderate performance for Khmer (50.70% WER, 35.31% CER, 46% token error rate). The Tiny model shows strong performance on English (31.13% WER, 20.98% CER, 19% token error rate) but significantly lower performance for Khmer (86.16% WER, 60.71% CER, 56% token error rate). This shows that `TrorYongASR` can be scaled to get higher performance (both models have the same pre-training configuration. See Section Training Details below).
207
 
208
 
209
  ## How to Get Started with the Model
 
267
  - **No speech detection**: The model was not trained for this task. User needs to fine-tune the model for this task (TrorYongASRTokenizer has `<|nospeech|>` token.)
268
  - **Translate task**: The training data for translation task is scarce. User needs to fine-tune the model for better translation performance
269
  - **Noise robustness**: Performance may degrade in noisy environments
270
+ - **No timestamp output**: The model does not support timestamp output
271
 
272
  **Sociotechnical Limitations:**
273
  - **Accent variability**: May not perform well on diverse Khmer accents
 
284
 
285
  ## Training Details
286
 
287
+ To capture model's scalability, both tiny and small variants were trained using the same configuration detailed below.
288
+
289
  ### Training Data
290
 
291
  <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
 
294
 
295
  For transcription task, the model was trained on around 140 hours of Khmer audio and around 100 hours of English audio.
296
  Khmer datasets include [`DDD-Cambodia/khm-asr-cultural`](https://huggingface.co/datasets/DDD-Cambodia/khm-asr-cultural) (134.6 hours), [`openslr/openslr`](https://huggingface.co/datasets/Kimang18/openslr-SLR42/blob/main/README.md), and [`google/fleurs`](https://huggingface.co/datasets/Kimang18/google-fleurs-km-kh).
297
+ Split `clean.100` of [`openslr/librispeech_asr`](https://huggingface.co/datasets/openslr/librispeech_asr) was used for English dataset.
298
 
299
  <div align="center">
300
 
301
  | Dataset | Language | Training examples | Validation examples | Description |
302
+ | --------- | ---------- | ----------------- | ------------------- |- |
303
+ | **DDD-Cambodia/khm-asr-cultural** | Khmer | 56716 | 0 | Khmer ASR Cultural Dataset (split `train`) |
304
  | **openslr/openslr** | Khmer | 2906 | 0 | Multi-speaker TTS data for Khmer language (split `SLR42`) |
305
  | **google/fleurs** | Khmer | 1675 | 324 | TTS data for Khmer language (split `km_kh`) |
306
+ | **librispeech\_asr.clean** | English | 28539 | 2703 | Clean speech dataset for English transcription |
 
307
  </div>
308
 
309
  #### Translation Task
 
317
  #### Preprocessing
318
 
319
  Following `Whisper` model of openai, audios with duration longer than 30 seconds are filtered out.
320
+ All audios have `16, 000` sample rate.
321
  For English dataset, all texts are in lowercase.
322
 
323
  #### Training Hyperparameters
 
334
 
335
  #### Speeds, Sizes, Times
336
 
337
+ The training was conducted over 4000 optimizer steps.
338
+ For tiny variant, the training took around 10 hours on 1 Tesla T4 GPU.
339
+ For small variant, the training took around 2 hours on 1 A100 GPU.
340
 
341
 
342
  ## Citation [optional]
 
350
 
351
  ## Model Card Author
352
 
353
+ - ឈ្មោះ: បណ្ឌិត ឃុន គីមអាង
354
+ - Name: KHUN Kimang (Ph.D.)
355
+
356
+ ## Acknowledgement
357
+
358
+ `LightningAI` and `Google Colab` did not specifically sponsor this project.
359
+ However, both models are be trained thanks to their free credits.
360
+ So, huge thanks to `LightningAI` and `Google Colab`.
361
 
362
  ## Model Card Contact
363