KrorngAI
/

TrorYongASR-tiny

@@ -1,6 +1,7 @@
 ---
 library_name: transformers
-license: mit
 datasets:
 - DDD-Cambodia/khm-asr-cultural
 - openslr/librispeech_asr
@@ -119,7 +120,7 @@ The evaluation assesses two capabilities — language detection and transcriptio
 <div align="center">
-| Model | Metrics | Khmer (`fleurs`) | English (librispeech.clean) |
 |-------|---------|------------------|-----------------------------|
 | Tiny | Precision | 100% | 100% |
 |  | Recall | 100% | 100% |
@@ -177,7 +178,7 @@ The evaluation assesses two capabilities — language detection and transcriptio
 **Note:** To compute `CER` and `WER`, whitespaces are added between words in Khmer text (Khmer text does not have word boundaries like English text). To do so, `khmercut` PyPI package is used to tokenize Khmer text into words, and then the words are joined back together with whitespaces.
-#### WER Comparison with Whisper
 | Tiny | Parameters | Khmer (`fleurs`) | English (`librispeech.clean`) |
 |-------|--------|---------------------------| --- |
@@ -189,18 +190,20 @@ The evaluation assesses two capabilities — language detection and transcriptio
 | TrorYongASR | 136M | 50.70% | 12.95% |
 | Whisper | 244M | 104.4% | 3.4% |
-**Comparison Notes:**
 - Whisper models have more parameters for comparable sizes (39M vs 29M for Tiny, 244M vs 136M for Small)
 - Whisper shows significantly lower word error rates on English (7.6% vs 31.13% for Tiny, 3.4% vs 12.95% for Small)
-- Whisper performs worse on Khmer (100.6% vs 86.16% for Tiny, 104.4% vs 50.70% for Small), suggesting overfitting or challenging evaluation conditions
 - Error rates > 100% for Whisper on Khmer indicate the model is overfitting to the training data
 ### Result Summary
 **Language Detection:** Both model sizes achieved perfect 100% performance across all metrics (Precision, Recall, Accuracy, F1-score) on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. This perfect score is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
-**Transcription:** The Small model shows strong performance on English (10% token error rate, 7.08% CER, 12.95% WER) and moderate performance for Khmer (46% token error rate, 35.31% CER, 50.70% WER). The Tiny model shows strong performance on English (19% token error rate, 20.98% CER, 31.13% WER) but significantly lower performance for Khmer (56% token error rate, 60.71% CER, 86.16% WER). The larger model benefits from increased embedding dimension (768 vs 384) and more layers (12 vs 4).
 ## How to Get Started with the Model
@@ -264,7 +267,7 @@ The model can be integrated into:
 - **No speech detection**: The model was not trained for this task. User needs to fine-tune the model for this task (TrorYongASRTokenizer has `<|nospeech|>` token.)
 - **Translate task**: The training data for translation task is scarce. User needs to fine-tune the model for better translation performance
 - **Noise robustness**: Performance may degrade in noisy environments
-- **No timestamp output**: The model does not timestamps output
 **Sociotechnical Limitations:**
 - **Accent variability**: May not perform well on diverse Khmer accents
@@ -281,6 +284,8 @@ Users (both direct and downstream) should be made aware of the risks, biases and
 ## Training Details
 ### Training Data
 <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
@@ -289,16 +294,16 @@ Users (both direct and downstream) should be made aware of the risks, biases and
 For transcription task, the model was trained on around 140 hours of Khmer audio and around 100 hours of English audio.
 Khmer datasets include [`DDD-Cambodia/khm-asr-cultural`](https://huggingface.co/datasets/DDD-Cambodia/khm-asr-cultural) (134.6 hours), [`openslr/openslr`](https://huggingface.co/datasets/Kimang18/openslr-SLR42/blob/main/README.md), and [`google/fleurs`](https://huggingface.co/datasets/Kimang18/google-fleurs-km-kh).
-Split `clean.100` of [`openslr/librispeech_asr`](https://huggingface.co/datasets/openslr/librispeech_asr) was used as English dataset.
 <div align="center">
 | Dataset                            | Language   | Training examples | Validation examples | Description                                       |
-| ---------                          | ---------- | ----------------- | ------------------- |-                                                 |
 | **openslr/openslr**                | Khmer      | 2906              | 0                   | Multi-speaker TTS data for Khmer language (split `SLR42`) |
 | **google/fleurs**                  | Khmer      | 1675              | 324                 | TTS data for Khmer language (split `km_kh`) |
-| **DDD-Cambodia/khm-asr-cultural**  | Khmer      | 56716             | 0                   | Khmer ASR Cultural Dataset  (split `train`) |
-| **librispeech.clean**              | English    | 28539             | 2703                | Clean speech dataset for English transcription    |
 </div>
 #### Translation Task
@@ -312,7 +317,7 @@ For translation task, the data was scarce: only 2000 examples for Khmer audio to
 #### Preprocessing
 Following `Whisper` model of openai, audios with duration longer than 30 seconds are filtered out.
-All audios have `16000` sample rate.
 For English dataset, all texts are in lowercase.
 #### Training Hyperparameters
@@ -329,8 +334,9 @@ For English dataset, all texts are in lowercase.
 #### Speeds, Sizes, Times
-The training was conducted over 4000 optimizer steps on 1 GPU Tesla T4.
-The training took around 10 hours.
 ## Citation [optional]
@@ -344,8 +350,14 @@ The training took around 10 hours.
 ## Model Card Author
-ឈ្មោះ: បណ្ឌិត ឃុន គីមអាង
-Name: KHUN Kimang (Ph.D.)
 ## Model Card Contact

 ---
 library_name: transformers
+license: other
+license_name: modified-mit
 datasets:
 - DDD-Cambodia/khm-asr-cultural
 - openslr/librispeech_asr
 <div align="center">
+| Model | Metrics | Khmer (`fleurs`) | English (`librispeech.clean`) |
 |-------|---------|------------------|-----------------------------|
 | Tiny | Precision | 100% | 100% |
 |  | Recall | 100% | 100% |
 **Note:** To compute `CER` and `WER`, whitespaces are added between words in Khmer text (Khmer text does not have word boundaries like English text). To do so, `khmercut` PyPI package is used to tokenize Khmer text into words, and then the words are joined back together with whitespaces.
+**WER Comparison with Whisper:**
 | Tiny | Parameters | Khmer (`fleurs`) | English (`librispeech.clean`) |
 |-------|--------|---------------------------| --- |
 | TrorYongASR | 136M | 50.70% | 12.95% |
 | Whisper | 244M | 104.4% | 3.4% |
+**Key Observations:**
 - Whisper models have more parameters for comparable sizes (39M vs 29M for Tiny, 244M vs 136M for Small)
 - Whisper shows significantly lower word error rates on English (7.6% vs 31.13% for Tiny, 3.4% vs 12.95% for Small)
+- Whisper performs worse on Khmer (100.6% vs 86.16% for Tiny, 104.4% vs 50.70% for Small)
 - Error rates > 100% for Whisper on Khmer indicate the model is overfitting to the training data
+**Note:** `WER` data of Whisper is taken from their [paper](https://arxiv.org/abs/2212.04356).
 ### Result Summary
 **Language Detection:** Both model sizes achieved perfect 100% performance across all metrics (Precision, Recall, Accuracy, F1-score) on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. This perfect score is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
+**Transcription:** The Small model shows strong performance on English (12.95% WER, 7.08% CER, 10% token error rate) and moderate performance for Khmer (50.70% WER, 35.31% CER, 46% token error rate). The Tiny model shows strong performance on English (31.13% WER, 20.98% CER, 19% token error rate) but significantly lower performance for Khmer (86.16% WER, 60.71% CER, 56% token error rate). This shows that `TrorYongASR` can be scaled to get higher performance (both models have the same pre-training configuration. See Section Training Details below).
 ## How to Get Started with the Model
 - **No speech detection**: The model was not trained for this task. User needs to fine-tune the model for this task (TrorYongASRTokenizer has `<|nospeech|>` token.)
 - **Translate task**: The training data for translation task is scarce. User needs to fine-tune the model for better translation performance
 - **Noise robustness**: Performance may degrade in noisy environments
+- **No timestamp output**: The model does not support timestamp output
 **Sociotechnical Limitations:**
 - **Accent variability**: May not perform well on diverse Khmer accents
 ## Training Details
+To capture model's scalability, both tiny and small variants were trained using the same configuration detailed below.
 ### Training Data
 <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
 For transcription task, the model was trained on around 140 hours of Khmer audio and around 100 hours of English audio.
 Khmer datasets include [`DDD-Cambodia/khm-asr-cultural`](https://huggingface.co/datasets/DDD-Cambodia/khm-asr-cultural) (134.6 hours), [`openslr/openslr`](https://huggingface.co/datasets/Kimang18/openslr-SLR42/blob/main/README.md), and [`google/fleurs`](https://huggingface.co/datasets/Kimang18/google-fleurs-km-kh).
+Split `clean.100` of [`openslr/librispeech_asr`](https://huggingface.co/datasets/openslr/librispeech_asr) was used for English dataset.
 <div align="center">
 | Dataset                            | Language   | Training examples | Validation examples | Description                                       |
+| ---------                          | ---------- | ----------------- | ------------------- |-                                                  |
+| **DDD-Cambodia/khm-asr-cultural**  | Khmer      | 56716             | 0                   | Khmer ASR Cultural Dataset  (split `train`) |
 | **openslr/openslr**                | Khmer      | 2906              | 0                   | Multi-speaker TTS data for Khmer language (split `SLR42`) |
 | **google/fleurs**                  | Khmer      | 1675              | 324                 | TTS data for Khmer language (split `km_kh`) |
+| **librispeech\_asr.clean**              | English    | 28539             | 2703                | Clean speech dataset for English transcription    |
 </div>
 #### Translation Task
 #### Preprocessing
 Following `Whisper` model of openai, audios with duration longer than 30 seconds are filtered out.
+All audios have `16, 000` sample rate.
 For English dataset, all texts are in lowercase.
 #### Training Hyperparameters
 #### Speeds, Sizes, Times
+The training was conducted over 4000 optimizer steps.
+For tiny variant, the training took around 10 hours on 1 Tesla T4 GPU.
+For small variant, the training took around 2 hours on 1 A100 GPU.
 ## Citation [optional]
 ## Model Card Author
+- ឈ្មោះ: បណ្ឌិត ឃុន គីមអាង
+- Name: KHUN Kimang (Ph.D.)
+## Acknowledgement
+`LightningAI` and `Google Colab` did not specifically sponsor this project.
+However, both models are be trained thanks to their free credits.
+So, huge thanks to `LightningAI` and `Google Colab`.
 ## Model Card Contact