KrorngAI
/

TrorYongASR-tiny

@@ -48,7 +48,7 @@ pipeline_tag: automatic-speech-recognition
 ### Model Description
-This is an ASR (Automatic Speech Recognition) model inspired by [PARSeq](https://github.com/baudm/parseq/tree/main) and [Whisper](https://github.com/openai/whisper/tree/main). It is smaller than Whisper model of openai. **TrorYongASR** supports both Khmer and English languages.
 <div align="center">
@@ -81,7 +81,7 @@ This is an ASR (Automatic Speech Recognition) model inspired by [PARSeq](https:/
 ## Evaluation
 <!-- This section describes the evaluation protocols and provides the results. -->
-The evaluation assesses two capabilities — language detection and transcription — on two datasets ([`google/fleurs`](https://huggingface.co/datasets/Kimang18/google-fleurs-km-kh) for Khmer and [`openslr/librispeech_asr`](https://huggingface.co/datasets/openslr/librispeech_asr) for English). All results are from the test split of each dataset, representing the model's generalization ability to unseen data.
 ### Testing Data
@@ -95,7 +95,7 @@ The evaluation assesses two capabilities — language detection and transcriptio
 | **librispeech.clean** | English    | 2620          | Clean speech dataset for English transcription    |
 </div>
-**Note:** All evaluation results below are from the **test split** of each dataset. For `google/fleurs`, audios longer than `30 seconds` are excluded from the evaluation.
 ### Metrics and Results
@@ -115,16 +115,20 @@ The evaluation assesses two capabilities — language detection and transcriptio
 | **F1-score** | Harmonic mean of precision and recall |
 </div>
-#### Language Detection Results
 <div align="center">
- | Model | Dataset | Precision | Recall | Accuracy | F1-score |
-|-------|---------|-----------|--------|----------|----------|
-| Tiny | Khmer (`fleurs`) | 100% | 100% | 100% | 100% |
-|  | English (librispeech.clean) | 100% | 100% | 100% | 100% |
-| Small | Khmer (`fleurs`) | 100% | 100% | 100% | 100% |
-|  | English (librispeech.clean) | 100% | 100% | 100% | 100% |
 </div>
 **Key Finding:** Both model sizes achieved perfect language detection performance on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio.
@@ -145,30 +149,29 @@ The evaluation assesses two capabilities — language detection and transcriptio
 | **Word Error Rate (WER)** | Proportion of words that are incorrect |
 </div>
-**Note on Token Error Rate:** Token Error Rate measures model's capability in predicting the next token given the audio input and the current sequence of tokens. This metric is weaker than Word Error Rate (WER) and Character Error Rate (CER) because it doesn't account for insertions, deletions, and substitutions as comprehensively. Token Error Rate is used here because Khmer text lacks word boundaries, making WER and CER calculations challenging without additional preprocessing.
-**Note on Translation Task:** The models are also trained for `translation` task, but evaluation is deferred to future work due to scarce data (2000 samples from Khmer audio to English text, and 1000 samples from English audio to Khmer text).
-#### Transcription Results
 <div align="center">
 | Model | Metric | Khmer (`fleurs`) | English (`librispeech.clean`) | Mixed (Khmer + English) |
 |-------|--------|-------|---------|---------|
-| **Tiny** | Token Error Rate | 56% | 19% | 29% |
 | | Character Error Rate (CER) | 60.71% | 20.98% | 32.89% |
-| | Word Error Rate (WER) | 86.16% | 31.13% | 46.53% |
-| **Small** | Token Error Rate | 46% | 10% | 19% |
 | | Character Error Rate (CER) | 35.31% | 7.08% | 15.54% |
-| | Word Error Rate (WER) | 50.70% | 12.95% | 23.52% |
 </div>
 **Key Observations:**
-- The tiny model shows strong performance on English (19% token error rate, 20.98% CER, 31.13% WER)
-- Performance drops significantly for Khmer (56% token error rate, 60.71% CER, 86.16% WER)
-- The small model shows strong performance on English (10% token error rate, 7.08% CER, 12.95% WER)
-- Performance for Khmer is moderate (46% token error rate, 35.31% CER, 50.70% WER)
 - The larger model benefits from increased embedding dimension (768 vs 384) and more layers (12 vs 4)
 **Note:** To compute `CER` and `WER`, whitespaces are added between words in Khmer text (Khmer text does not have word boundaries like English text). To do so, `khmercut` PyPI package is used to tokenize Khmer text into words, and then the words are joined back together with whitespaces.

 ### Model Description
+This is an ASR (Automatic Speech Recognition) model inspired by [PARSeq](https://github.com/baudm/parseq/tree/main) and [Whisper](https://github.com/openai/whisper/tree/main). **TrorYongASR** is smaller than Whisper model of openai and supports only Khmer and English languages.
 <div align="center">
 ## Evaluation
 <!-- This section describes the evaluation protocols and provides the results. -->
+The evaluation assesses two capabilities — language detection and transcription — on two datasets ([`google/fleurs`](https://huggingface.co/datasets/Kimang18/google-fleurs-km-kh) for Khmer and [`openslr/librispeech_asr`](https://huggingface.co/datasets/openslr/librispeech_asr) for English). All results are from the **test split** of each dataset, representing the model's generalization ability to unseen data.
 ### Testing Data
 | **librispeech.clean** | English    | 2620          | Clean speech dataset for English transcription    |
 </div>
+**Note:** All evaluation results below are from the **test split** of each dataset. Audios longer than `30 seconds` are excluded from the evaluation (that is why `google/fleurs` has 765 examples instead of 771).
 ### Metrics and Results
 | **F1-score** | Harmonic mean of precision and recall |
 </div>
+**Results:**
 <div align="center">
+| Model | Metrics | Khmer (`fleurs`) | English (librispeech.clean) |
+|-------|---------|------------------|-----------------------------|
+| Tiny | Precision | 100% | 100% |
+|  | Recall | 100% | 100% |
+|  | Accuracy | 100% | 100% |
+|  | F1-score | 100% | 100% |
+| Small | Precision | 100% | 100% |
+|  | Recall | 100% | 100% |
+|  | Accuracy | 100% | 100% |
+|  | F1-score | 100% | 100% |
 </div>
 **Key Finding:** Both model sizes achieved perfect language detection performance on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio.
 | **Word Error Rate (WER)** | Proportion of words that are incorrect |
 </div>
+**Note on Token Error Rate:** Token Error Rate measures model's capability in predicting the next token given the audio input and the current sequence of tokens. This metric is weaker than Word Error Rate (WER) and Character Error Rate (CER) because it doesn't account for insertions, deletions, substitutions, and autoregression as comprehensively. Token Error Rate is used here because Khmer text lacks word boundaries, making WER and CER calculations challenging without additional preprocessing.
+**Note on Translation Task:** The models are also trained for `translation` task, but evaluation is deferred to future work due to scarce data (there are only 2000 examples from Khmer audio to English text, and 1000 examples from English audio to Khmer text in the pre-training).
+**Transcription Results:**
 <div align="center">
 | Model | Metric | Khmer (`fleurs`) | English (`librispeech.clean`) | Mixed (Khmer + English) |
 |-------|--------|-------|---------|---------|
+| **Tiny** | Word Error Rate (WER) | 86.16% | 31.13% | 46.53% |
 | | Character Error Rate (CER) | 60.71% | 20.98% | 32.89% |
+| | Token Error Rate | 56% | 19% | 29% |
+| **Small** | Word Error Rate (WER) | 50.70% | 12.95% | 23.52% |
 | | Character Error Rate (CER) | 35.31% | 7.08% | 15.54% |
+| | Token Error Rate | 46% | 10% | 19% |
 </div>
 **Key Observations:**
+- The tiny model shows strong performance on English (31.13% WER, 20.98% CER, 19% token error rate)
+- Performance drops significantly for Khmer (86.16% WER, 60.71% CER, 56% token error rate)
+- The small model shows strong performance on English (12.95% WER, 7.08% CER, 10% token error rate)
+- Performance for Khmer is moderate (50.70% WER, 35.31% CER, 46% token error rate)
 - The larger model benefits from increased embedding dimension (768 vs 384) and more layers (12 vs 4)
 **Note:** To compute `CER` and `WER`, whitespaces are added between words in Khmer text (Khmer text does not have word boundaries like English text). To do so, `khmercut` PyPI package is used to tokenize Khmer text into words, and then the words are joined back together with whitespaces.