KrorngAI
/

TrorYongASR-tiny

@@ -83,9 +83,7 @@ This is an ASR (Automatic Speech Recognition) model inspired by [PARSeq](https:/
 <!-- This section describes the evaluation protocols and provides the results. -->
 The evaluation assesses two capabilities — language detection and transcription — on two datasets ([`google/fleurs`](https://huggingface.co/datasets/Kimang18/google-fleurs-km-kh) for Khmer and [`openslr/librispeech_asr`](https://huggingface.co/datasets/openslr/librispeech_asr) for English). All results are from the test split of each dataset, representing the model's generalization ability to unseen data.
-### Testing Data & Metrics
-#### Testing Data
 <!-- This should link to a Dataset Card if possible. -->
@@ -99,11 +97,11 @@ The evaluation assesses two capabilities — language detection and transcriptio
 **Note:** All evaluation results below are from the **test split** of each dataset. For `google/fleurs`, audios longer than `30 seconds` are excluded from the evaluation.
-#### Metrics
 <!-- These are the evaluation metrics being used, ideally with a description of why. -->
-##### Language Detection
 **Task:** Given audio input, detect the language.
@@ -117,26 +115,6 @@ The evaluation assesses two capabilities — language detection and transcriptio
 | **F1-score** | Harmonic mean of precision and recall |
 </div>
-##### Transcription
-**Task:** Convert audio to text (transcription).
-<div align="center">
-| Metric | Description |
-|--------|-------------|
-| **Token Error Rate** | Proportion of incorrectly transcribed tokens |
-| **Character Error Rate (CER)** | Proportion of characters that are incorrect |
-| **Word Error Rate (WER)** | Proportion of words that are incorrect |
-</div>
-**Note on Token Error Rate:** Token Error Rate measures model's capability in predicting the next token given the audio input and the current sequence of tokens. This metric is weaker than Word Error Rate (WER) and Character Error Rate (CER) because it doesn't account for insertions, deletions, and substitutions as comprehensively. Token Error Rate is used here because Khmer text lacks word boundaries, making WER and CER calculations challenging without additional preprocessing.
-**Note on Translation Task:** The models are also trained for `translation` task, but evaluation is deferred to future work due to scarce data (2000 samples from Khmer audio to English text, and 1000 samples from English audio to Khmer text).
-### Results
 #### Language Detection Results
 <div align="center">
@@ -154,6 +132,24 @@ The evaluation assesses two capabilities — language detection and transcriptio
 **Note on Language Detection Performance:** The 100% language detection scores may appear unusually high. This is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
 #### Transcription Results
 <div align="center">
@@ -177,7 +173,8 @@ The evaluation assesses two capabilities — language detection and transcriptio
 **Note:** To compute `CER` and `WER`, whitespaces are added between words in Khmer text (Khmer text does not have word boundaries like English text). To do so, `khmercut` PyPI package is used to tokenize Khmer text into words, and then the words are joined back together with whitespaces.
-##### WER Comparison with Whisper
 | Tiny | Parameters | Khmer (`fleurs`) | English (`librispeech.clean`) |
 |-------|--------|---------------------------| --- |
@@ -196,7 +193,7 @@ The evaluation assesses two capabilities — language detection and transcriptio
 - Error rates > 100% for Whisper on Khmer indicate the model is overfitting to the training data
-#### Result Summary
 **Language Detection:** Both model sizes achieved perfect 100% performance across all metrics (Precision, Recall, Accuracy, F1-score) on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. This perfect score is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.

 <!-- This section describes the evaluation protocols and provides the results. -->
 The evaluation assesses two capabilities — language detection and transcription — on two datasets ([`google/fleurs`](https://huggingface.co/datasets/Kimang18/google-fleurs-km-kh) for Khmer and [`openslr/librispeech_asr`](https://huggingface.co/datasets/openslr/librispeech_asr) for English). All results are from the test split of each dataset, representing the model's generalization ability to unseen data.
+### Testing Data
 <!-- This should link to a Dataset Card if possible. -->
 **Note:** All evaluation results below are from the **test split** of each dataset. For `google/fleurs`, audios longer than `30 seconds` are excluded from the evaluation.
+### Metrics and Results
 <!-- These are the evaluation metrics being used, ideally with a description of why. -->
+#### Language Detection
 **Task:** Given audio input, detect the language.
 | **F1-score** | Harmonic mean of precision and recall |
 </div>
 #### Language Detection Results
 <div align="center">
 **Note on Language Detection Performance:** The 100% language detection scores may appear unusually high. This is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
+#### Transcription
+**Task:** Convert audio to text (transcription).
+<div align="center">
+| Metric | Description |
+|--------|-------------|
+| **Token Error Rate** | Proportion of incorrectly transcribed tokens |
+| **Character Error Rate (CER)** | Proportion of characters that are incorrect |
+| **Word Error Rate (WER)** | Proportion of words that are incorrect |
+</div>
+**Note on Token Error Rate:** Token Error Rate measures model's capability in predicting the next token given the audio input and the current sequence of tokens. This metric is weaker than Word Error Rate (WER) and Character Error Rate (CER) because it doesn't account for insertions, deletions, and substitutions as comprehensively. Token Error Rate is used here because Khmer text lacks word boundaries, making WER and CER calculations challenging without additional preprocessing.
+**Note on Translation Task:** The models are also trained for `translation` task, but evaluation is deferred to future work due to scarce data (2000 samples from Khmer audio to English text, and 1000 samples from English audio to Khmer text).
 #### Transcription Results
 <div align="center">
 **Note:** To compute `CER` and `WER`, whitespaces are added between words in Khmer text (Khmer text does not have word boundaries like English text). To do so, `khmercut` PyPI package is used to tokenize Khmer text into words, and then the words are joined back together with whitespaces.
+#### WER Comparison with Whisper
 | Tiny | Parameters | Khmer (`fleurs`) | English (`librispeech.clean`) |
 |-------|--------|---------------------------| --- |
 - Error rates > 100% for Whisper on Khmer indicate the model is overfitting to the training data
+### Result Summary
 **Language Detection:** Both model sizes achieved perfect 100% performance across all metrics (Precision, Recall, Accuracy, F1-score) on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. This perfect score is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.