KrorngAI
/

TrorYongASR-tiny

@@ -34,7 +34,7 @@ pipeline_tag: automatic-speech-recognition
   <a href="https://kimang18.github.io" target="_blank"><img alt="Personal" src="https://img.shields.io/badge/KHUN-white?logoColor=white"/></a>
 </div>
 <div align="center" style="line-height: 1;">
-  <a href="https://huggingface.co/Kimang18/tror-yong-asr-tiny/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/badge/License-Modified_MIT-f5de53?&color=f5de53"/></a>
 </div>
@@ -143,10 +143,10 @@ The evaluation assesses two capabilities — language detection and transcriptio
  | Model | Dataset | Precision | Recall | Accuracy | F1-score |
 |-------|---------|-----------|--------|----------|----------|
-| Tiny | google/fleurs (Khmer) | 100% | 100% | 100% | 100% |
-|  | librispeech.clean (English) | 100% | 100% | 100% | 100% |
-| Small | google/fleurs (Khmer) | 100% | 100% | 100% | 100% |
-|  | librispeech.clean (English) | 100% | 100% | 100% | 100% |
 </div>
 **Key Finding:** Both model sizes achieved perfect language detection performance on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio.
@@ -158,8 +158,8 @@ The evaluation assesses two capabilities — language detection and transcriptio
 <div align="center">
-| Model | Metric | Khmer | English | Combined (Khmer + English) |
-|-------|--------|---------------------------|-------|---------|
 | **Tiny** | Token Error Rate | 56% | 19% | 29% |
 | | Character Error Rate (CER) | 60.71% | 20.98% | 32.89% |
 | | Word Error Rate (WER) | 86.16% | 31.13% | 46.53% |
@@ -175,9 +175,28 @@ The evaluation assesses two capabilities — language detection and transcriptio
 - Performance for Khmer is moderate (46% token error rate, 35.31% CER, 50.70% WER)
 - The larger model benefits from increased embedding dimension (768 vs 384) and more layers (12 vs 4)
-**Note:** To compute `CER` and `WER`, whitespaces are added between words in Khmer text (Khmer text has no word boundary like English text). To do so, `khmercut` PyPI package is used to tokenize Khmer text into words, and then the words are joined back together with whitespaces.
-#### Summary
 **Language Detection:** Both model sizes achieved perfect 100% performance across all metrics (Precision, Recall, Accuracy, F1-score) on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. This perfect score is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
@@ -195,24 +214,24 @@ Then, use the code below to get started with the model.
 ```python
 from transformers import AutoProcessor
-from tror_yong_asr import TrorYongASRModel, transcribe, translate
 model_id = "KrorngAI/tror-yong-asr-tiny"
 processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
 model = TrorYongASRModel.from_pretrained(model_id, trust_remote_code=True)
-result1 = transcribe('/path/to/audio_file.mp3', model, processor, max_tokens=64)
-print(result1) # namedtuple: text: str, output_ds: torch.Tensor
-result2 = translate('/path/to/audio_file.mp3', model, processor, max_tokens=64)
-print(result2) # namedtuple: text:str, output_ids: torch.Tensor
-#TODO: add detect_language usage
 ```
-### Fine-tuning
 Notebook (TBA)
@@ -243,8 +262,9 @@ The model can be integrated into:
 **Technical Limitations:**
 - **No speech detection**: The model was not trained for this task. User needs to fine-tune the model for this task (TrorYongASRTokenizer has `<|nospeech|>` token.)
 - **Noise robustness**: Performance may degrade in noisy environments
-- **Translate task**: The training data for translation task is scarce. User needs to fine-tune the model for better translation performance.
 **Sociotechnical Limitations:**
 - **Accent variability**: May not perform well on diverse Khmer accents

   <a href="https://kimang18.github.io" target="_blank"><img alt="Personal" src="https://img.shields.io/badge/KHUN-white?logoColor=white"/></a>
 </div>
 <div align="center" style="line-height: 1;">
+  <a href="https://huggingface.co/moonshotai/Kimi-K2.6/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/badge/License-Modified_MIT-f5de53?&color=f5de53"/></a>
 </div>
  | Model | Dataset | Precision | Recall | Accuracy | F1-score |
 |-------|---------|-----------|--------|----------|----------|
+| Tiny | Khmer (`fleurs`) | 100% | 100% | 100% | 100% |
+|  | English (librispeech.clean) | 100% | 100% | 100% | 100% |
+| Small | Khmer (`fleurs`) | 100% | 100% | 100% | 100% |
+|  | English (librispeech.clean) | 100% | 100% | 100% | 100% |
 </div>
 **Key Finding:** Both model sizes achieved perfect language detection performance on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio.
 <div align="center">
+| Model | Metric | Khmer (`fleurs`) | English (`librispeech.clean`) | Mixed (Khmer + English) |
+|-------|--------|-------|---------|---------|
 | **Tiny** | Token Error Rate | 56% | 19% | 29% |
 | | Character Error Rate (CER) | 60.71% | 20.98% | 32.89% |
 | | Word Error Rate (WER) | 86.16% | 31.13% | 46.53% |
 - Performance for Khmer is moderate (46% token error rate, 35.31% CER, 50.70% WER)
 - The larger model benefits from increased embedding dimension (768 vs 384) and more layers (12 vs 4)
+**Note:** To compute `CER` and `WER`, whitespaces are added between words in Khmer text (Khmer text does not have word boundaries like English text). To do so, `khmercut` PyPI package is used to tokenize Khmer text into words, and then the words are joined back together with whitespaces.
+##### WER Comparison with Whisper
+| Tiny | Parameters | Khmer (`fleurs`) | English (`librispeech.clean`) |
+|-------|--------|---------------------------| --- |
+| TrorYongASR | 29M | 86.16% | 31.13% |
+| Whisper | 39M | 100.6% | 7.6% |
+| Small | Parameters | Khmer (`fleurs`) | English (`librispeech.clean`) |
+|-------|--------|---------------------------| --- |
+| TrorYongASR | 136M | 50.70% | 12.95% |
+| Whisper | 244M | 104.4% | 3.4% |
+**Comparison Notes:**
+- Whisper models have more parameters for comparable sizes (39M vs 29M for Tiny, 244M vs 136M for Small)
+- Whisper shows significantly lower word error rates on English (7.6% vs 31.13% for Tiny, 3.4% vs 12.95% for Small)
+- Whisper performs worse on Khmer (100.6% vs 86.16% for Tiny, 104.4% vs 50.70% for Small), suggesting overfitting or challenging evaluation conditions
+- Error rates > 100% for Whisper on Khmer indicate the model is overfitting to the training data
+#### Result Summary
 **Language Detection:** Both model sizes achieved perfect 100% performance across all metrics (Precision, Recall, Accuracy, F1-score) on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. This perfect score is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
 ```python
 from transformers import AutoProcessor
+from tror_yong_asr import TrorYongASRModel, transcribe, translate, detect_language
 model_id = "KrorngAI/tror-yong-asr-tiny"
 processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
 model = TrorYongASRModel.from_pretrained(model_id, trust_remote_code=True)
+result1 = detect_language('/path/to/audio_file.mp3', model, processor)
+print(result1)
+result2 = transcribe('/path/to/audio_file.mp3', model, processor, max_tokens=64)
+print(result2)
+result3 = translate('/path/to/audio_file.mp3', model, processor, max_tokens=64)
+print(result3)
 ```
+## Fine-tuning
 Notebook (TBA)
 **Technical Limitations:**
 - **No speech detection**: The model was not trained for this task. User needs to fine-tune the model for this task (TrorYongASRTokenizer has `<|nospeech|>` token.)
+- **Translate task**: The training data for translation task is scarce. User needs to fine-tune the model for better translation performance
 - **Noise robustness**: Performance may degrade in noisy environments
+- **No timestamp output**: The model does not timestamps output
 **Sociotechnical Limitations:**
 - **Accent variability**: May not perform well on diverse Khmer accents