KrorngAI
/

TrorYongASR-tiny

@@ -38,7 +38,7 @@ pipeline_tag: automatic-speech-recognition
 </div>
-# TrorYongASR Tiny
 > [!Note]
 > This repository contains model weights and configuration files for the pre-trained model.
@@ -48,18 +48,18 @@ pipeline_tag: automatic-speech-recognition
 ### Model Description
-This is a **29M parameter** ASR (Automatic Speech Recognition) model inspired by [PARSeq](https://github.com/baudm/parseq/tree/main) and [Whisper](https://github.com/openai/whisper/tree/main). Compared to [whisper-tiny](https://huggingface.co/openai/whisper-tiny) with **39M** parameters, **TrorYongASR Tiny** is a lightweight model designed for efficient speech-to-text task, particularly suitable for edge devices and mobile applications. The model supports both Khmer and English languages.
 <div align="center">
-| **Model Size**    | Tiny          |
-|:-------------:|:-----------------:|
-| **Parameters**    | 29M               |
-| **Audio Encoder** | 4 layers, 6 heads |
-| **Text Decoder**  | 1 layer, 12 heads |
-| **Embedding Dim** | 384 |
-| **Audio Context** | 1500 |
-| **Text Context**  | 1024 |
 </div>
 **Note:** The audio array are processed to log-mel spectrogram with `80` mels (the same as Whisper models of the same size)
@@ -140,32 +140,40 @@ The evaluation assesses two capabilities — language detection and transcriptio
 #### Language Detection Results
 <div align="center">
-| Dataset | Precision | Recall | Accuracy | F1-score |
-|---------|-----------|--------|----------|----------|
-| google/fleurs (Khmer) | 100% | 100% | 100% | 100% |
-| librispeech.clean (English) | 100% | 100% | 100% | 100% |
 </div>
 **Key Finding:** Both model sizes achieved perfect language detection performance on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio.
-**Note on Language Detection Performance:** The 100% language detection scores may appear unusually high. This is probably due to the fact that during the training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. This means that with 6 permutations, the model learns to predict language token 6 times for a given audio. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
 #### Transcription Results
 <div align="center">
-| Metric | Combined (Khmer + English) | Khmer | English |
-|--------|---------------------------|-------|---------|
-| Token Error Rate | 29% | 56% | 19% |
-| Character Error Rate (CER) | 32.89% | 60.71% | 20.98% |
-| Word Error Rate (WER) | 46.53% | 86.16% | 31.13% |
 </div>
 **Key Observations:**
-- The model shows strong performance on English (19% token error rate, 20.98% CER, 31.13% WER)
 - Performance drops significantly for Khmer (56% token error rate, 60.71% CER, 86.16% WER)
 **Note:** To compute `CER` and `WER`, whitespaces are added between words in Khmer text (Khmer text has no word boundary like English text). To do so, `khmercut` PyPI package is used to tokenize Khmer text into words, and then the words are joined back together with whitespaces.
@@ -173,7 +181,7 @@ The evaluation assesses two capabilities — language detection and transcriptio
 **Language Detection:** Both model sizes achieved perfect 100% performance across all metrics (Precision, Recall, Accuracy, F1-score) on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. This perfect score is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
-**Transcription:** The model shows strong performance on English (19% token error rate, 20.98% CER, 31.13% WER) but significantly lower performance for Khmer (56% token error rate, 60.71% CER, 86.16% WER).
 ## How to Get Started with the Model
@@ -199,8 +207,14 @@ print(result1) # namedtuple: text: str, output_ds: torch.Tensor
 result2 = translate('/path/to/audio_file.mp3', model, processor, max_tokens=64)
 print(result2) # namedtuple: text:str, output_ids: torch.Tensor
 ```
 ## Uses
@@ -287,8 +301,7 @@ For English dataset, all texts are in lowercase.
 - **Optimizer:** MuonAdamW (custom implementation)
 - **Learning rate:** Linear Warmup (40 optimizer steps) + CosineAnnealing (3960 optimizer steps)
 - **Weight decay:** 0.1
-- **Batch size:** 8
-- **Gradient accumulation:** 8
 - **Number of optimizer steps:** 4000
 - **Number of epochs:** roughly 2 epochs
 - **Gradient Clip Value:** 0.5 (only for parameters trained by AdamW)

 </div>
+# TrorYongASR
 > [!Note]
 > This repository contains model weights and configuration files for the pre-trained model.
 ### Model Description
+This is an ASR (Automatic Speech Recognition) model inspired by [PARSeq](https://github.com/baudm/parseq/tree/main) and [Whisper](https://github.com/openai/whisper/tree/main). It is smaller than Whisper model of openai. **TrorYongASR** supports both Khmer and English languages.
 <div align="center">
+| **Model Size**    | Tiny          | Small |
+|:-------------:|:-----------------:|:-----:|
+| **Parameters**    | 29M               | 136M |
+| **Audio Encoder** | 4 layers, 6 heads | 12 layers, 12 heads |
+| **Text Decoder**  | 1 layer, 12 heads | 1 layer, 24 heads |
+| **Embedding Dim** | 384 | 768 |
+| **Audio Context** | 1500 | 1500 |
+| **Text Context**  | 1024 | 1024 |
 </div>
 **Note:** The audio array are processed to log-mel spectrogram with `80` mels (the same as Whisper models of the same size)
 #### Language Detection Results
 <div align="center">
+ | Model | Dataset | Precision | Recall | Accuracy | F1-score |
+|-------|---------|-----------|--------|----------|----------|
+| Tiny | google/fleurs (Khmer) | 100% | 100% | 100% | 100% |
+|  | librispeech.clean (English) | 100% | 100% | 100% | 100% |
+| Small | google/fleurs (Khmer) | 100% | 100% | 100% | 100% |
+|  | librispeech.clean (English) | 100% | 100% | 100% | 100% |
 </div>
 **Key Finding:** Both model sizes achieved perfect language detection performance on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio.
+**Note on Language Detection Performance:** The 100% language detection scores may appear unusually high. This is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
 #### Transcription Results
 <div align="center">
+| Model | Metric | Combined (Khmer + English) | Khmer | English |
+|-------|--------|---------------------------|-------|---------|
+| **Tiny** | Token Error Rate | 29% | 56% | 19% |
+| | Character Error Rate (CER) | 32.89% | 60.71% | 20.98% |
+| | Word Error Rate (WER) | 46.53% | 86.16% | 31.13% |
+| **Small** | Token Error Rate | 19% | 46% | 10% |
+| | Character Error Rate (CER) | 15.54% | 35.31% | 7.08% |
+| | Word Error Rate (WER) | 23.52% | 50.70% | 12.95% |
 </div>
 **Key Observations:**
+- The tiny model shows strong performance on English (19% token error rate, 20.98% CER, 31.13% WER)
 - Performance drops significantly for Khmer (56% token error rate, 60.71% CER, 86.16% WER)
+- The small model shows strong performance on English (10% token error rate, 7.08% CER, 12.95% WER)
+- Performance for Khmer is moderate (46% token error rate, 35.31% CER, 50.70% WER)
+- The larger model benefits from increased embedding dimension (768 vs 384) and more layers (12 vs 4)
 **Note:** To compute `CER` and `WER`, whitespaces are added between words in Khmer text (Khmer text has no word boundary like English text). To do so, `khmercut` PyPI package is used to tokenize Khmer text into words, and then the words are joined back together with whitespaces.
 **Language Detection:** Both model sizes achieved perfect 100% performance across all metrics (Precision, Recall, Accuracy, F1-score) on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. This perfect score is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
+**Transcription:** The Small model shows strong performance on English (10% token error rate, 7.08% CER, 12.95% WER) and moderate performance for Khmer (46% token error rate, 35.31% CER, 50.70% WER). The Tiny model shows strong performance on English (19% token error rate, 20.98% CER, 31.13% WER) but significantly lower performance for Khmer (56% token error rate, 60.71% CER, 86.16% WER). The larger model benefits from increased embedding dimension (768 vs 384) and more layers (12 vs 4).
 ## How to Get Started with the Model
 result2 = translate('/path/to/audio_file.mp3', model, processor, max_tokens=64)
 print(result2) # namedtuple: text:str, output_ids: torch.Tensor
+#TODO: add detect_language usage
 ```
+### Fine-tuning
+Notebook (TBA)
 ## Uses
 - **Optimizer:** MuonAdamW (custom implementation)
 - **Learning rate:** Linear Warmup (40 optimizer steps) + CosineAnnealing (3960 optimizer steps)
 - **Weight decay:** 0.1
+- **Effective Batch size:** 64
 - **Number of optimizer steps:** 4000
 - **Number of epochs:** roughly 2 epochs
 - **Gradient Clip Value:** 0.5 (only for parameters trained by AdamW)