KrorngAI
/

TrorYongASR-tiny

@@ -1,35 +1,51 @@
 ---
 library_name: transformers
-tags: []
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
 ## Model Details
 ### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
 - **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
 ### Model Sources [optional]
 <!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
 - **Paper [optional]:** [More Information Needed]
 - **Demo [optional]:** [More Information Needed]
@@ -39,27 +55,39 @@ This is the model card of a 🤗 transformers model that has been pushed on the
 ### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
 ### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
 ### Out-of-Scope Use
 <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
 ## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
 ### Recommendations
@@ -69,9 +97,28 @@ Users (both direct and downstream) should be made aware of the risks, biases and
 ## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
 ## Training Details
@@ -79,7 +126,22 @@ Use the code below to get started with the model.
 <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
 ### Training Procedure
@@ -87,49 +149,125 @@ Use the code below to get started with the model.
 #### Preprocessing [optional]
-[More Information Needed]
 #### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
 #### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
 ## Evaluation
 <!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
 #### Testing Data
 <!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
 #### Metrics
 <!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
 ### Results
-[More Information Needed]
 #### Summary
 ## Model Examination [optional]

 ---
 library_name: transformers
+license: mit
+datasets:
+- DDD-Cambodia/khm-asr-cultural
+- openslr/librispeech_asr
+- KrorngAI/fleurs-km-kh-openslr-SLR42
+language:
+- km
+- en
+metrics:
+- wer
+- cer
+- ter
+pipeline_tag: automatic-speech-recognition
 ---
+# TrorYongASR Tiny
+> [!Note]
+> This repository contains model weights and configuration files for the pre-trained model.
+>
 ## Model Details
 ### Model Description
+This is a **29M parameter** ASR (Automatic Speech Recognition) model inspired by [PARSeq](https://github.com/baudm/parseq/tree/main) and [Whisper](https://github.com/openai/whisper/tree/main). It is a lightweight model designed for efficient speech-to-text task, particularly suitable for edge devices and mobile applications. The model supports both Khmer and English languages.
+| Model Size | Parameters | Audio Encoder | Text Decoder | Embedding Dim | Audio Context | Text Context |
+|------------|------------|---------------|--------------|---------------|---------------|--------------|
+| **Tiny** | 29M | 4 layers, 6 heads | 1 layer, 12 heads | 384 | 1500 | 1024 |
+| **Small** | 136M | 12 layers, 12 heads | 1 layer, 24 heads | 768 | 1500 | 1024 |
+**Note:** The audio array are processed to log-mel spectrogram with `80` mels (the same as Whisper models of the same size)
+- **Developed by:** Dr. KHUN Kimang
+- **Shared by [optional]:** KrorngAI
+- **Model type:** ASR (Automatic Speech Recognition)
+- **Language(s) (NLP):** Khmer and English
 - **License:** [More Information Needed]
 ### Model Sources [optional]
 <!-- Provide the basic links for the model. -->
+- **Repository:** https://github.com/Kimang18/KrorngAI/tree/main/tror-yong-asr
 - **Paper [optional]:** [More Information Needed]
 - **Demo [optional]:** [More Information Needed]
 ### Direct Use
+The Tiny model can be used directly for:
+- **Speech-to-text transcription**: transcribe Khmer and English audio
+- **Speech-to-text translation**: translate Khmer audio to English text and English audio to Khmer text
+- **Language detection**: Identify whether audio is in Khmer or English (100% accuracy)
+- **Edge computing**: Deploy on mobile devices, IoT devices, and embedded systems due to its small size (29M parameters)
+- **Real-time applications**: Low latency inference suitable for real-time speech interfaces
 ### Downstream Use [optional]
+The model can be integrated into:
+- **Mobile applications**: Android/iOS apps with speech recognition
+- **Web applications**: Browser-based speech-to-text using WebAssembly
+- **IoT devices**: Smart speakers, voice assistants
+- **Larger ASR systems**: As a component in multi-language ASR pipelines
 ### Out-of-Scope Use
 <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
 ## Bias, Risks, and Limitations
+**Technical Limitations:**
+- **No speech detection**: The model was not trained for this task. User needs to fine-tune the model for this task (TrorYongASRTokenizer has `<|nospeech|>` token.)
+- **Noise robustness**: Performance may degrade in noisy environments
+- **Translate task**: The training data for translation task is scarce. User needs to fine-tune the model for better translation performance.
+**Sociotechnical Limitations:**
+- **Accent variability**: May not perform well on diverse Khmer accents
+- **Background noise**: Limited robustness to background noise and reverberation
+- **Speaker variability**: May struggle with different speaking styles and rates
 ### Recommendations
 ## How to Get Started with the Model
+First, install `tror-yong-asr` PyPI package:
+```bash
+pip install tror-yong-asr
+```
+Then, use the code below to get started with the model.
+```python
+from transformers import AutoProcessor
+from tror_yong_asr import TrorYongASRModel, transcribe, translate
+model_id = "Kimang18/tror-yong-asr-small"
+processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
+model = TrorYongASRModel.from_pretrained(model_id, trust_remote_code=True)
+result1 = transcribe('/content/voice5.mp3', model, processor, max_tokens=64)
+print(result1) # namedtuple: text: str, output_ds: torch.Tensor
+result2 = translate('/content/voice5.mp3', model, processor, max_tokens=64)
+print(result2) # namedtuple: text:str, output_ids: torch.Tensor
+```
 ## Training Details
 <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+#### Transcription Task
+For transcription task, the model was trained on around 140 hours of Khmer audio and around 100 hours of English audio.
+Khmer datasets include [`DDD-Cambodia/khm-asr-cultural`](https://huggingface.co/datasets/DDD-Cambodia/khm-asr-cultural) (134.6 hours), [`openslr/openslr`](https://huggingface.co/datasets/Kimang18/openslr-SLR42/blob/main/README.md), and [`google/fleurs`](https://huggingface.co/datasets/Kimang18/google-fleurs-km-kh).
+Split `clean.100` of [`openslr/librispeech_asr`](https://huggingface.co/datasets/openslr/librispeech_asr) was used as English dataset.
+| Dataset                            | Language   | Training examples | Validation examples | Description                                       |
+| ---------                          | ---------- | ----------------- | ------------------- |-                                                 |
+| **openslr/openslr**                | Khmer      | 2906              | 0                   | Multi-speaker TTS data for Khmer language (split `SLR42`) |
+| **google/fleurs**                  | Khmer      | 1675              | 324                 | TTS data for Khmer language (split `km_kh`) |
+| **DDD-Cambodia/khm-asr-cultural**  | Khmer      | 56716             | 0                   | Khmer ASR Cultural Dataset  (split `train`) |
+| **librispeech.clean**              | English    | 28539             | 2703                | Clean speech dataset for English transcription    |
+#### Translation Task
+For translation task, the data was scarce: only 2000 examples for Khmer audio to English text, and only 1000 examples for English audio to Khmer text.
 ### Training Procedure
 #### Preprocessing [optional]
+Following `Whisper` model of openai, audios with duration longer than 30 seconds are filtered out.
+All audios have `16000` sample rate.
+For English dataset, all texts are in lowercase.
 #### Training Hyperparameters
+- **Training regime:** bf16 mixed precision training using `LightningAI` package <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+- **Optimizer:** MuonAdamW (custom implementation)
+- **Learning rate:** Linear Warmup (40 optimizer steps) + CosineAnnealing (3960 optimizer steps)
+- **Weight decay:** 0.1
+- **Batch size:** 8
+- **Gradient accumulation:** 8
+- **Number of optimizer steps:** 4000
+- **Number of epochs:** roughly 2 epochs
+- **Gradient Clip Value:** 0.5 (only for parameters trained by AdamW)
 #### Speeds, Sizes, Times [optional]
+The training was conducted over 4000 optimizer steps on 1 GPU Tesla T4.
+The trainig took around 10 hours.
 ## Evaluation
 <!-- This section describes the evaluation protocols and provides the results. -->
+The evaluation assesses two capabilities — language detection and transcription — on two datasets ([`google/fleurs`](https://huggingface.co/datasets/Kimang18/google-fleurs-km-kh) for Khmer and [`openslr/librispeech_asr`](https://huggingface.co/datasets/openslr/librispeech_asr) for English). All results are from the test split of each dataset, representing the model's generalization ability to unseen data.
+### Testing Data & Metrics
 #### Testing Data
 <!-- This should link to a Dataset Card if possible. -->
+| Dataset               | Language   | Testing examples | Description                                       |
+| ---------             | ---------- | ------------- | -                                                 |
+| **google/fleurs**     | Khmer      | 765           | Multi-lingual dataset with Khmer language samples |
+| **librispeech.clean** | English    | 2620          | Clean speech dataset for English transcription    |
+**Note:** All evaluation results below are from the **test split** of each dataset. For `google/fleurs`, audios longer than `30 seconds` are excluded from the evaluation.
 #### Metrics
 <!-- These are the evaluation metrics being used, ideally with a description of why. -->
+##### Language Detection
+**Task:** Given audio input, detect the language.
+**Approach:** Binary classification task (2 languages: Khmer and English).
+**Metrics:**
+| Metric | Description |
+|--------|-------------|
+| **Precision** | Proportion of predicted languages that are correct |
+| **Recall** | Proportion of actual language samples correctly identified |
+| **Accuracy** | Proportion of total predictions that are correct |
+| **F1-score** | Harmonic mean of precision and recall |
+##### Transcription
+**Task:** Convert audio to text (transcription).
+**Metrics:**
+| Metric | Description |
+|--------|-------------|
+| **Token Error Rate** | Proportion of incorrectly transcribed tokens |
+| **Character Error Rate (CER)** | Proportion of characters that are incorrect |
+| **Word Error Rate (WER)** | Proportion of words that are incorrect |
+**Note on Translation Task:** The models are also trained for the `translation` task, but evaluation is deferred to future work due to scarce data (2000 samples from Khmer audio to English text, and 1000 samples from English audio to Khmer text).
+**Note on Token Error Rate:** Token Error Rate measures the model's capability to predict the next token given the audio input and the current sequence of tokens. This metric is weaker than Word Error Rate (WER) and Character Error Rate (CER) because it doesn't account for insertions, deletions, and substitutions as comprehensively. Token Error Rate is used here because Khmer text lacks word boundaries, making WER and CER calculations challenging without additional preprocessing.
 ### Results
+#### Language Detection Results
+| Model | Dataset | Precision | Recall | Accuracy | F1-score |
+|-------|---------|-----------|--------|----------|----------|
+| Tiny | google/fleurs (Khmer) | 100% | 100% | 100% | 100% |
+|  | librispeech.clean (English) | 100% | 100% | 100% | 100% |
+| Small | google/fleurs (Khmer) | 100% | 100% | 100% | 100% |
+|  | librispeech.clean (English) | 100% | 100% | 100% | 100% |
+**Key Finding:** Both model sizes achieved perfect language detection performance on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio.
+**Note on Language Detection Performance:** The 100% language detection scores may appear unusually high. This is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
+#### Transcription Results
+| Model | Metric | Combined (Khmer + English) | Khmer | English |
+|-------|--------|---------------------------|-------|---------|
+| **Tiny** | Token Error Rate | 29% | 56% | 19% |
+| | Character Error Rate (CER) | 32.89% | 60.71% | 20.98% |
+| | Word Error Rate (WER) | 46.53% | 86.16% | 31.13% |
+| **Small** | Token Error Rate | 19% | 46% | 10% |
+| | Character Error Rate (CER) | 15.54% | 35.31% | 7.08% |
+| | Word Error Rate (WER) | 23.52% | 50.70% | 12.95% |
+**Key Observations:**
+- The tiny model shows strong performance on English (19% token error rate, 20.98% CER, 31.13% WER)
+- Performance drops significantly for Khmer (56% token error rate, 60.71% CER, 86.16% WER)
+- The small model shows strong performance on English (10% token error rate, 7.08% CER, 12.95% WER)
+- Performance for Khmer is moderate (46% token error rate, 35.31% CER, 50.70% WER)
+- The larger model benefits from increased embedding dimension (768 vs 384) and more layers (12 vs 4)
+**Note:** To compute `CER` and `WER`, whitespaces are added between words in Khmer text (Khmer text has no word boundary like English text). To do so, `khmercut` PyPI package is used to tokenize Khmer text into words, and then the words are joined back together with whitespaces.
 #### Summary
+**Language Detection:** Both model sizes achieved perfect 100% performance across all metrics (Precision, Recall, Accuracy, F1-score) on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. This perfect score is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
+**Transcription:** The Small model shows strong performance on English (10% token error rate, 7.08% CER, 12.95% WER) and moderate performance for Khmer (46% token error rate, 35.31% CER, 50.70% WER). The Tiny model shows strong performance on English (19% token error rate, 20.98% CER, 31.13% WER) but significantly lower performance for Khmer (56% token error rate, 60.71% CER, 86.16% WER). The larger model benefits from increased embedding dimension (768 vs 384) and more layers (12 vs 4).
 ## Model Examination [optional]