KrorngAI
/

TrorYongASR-tiny

@@ -50,21 +50,27 @@ pipeline_tag: automatic-speech-recognition
 This is a **29M parameter** ASR (Automatic Speech Recognition) model inspired by [PARSeq](https://github.com/baudm/parseq/tree/main) and [Whisper](https://github.com/openai/whisper/tree/main). It is a lightweight model designed for efficient speech-to-text task, particularly suitable for edge devices and mobile applications. The model supports both Khmer and English languages.
-| Model Size | Parameters | Audio Encoder | Text Decoder | Embedding Dim | Audio Context | Text Context |
-|------------|------------|---------------|--------------|---------------|---------------|--------------|
-| **Tiny** | 29M | 4 layers, 6 heads | 1 layer, 12 heads | 384 | 1500 | 1024 |
-| **Small** | 136M | 12 layers, 12 heads | 1 layer, 24 heads | 768 | 1500 | 1024 |
 **Note:** The audio array are processed to log-mel spectrogram with `80` mels (the same as Whisper models of the same size)
-- **Developed by:** Dr. KHUN Kimang
-- **Shared by [optional]:** KrorngAI
 - **Model type:** ASR (Automatic Speech Recognition)
 - **Language(s) (NLP):** Khmer and English
 - **License:** [More Information Needed]
-### Model Sources [optional]
 <!-- Provide the basic links for the model. -->
@@ -72,6 +78,116 @@ This is a **29M parameter** ASR (Automatic Speech Recognition) model inspired by
 - **Paper [optional]:** [More Information Needed]
 - **Demo [optional]:** [More Information Needed]
 ## Uses
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
@@ -95,10 +211,6 @@ The model can be integrated into:
 - **Larger ASR systems**: As a component in multi-language ASR pipelines
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
 ## Bias, Risks, and Limitations
 **Technical Limitations:**
@@ -118,30 +230,6 @@ The model can be integrated into:
 Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-First, install `tror-yong-asr` PyPI package:
-```bash
-pip install tror-yong-asr
-```
-Then, use the code below to get started with the model.
-```python
-from transformers import AutoProcessor
-from tror_yong_asr import TrorYongASRModel, transcribe, translate
-model_id = "Kimang18/tror-yong-asr-small"
-processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
-model = TrorYongASRModel.from_pretrained(model_id, trust_remote_code=True)
-result1 = transcribe('/content/voice5.mp3', model, processor, max_tokens=64)
-print(result1) # namedtuple: text: str, output_ds: torch.Tensor
-result2 = translate('/content/voice5.mp3', model, processor, max_tokens=64)
-print(result2) # namedtuple: text:str, output_ids: torch.Tensor
-```
 ## Training Details
@@ -170,7 +258,7 @@ For translation task, the data was scarce: only 2000 examples for Khmer audio to
 <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
 Following `Whisper` model of openai, audios with duration longer than 30 seconds are filtered out.
 All audios have `16000` sample rate.
@@ -189,146 +277,12 @@ For English dataset, all texts are in lowercase.
 - **Gradient Clip Value:** 0.5 (only for parameters trained by AdamW)
-#### Speeds, Sizes, Times [optional]
 The training was conducted over 4000 optimizer steps on 1 GPU Tesla T4.
-The trainig took around 10 hours.
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-The evaluation assesses two capabilities — language detection and transcription — on two datasets ([`google/fleurs`](https://huggingface.co/datasets/Kimang18/google-fleurs-km-kh) for Khmer and [`openslr/librispeech_asr`](https://huggingface.co/datasets/openslr/librispeech_asr) for English). All results are from the test split of each dataset, representing the model's generalization ability to unseen data.
-### Testing Data & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-| Dataset               | Language   | Testing examples | Description                                       |
-| ---------             | ---------- | ------------- | -                                                 |
-| **google/fleurs**     | Khmer      | 765           | Multi-lingual dataset with Khmer language samples |
-| **librispeech.clean** | English    | 2620          | Clean speech dataset for English transcription    |
-**Note:** All evaluation results below are from the **test split** of each dataset. For `google/fleurs`, audios longer than `30 seconds` are excluded from the evaluation.
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-##### Language Detection
-**Task:** Given audio input, detect the language.
-**Approach:** Binary classification task (2 languages: Khmer and English).
-**Metrics:**
-| Metric | Description |
-|--------|-------------|
-| **Precision** | Proportion of predicted languages that are correct |
-| **Recall** | Proportion of actual language samples correctly identified |
-| **Accuracy** | Proportion of total predictions that are correct |
-| **F1-score** | Harmonic mean of precision and recall |
-##### Transcription
-**Task:** Convert audio to text (transcription).
-**Metrics:**
-| Metric | Description |
-|--------|-------------|
-| **Token Error Rate** | Proportion of incorrectly transcribed tokens |
-| **Character Error Rate (CER)** | Proportion of characters that are incorrect |
-| **Word Error Rate (WER)** | Proportion of words that are incorrect |
-**Note on Translation Task:** The models are also trained for the `translation` task, but evaluation is deferred to future work due to scarce data (2000 samples from Khmer audio to English text, and 1000 samples from English audio to Khmer text).
-**Note on Token Error Rate:** Token Error Rate measures the model's capability to predict the next token given the audio input and the current sequence of tokens. This metric is weaker than Word Error Rate (WER) and Character Error Rate (CER) because it doesn't account for insertions, deletions, and substitutions as comprehensively. Token Error Rate is used here because Khmer text lacks word boundaries, making WER and CER calculations challenging without additional preprocessing.
-### Results
-#### Language Detection Results
-| Model | Dataset | Precision | Recall | Accuracy | F1-score |
-|-------|---------|-----------|--------|----------|----------|
-| Tiny | google/fleurs (Khmer) | 100% | 100% | 100% | 100% |
-|  | librispeech.clean (English) | 100% | 100% | 100% | 100% |
-| Small | google/fleurs (Khmer) | 100% | 100% | 100% | 100% |
-|  | librispeech.clean (English) | 100% | 100% | 100% | 100% |
-**Key Finding:** Both model sizes achieved perfect language detection performance on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio.
-**Note on Language Detection Performance:** The 100% language detection scores may appear unusually high. This is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
-#### Transcription Results
-| Model | Metric | Combined (Khmer + English) | Khmer | English |
-|-------|--------|---------------------------|-------|---------|
-| **Tiny** | Token Error Rate | 29% | 56% | 19% |
-| | Character Error Rate (CER) | 32.89% | 60.71% | 20.98% |
-| | Word Error Rate (WER) | 46.53% | 86.16% | 31.13% |
-| **Small** | Token Error Rate | 19% | 46% | 10% |
-| | Character Error Rate (CER) | 15.54% | 35.31% | 7.08% |
-| | Word Error Rate (WER) | 23.52% | 50.70% | 12.95% |
-**Key Observations:**
-- The tiny model shows strong performance on English (19% token error rate, 20.98% CER, 31.13% WER)
-- Performance drops significantly for Khmer (56% token error rate, 60.71% CER, 86.16% WER)
-- The small model shows strong performance on English (10% token error rate, 7.08% CER, 12.95% WER)
-- Performance for Khmer is moderate (46% token error rate, 35.31% CER, 50.70% WER)
-- The larger model benefits from increased embedding dimension (768 vs 384) and more layers (12 vs 4)
-**Note:** To compute `CER` and `WER`, whitespaces are added between words in Khmer text (Khmer text has no word boundary like English text). To do so, `khmercut` PyPI package is used to tokenize Khmer text into words, and then the words are joined back together with whitespaces.
-#### Summary
-**Language Detection:** Both model sizes achieved perfect 100% performance across all metrics (Precision, Recall, Accuracy, F1-score) on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. This perfect score is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
-**Transcription:** The Small model shows strong performance on English (10% token error rate, 7.08% CER, 12.95% WER) and moderate performance for Khmer (46% token error rate, 35.31% CER, 50.70% WER). The Tiny model shows strong performance on English (19% token error rate, 20.98% CER, 31.13% WER) but significantly lower performance for Khmer (56% token error rate, 60.71% CER, 86.16% WER). The larger model benefits from increased embedding dimension (768 vs 384) and more layers (12 vs 4).
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
 ## Citation [optional]
 <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
@@ -337,24 +291,12 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
 [More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
 ## Model Card Contact
-[More Information Needed]

 This is a **29M parameter** ASR (Automatic Speech Recognition) model inspired by [PARSeq](https://github.com/baudm/parseq/tree/main) and [Whisper](https://github.com/openai/whisper/tree/main). It is a lightweight model designed for efficient speech-to-text task, particularly suitable for edge devices and mobile applications. The model supports both Khmer and English languages.
+<div align="center">
+| **Model Size**    | Tiny          |
+|:-------------:|:-----------------:|
+| **Parameters**    | 29M               |
+| **Audio Encoder** | 4 layers, 6 heads |
+| **Text Decoder**  | 1 layer, 12 heads |
+| **Embedding Dim** | 384 |
+| **Audio Context** | 1500 |
+| **Text Context**  | 1024 |
+</div>
 **Note:** The audio array are processed to log-mel spectrogram with `80` mels (the same as Whisper models of the same size)
+- **Developed by:** KHUN Kimang (Ph.D.)
+- **Shared by:** KrorngAI
 - **Model type:** ASR (Automatic Speech Recognition)
 - **Language(s) (NLP):** Khmer and English
 - **License:** [More Information Needed]
+### Model Sources
 <!-- Provide the basic links for the model. -->
 - **Paper [optional]:** [More Information Needed]
 - **Demo [optional]:** [More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+The evaluation assesses two capabilities — language detection and transcription — on two datasets ([`google/fleurs`](https://huggingface.co/datasets/Kimang18/google-fleurs-km-kh) for Khmer and [`openslr/librispeech_asr`](https://huggingface.co/datasets/openslr/librispeech_asr) for English). All results are from the test split of each dataset, representing the model's generalization ability to unseen data.
+### Testing Data & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+| Dataset               | Language   | Testing examples | Description                                       |
+| ---------             | ---------- | ------------- | -                                                 |
+| **google/fleurs**     | Khmer      | 765           | Multi-lingual dataset with Khmer language samples |
+| **librispeech.clean** | English    | 2620          | Clean speech dataset for English transcription    |
+**Note:** All evaluation results below are from the **test split** of each dataset. For `google/fleurs`, audios longer than `30 seconds` are excluded from the evaluation.
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+##### Language Detection
+**Task:** Given audio input, detect the language.
+| Metric | Description |
+|--------|-------------|
+| **Precision** | Proportion of predicted languages that are correct |
+| **Recall** | Proportion of actual language samples correctly identified |
+| **Accuracy** | Proportion of total predictions that are correct |
+| **F1-score** | Harmonic mean of precision and recall |
+##### Transcription
+**Task:** Convert audio to text (transcription).
+| Metric | Description |
+|--------|-------------|
+| **Token Error Rate** | Proportion of incorrectly transcribed tokens |
+| **Character Error Rate (CER)** | Proportion of characters that are incorrect |
+| **Word Error Rate (WER)** | Proportion of words that are incorrect |
+**Note on Token Error Rate:** Token Error Rate measures model's capability in predicting the next token given the audio input and the current sequence of tokens. This metric is weaker than Word Error Rate (WER) and Character Error Rate (CER) because it doesn't account for insertions, deletions, and substitutions as comprehensively. Token Error Rate is used here because Khmer text lacks word boundaries, making WER and CER calculations challenging without additional preprocessing.
+**Note on Translation Task:** The models are also trained for `translation` task, but evaluation is deferred to future work due to scarce data (2000 samples from Khmer audio to English text, and 1000 samples from English audio to Khmer text).
+### Results
+#### Language Detection Results
+| Dataset | Precision | Recall | Accuracy | F1-score |
+|---------|-----------|--------|----------|----------|
+| google/fleurs (Khmer) | 100% | 100% | 100% | 100% |
+| librispeech.clean (English) | 100% | 100% | 100% | 100% |
+**Key Finding:** Both model sizes achieved perfect language detection performance on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio.
+**Note on Language Detection Performance:** The 100% language detection scores may appear unusually high. This is probably due to the fact that during the training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. This means that with 6 permutations, the model learns to predict language token 6 times for a given audio. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
+#### Transcription Results
+| Metric | Combined (Khmer + English) | Khmer | English |
+|--------|---------------------------|-------|---------|
+| Token Error Rate | 29% | 56% | 19% |
+| Character Error Rate (CER) | 32.89% | 60.71% | 20.98% |
+| Word Error Rate (WER) | 46.53% | 86.16% | 31.13% |
+**Key Observations:**
+- The model shows strong performance on English (19% token error rate, 20.98% CER, 31.13% WER)
+- Performance drops significantly for Khmer (56% token error rate, 60.71% CER, 86.16% WER)
+**Note:** To compute `CER` and `WER`, whitespaces are added between words in Khmer text (Khmer text has no word boundary like English text). To do so, `khmercut` PyPI package is used to tokenize Khmer text into words, and then the words are joined back together with whitespaces.
+#### Summary
+**Language Detection:** Both model sizes achieved perfect 100% performance across all metrics (Precision, Recall, Accuracy, F1-score) on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. This perfect score is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
+**Transcription:** The Small model shows strong performance on English (10% token error rate, 7.08% CER, 12.95% WER) and moderate performance for Khmer (46% token error rate, 35.31% CER, 50.70% WER). The Tiny model shows strong performance on English (19% token error rate, 20.98% CER, 31.13% WER) but significantly lower performance for Khmer (56% token error rate, 60.71% CER, 86.16% WER). The larger model benefits from increased embedding dimension (768 vs 384) and more layers (12 vs 4).
+## How to Get Started with the Model
+First, install `tror-yong-asr` PyPI package:
+```bash
+pip install tror-yong-asr
+```
+Then, use the code below to get started with the model.
+```python
+from transformers import AutoProcessor
+from tror_yong_asr import TrorYongASRModel, transcribe, translate
+model_id = "KrorngAI/tror-yong-asr-tiny"
+processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
+model = TrorYongASRModel.from_pretrained(model_id, trust_remote_code=True)
+result1 = transcribe('/path/to/audio_file.mp3', model, processor, max_tokens=64)
+print(result1) # namedtuple: text: str, output_ds: torch.Tensor
+result2 = translate('/path/to/audio_file.mp3', model, processor, max_tokens=64)
+print(result2) # namedtuple: text:str, output_ids: torch.Tensor
+```
 ## Uses
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 - **Larger ASR systems**: As a component in multi-language ASR pipelines
 ## Bias, Risks, and Limitations
 **Technical Limitations:**
 Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
 ## Training Details
 <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing
 Following `Whisper` model of openai, audios with duration longer than 30 seconds are filtered out.
 All audios have `16000` sample rate.
 - **Gradient Clip Value:** 0.5 (only for parameters trained by AdamW)
+#### Speeds, Sizes, Times
 The training was conducted over 4000 optimizer steps on 1 GPU Tesla T4.
+The training took around 10 hours.
 ## Citation [optional]
 <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 [More Information Needed]
+## Model Card Authors
+Name: KHUN Kimang (Ph.D.)
+Email: kimang.khun@polytechnique.org
 ## Model Card Contact
+If you have any questions, please reach out at [Facebook Page](https://www.facebook.com/profile.php?id=61582509385293).