Automatic Speech Recognition
Transformers
Safetensors
Khmer
English
troryongasr
custom_code
Kimang18 commited on
Commit
6d23783
·
verified ·
1 Parent(s): 51a732c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +85 -87
README.md CHANGED
@@ -49,18 +49,19 @@ pipeline_tag: automatic-speech-recognition
49
 
50
  ### Model Description
51
 
52
- This is an ASR (Automatic Speech Recognition) model inspired by [PARSeq](https://github.com/baudm/parseq/tree/main) and [Whisper](https://github.com/openai/whisper/tree/main). **TrorYongASR** is smaller than Whisper model of openai and supports only Khmer and English languages.
 
53
 
54
  <div align="center">
55
 
56
- | **Model Size** | Tiny | Small |
57
- |:-------------:|:-----------------:|:-----:|
58
- | **Parameters** | 29M | 136M |
59
  | **Audio Encoder** | 4 layers, 6 heads | 12 layers, 12 heads |
60
- | **Text Decoder** | 1 layer, 12 heads | 1 layer, 24 heads |
61
- | **Embedding Dim** | 384 | 768 |
62
- | **Audio Context** | 1500 | 1500 |
63
- | **Text Context** | 1024 | 1024 |
64
  </div>
65
 
66
  **Note:** The audio array are processed to log-mel spectrogram with `80` mels (the same as Whisper models of the same size)
@@ -75,13 +76,12 @@ This is an ASR (Automatic Speech Recognition) model inspired by [PARSeq](https:/
75
  <!-- Provide the basic links for the model. -->
76
 
77
  - **Repository:** https://github.com/Kimang18/KrorngAI/tree/main/tror-yong-asr
78
- - **Paper [optional]:** [More Information Needed]
79
- - **Demo [optional]:** [More Information Needed]
80
 
81
 
82
  ## Evaluation
83
 
84
- <!-- This section describes the evaluation protocols and provides the results. -->
85
  The evaluation assesses two capabilities — language detection and transcription — on two datasets ([`google/fleurs`](https://huggingface.co/datasets/Kimang18/google-fleurs-km-kh) for Khmer and [`openslr/librispeech_asr`](https://huggingface.co/datasets/openslr/librispeech_asr) for English). All results are from the **test split** of each dataset, representing the model's generalization ability to unseen data.
86
 
87
  ### Testing Data
@@ -90,13 +90,13 @@ The evaluation assesses two capabilities — language detection and transcriptio
90
 
91
  <div align="center">
92
 
93
- | Dataset | Language | Testing examples | Description |
94
- | --------- | ---------- | ------------- | - |
95
- | **google/fleurs** | Khmer | 765 | Multi-lingual dataset with Khmer language samples |
96
- | **librispeech.clean** | English | 2620 | Clean speech dataset for English transcription |
97
  </div>
98
 
99
- **Note:** All evaluation results below are from the **test split** of each dataset. Audios longer than `30 seconds` are excluded from the evaluation (that is why `google/fleurs` has 765 examples instead of 771).
100
 
101
  ### Metrics and Results
102
 
@@ -104,96 +104,84 @@ The evaluation assesses two capabilities — language detection and transcriptio
104
 
105
  #### Language Detection
106
 
107
- **Task:** Given audio input, detect the language.
108
 
109
- <div align="center">
110
-
111
- | Metric | Description |
112
- |--------|-------------|
113
- | **Precision** | Proportion of predicted languages that are correct |
114
- | **Recall** | Proportion of actual language samples correctly identified |
115
- | **Accuracy** | Proportion of total predictions that are correct |
116
- | **F1-score** | Harmonic mean of precision and recall |
117
- </div>
118
 
119
  **Results:**
120
 
121
  <div align="center">
122
 
123
- | Model | Metrics | Khmer (`fleurs`) | English (`librispeech.clean`) |
124
- |-------|---------|------------------|-----------------------------|
125
- | Tiny | Precision | 100% | 100% |
126
- | | Recall | 100% | 100% |
127
- | | Accuracy | 100% | 100% |
128
- | | F1-score | 100% | 100% |
129
- | Small | Precision | 100% | 100% |
130
- | | Recall | 100% | 100% |
131
- | | Accuracy | 100% | 100% |
132
- | | F1-score | 100% | 100% |
133
  </div>
134
 
135
- **Key Finding:** Both model sizes achieved perfect language detection performance on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio.
136
 
137
- **Note on Language Detection Performance:** The 100% language detection scores may appear unusually high. This is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
138
 
139
 
140
  #### Transcription
141
 
142
- **Task:** Convert audio to text (transcription).
143
 
144
- <div align="center">
145
-
146
- | Metric | Description |
147
- |--------|-------------|
148
- | **Token Error Rate** | Proportion of incorrectly transcribed tokens |
149
- | **Character Error Rate (CER)** | Proportion of characters that are incorrect |
150
- | **Word Error Rate (WER)** | Proportion of words that are incorrect |
151
- </div>
152
 
153
- **Note on Token Error Rate:** Token Error Rate measures model's capability in predicting the next token given the audio input and the current sequence of tokens. This metric is weaker than Word Error Rate (WER) and Character Error Rate (CER) because it doesn't account for insertions, deletions, substitutions, and autoregression as comprehensively. Token Error Rate is used here because Khmer text lacks word boundaries, making WER and CER calculations challenging without additional preprocessing.
154
 
155
- **Note on Translation Task:** The models are also trained for `translation` task, but evaluation is deferred to future work due to scarce data (there are only 2000 examples from Khmer audio to English text, and 1000 examples from English audio to Khmer text in the pre-training).
156
 
157
  **Transcription Results:**
158
 
159
  <div align="center">
160
 
161
- | Model | Metric | Khmer (`fleurs`) | English (`librispeech.clean`) | Mixed (Khmer + English) |
162
- |-------|--------|-------|---------|---------|
163
- | **Tiny** | Word Error Rate (WER) | 86.16% | 31.13% | 46.53% |
164
- | | Character Error Rate (CER) | 60.71% | 20.98% | 32.89% |
165
- | | Token Error Rate | 56% | 19% | 29% |
166
- | **Small** | Word Error Rate (WER) | 50.70% | 12.95% | 23.52% |
167
- | | Character Error Rate (CER) | 35.31% | 7.08% | 15.54% |
168
- | | Token Error Rate | 46% | 10% | 19% |
169
  </div>
170
 
171
  **Key Observations:**
172
- - The tiny model shows strong performance on English (31.13% WER, 20.98% CER, 19% token error rate)
173
- - Performance drops significantly for Khmer (86.16% WER, 60.71% CER, 56% token error rate)
174
- - The small model shows strong performance on English (12.95% WER, 7.08% CER, 10% token error rate)
175
- - Performance for Khmer is moderate (50.70% WER, 35.31% CER, 46% token error rate)
176
- - The larger model benefits from increased embedding dimension (768 vs 384) and more layers (12 vs 4)
 
177
 
178
  **Note:** To compute `CER` and `WER`, whitespaces are added between words in Khmer text (Khmer text does not have word boundaries like English text). To do so, `khmercut` PyPI package is used to tokenize Khmer text into words, and then the words are joined back together with whitespaces.
179
 
180
 
181
  **WER Comparison with Whisper:**
182
 
183
- | Tiny | Parameters | Khmer (`fleurs`) | English (`librispeech.clean`) |
184
- |-------|--------|---------------------------| --- |
185
- | TrorYongASR | 29M | 86.16% | 31.13% |
186
- | Whisper | 39M | 100.6% | 7.6% |
187
 
188
- | Small | Parameters | Khmer (`fleurs`) | English (`librispeech.clean`) |
189
- |-------|--------|---------------------------| --- |
190
- | TrorYongASR | 136M | 50.70% | 12.95% |
191
- | Whisper | 244M | 104.4% | 3.4% |
192
 
193
  **Key Observations:**
194
- - Whisper models have more parameters for comparable sizes (39M vs 29M for Tiny, 244M vs 136M for Small)
195
- - Whisper shows significantly lower word error rates on English (7.6% vs 31.13% for Tiny, 3.4% vs 12.95% for Small)
196
- - Whisper performs worse on Khmer (100.6% vs 86.16% for Tiny, 104.4% vs 50.70% for Small)
 
197
  - Error rates > 100% for Whisper on Khmer indicate the model is overfitting to the training data
198
 
199
  **Note:** `WER` data of Whisper is taken from their [paper](https://arxiv.org/abs/2212.04356).
@@ -201,9 +189,11 @@ The evaluation assesses two capabilities — language detection and transcriptio
201
 
202
  ### Result Summary
203
 
204
- **Language Detection:** Both model sizes achieved perfect 100% performance across all metrics (Precision, Recall, Accuracy, F1-score) on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. This perfect score is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
205
 
206
- **Transcription:** The Small model shows strong performance on English (12.95% WER, 7.08% CER, 10% token error rate) and moderate performance for Khmer (50.70% WER, 35.31% CER, 46% token error rate). The Tiny model shows strong performance on English (31.13% WER, 20.98% CER, 19% token error rate) but significantly lower performance for Khmer (86.16% WER, 60.71% CER, 56% token error rate). This shows that `TrorYongASR` can be scaled to get higher performance (both models have the same pre-training configuration. See Section Training Details below).
 
 
207
 
208
 
209
  ## How to Get Started with the Model
@@ -220,7 +210,7 @@ from transformers import AutoProcessor
220
  from tror_yong_asr import TrorYongASRModel, transcribe, translate, detect_language
221
 
222
 
223
- model_id = "KrorngAI/tror-yong-asr-tiny"
224
  processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
225
  model = TrorYongASRModel.from_pretrained(model_id, trust_remote_code=True)
226
 
@@ -317,35 +307,45 @@ For translation task, the data was scarce: only 2000 examples for Khmer audio to
317
  #### Preprocessing
318
 
319
  Following `Whisper` model of openai, audios with duration longer than 30 seconds are filtered out.
320
- All audios have `16, 000` sample rate.
321
  For English dataset, all texts are in lowercase.
322
 
323
  #### Training Hyperparameters
324
 
325
- - **Training regime:** bf16 mixed precision training using `LightningAI` package <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
326
  - **Optimizer:** MuonAdamW (custom implementation)
327
- - **Learning rate:** Linear Warmup (40 optimizer steps) + CosineAnnealing (3960 optimizer steps)
328
  - **Weight decay:** 0.1
329
  - **Effective Batch size:** 64
330
- - **Number of optimizer steps:** 4000
331
  - **Number of epochs:** roughly 2 epochs
332
  - **Gradient Clip Value:** 0.5 (only for parameters trained by AdamW)
333
 
334
 
335
  #### Speeds, Sizes, Times
336
 
337
- The training was conducted over 4000 optimizer steps.
338
- For tiny variant, the training took around 10 hours on 1 Tesla T4 GPU.
339
- For small variant, the training took around 2 hours on 1 A100 GPU.
340
 
 
 
341
 
342
- ## Citation [optional]
 
343
 
344
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
345
 
346
  **BibTeX:**
347
 
348
- [More Information Needed]
 
 
 
 
 
 
 
 
 
349
 
350
 
351
  ## Model Card Author
@@ -361,8 +361,6 @@ So, huge thanks to [`LightningAI`](https://lightning.ai) and `Google Colab`.
361
 
362
  Thanks to the authors of [`PARSeq`](https://github.com/baudm/parseq/tree/main) and [`Whisper`](https://github.com/openai/whisper/tree/main) for their publicly available sourcecode.
363
 
364
- Thanks to [`openslr`](https://openslr.org), [Mozilla Data Collective](https://mozilladatacollective.com/datasets/cml9h5vgc01bxmn075sjeftek) and Google for their publicly available dataset.
365
-
366
  ## Model Card Contact
367
 
368
  If you have any questions, please reach out at [Facebook Page](https://www.facebook.com/profile.php?id=61582509385293).
 
49
 
50
  ### Model Description
51
 
52
+ TrorYongASR is an Encoder-Decoder model for Automatic Speech Recognition (ASR) task.
53
+ It is inspired by [PARSeq](https://github.com/baudm/parseq/tree/main) and [Whisper](https://github.com/openai/whisper/tree/main): the auditory-lingual decoder has only one transformer block.
54
 
55
  <div align="center">
56
 
57
+ | **Model Size** | Tiny | Small |
58
+ |:-----------------:|:-----------------:|:-------------------:|
59
+ | **Parameters** | 29M | 135M |
60
  | **Audio Encoder** | 4 layers, 6 heads | 12 layers, 12 heads |
61
+ | **Text Decoder** | 1 layer, 12 heads | 1 layer, 24 heads |
62
+ | **Embedding Dim** | 384 | 768 |
63
+ | **Audio Context** | 1500 | 1500 |
64
+ | **Text Context** | 1024 | 1024 |
65
  </div>
66
 
67
  **Note:** The audio array are processed to log-mel spectrogram with `80` mels (the same as Whisper models of the same size)
 
76
  <!-- Provide the basic links for the model. -->
77
 
78
  - **Repository:** https://github.com/Kimang18/KrorngAI/tree/main/tror-yong-asr
79
+ - **Blog Post:** https://kimang18.github.io/krorngai-blog/TrorYongASR/
80
+ - **Demo [optional]:** TBA
81
 
82
 
83
  ## Evaluation
84
 
 
85
  The evaluation assesses two capabilities — language detection and transcription — on two datasets ([`google/fleurs`](https://huggingface.co/datasets/Kimang18/google-fleurs-km-kh) for Khmer and [`openslr/librispeech_asr`](https://huggingface.co/datasets/openslr/librispeech_asr) for English). All results are from the **test split** of each dataset, representing the model's generalization ability to unseen data.
86
 
87
  ### Testing Data
 
90
 
91
  <div align="center">
92
 
93
+ | Dataset | Language | Testing examples | Description |
94
+ | ------------- | ---------- | ------------- | - |
95
+ | **google/fleurs** | Khmer | 765 | Multi-lingual dataset with Khmer language samples |
96
+ | **librispeech.clean** | English | 2620 | Clean speech dataset for English transcription |
97
  </div>
98
 
99
+ **Note:** Audios longer than `30 seconds` are excluded from the evaluation (that is why `google/fleurs` has 765 examples instead of 771).
100
 
101
  ### Metrics and Results
102
 
 
104
 
105
  #### Language Detection
106
 
107
+ Language detection measures model’s capability to recognize the spoken language from audio input. Since TrorYongASR currently supports 2 languages, this task becomes binary classification task. Classic metrics are used:
108
 
109
+ - **Precision**: Proportion of predicted languages that are correct
110
+ - **Recall** : Proportion of actual language samples correctly identified
111
+ - **F1-score** : Harmonic mean of precision and recall
 
 
 
 
 
 
112
 
113
  **Results:**
114
 
115
  <div align="center">
116
 
117
+ | Model | Metrics | Khmer (`fleurs`) | English (`librispeech.clean`) |
118
+ |-------|-----------|------------------|-------------------------------|
119
+ | Tiny | Precision | 100% | 100% |
120
+ | | Recall | 100% | 100% |
121
+ | | F1-score | 100% | 100% |
122
+ | Small | Precision | 100% | 99% |
123
+ | | Recall | 96% | 100% |
124
+ | | F1-score | 98% | 99% |
 
 
125
  </div>
126
 
127
+ Tiny size achieved perfect language detection performance on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. Small size performs slightly worst by tending to predict English language.
128
 
129
+ The 100% language detection scores may appear unusually high. This is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
130
 
131
 
132
  #### Transcription
133
 
134
+ For transcription task, 3 metrics below are used
135
 
136
+ - **Token Error Rate (TER)** : Proportion of incorrectly transcribed tokens
137
+ - **Character Error Rate (CER)** : Proportion of characters that are incorrect
138
+ - **Word Error Rate (WER)** : Proportion of words that are incorrect
 
 
 
 
 
139
 
140
+ Token Error Rate (TER) measures model's capability in predicting the next token given the audio input and the current sequence of tokens. This metric is weaker than Word Error Rate (WER) and Character Error Rate (CER) because it doesn't account for insertions, deletions, substitutions, and autoregression as comprehensively. Token Error Rate is used here because Khmer text lacks word boundaries, making WER and CER calculations challenging without additional preprocessing.
141
 
 
142
 
143
  **Transcription Results:**
144
 
145
  <div align="center">
146
 
147
+ | Model | Metric | Khmer (`fleurs`) | English (`librispeech.clean`) | Mixed (Khmer + English) |
148
+ |-----------|----------------------------|------------------|-------------------------------|-------------------------|
149
+ | **Tiny** | WER | 75.81% | 54.33% | 60.36% |
150
+ | | CER | 54.99% | 42.41% | 46.18% |
151
+ | | TER | 54% | 17% | 27% |
152
+ | **Small** | WER | 50.46% | 21.75% | 29.78% |
153
+ | | CER | 35.89% | 16.58% | 22.37% |
154
+ | | TER | 43% | 8% | 18% |
155
  </div>
156
 
157
  **Key Observations:**
158
+
159
+ - The tiny model shows strong performance on English (54.33% WER, 42.41% CER, 17% TER)
160
+ - Performance drops significantly for Khmer (75.88% WER, 54.99% CER, 54% TER)
161
+ - The small model shows strong performance on English (21.75% WER, 16.58% CER, 8% TER)
162
+ - Performance for Khmer is moderate (50.46% WER, 35.89% CER, 43% TER)
163
+ - The larger model benefits from increased embedding dimension (768 vs 384) and more layers for audio encoder (12 vs 4)
164
 
165
  **Note:** To compute `CER` and `WER`, whitespaces are added between words in Khmer text (Khmer text does not have word boundaries like English text). To do so, `khmercut` PyPI package is used to tokenize Khmer text into words, and then the words are joined back together with whitespaces.
166
 
167
 
168
  **WER Comparison with Whisper:**
169
 
170
+ | Tiny | Parameters | Khmer (`fleurs`) | English (`librispeech.clean`) |
171
+ | ------- | -------- | --------------------------- | --- |
172
+ | TrorYongASR | 29M | 75.88% | 54.33% |
173
+ | Whisper | 39M | 100.6% | 7.6% |
174
 
175
+ | Small | Parameters | Khmer (`fleurs`) | English (`librispeech.clean`) |
176
+ | ------- | -------- | --------------------------- | --- |
177
+ | TrorYongASR | 135M | 50.46% | 21.75% |
178
+ | Whisper | 244M | 104.4% | 3.4% |
179
 
180
  **Key Observations:**
181
+
182
+ - Whisper models have more parameters for comparable sizes (39M vs 29M for Tiny, 244M vs 135M for Small)
183
+ - Whisper shows significantly lower word error rates on English (7.6% vs 54.33% for Tiny, 3.4% vs 12.95% for Small)
184
+ - Whisper performs worse on Khmer (100.6% vs 75.88% for Tiny, 104.4% vs 50.46% for Small)
185
  - Error rates > 100% for Whisper on Khmer indicate the model is overfitting to the training data
186
 
187
  **Note:** `WER` data of Whisper is taken from their [paper](https://arxiv.org/abs/2212.04356).
 
189
 
190
  ### Result Summary
191
 
192
+ **Language Detection:** Both model sizes achieved great performance across all metrics (Precision, Recall, F1-score) on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. This high score is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
193
 
194
+ **Transcription:** The Small model shows strong performance on English (21.75% WER, 16.58% CER, 8% TER) and moderate performance for Khmer (50.46% WER, 35.89% CER, 43% TER). The Tiny model shows strong performance on English (54.33% WER, 42.41% CER, 17% TER) but significantly lower performance for Khmer (75.88% WER, 54.99% CER, 54% TER). This shows that TrorYongASR can be scaled to get higher performance.
195
+
196
+ **Note on Translation Task:** The models are also trained for `translation` task, but evaluation is deferred to future work due to scarce data (there are only 2000 examples from Khmer audio to English text, and 1000 examples from English audio to Khmer text in the pre-training).
197
 
198
 
199
  ## How to Get Started with the Model
 
210
  from tror_yong_asr import TrorYongASRModel, transcribe, translate, detect_language
211
 
212
 
213
+ model_id = "KrorngAI/TrorYongASR-tiny"
214
  processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
215
  model = TrorYongASRModel.from_pretrained(model_id, trust_remote_code=True)
216
 
 
307
  #### Preprocessing
308
 
309
  Following `Whisper` model of openai, audios with duration longer than 30 seconds are filtered out.
310
+ All audios have `16000` sample rate.
311
  For English dataset, all texts are in lowercase.
312
 
313
  #### Training Hyperparameters
314
 
315
+ - **Training regime:** 16-mixed precision training using `LightningAI` package <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
316
  - **Optimizer:** MuonAdamW (custom implementation)
317
+ - **Learning rate:** Linear Warmup (38 optimizer steps) + CosineAnnealing (3774 optimizer steps)
318
  - **Weight decay:** 0.1
319
  - **Effective Batch size:** 64
320
+ - **Number of optimizer steps:** 3812
321
  - **Number of epochs:** roughly 2 epochs
322
  - **Gradient Clip Value:** 0.5 (only for parameters trained by AdamW)
323
 
324
 
325
  #### Speeds, Sizes, Times
326
 
327
+ The training was conducted over 3812 optimizer steps.
 
 
328
 
329
+ - For tiny variant, the training took around 6 hours on 1 Tesla T4 GPU.
330
+ - For small variant, the training took around 7 hours on 2 Tesla T4 GPU (using DDP strategy).
331
 
332
+
333
+ ## Citation
334
 
335
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
336
 
337
  **BibTeX:**
338
 
339
+ ```bibtex
340
+ @online{khun2026,
341
+ author = {Khun, Kimang},
342
+ title = {TrorYongASR: {Permuted} {AutoRegressive} {Sequence}
343
+ {Modeling} for {Automatic} {Speech} {Recognition}},
344
+ date = {2026-05-07},
345
+ url = {https://kimang18.github.io/krorngai-blog/TrorYongASR/},
346
+ langid = {en}
347
+ }
348
+ ```
349
 
350
 
351
  ## Model Card Author
 
361
 
362
  Thanks to the authors of [`PARSeq`](https://github.com/baudm/parseq/tree/main) and [`Whisper`](https://github.com/openai/whisper/tree/main) for their publicly available sourcecode.
363
 
 
 
364
  ## Model Card Contact
365
 
366
  If you have any questions, please reach out at [Facebook Page](https://www.facebook.com/profile.php?id=61582509385293).