Automatic Speech Recognition
Transformers
Safetensors
Khmer
English
troryongasr
custom_code
Kimang18 commited on
Commit
ebdd841
·
1 Parent(s): 62dc600

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +268 -101
README.md CHANGED
@@ -1,199 +1,366 @@
1
  ---
2
  library_name: transformers
3
- tags: []
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
 
6
- # Model Card for Model ID
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
9
 
 
 
10
 
 
11
 
12
- ## Model Details
 
 
 
 
 
 
 
 
13
 
14
- ### Model Description
15
 
16
- <!-- Provide a longer summary of what this model is. -->
 
 
 
17
 
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
 
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
 
28
- ### Model Sources [optional]
 
 
29
 
30
- <!-- Provide the basic links for the model. -->
31
 
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
 
36
- ## Uses
37
 
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
 
40
- ### Direct Use
41
 
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
 
44
- [More Information Needed]
 
 
 
 
45
 
46
- ### Downstream Use [optional]
47
 
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
 
50
- [More Information Needed]
51
 
52
- ### Out-of-Scope Use
53
 
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
 
56
- [More Information Needed]
 
 
57
 
58
- ## Bias, Risks, and Limitations
59
 
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
 
62
- [More Information Needed]
 
 
 
 
 
 
 
 
63
 
64
- ### Recommendations
65
 
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
 
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
 
70
- ## How to Get Started with the Model
71
 
72
- Use the code below to get started with the model.
73
 
74
- [More Information Needed]
 
 
75
 
76
- ## Training Details
77
 
78
- ### Training Data
79
 
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
 
82
- [More Information Needed]
83
 
84
- ### Training Procedure
 
 
 
 
 
 
 
 
85
 
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
 
88
- #### Preprocessing [optional]
 
 
 
 
89
 
90
- [More Information Needed]
91
 
92
 
93
- #### Training Hyperparameters
94
 
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
 
 
 
96
 
97
- #### Speeds, Sizes, Times [optional]
 
 
 
98
 
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
 
101
- [More Information Needed]
 
 
 
102
 
103
- ## Evaluation
104
 
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
 
107
- ### Testing Data, Factors & Metrics
108
 
109
- #### Testing Data
110
 
111
- <!-- This should link to a Dataset Card if possible. -->
112
 
113
- [More Information Needed]
114
 
115
- #### Factors
116
 
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
 
119
- [More Information Needed]
 
 
 
120
 
121
- #### Metrics
122
 
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
124
 
125
- [More Information Needed]
 
 
126
 
127
- ### Results
128
 
129
- [More Information Needed]
130
 
131
- #### Summary
 
 
 
 
 
 
 
 
 
 
 
132
 
133
 
 
134
 
135
- ## Model Examination [optional]
 
 
 
 
136
 
137
- <!-- Relevant interpretability work for the model goes here -->
138
 
139
- [More Information Needed]
140
 
141
- ## Environmental Impact
 
 
 
 
142
 
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
 
 
 
144
 
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
 
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
 
153
- ## Technical Specifications [optional]
154
 
155
- ### Model Architecture and Objective
156
 
157
- [More Information Needed]
158
 
159
- ### Compute Infrastructure
160
 
161
- [More Information Needed]
162
 
163
- #### Hardware
164
 
165
- [More Information Needed]
166
 
167
- #### Software
168
 
169
- [More Information Needed]
 
 
170
 
171
- ## Citation [optional]
 
 
 
 
 
 
 
 
172
 
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
 
175
- **BibTeX:**
176
 
177
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
178
 
179
- **APA:**
180
 
181
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
182
 
183
- ## Glossary [optional]
 
 
 
 
 
 
 
 
 
184
 
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
 
187
- [More Information Needed]
188
 
189
- ## More Information [optional]
 
190
 
191
- [More Information Needed]
192
 
193
- ## Model Card Authors [optional]
 
 
194
 
195
- [More Information Needed]
196
 
197
  ## Model Card Contact
198
 
199
- [More Information Needed]
 
1
  ---
2
  library_name: transformers
3
+ license: other
4
+ license_name: modified-mit
5
+ datasets:
6
+ - DDD-Cambodia/khm-asr-cultural
7
+ - openslr/librispeech_asr
8
+ - KrorngAI/fleurs-km-kh-openslr-SLR42
9
+ language:
10
+ - km
11
+ - en
12
+ metrics:
13
+ - wer
14
+ - cer
15
+ - ter
16
+ pipeline_tag: automatic-speech-recognition
17
  ---
18
+ <div align="center">
19
+ <picture>
20
+ <img src="figures/krorngai.png" width="30%" alt="KrorngAI">
21
+ </picture>
22
+ </div>
23
+ <hr>
24
+ <!--
25
+ <div align="center" style="line-height:1">
26
+ <a href="https://www.kimi.com" target="_blank"><img alt="Chat" src="https://img.shields.io/badge/🤖%20Chat-Kimi%20K2.6-ff6b6b?color=1783ff&logoColor=white"/></a>
27
+ <a href="https://www.facebook.com/profile.php?id=61582509385293" target="_blank"><img alt="Homepage" src="https://img.shields.io/badge/Homepage-Krorng%20AI-white?logoColor=white"/></a>
28
+ </div>
29
+ -->
30
+
31
+ <div align="center" style="line-height: 1;">
32
+ <a href="https://huggingface.co/KrorngAI" target="_blank"><img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Krorng%20AI-ffc107?color=ffc107&logoColor=white"/></a>
33
+ <a href="https://youtube.com/@krorngai" target="_blank"><img alt="YouTube Channel" src="https://img.shields.io/badge/Youtube-Krorng%20AI-red?logoColor=red"/></a>
34
+ <a href="https://www.facebook.com/profile.php?id=61582509385293" target="_blank"><img alt="Homepage" src="https://img.shields.io/badge/Facebook-Krorng%20AI-blue?logoColor=blue"/></a>
35
+ <a href="https://kimang18.github.io" target="_blank"><img alt="Personal" src="https://img.shields.io/badge/KHUN-white?logoColor=white"/></a>
36
+ </div>
37
+ <div align="center" style="line-height: 1;">
38
+ <a href="https://huggingface.co/moonshotai/Kimi-K2.6/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/badge/License-Modified_MIT-f5de53?&color=f5de53"/></a>
39
+ </div>
40
+
41
+
42
+ # TrorYongASR
43
+
44
+ > [!Note]
45
+ > This repository contains model weights and configuration files for the pre-trained model.
46
+ >
47
 
48
+ ## Model Details
49
 
50
+ ### Model Description
51
 
52
+ TrorYongASR is an Encoder-Decoder model for Automatic Speech Recognition (ASR) task.
53
+ It is inspired by [PARSeq](https://github.com/baudm/parseq/tree/main) and [Whisper](https://github.com/openai/whisper/tree/main): the auditory-lingual decoder has only one transformer block.
54
 
55
+ <div align="center">
56
 
57
+ | **Model Size** | Tiny | Small |
58
+ |:-----------------:|:-----------------:|:-------------------:|
59
+ | **Parameters** | 29M | 135M |
60
+ | **Audio Encoder** | 4 layers, 6 heads | 12 layers, 12 heads |
61
+ | **Text Decoder** | 1 layer, 12 heads | 1 layer, 24 heads |
62
+ | **Embedding Dim** | 384 | 768 |
63
+ | **Audio Context** | 1500 | 1500 |
64
+ | **Text Context** | 1024 | 1024 |
65
+ </div>
66
 
67
+ **Note:** The audio array are processed to log-mel spectrogram with `80` mels (the same as Whisper models of the same size)
68
 
69
+ - **Developed by:** KHUN Kimang (Ph.D.)
70
+ - **Shared by:** KrorngAI
71
+ - **Model type:** ASR (Automatic Speech Recognition)
72
+ - **Language(s) (NLP):** Khmer and English
73
 
74
+ ### Model Sources
75
 
76
+ <!-- Provide the basic links for the model. -->
 
 
 
 
 
 
77
 
78
+ - **Repository:** https://github.com/Kimang18/KrorngAI/tree/main/tror-yong-asr
79
+ - **Blog Post:** https://kimang18.github.io/krorngai-blog/TrorYongASR/
80
+ - **Demo [optional]:** TBA
81
 
 
82
 
83
+ ## Evaluation
 
 
84
 
85
+ The evaluation assesses two capabilities — language detection and transcription — on two datasets ([`google/fleurs`](https://huggingface.co/datasets/Kimang18/google-fleurs-km-kh) for Khmer and [`openslr/librispeech_asr`](https://huggingface.co/datasets/openslr/librispeech_asr) for English). All results are from the **test split** of each dataset, representing the model's generalization ability to unseen data.
86
 
87
+ ### Testing Data
88
 
89
+ <!-- This should link to a Dataset Card if possible. -->
90
 
91
+ <div align="center">
92
 
93
+ | Dataset | Language | Testing examples | Description |
94
+ | ------------- | ---------- | ------------- | - |
95
+ | **google/fleurs** | Khmer | 765 | Multi-lingual dataset with Khmer language samples |
96
+ | **librispeech.clean** | English | 2620 | Clean speech dataset for English transcription |
97
+ </div>
98
 
99
+ **Note:** Audios longer than `30 seconds` are excluded from the evaluation (that is why `google/fleurs` has 765 examples instead of 771).
100
 
101
+ ### Metrics and Results
102
 
103
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
104
 
105
+ #### Language Detection
106
 
107
+ Language detection measures model’s capability to recognize the spoken language from audio input. Since TrorYongASR currently supports 2 languages, this task becomes binary classification task. Classic metrics are used:
108
 
109
+ - **Precision**: Proportion of predicted languages that are correct
110
+ - **Recall** : Proportion of actual language samples correctly identified
111
+ - **F1-score** : Harmonic mean of precision and recall
112
 
113
+ **Results:**
114
 
115
+ <div align="center">
116
 
117
+ | Model | Metrics | Khmer (`fleurs`) | English (`librispeech.clean`) |
118
+ |-------|-----------|------------------|-------------------------------|
119
+ | Tiny | Precision | 100% | 100% |
120
+ | | Recall | 100% | 100% |
121
+ | | F1-score | 100% | 100% |
122
+ | Small | Precision | 100% | 99% |
123
+ | | Recall | 96% | 100% |
124
+ | | F1-score | 98% | 99% |
125
+ </div>
126
 
127
+ Tiny size achieved perfect language detection performance on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. Small size performs slightly worst by tending to predict English language.
128
 
129
+ The 100% language detection scores may appear unusually high. This is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
130
 
 
131
 
132
+ #### Transcription
133
 
134
+ For transcription task, 3 metrics below are used
135
 
136
+ - **Token Error Rate (TER)** : Proportion of incorrectly transcribed tokens
137
+ - **Character Error Rate (CER)** : Proportion of characters that are incorrect
138
+ - **Word Error Rate (WER)** : Proportion of words that are incorrect
139
 
140
+ Token Error Rate (TER) measures model's capability in predicting the next token given the audio input and the current sequence of tokens. This metric is weaker than Word Error Rate (WER) and Character Error Rate (CER) because it doesn't account for insertions, deletions, substitutions, and autoregression as comprehensively. Token Error Rate is used here because Khmer text lacks word boundaries, making WER and CER calculations challenging without additional preprocessing.
141
 
 
142
 
143
+ **Transcription Results:**
144
 
145
+ <div align="center">
146
 
147
+ | Model | Metric | Khmer (`fleurs`) | English (`librispeech.clean`) | Mixed (Khmer + English) |
148
+ |-----------|----------------------------|------------------|-------------------------------|-------------------------|
149
+ | **Tiny** | WER | 75.81% | 54.33% | 60.36% |
150
+ | | CER | 54.99% | 42.41% | 46.18% |
151
+ | | TER | 54% | 17% | 27% |
152
+ | **Small** | WER | 50.46% | 21.75% | 29.78% |
153
+ | | CER | 35.89% | 16.58% | 22.37% |
154
+ | | TER | 43% | 8% | 18% |
155
+ </div>
156
 
157
+ **Key Observations:**
158
 
159
+ - The tiny model shows strong performance on English (54.33% WER, 42.41% CER, 17% TER)
160
+ - Performance drops significantly for Khmer (75.88% WER, 54.99% CER, 54% TER)
161
+ - The small model shows strong performance on English (21.75% WER, 16.58% CER, 8% TER)
162
+ - Performance for Khmer is moderate (50.46% WER, 35.89% CER, 43% TER)
163
+ - The larger model benefits from increased embedding dimension (768 vs 384) and more layers for audio encoder (12 vs 4)
164
 
165
+ **Note:** To compute `CER` and `WER`, whitespaces are added between words in Khmer text (Khmer text does not have word boundaries like English text). To do so, `khmercut` PyPI package is used to tokenize Khmer text into words, and then the words are joined back together with whitespaces.
166
 
167
 
168
+ **WER Comparison with Whisper:**
169
 
170
+ | Tiny | Parameters | Khmer (`fleurs`) | English (`librispeech.clean`) |
171
+ | ------- | -------- | --------------------------- | --- |
172
+ | TrorYongASR | 29M | 75.88% | 54.33% |
173
+ | Whisper | 39M | 100.6% | 7.6% |
174
 
175
+ | Small | Parameters | Khmer (`fleurs`) | English (`librispeech.clean`) |
176
+ | ------- | -------- | --------------------------- | --- |
177
+ | TrorYongASR | 135M | 50.46% | 21.75% |
178
+ | Whisper | 244M | 104.4% | 3.4% |
179
 
180
+ **Key Observations:**
181
 
182
+ - Whisper models have more parameters for comparable sizes (39M vs 29M for Tiny, 244M vs 135M for Small)
183
+ - Whisper shows significantly lower word error rates on English (7.6% vs 54.33% for Tiny, 3.4% vs 12.95% for Small)
184
+ - Whisper performs worse on Khmer (100.6% vs 75.88% for Tiny, 104.4% vs 50.46% for Small)
185
+ - Error rates > 100% for Whisper on Khmer indicate the model is overfitting to the training data
186
 
187
+ **Note:** `WER` data of Whisper is taken from their [paper](https://arxiv.org/abs/2212.04356).
188
 
 
189
 
190
+ ### Result Summary
191
 
192
+ **Language Detection:** Both model sizes achieved great performance across all metrics (Precision, Recall, F1-score) on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. This high score is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
193
 
194
+ **Transcription:** The Small model shows strong performance on English (21.75% WER, 16.58% CER, 8% TER) and moderate performance for Khmer (50.46% WER, 35.89% CER, 43% TER). The Tiny model shows strong performance on English (54.33% WER, 42.41% CER, 17% TER) but significantly lower performance for Khmer (75.88% WER, 54.99% CER, 54% TER). This shows that TrorYongASR can be scaled to get higher performance.
195
 
196
+ **Note on Translation Task:** The models are also trained for `translation` task, but evaluation is deferred to future work due to scarce data (there are only 2000 examples from Khmer audio to English text, and 1000 examples from English audio to Khmer text in the pre-training).
197
 
 
198
 
199
+ ## How to Get Started with the Model
200
 
201
+ First, install `tror-yong-asr` PyPI package:
202
+ ```bash
203
+ pip install tror-yong-asr
204
+ ```
205
 
206
+ Then, use the code below to get started with the model.
207
 
208
+ ```python
209
+ from transformers import AutoProcessor
210
+ from tror_yong_asr import TrorYongASRModel, transcribe, translate, detect_language
211
+
212
+
213
+ model_id = "KrorngAI/TrorYongASR-tiny"
214
+ processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
215
+ model = TrorYongASRModel.from_pretrained(model_id, trust_remote_code=True)
216
+
217
+ result1 = detect_language('/path/to/audio_file.mp3', model, processor)
218
+ print(result1)
219
+
220
+ result2 = transcribe('/path/to/audio_file.mp3', model, processor, max_tokens=64)
221
+ print(result2)
222
 
223
+ result3 = translate('/path/to/audio_file.mp3', model, processor, max_tokens=64)
224
+ print(result3)
225
+ ```
226
 
227
+ ## Fine-tuning
228
 
229
+ Notebook (TBA)
230
 
231
+ ## Uses
232
+
233
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
234
+
235
+ ### Direct Use
236
+
237
+ The Tiny model can be used directly for:
238
+ - **Speech-to-text transcription**: transcribe Khmer and English audio
239
+ - **Speech-to-text translation**: translate Khmer audio to English text and English audio to Khmer text
240
+ - **Language detection**: Identify whether audio is in Khmer or English (100% accuracy)
241
+ - **Edge computing**: Deploy on mobile devices, IoT devices, and embedded systems due to its small size (29M parameters)
242
+ - **Real-time applications**: Low latency inference suitable for real-time speech interfaces
243
 
244
 
245
+ ### Downstream Use [optional]
246
 
247
+ The model can be integrated into:
248
+ - **Mobile applications**: Android/iOS apps with speech recognition
249
+ - **Web applications**: Browser-based speech-to-text using WebAssembly
250
+ - **IoT devices**: Smart speakers, voice assistants
251
+ - **Larger ASR systems**: As a component in multi-language ASR pipelines
252
 
 
253
 
254
+ ## Bias, Risks, and Limitations
255
 
256
+ **Technical Limitations:**
257
+ - **No speech detection**: The model was not trained for this task. User needs to fine-tune the model for this task (TrorYongASRTokenizer has `<|nospeech|>` token.)
258
+ - **Translate task**: The training data for translation task is scarce. User needs to fine-tune the model for better translation performance
259
+ - **Noise robustness**: Performance may degrade in noisy environments
260
+ - **No timestamp output**: The model does not support timestamp output
261
 
262
+ **Sociotechnical Limitations:**
263
+ - **Accent variability**: May not perform well on diverse Khmer accents
264
+ - **Background noise**: Limited robustness to background noise and reverberation
265
+ - **Speaker variability**: May struggle with different speaking styles and rates
266
 
 
267
 
268
+ ### Recommendations
 
 
 
 
269
 
270
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
271
 
272
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
273
 
 
274
 
275
+ ## Training Details
276
 
277
+ To capture model's scalability, both tiny and small variants were trained using the same configuration detailed below.
278
 
279
+ ### Training Data
280
 
281
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
282
 
283
+ #### Transcription Task
284
 
285
+ For transcription task, the model was trained on around 140 hours of Khmer audio and around 100 hours of English audio.
286
+ Khmer datasets include [`DDD-Cambodia/khm-asr-cultural`](https://huggingface.co/datasets/DDD-Cambodia/khm-asr-cultural) (134.6 hours), [`openslr/openslr`](https://huggingface.co/datasets/Kimang18/openslr-SLR42/blob/main/README.md), and [`google/fleurs`](https://huggingface.co/datasets/Kimang18/google-fleurs-km-kh).
287
+ Split `clean.100` of [`openslr/librispeech_asr`](https://huggingface.co/datasets/openslr/librispeech_asr) was used for English dataset.
288
 
289
+ <div align="center">
290
+
291
+ | Dataset | Language | Training examples | Validation examples | Description |
292
+ | --------- | ---------- | ----------------- | ------------------- |- |
293
+ | **DDD-Cambodia/khm-asr-cultural** | Khmer | 56716 | 0 | Khmer ASR Cultural Dataset (split `train`) |
294
+ | **openslr/openslr** | Khmer | 2906 | 0 | Multi-speaker TTS data for Khmer language (split `SLR42`) |
295
+ | **google/fleurs** | Khmer | 1675 | 324 | TTS data for Khmer language (split `km_kh`) |
296
+ | **librispeech\_asr.clean** | English | 28539 | 2703 | Clean speech dataset for English transcription |
297
+ </div>
298
 
299
+ #### Translation Task
300
 
301
+ For translation task, the data was scarce: only 2000 examples for Khmer audio to English text, and only 1000 examples for English audio to Khmer text.
302
 
303
+ ### Training Procedure
304
+
305
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
306
+
307
+ #### Preprocessing
308
+
309
+ Following `Whisper` model of openai, audios with duration longer than 30 seconds are filtered out.
310
+ All audios have `16000` sample rate.
311
+ For English dataset, all texts are in lowercase.
312
+
313
+ #### Training Hyperparameters
314
+
315
+ - **Training regime:** 16-mixed precision training using `LightningAI` package <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
316
+ - **Optimizer:** MuonAdamW (custom implementation)
317
+ - **Learning rate:** Linear Warmup (38 optimizer steps) + CosineAnnealing (3774 optimizer steps)
318
+ - **Weight decay:** 0.1
319
+ - **Effective Batch size:** 64
320
+ - **Number of optimizer steps:** 3812
321
+ - **Number of epochs:** roughly 2 epochs
322
+ - **Gradient Clip Value:** 0.5 (only for parameters trained by AdamW)
323
 
 
324
 
325
+ #### Speeds, Sizes, Times
326
+
327
+ The training was conducted over 3812 optimizer steps.
328
+
329
+ - For tiny variant, the training took around 6 hours on 1 Tesla T4 GPU.
330
+ - For small variant, the training took around 7 hours on 2 Tesla T4 GPU (using DDP strategy).
331
+
332
+
333
+ ## Citation
334
+
335
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
336
+
337
+ **BibTeX:**
338
 
339
+ ```bibtex
340
+ @online{khun2026,
341
+ author = {Khun, Kimang},
342
+ title = {TrorYongASR: {Permuted} {AutoRegressive} {Sequence}
343
+ {Modeling} for {Automatic} {Speech} {Recognition}},
344
+ date = {2026-05-07},
345
+ url = {https://kimang18.github.io/krorngai-blog/TrorYongASR/},
346
+ langid = {en}
347
+ }
348
+ ```
349
 
 
350
 
351
+ ## Model Card Author
352
 
353
+ - ឈ្មោះ: បណ្ឌិត ឃុន គីមអាង
354
+ - Name: KHUN Kimang (Ph.D.)
355
 
356
+ ## Acknowledgement
357
 
358
+ [`LightningAI`](https://lightning.ai) and `Google Colab` did not specifically sponsor this project.
359
+ But, both models are be trained thanks to their free credits.
360
+ So, huge thanks to [`LightningAI`](https://lightning.ai) and `Google Colab`.
361
 
362
+ Thanks to the authors of [`PARSeq`](https://github.com/baudm/parseq/tree/main) and [`Whisper`](https://github.com/openai/whisper/tree/main) for their publicly available sourcecode.
363
 
364
  ## Model Card Contact
365
 
366
+ If you have any questions, please reach out at [Facebook Page](https://www.facebook.com/profile.php?id=61582509385293).