Automatic Speech Recognition
Transformers
Safetensors
Khmer
English
troryongasr
custom_code
File size: 19,140 Bytes
faa1524
 
81a41c6
 
3b87804
 
 
 
 
 
 
 
 
 
 
 
faa1524
a9e12fa
 
431faae
a9e12fa
 
 
b069fff
a9e12fa
 
b069fff
a9e12fa
b069fff
a9e12fa
 
d0bbe23
ff06e15
 
2ad8a42
a9e12fa
 
6b8abf3
a9e12fa
 
faa1524
b87688a
faa1524
3b87804
 
 
faa1524
 
 
 
 
6d23783
 
3b87804
a3c7942
 
abb8f48
a3c7942
 
 
 
3c086fa
3b87804
6d23783
 
 
b87688a
6d23783
 
 
 
3c086fa
faa1524
3b87804
faa1524
3c086fa
 
3b87804
 
faa1524
3c086fa
faa1524
 
 
3b87804
6d23783
30196e4
faa1524
3c086fa
 
 
7b0eb8e
3c086fa
70f7aa2
3c086fa
 
 
7fc219a
 
6d23783
 
 
 
7fc219a
3c086fa
6d23783
3c086fa
70f7aa2
3c086fa
 
 
70f7aa2
3c086fa
6d23783
3c086fa
6d23783
 
 
3c086fa
7b0eb8e
3c086fa
7fc219a
b87688a
6d23783
 
 
 
 
 
 
 
7fc219a
3c086fa
6d23783
3c086fa
6d23783
3c086fa
 
70f7aa2
 
6d23783
70f7aa2
6d23783
 
 
70f7aa2
6d23783
70f7aa2
 
7b0eb8e
3c086fa
7fc219a
b87688a
6d23783
 
 
 
 
 
 
 
7fc219a
3c086fa
 
6d23783
 
 
 
 
 
3c086fa
86d04b8
3c086fa
70f7aa2
81a41c6
86d04b8
6d23783
 
 
 
86d04b8
6d23783
 
 
 
86d04b8
81a41c6
6d23783
 
 
 
86d04b8
 
81a41c6
 
86d04b8
70f7aa2
3c086fa
6d23783
3c086fa
6d23783
 
 
3c086fa
 
 
 
 
 
 
 
 
 
 
 
 
86d04b8
3c086fa
 
6d23783
3c086fa
 
 
86d04b8
 
b87688a
86d04b8
 
b87688a
86d04b8
 
3c086fa
 
86d04b8
b87688a
 
3c086fa
faa1524
 
 
 
 
 
3b87804
 
 
 
 
 
faa1524
 
 
 
3b87804
 
 
 
 
faa1524
 
 
 
3b87804
 
86d04b8
3b87804
81a41c6
3b87804
 
 
 
 
faa1524
 
 
 
 
 
 
 
 
 
 
81a41c6
 
faa1524
 
 
 
3b87804
 
 
 
81a41c6
3b87804
7fc219a
1286163
3b87804
81a41c6
 
3b87804
 
81a41c6
7fc219a
3b87804
 
 
 
faa1524
 
 
 
 
3c086fa
faa1524
3b87804
6d23783
3b87804
faa1524
 
 
6d23783
3b87804
6d23783
3b87804
b87688a
6d23783
3b87804
 
 
faa1524
3c086fa
faa1524
6d23783
3b87804
6d23783
 
faa1524
6d23783
 
faa1524
 
 
 
 
6d23783
 
 
 
 
 
 
 
 
 
faa1524
 
7fc219a
faa1524
81a41c6
 
 
 
 
b07fae7
 
 
 
 
faa1524
2d6512e
 
faa1524
 
3c086fa
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
---
library_name: transformers
license: other
license_name: modified-mit
datasets:
- DDD-Cambodia/khm-asr-cultural
- openslr/librispeech_asr
- KrorngAI/fleurs-km-kh-openslr-SLR42
language:
- km
- en
metrics:
- wer
- cer
- ter
pipeline_tag: automatic-speech-recognition
---
<div align="center">
  <picture>
      <img src="figures/krorngai.png" width="30%" alt="KrorngAI">
  </picture>
</div>
<hr>
<!--
<div align="center" style="line-height:1">
  <a href="https://www.kimi.com" target="_blank"><img alt="Chat" src="https://img.shields.io/badge/🤖%20Chat-Kimi%20K2.6-ff6b6b?color=1783ff&logoColor=white"/></a>
  <a href="https://www.facebook.com/profile.php?id=61582509385293" target="_blank"><img alt="Homepage" src="https://img.shields.io/badge/Homepage-Krorng%20AI-white?logoColor=white"/></a>
</div>
-->

<div align="center" style="line-height: 1;">
  <a href="https://huggingface.co/KrorngAI" target="_blank"><img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Krorng%20AI-ffc107?color=ffc107&logoColor=white"/></a>
  <a href="https://youtube.com/@krorngai" target="_blank"><img alt="YouTube Channel" src="https://img.shields.io/badge/Youtube-Krorng%20AI-red?logoColor=red"/></a>
  <a href="https://www.facebook.com/profile.php?id=61582509385293" target="_blank"><img alt="Homepage" src="https://img.shields.io/badge/Facebook-Krorng%20AI-blue?logoColor=blue"/></a>
  <a href="https://kimang18.github.io" target="_blank"><img alt="Personal" src="https://img.shields.io/badge/KHUN-white?logoColor=white"/></a>
</div>
<div align="center" style="line-height: 1;">
  <a href="https://huggingface.co/Kimang18/tror-yong-asr-tiny/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/badge/License-Modified_MIT-f5de53?&color=f5de53"/></a>
</div>


# TrorYongASR

> [!Note]
> This repository contains model weights and configuration files for the pre-trained model. 
>

## Model Details

### Model Description

TrorYongASR is an Encoder-Decoder model for Automatic Speech Recognition (ASR) task.
It is inspired by [PARSeq](https://github.com/baudm/parseq/tree/main) and [Whisper](https://github.com/openai/whisper/tree/main): the auditory-lingual decoder has only one transformer block.

<div align="center">
  <picture>
      <img src="figures/architecture.png" width="100%" alt="TrorYongASR">
  </picture>
</div>

TrorYongASR has **2 configurations**:
<div align="center">

|  **Model Size**   |       Tiny        |        Small        |
|:-----------------:|:-----------------:|:-------------------:|
|  **Parameters**   |        29M        |        135M         |
| **Audio Encoder** | 4 layers, 6 heads | 12 layers, 12 heads |
| **Text Decoder**  | 1 layer, 12 heads |  1 layer, 24 heads  |
| **Embedding Dim** |        384        |         768         |
| **Audio Context** |       1500        |        1500         |
| **Text Context**  |       1024        |        1024         |
</div>

**Note:** The audio array are processed to log-mel spectrogram with `80` mels (the same as Whisper models of the same size)

- **Developed by:** KHUN Kimang (Ph.D.)
- **Shared by:** KrorngAI
- **Model type:** ASR (Automatic Speech Recognition)
- **Language(s) (NLP):** Khmer and English

### Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** https://github.com/Kimang18/KrorngAI/tree/main/tror-yong-asr
- **Blog Post:** https://kimang18.github.io/krorngai-blog/TrorYongASR/
- **Demo:** https://krorngai-troryongasr-demo.hf.space


## Evaluation

The evaluation assesses two capabilities — language detection and transcription — on two datasets ([`google/fleurs`](https://huggingface.co/datasets/Kimang18/google-fleurs-km-kh) for Khmer and [`openslr/librispeech_asr`](https://huggingface.co/datasets/openslr/librispeech_asr) for English). All results are from the **test split** of each dataset, representing the model's generalization ability to unseen data.

### Testing Data

<!-- This should link to a Dataset Card if possible. -->

<div align="center">

| Dataset               | Language   | Testing examples | Description                                       |
| -------------         | ---------- | -------------    | -                                                 |
| **google/fleurs**     | Khmer      | 765              | Multi-lingual dataset with Khmer language samples |
| **librispeech.clean** | English    | 2620             | Clean speech dataset for English transcription    |
</div>

**Note:** Audios longer than `30 seconds` are excluded from the evaluation (that is why `google/fleurs` has 765 examples instead of 771).

### Metrics and Results

<!-- These are the evaluation metrics being used, ideally with a description of why. -->

#### Language Detection

Language detection measures model’s capability to recognize the spoken language from audio input. Since TrorYongASR currently supports 2 languages, this task becomes binary classification task. Classic metrics are used:

- **Precision**: Proportion of predicted languages that are correct
- **Recall** : Proportion of actual language samples correctly identified
- **F1-score** : Harmonic mean of precision and recall

**Results:**

<div align="center">

| Model | Metrics   | Khmer (`fleurs`) | English (`librispeech.clean`) |
|-------|-----------|------------------|-------------------------------|
| Tiny  | Precision | 100%             | 100%                          |
|       | Recall    | 100%             | 100%                          |
|       | F1-score  | 100%             | 100%                          |
| Small | Precision | 100%             | 99%                          |
|       | Recall    | 96%              | 100%                          |
|       | F1-score  | 98%              | 99%                          |
</div>

Tiny size achieved perfect language detection performance on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. Small size performs slightly worst by tending to predict English language.

The 100% language detection scores may appear unusually high. This is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.


#### Transcription

For transcription task, 3 metrics below are used

- **Token Error Rate (TER)** : Proportion of incorrectly transcribed tokens
- **Character Error Rate (CER)** : Proportion of characters that are incorrect
- **Word Error Rate (WER)** : Proportion of words that are incorrect

Token Error Rate (TER) measures model's capability in predicting the next token given the audio input and the current sequence of tokens. This metric is weaker than Word Error Rate (WER) and Character Error Rate (CER) because it doesn't account for insertions, deletions, substitutions, and autoregression as comprehensively. Token Error Rate is used here because Khmer text lacks word boundaries, making WER and CER calculations challenging without additional preprocessing.


**Transcription Results:**

<div align="center">

| Model     | Metric                     | Khmer (`fleurs`) | English (`librispeech.clean`) | Mixed (Khmer + English) |
|-----------|----------------------------|------------------|-------------------------------|-------------------------|
| **Tiny**  | WER | 75.81%           | 54.33%                        | 60.36%                  |
|           | CER | 54.99%           | 42.41%                        | 46.18%                  |
|           | TER | 54%              | 17%                           | 27%                     |
| **Small** | WER | 50.46%           | 21.75%                        | 29.78%                  |
|           | CER | 35.89%           | 16.58%                        | 22.37%                  |
|           | TER | 43%              | 8%                            | 18%                     |
</div>

**Key Observations:**

- The tiny model shows strong performance on English (54.33% WER, 42.41% CER, 17% TER)
- Performance drops significantly for Khmer (75.88% WER, 54.99% CER, 54% TER)
- The small model shows strong performance on English (21.75% WER, 16.58% CER, 8% TER)
- Performance for Khmer is moderate (50.46% WER, 35.89% CER, 43% TER)
- The larger model benefits from increased embedding dimension (768 vs 384) and more layers for audio encoder (12 vs 4)

**Note:** To compute `CER` and `WER`, whitespaces are added between words in Khmer text (Khmer text does not have word boundaries like English text). To do so, `khmercut` PyPI package is used to tokenize Khmer text into words, and then the words are joined back together with whitespaces.


**WER Comparison with Whisper:**

| Tiny        | Parameters | Khmer (`fleurs`)            | English (`librispeech.clean`) |
| -------     | --------   | --------------------------- | ---                           |
| TrorYongASR | 29M        | 75.88%                      | 54.33%                        |
| Whisper     | 39M        | 100.6%                      | 7.6%                          |

| Small       | Parameters | Khmer (`fleurs`)            | English (`librispeech.clean`) |
| -------     | --------   | --------------------------- | ---                           |
| TrorYongASR | 135M       | 50.46%                      | 21.75%                        |
| Whisper     | 244M       | 104.4%                      | 3.4%                          |

**Key Observations:**

- Whisper models have more parameters for comparable sizes (39M vs 29M for Tiny, 244M vs 135M for Small)
- Whisper shows significantly lower word error rates on English (7.6% vs 54.33% for Tiny, 3.4% vs 12.95% for Small)
- Whisper performs worse on Khmer (100.6% vs 75.88% for Tiny, 104.4% vs 50.46% for Small)
- Error rates > 100% for Whisper on Khmer indicate the model is overfitting to the training data

**Note:** `WER` data of Whisper is taken from their [paper](https://arxiv.org/abs/2212.04356).


### Result Summary

**Language Detection:** Both model sizes achieved great performance across all metrics (Precision, Recall, F1-score) on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. This high score is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.

**Transcription:** The Small model shows strong performance on English (21.75% WER, 16.58% CER, 8% TER) and moderate performance for Khmer (50.46% WER, 35.89% CER, 43% TER). The Tiny model shows strong performance on English (54.33% WER, 42.41% CER, 17% TER) but significantly lower performance for Khmer (75.88% WER, 54.99% CER, 54% TER). This shows that TrorYongASR can be scaled to get higher performance.

**Note on Translation Task:** The models are also trained for `translation` task, but evaluation is deferred to future work due to scarce data (there are only 2000 examples from Khmer audio to English text, and 1000 examples from English audio to Khmer text in the pre-training).


## How to Get Started with the Model

First, install `tror-yong-asr` PyPI package:
```bash
pip install tror-yong-asr
```

Then, use the code below to get started with the model.

```python
from transformers import AutoProcessor
from tror_yong_asr import TrorYongASRModel, transcribe, translate, detect_language


model_id = "KrorngAI/TrorYongASR-tiny"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = TrorYongASRModel.from_pretrained(model_id, trust_remote_code=True)

result1 = detect_language('/path/to/audio_file.mp3', model, processor)
print(result1)

result2 = transcribe('/path/to/audio_file.mp3', model, processor, max_tokens=64)
print(result2)

result3 = translate('/path/to/audio_file.mp3', model, processor, max_tokens=64)
print(result3)
```

## Fine-tuning

Notebook (TBA)

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

### Direct Use

The Tiny model can be used directly for:
- **Speech-to-text transcription**: transcribe Khmer and English audio
- **Speech-to-text translation**: translate Khmer audio to English text and English audio to Khmer text
- **Language detection**: Identify whether audio is in Khmer or English (100% accuracy)
- **Edge computing**: Deploy on mobile devices, IoT devices, and embedded systems due to its small size (29M parameters)
- **Real-time applications**: Low latency inference suitable for real-time speech interfaces


### Downstream Use [optional]

The model can be integrated into:
- **Mobile applications**: Android/iOS apps with speech recognition
- **Web applications**: Browser-based speech-to-text using WebAssembly
- **IoT devices**: Smart speakers, voice assistants
- **Larger ASR systems**: As a component in multi-language ASR pipelines


## Bias, Risks, and Limitations

**Technical Limitations:**
- **No speech detection**: The model was not trained for this task. User needs to fine-tune the model for this task (TrorYongASRTokenizer has `<|nospeech|>` token.)
- **Translate task**: The training data for translation task is scarce. User needs to fine-tune the model for better translation performance
- **Noise robustness**: Performance may degrade in noisy environments
- **No timestamp output**: The model does not support timestamp output

**Sociotechnical Limitations:**
- **Accent variability**: May not perform well on diverse Khmer accents
- **Background noise**: Limited robustness to background noise and reverberation
- **Speaker variability**: May struggle with different speaking styles and rates


### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.


## Training Details

To capture model's scalability, both tiny and small variants were trained using the same configuration detailed below.

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

#### Transcription Task

For transcription task, the model was trained on around 140 hours of Khmer audio and around 100 hours of English audio.
Khmer datasets include [`DDD-Cambodia/khm-asr-cultural`](https://huggingface.co/datasets/DDD-Cambodia/khm-asr-cultural) (134.6 hours), [`openslr/openslr`](https://huggingface.co/datasets/Kimang18/openslr-SLR42/blob/main/README.md), and [`google/fleurs`](https://huggingface.co/datasets/Kimang18/google-fleurs-km-kh).
Split `clean.100` of [`openslr/librispeech_asr`](https://huggingface.co/datasets/openslr/librispeech_asr) was used for English dataset.

<div align="center">
  
| Dataset                            | Language   | Training examples | Validation examples | Description                                       |
| ---------                          | ---------- | ----------------- | ------------------- |-                                                  |
| **DDD-Cambodia/khm-asr-cultural**  | Khmer      | 56716             | 0                   | Khmer ASR Cultural Dataset  (split `train`) |
| **openslr/openslr**                | Khmer      | 2906              | 0                   | Multi-speaker TTS data for Khmer language (split `SLR42`) |
| **google/fleurs**                  | Khmer      | 1675              | 324                 | TTS data for Khmer language (split `km_kh`) |
| **librispeech\_asr.clean**              | English    | 28539             | 2703                | Clean speech dataset for English transcription    |
</div>

#### Translation Task

For translation task, the data was scarce: only 2000 examples for Khmer audio to English text, and only 1000 examples for English audio to Khmer text.

### Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

#### Preprocessing

Following `Whisper` model of openai, audios with duration longer than 30 seconds are filtered out.
All audios have `16000` sample rate.
For English dataset, all texts are in lowercase.

#### Training Hyperparameters

- **Training regime:** 16-mixed precision training using `LightningAI` package <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
- **Optimizer:** MuonAdamW (custom implementation)
- **Learning rate:** Linear Warmup (38 optimizer steps) + CosineAnnealing (3774 optimizer steps)
- **Weight decay:** 0.1
- **Effective Batch size:** 64
- **Number of optimizer steps:** 3812
- **Number of epochs:** roughly 2 epochs
- **Gradient Clip Value:** 0.5 (only for parameters trained by AdamW)


#### Speeds, Sizes, Times

The training was conducted over 3812 optimizer steps.

- For tiny variant, the training took around 6 hours on 1 Tesla T4 GPU.
- For small variant, the training took around 7 hours on 2 Tesla T4 GPU (using DDP strategy).


## Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

**BibTeX:**

```bibtex
@online{khun2026,
  author = {Khun, Kimang},
  title = {TrorYongASR: {Permuted} {AutoRegressive} {Sequence}
    {Modeling} for {Automatic} {Speech} {Recognition}},
  date = {2026-05-07},
  url = {https://kimang18.github.io/krorngai-blog/TrorYongASR/},
  langid = {en}
}
```


## Model Card Author

- ឈ្មោះ: បណ្ឌិត ឃុន គីមអាង 
- Name: KHUN Kimang (Ph.D.)

## Acknowledgement

[`LightningAI`](https://lightning.ai) and `Google Colab` did not specifically sponsor this project.
But, both models are be trained thanks to their free credits.
So, huge thanks to [`LightningAI`](https://lightning.ai) and `Google Colab`.

Thanks to the authors of [`PARSeq`](https://github.com/baudm/parseq/tree/main) and [`Whisper`](https://github.com/openai/whisper/tree/main) for their publicly available sourcecode.

Thanks to [`openslr`](https://openslr.org), [Mozilla Data Collective](https://mozilladatacollective.com/datasets/cml9h5vgc01bxmn075sjeftek) and Google for their publicly available dataset.

## Model Card Contact

If you have any questions, please reach out at [Facebook Page](https://www.facebook.com/profile.php?id=61582509385293).