Automatic Speech Recognition
Transformers
Safetensors
Khmer
English
troryongasr
custom_code
Kimang18 commited on
Commit
3c086fa
·
verified ·
1 Parent(s): 2ad8a42

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +130 -188
README.md CHANGED
@@ -50,21 +50,27 @@ pipeline_tag: automatic-speech-recognition
50
 
51
  This is a **29M parameter** ASR (Automatic Speech Recognition) model inspired by [PARSeq](https://github.com/baudm/parseq/tree/main) and [Whisper](https://github.com/openai/whisper/tree/main). It is a lightweight model designed for efficient speech-to-text task, particularly suitable for edge devices and mobile applications. The model supports both Khmer and English languages.
52
 
 
53
 
54
- | Model Size | Parameters | Audio Encoder | Text Decoder | Embedding Dim | Audio Context | Text Context |
55
- |------------|------------|---------------|--------------|---------------|---------------|--------------|
56
- | **Tiny** | 29M | 4 layers, 6 heads | 1 layer, 12 heads | 384 | 1500 | 1024 |
57
- | **Small** | 136M | 12 layers, 12 heads | 1 layer, 24 heads | 768 | 1500 | 1024 |
 
 
 
 
 
58
 
59
  **Note:** The audio array are processed to log-mel spectrogram with `80` mels (the same as Whisper models of the same size)
60
 
61
- - **Developed by:** Dr. KHUN Kimang
62
- - **Shared by [optional]:** KrorngAI
63
  - **Model type:** ASR (Automatic Speech Recognition)
64
  - **Language(s) (NLP):** Khmer and English
65
  - **License:** [More Information Needed]
66
 
67
- ### Model Sources [optional]
68
 
69
  <!-- Provide the basic links for the model. -->
70
 
@@ -72,6 +78,116 @@ This is a **29M parameter** ASR (Automatic Speech Recognition) model inspired by
72
  - **Paper [optional]:** [More Information Needed]
73
  - **Demo [optional]:** [More Information Needed]
74
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
  ## Uses
76
 
77
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
@@ -95,10 +211,6 @@ The model can be integrated into:
95
  - **Larger ASR systems**: As a component in multi-language ASR pipelines
96
 
97
 
98
- ### Out-of-Scope Use
99
-
100
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
101
-
102
  ## Bias, Risks, and Limitations
103
 
104
  **Technical Limitations:**
@@ -118,30 +230,6 @@ The model can be integrated into:
118
 
119
  Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
120
 
121
- ## How to Get Started with the Model
122
-
123
- First, install `tror-yong-asr` PyPI package:
124
- ```bash
125
- pip install tror-yong-asr
126
- ```
127
-
128
- Then, use the code below to get started with the model.
129
- ```python
130
- from transformers import AutoProcessor
131
- from tror_yong_asr import TrorYongASRModel, transcribe, translate
132
-
133
-
134
- model_id = "Kimang18/tror-yong-asr-small"
135
- processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
136
- model = TrorYongASRModel.from_pretrained(model_id, trust_remote_code=True)
137
-
138
- result1 = transcribe('/content/voice5.mp3', model, processor, max_tokens=64)
139
- print(result1) # namedtuple: text: str, output_ds: torch.Tensor
140
-
141
- result2 = translate('/content/voice5.mp3', model, processor, max_tokens=64)
142
- print(result2) # namedtuple: text:str, output_ids: torch.Tensor
143
- ```
144
-
145
 
146
  ## Training Details
147
 
@@ -170,7 +258,7 @@ For translation task, the data was scarce: only 2000 examples for Khmer audio to
170
 
171
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
172
 
173
- #### Preprocessing [optional]
174
 
175
  Following `Whisper` model of openai, audios with duration longer than 30 seconds are filtered out.
176
  All audios have `16000` sample rate.
@@ -189,146 +277,12 @@ For English dataset, all texts are in lowercase.
189
  - **Gradient Clip Value:** 0.5 (only for parameters trained by AdamW)
190
 
191
 
192
- #### Speeds, Sizes, Times [optional]
193
 
194
  The training was conducted over 4000 optimizer steps on 1 GPU Tesla T4.
195
- The trainig took around 10 hours.
196
-
197
-
198
- ## Evaluation
199
-
200
- <!-- This section describes the evaluation protocols and provides the results. -->
201
- The evaluation assesses two capabilities — language detection and transcription — on two datasets ([`google/fleurs`](https://huggingface.co/datasets/Kimang18/google-fleurs-km-kh) for Khmer and [`openslr/librispeech_asr`](https://huggingface.co/datasets/openslr/librispeech_asr) for English). All results are from the test split of each dataset, representing the model's generalization ability to unseen data.
202
-
203
- ### Testing Data & Metrics
204
-
205
- #### Testing Data
206
-
207
- <!-- This should link to a Dataset Card if possible. -->
208
-
209
- | Dataset | Language | Testing examples | Description |
210
- | --------- | ---------- | ------------- | - |
211
- | **google/fleurs** | Khmer | 765 | Multi-lingual dataset with Khmer language samples |
212
- | **librispeech.clean** | English | 2620 | Clean speech dataset for English transcription |
213
-
214
- **Note:** All evaluation results below are from the **test split** of each dataset. For `google/fleurs`, audios longer than `30 seconds` are excluded from the evaluation.
215
-
216
- #### Metrics
217
-
218
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
219
-
220
- ##### Language Detection
221
-
222
- **Task:** Given audio input, detect the language.
223
-
224
- **Approach:** Binary classification task (2 languages: Khmer and English).
225
-
226
- **Metrics:**
227
-
228
- | Metric | Description |
229
- |--------|-------------|
230
- | **Precision** | Proportion of predicted languages that are correct |
231
- | **Recall** | Proportion of actual language samples correctly identified |
232
- | **Accuracy** | Proportion of total predictions that are correct |
233
- | **F1-score** | Harmonic mean of precision and recall |
234
-
235
- ##### Transcription
236
-
237
- **Task:** Convert audio to text (transcription).
238
-
239
- **Metrics:**
240
-
241
- | Metric | Description |
242
- |--------|-------------|
243
- | **Token Error Rate** | Proportion of incorrectly transcribed tokens |
244
- | **Character Error Rate (CER)** | Proportion of characters that are incorrect |
245
- | **Word Error Rate (WER)** | Proportion of words that are incorrect |
246
-
247
- **Note on Translation Task:** The models are also trained for the `translation` task, but evaluation is deferred to future work due to scarce data (2000 samples from Khmer audio to English text, and 1000 samples from English audio to Khmer text).
248
-
249
- **Note on Token Error Rate:** Token Error Rate measures the model's capability to predict the next token given the audio input and the current sequence of tokens. This metric is weaker than Word Error Rate (WER) and Character Error Rate (CER) because it doesn't account for insertions, deletions, and substitutions as comprehensively. Token Error Rate is used here because Khmer text lacks word boundaries, making WER and CER calculations challenging without additional preprocessing.
250
 
251
 
252
- ### Results
253
-
254
- #### Language Detection Results
255
-
256
- | Model | Dataset | Precision | Recall | Accuracy | F1-score |
257
- |-------|---------|-----------|--------|----------|----------|
258
- | Tiny | google/fleurs (Khmer) | 100% | 100% | 100% | 100% |
259
- | | librispeech.clean (English) | 100% | 100% | 100% | 100% |
260
- | Small | google/fleurs (Khmer) | 100% | 100% | 100% | 100% |
261
- | | librispeech.clean (English) | 100% | 100% | 100% | 100% |
262
-
263
- **Key Finding:** Both model sizes achieved perfect language detection performance on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio.
264
-
265
- **Note on Language Detection Performance:** The 100% language detection scores may appear unusually high. This is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
266
-
267
-
268
- #### Transcription Results
269
-
270
- | Model | Metric | Combined (Khmer + English) | Khmer | English |
271
- |-------|--------|---------------------------|-------|---------|
272
- | **Tiny** | Token Error Rate | 29% | 56% | 19% |
273
- | | Character Error Rate (CER) | 32.89% | 60.71% | 20.98% |
274
- | | Word Error Rate (WER) | 46.53% | 86.16% | 31.13% |
275
- | **Small** | Token Error Rate | 19% | 46% | 10% |
276
- | | Character Error Rate (CER) | 15.54% | 35.31% | 7.08% |
277
- | | Word Error Rate (WER) | 23.52% | 50.70% | 12.95% |
278
-
279
- **Key Observations:**
280
- - The tiny model shows strong performance on English (19% token error rate, 20.98% CER, 31.13% WER)
281
- - Performance drops significantly for Khmer (56% token error rate, 60.71% CER, 86.16% WER)
282
- - The small model shows strong performance on English (10% token error rate, 7.08% CER, 12.95% WER)
283
- - Performance for Khmer is moderate (46% token error rate, 35.31% CER, 50.70% WER)
284
- - The larger model benefits from increased embedding dimension (768 vs 384) and more layers (12 vs 4)
285
-
286
- **Note:** To compute `CER` and `WER`, whitespaces are added between words in Khmer text (Khmer text has no word boundary like English text). To do so, `khmercut` PyPI package is used to tokenize Khmer text into words, and then the words are joined back together with whitespaces.
287
-
288
-
289
- #### Summary
290
-
291
- **Language Detection:** Both model sizes achieved perfect 100% performance across all metrics (Precision, Recall, Accuracy, F1-score) on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. This perfect score is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
292
-
293
- **Transcription:** The Small model shows strong performance on English (10% token error rate, 7.08% CER, 12.95% WER) and moderate performance for Khmer (46% token error rate, 35.31% CER, 50.70% WER). The Tiny model shows strong performance on English (19% token error rate, 20.98% CER, 31.13% WER) but significantly lower performance for Khmer (56% token error rate, 60.71% CER, 86.16% WER). The larger model benefits from increased embedding dimension (768 vs 384) and more layers (12 vs 4).
294
-
295
-
296
- ## Model Examination [optional]
297
-
298
- <!-- Relevant interpretability work for the model goes here -->
299
-
300
- [More Information Needed]
301
-
302
- ## Environmental Impact
303
-
304
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
305
-
306
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
307
-
308
- - **Hardware Type:** [More Information Needed]
309
- - **Hours used:** [More Information Needed]
310
- - **Cloud Provider:** [More Information Needed]
311
- - **Compute Region:** [More Information Needed]
312
- - **Carbon Emitted:** [More Information Needed]
313
-
314
- ## Technical Specifications [optional]
315
-
316
- ### Model Architecture and Objective
317
-
318
- [More Information Needed]
319
-
320
- ### Compute Infrastructure
321
-
322
- [More Information Needed]
323
-
324
- #### Hardware
325
-
326
- [More Information Needed]
327
-
328
- #### Software
329
-
330
- [More Information Needed]
331
-
332
  ## Citation [optional]
333
 
334
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
@@ -337,24 +291,12 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
337
 
338
  [More Information Needed]
339
 
340
- **APA:**
341
-
342
- [More Information Needed]
343
-
344
- ## Glossary [optional]
345
 
346
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
347
 
348
- [More Information Needed]
349
-
350
- ## More Information [optional]
351
-
352
- [More Information Needed]
353
-
354
- ## Model Card Authors [optional]
355
-
356
- [More Information Needed]
357
 
358
  ## Model Card Contact
359
 
360
- [More Information Needed]
 
50
 
51
  This is a **29M parameter** ASR (Automatic Speech Recognition) model inspired by [PARSeq](https://github.com/baudm/parseq/tree/main) and [Whisper](https://github.com/openai/whisper/tree/main). It is a lightweight model designed for efficient speech-to-text task, particularly suitable for edge devices and mobile applications. The model supports both Khmer and English languages.
52
 
53
+ <div align="center">
54
 
55
+ | **Model Size** | Tiny |
56
+ |:-------------:|:-----------------:|
57
+ | **Parameters** | 29M |
58
+ | **Audio Encoder** | 4 layers, 6 heads |
59
+ | **Text Decoder** | 1 layer, 12 heads |
60
+ | **Embedding Dim** | 384 |
61
+ | **Audio Context** | 1500 |
62
+ | **Text Context** | 1024 |
63
+ </div>
64
 
65
  **Note:** The audio array are processed to log-mel spectrogram with `80` mels (the same as Whisper models of the same size)
66
 
67
+ - **Developed by:** KHUN Kimang (Ph.D.)
68
+ - **Shared by:** KrorngAI
69
  - **Model type:** ASR (Automatic Speech Recognition)
70
  - **Language(s) (NLP):** Khmer and English
71
  - **License:** [More Information Needed]
72
 
73
+ ### Model Sources
74
 
75
  <!-- Provide the basic links for the model. -->
76
 
 
78
  - **Paper [optional]:** [More Information Needed]
79
  - **Demo [optional]:** [More Information Needed]
80
 
81
+
82
+ ## Evaluation
83
+
84
+ <!-- This section describes the evaluation protocols and provides the results. -->
85
+ The evaluation assesses two capabilities — language detection and transcription — on two datasets ([`google/fleurs`](https://huggingface.co/datasets/Kimang18/google-fleurs-km-kh) for Khmer and [`openslr/librispeech_asr`](https://huggingface.co/datasets/openslr/librispeech_asr) for English). All results are from the test split of each dataset, representing the model's generalization ability to unseen data.
86
+
87
+ ### Testing Data & Metrics
88
+
89
+ #### Testing Data
90
+
91
+ <!-- This should link to a Dataset Card if possible. -->
92
+
93
+ | Dataset | Language | Testing examples | Description |
94
+ | --------- | ---------- | ------------- | - |
95
+ | **google/fleurs** | Khmer | 765 | Multi-lingual dataset with Khmer language samples |
96
+ | **librispeech.clean** | English | 2620 | Clean speech dataset for English transcription |
97
+
98
+ **Note:** All evaluation results below are from the **test split** of each dataset. For `google/fleurs`, audios longer than `30 seconds` are excluded from the evaluation.
99
+
100
+ #### Metrics
101
+
102
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
103
+
104
+ ##### Language Detection
105
+
106
+ **Task:** Given audio input, detect the language.
107
+
108
+ | Metric | Description |
109
+ |--------|-------------|
110
+ | **Precision** | Proportion of predicted languages that are correct |
111
+ | **Recall** | Proportion of actual language samples correctly identified |
112
+ | **Accuracy** | Proportion of total predictions that are correct |
113
+ | **F1-score** | Harmonic mean of precision and recall |
114
+
115
+ ##### Transcription
116
+
117
+ **Task:** Convert audio to text (transcription).
118
+
119
+ | Metric | Description |
120
+ |--------|-------------|
121
+ | **Token Error Rate** | Proportion of incorrectly transcribed tokens |
122
+ | **Character Error Rate (CER)** | Proportion of characters that are incorrect |
123
+ | **Word Error Rate (WER)** | Proportion of words that are incorrect |
124
+
125
+ **Note on Token Error Rate:** Token Error Rate measures model's capability in predicting the next token given the audio input and the current sequence of tokens. This metric is weaker than Word Error Rate (WER) and Character Error Rate (CER) because it doesn't account for insertions, deletions, and substitutions as comprehensively. Token Error Rate is used here because Khmer text lacks word boundaries, making WER and CER calculations challenging without additional preprocessing.
126
+
127
+ **Note on Translation Task:** The models are also trained for `translation` task, but evaluation is deferred to future work due to scarce data (2000 samples from Khmer audio to English text, and 1000 samples from English audio to Khmer text).
128
+
129
+
130
+ ### Results
131
+
132
+ #### Language Detection Results
133
+
134
+ | Dataset | Precision | Recall | Accuracy | F1-score |
135
+ |---------|-----------|--------|----------|----------|
136
+ | google/fleurs (Khmer) | 100% | 100% | 100% | 100% |
137
+ | librispeech.clean (English) | 100% | 100% | 100% | 100% |
138
+
139
+ **Key Finding:** Both model sizes achieved perfect language detection performance on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio.
140
+
141
+ **Note on Language Detection Performance:** The 100% language detection scores may appear unusually high. This is probably due to the fact that during the training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. This means that with 6 permutations, the model learns to predict language token 6 times for a given audio. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
142
+
143
+
144
+ #### Transcription Results
145
+
146
+ | Metric | Combined (Khmer + English) | Khmer | English |
147
+ |--------|---------------------------|-------|---------|
148
+ | Token Error Rate | 29% | 56% | 19% |
149
+ | Character Error Rate (CER) | 32.89% | 60.71% | 20.98% |
150
+ | Word Error Rate (WER) | 46.53% | 86.16% | 31.13% |
151
+
152
+ **Key Observations:**
153
+ - The model shows strong performance on English (19% token error rate, 20.98% CER, 31.13% WER)
154
+ - Performance drops significantly for Khmer (56% token error rate, 60.71% CER, 86.16% WER)
155
+
156
+ **Note:** To compute `CER` and `WER`, whitespaces are added between words in Khmer text (Khmer text has no word boundary like English text). To do so, `khmercut` PyPI package is used to tokenize Khmer text into words, and then the words are joined back together with whitespaces.
157
+
158
+ #### Summary
159
+
160
+ **Language Detection:** Both model sizes achieved perfect 100% performance across all metrics (Precision, Recall, Accuracy, F1-score) on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. This perfect score is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
161
+
162
+ **Transcription:** The Small model shows strong performance on English (10% token error rate, 7.08% CER, 12.95% WER) and moderate performance for Khmer (46% token error rate, 35.31% CER, 50.70% WER). The Tiny model shows strong performance on English (19% token error rate, 20.98% CER, 31.13% WER) but significantly lower performance for Khmer (56% token error rate, 60.71% CER, 86.16% WER). The larger model benefits from increased embedding dimension (768 vs 384) and more layers (12 vs 4).
163
+
164
+
165
+ ## How to Get Started with the Model
166
+
167
+ First, install `tror-yong-asr` PyPI package:
168
+ ```bash
169
+ pip install tror-yong-asr
170
+ ```
171
+
172
+ Then, use the code below to get started with the model.
173
+
174
+ ```python
175
+ from transformers import AutoProcessor
176
+ from tror_yong_asr import TrorYongASRModel, transcribe, translate
177
+
178
+
179
+ model_id = "KrorngAI/tror-yong-asr-tiny"
180
+ processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
181
+ model = TrorYongASRModel.from_pretrained(model_id, trust_remote_code=True)
182
+
183
+ result1 = transcribe('/path/to/audio_file.mp3', model, processor, max_tokens=64)
184
+ print(result1) # namedtuple: text: str, output_ds: torch.Tensor
185
+
186
+ result2 = translate('/path/to/audio_file.mp3', model, processor, max_tokens=64)
187
+ print(result2) # namedtuple: text:str, output_ids: torch.Tensor
188
+ ```
189
+
190
+
191
  ## Uses
192
 
193
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 
211
  - **Larger ASR systems**: As a component in multi-language ASR pipelines
212
 
213
 
 
 
 
 
214
  ## Bias, Risks, and Limitations
215
 
216
  **Technical Limitations:**
 
230
 
231
  Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
232
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
233
 
234
  ## Training Details
235
 
 
258
 
259
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
260
 
261
+ #### Preprocessing
262
 
263
  Following `Whisper` model of openai, audios with duration longer than 30 seconds are filtered out.
264
  All audios have `16000` sample rate.
 
277
  - **Gradient Clip Value:** 0.5 (only for parameters trained by AdamW)
278
 
279
 
280
+ #### Speeds, Sizes, Times
281
 
282
  The training was conducted over 4000 optimizer steps on 1 GPU Tesla T4.
283
+ The training took around 10 hours.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
284
 
285
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
286
  ## Citation [optional]
287
 
288
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 
291
 
292
  [More Information Needed]
293
 
 
 
 
 
 
294
 
295
+ ## Model Card Authors
296
 
297
+ Name: KHUN Kimang (Ph.D.)
298
+ Email: kimang.khun@polytechnique.org
 
 
 
 
 
 
 
299
 
300
  ## Model Card Contact
301
 
302
+ If you have any questions, please reach out at [Facebook Page](https://www.facebook.com/profile.php?id=61582509385293).