Automatic Speech Recognition
Transformers
Safetensors
Khmer
English
troryongasr
custom_code
Kimang18 commited on
Commit
3b87804
·
verified ·
1 Parent(s): 0133579

Preliminary result

Browse files
Files changed (1) hide show
  1. README.md +177 -39
README.md CHANGED
@@ -1,35 +1,51 @@
1
  ---
2
  library_name: transformers
3
- tags: []
 
 
 
 
 
 
 
 
 
 
 
 
4
  ---
5
 
6
- # Model Card for Model ID
7
-
8
- <!-- Provide a quick summary of what the model is/does. -->
9
-
10
 
 
 
 
11
 
12
  ## Model Details
13
 
14
  ### Model Description
15
 
16
- <!-- Provide a longer summary of what this model is. -->
 
 
 
 
 
 
17
 
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
 
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
  - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
 
28
  ### Model Sources [optional]
29
 
30
  <!-- Provide the basic links for the model. -->
31
 
32
- - **Repository:** [More Information Needed]
33
  - **Paper [optional]:** [More Information Needed]
34
  - **Demo [optional]:** [More Information Needed]
35
 
@@ -39,27 +55,39 @@ This is the model card of a 🤗 transformers model that has been pushed on the
39
 
40
  ### Direct Use
41
 
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
 
 
 
 
 
43
 
44
- [More Information Needed]
45
 
46
  ### Downstream Use [optional]
47
 
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
 
 
 
 
49
 
50
- [More Information Needed]
51
 
52
  ### Out-of-Scope Use
53
 
54
  <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
 
56
- [More Information Needed]
57
-
58
  ## Bias, Risks, and Limitations
59
 
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 
 
 
 
 
 
 
 
61
 
62
- [More Information Needed]
63
 
64
  ### Recommendations
65
 
@@ -69,9 +97,28 @@ Users (both direct and downstream) should be made aware of the risks, biases and
69
 
70
  ## How to Get Started with the Model
71
 
72
- Use the code below to get started with the model.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
 
74
- [More Information Needed]
75
 
76
  ## Training Details
77
 
@@ -79,7 +126,22 @@ Use the code below to get started with the model.
79
 
80
  <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
 
82
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
 
84
  ### Training Procedure
85
 
@@ -87,49 +149,125 @@ Use the code below to get started with the model.
87
 
88
  #### Preprocessing [optional]
89
 
90
- [More Information Needed]
91
-
 
92
 
93
  #### Training Hyperparameters
94
 
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
 
 
 
 
 
 
 
 
 
96
 
97
  #### Speeds, Sizes, Times [optional]
98
 
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
 
100
 
101
- [More Information Needed]
102
 
103
  ## Evaluation
104
 
105
  <!-- This section describes the evaluation protocols and provides the results. -->
 
106
 
107
- ### Testing Data, Factors & Metrics
108
 
109
  #### Testing Data
110
 
111
  <!-- This should link to a Dataset Card if possible. -->
112
 
113
- [More Information Needed]
114
-
115
- #### Factors
116
-
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
 
119
- [More Information Needed]
120
 
121
  #### Metrics
122
 
123
  <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
 
125
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
126
 
127
  ### Results
128
 
129
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
130
 
131
  #### Summary
132
 
 
 
 
133
 
134
 
135
  ## Model Examination [optional]
 
1
  ---
2
  library_name: transformers
3
+ license: mit
4
+ datasets:
5
+ - DDD-Cambodia/khm-asr-cultural
6
+ - openslr/librispeech_asr
7
+ - KrorngAI/fleurs-km-kh-openslr-SLR42
8
+ language:
9
+ - km
10
+ - en
11
+ metrics:
12
+ - wer
13
+ - cer
14
+ - ter
15
+ pipeline_tag: automatic-speech-recognition
16
  ---
17
 
18
+ # TrorYongASR Tiny
 
 
 
19
 
20
+ > [!Note]
21
+ > This repository contains model weights and configuration files for the pre-trained model.
22
+ >
23
 
24
  ## Model Details
25
 
26
  ### Model Description
27
 
28
+ This is a **29M parameter** ASR (Automatic Speech Recognition) model inspired by [PARSeq](https://github.com/baudm/parseq/tree/main) and [Whisper](https://github.com/openai/whisper/tree/main). It is a lightweight model designed for efficient speech-to-text task, particularly suitable for edge devices and mobile applications. The model supports both Khmer and English languages.
29
+
30
+
31
+ | Model Size | Parameters | Audio Encoder | Text Decoder | Embedding Dim | Audio Context | Text Context |
32
+ |------------|------------|---------------|--------------|---------------|---------------|--------------|
33
+ | **Tiny** | 29M | 4 layers, 6 heads | 1 layer, 12 heads | 384 | 1500 | 1024 |
34
+ | **Small** | 136M | 12 layers, 12 heads | 1 layer, 24 heads | 768 | 1500 | 1024 |
35
 
36
+ **Note:** The audio array are processed to log-mel spectrogram with `80` mels (the same as Whisper models of the same size)
37
 
38
+ - **Developed by:** Dr. KHUN Kimang
39
+ - **Shared by [optional]:** KrorngAI
40
+ - **Model type:** ASR (Automatic Speech Recognition)
41
+ - **Language(s) (NLP):** Khmer and English
 
42
  - **License:** [More Information Needed]
 
43
 
44
  ### Model Sources [optional]
45
 
46
  <!-- Provide the basic links for the model. -->
47
 
48
+ - **Repository:** https://github.com/Kimang18/KrorngAI/tree/main/tror-yong-asr
49
  - **Paper [optional]:** [More Information Needed]
50
  - **Demo [optional]:** [More Information Needed]
51
 
 
55
 
56
  ### Direct Use
57
 
58
+ The Tiny model can be used directly for:
59
+ - **Speech-to-text transcription**: transcribe Khmer and English audio
60
+ - **Speech-to-text translation**: translate Khmer audio to English text and English audio to Khmer text
61
+ - **Language detection**: Identify whether audio is in Khmer or English (100% accuracy)
62
+ - **Edge computing**: Deploy on mobile devices, IoT devices, and embedded systems due to its small size (29M parameters)
63
+ - **Real-time applications**: Low latency inference suitable for real-time speech interfaces
64
 
 
65
 
66
  ### Downstream Use [optional]
67
 
68
+ The model can be integrated into:
69
+ - **Mobile applications**: Android/iOS apps with speech recognition
70
+ - **Web applications**: Browser-based speech-to-text using WebAssembly
71
+ - **IoT devices**: Smart speakers, voice assistants
72
+ - **Larger ASR systems**: As a component in multi-language ASR pipelines
73
 
 
74
 
75
  ### Out-of-Scope Use
76
 
77
  <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
78
 
 
 
79
  ## Bias, Risks, and Limitations
80
 
81
+ **Technical Limitations:**
82
+ - **No speech detection**: The model was not trained for this task. User needs to fine-tune the model for this task (TrorYongASRTokenizer has `<|nospeech|>` token.)
83
+ - **Noise robustness**: Performance may degrade in noisy environments
84
+ - **Translate task**: The training data for translation task is scarce. User needs to fine-tune the model for better translation performance.
85
+
86
+ **Sociotechnical Limitations:**
87
+ - **Accent variability**: May not perform well on diverse Khmer accents
88
+ - **Background noise**: Limited robustness to background noise and reverberation
89
+ - **Speaker variability**: May struggle with different speaking styles and rates
90
 
 
91
 
92
  ### Recommendations
93
 
 
97
 
98
  ## How to Get Started with the Model
99
 
100
+ First, install `tror-yong-asr` PyPI package:
101
+ ```bash
102
+ pip install tror-yong-asr
103
+ ```
104
+
105
+ Then, use the code below to get started with the model.
106
+ ```python
107
+ from transformers import AutoProcessor
108
+ from tror_yong_asr import TrorYongASRModel, transcribe, translate
109
+
110
+
111
+ model_id = "Kimang18/tror-yong-asr-small"
112
+ processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
113
+ model = TrorYongASRModel.from_pretrained(model_id, trust_remote_code=True)
114
+
115
+ result1 = transcribe('/content/voice5.mp3', model, processor, max_tokens=64)
116
+ print(result1) # namedtuple: text: str, output_ds: torch.Tensor
117
+
118
+ result2 = translate('/content/voice5.mp3', model, processor, max_tokens=64)
119
+ print(result2) # namedtuple: text:str, output_ids: torch.Tensor
120
+ ```
121
 
 
122
 
123
  ## Training Details
124
 
 
126
 
127
  <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
128
 
129
+ #### Transcription Task
130
+
131
+ For transcription task, the model was trained on around 140 hours of Khmer audio and around 100 hours of English audio.
132
+ Khmer datasets include [`DDD-Cambodia/khm-asr-cultural`](https://huggingface.co/datasets/DDD-Cambodia/khm-asr-cultural) (134.6 hours), [`openslr/openslr`](https://huggingface.co/datasets/Kimang18/openslr-SLR42/blob/main/README.md), and [`google/fleurs`](https://huggingface.co/datasets/Kimang18/google-fleurs-km-kh).
133
+ Split `clean.100` of [`openslr/librispeech_asr`](https://huggingface.co/datasets/openslr/librispeech_asr) was used as English dataset.
134
+
135
+ | Dataset | Language | Training examples | Validation examples | Description |
136
+ | --------- | ---------- | ----------------- | ------------------- |- |
137
+ | **openslr/openslr** | Khmer | 2906 | 0 | Multi-speaker TTS data for Khmer language (split `SLR42`) |
138
+ | **google/fleurs** | Khmer | 1675 | 324 | TTS data for Khmer language (split `km_kh`) |
139
+ | **DDD-Cambodia/khm-asr-cultural** | Khmer | 56716 | 0 | Khmer ASR Cultural Dataset (split `train`) |
140
+ | **librispeech.clean** | English | 28539 | 2703 | Clean speech dataset for English transcription |
141
+
142
+ #### Translation Task
143
+
144
+ For translation task, the data was scarce: only 2000 examples for Khmer audio to English text, and only 1000 examples for English audio to Khmer text.
145
 
146
  ### Training Procedure
147
 
 
149
 
150
  #### Preprocessing [optional]
151
 
152
+ Following `Whisper` model of openai, audios with duration longer than 30 seconds are filtered out.
153
+ All audios have `16000` sample rate.
154
+ For English dataset, all texts are in lowercase.
155
 
156
  #### Training Hyperparameters
157
 
158
+ - **Training regime:** bf16 mixed precision training using `LightningAI` package <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
159
+ - **Optimizer:** MuonAdamW (custom implementation)
160
+ - **Learning rate:** Linear Warmup (40 optimizer steps) + CosineAnnealing (3960 optimizer steps)
161
+ - **Weight decay:** 0.1
162
+ - **Batch size:** 8
163
+ - **Gradient accumulation:** 8
164
+ - **Number of optimizer steps:** 4000
165
+ - **Number of epochs:** roughly 2 epochs
166
+ - **Gradient Clip Value:** 0.5 (only for parameters trained by AdamW)
167
+
168
 
169
  #### Speeds, Sizes, Times [optional]
170
 
171
+ The training was conducted over 4000 optimizer steps on 1 GPU Tesla T4.
172
+ The trainig took around 10 hours.
173
 
 
174
 
175
  ## Evaluation
176
 
177
  <!-- This section describes the evaluation protocols and provides the results. -->
178
+ The evaluation assesses two capabilities — language detection and transcription — on two datasets ([`google/fleurs`](https://huggingface.co/datasets/Kimang18/google-fleurs-km-kh) for Khmer and [`openslr/librispeech_asr`](https://huggingface.co/datasets/openslr/librispeech_asr) for English). All results are from the test split of each dataset, representing the model's generalization ability to unseen data.
179
 
180
+ ### Testing Data & Metrics
181
 
182
  #### Testing Data
183
 
184
  <!-- This should link to a Dataset Card if possible. -->
185
 
186
+ | Dataset | Language | Testing examples | Description |
187
+ | --------- | ---------- | ------------- | - |
188
+ | **google/fleurs** | Khmer | 765 | Multi-lingual dataset with Khmer language samples |
189
+ | **librispeech.clean** | English | 2620 | Clean speech dataset for English transcription |
 
190
 
191
+ **Note:** All evaluation results below are from the **test split** of each dataset. For `google/fleurs`, audios longer than `30 seconds` are excluded from the evaluation.
192
 
193
  #### Metrics
194
 
195
  <!-- These are the evaluation metrics being used, ideally with a description of why. -->
196
 
197
+ ##### Language Detection
198
+
199
+ **Task:** Given audio input, detect the language.
200
+
201
+ **Approach:** Binary classification task (2 languages: Khmer and English).
202
+
203
+ **Metrics:**
204
+
205
+ | Metric | Description |
206
+ |--------|-------------|
207
+ | **Precision** | Proportion of predicted languages that are correct |
208
+ | **Recall** | Proportion of actual language samples correctly identified |
209
+ | **Accuracy** | Proportion of total predictions that are correct |
210
+ | **F1-score** | Harmonic mean of precision and recall |
211
+
212
+ ##### Transcription
213
+
214
+ **Task:** Convert audio to text (transcription).
215
+
216
+ **Metrics:**
217
+
218
+ | Metric | Description |
219
+ |--------|-------------|
220
+ | **Token Error Rate** | Proportion of incorrectly transcribed tokens |
221
+ | **Character Error Rate (CER)** | Proportion of characters that are incorrect |
222
+ | **Word Error Rate (WER)** | Proportion of words that are incorrect |
223
+
224
+ **Note on Translation Task:** The models are also trained for the `translation` task, but evaluation is deferred to future work due to scarce data (2000 samples from Khmer audio to English text, and 1000 samples from English audio to Khmer text).
225
+
226
+ **Note on Token Error Rate:** Token Error Rate measures the model's capability to predict the next token given the audio input and the current sequence of tokens. This metric is weaker than Word Error Rate (WER) and Character Error Rate (CER) because it doesn't account for insertions, deletions, and substitutions as comprehensively. Token Error Rate is used here because Khmer text lacks word boundaries, making WER and CER calculations challenging without additional preprocessing.
227
+
228
 
229
  ### Results
230
 
231
+ #### Language Detection Results
232
+
233
+ | Model | Dataset | Precision | Recall | Accuracy | F1-score |
234
+ |-------|---------|-----------|--------|----------|----------|
235
+ | Tiny | google/fleurs (Khmer) | 100% | 100% | 100% | 100% |
236
+ | | librispeech.clean (English) | 100% | 100% | 100% | 100% |
237
+ | Small | google/fleurs (Khmer) | 100% | 100% | 100% | 100% |
238
+ | | librispeech.clean (English) | 100% | 100% | 100% | 100% |
239
+
240
+ **Key Finding:** Both model sizes achieved perfect language detection performance on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio.
241
+
242
+ **Note on Language Detection Performance:** The 100% language detection scores may appear unusually high. This is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
243
+
244
+
245
+ #### Transcription Results
246
+
247
+ | Model | Metric | Combined (Khmer + English) | Khmer | English |
248
+ |-------|--------|---------------------------|-------|---------|
249
+ | **Tiny** | Token Error Rate | 29% | 56% | 19% |
250
+ | | Character Error Rate (CER) | 32.89% | 60.71% | 20.98% |
251
+ | | Word Error Rate (WER) | 46.53% | 86.16% | 31.13% |
252
+ | **Small** | Token Error Rate | 19% | 46% | 10% |
253
+ | | Character Error Rate (CER) | 15.54% | 35.31% | 7.08% |
254
+ | | Word Error Rate (WER) | 23.52% | 50.70% | 12.95% |
255
+
256
+ **Key Observations:**
257
+ - The tiny model shows strong performance on English (19% token error rate, 20.98% CER, 31.13% WER)
258
+ - Performance drops significantly for Khmer (56% token error rate, 60.71% CER, 86.16% WER)
259
+ - The small model shows strong performance on English (10% token error rate, 7.08% CER, 12.95% WER)
260
+ - Performance for Khmer is moderate (46% token error rate, 35.31% CER, 50.70% WER)
261
+ - The larger model benefits from increased embedding dimension (768 vs 384) and more layers (12 vs 4)
262
+
263
+ **Note:** To compute `CER` and `WER`, whitespaces are added between words in Khmer text (Khmer text has no word boundary like English text). To do so, `khmercut` PyPI package is used to tokenize Khmer text into words, and then the words are joined back together with whitespaces.
264
+
265
 
266
  #### Summary
267
 
268
+ **Language Detection:** Both model sizes achieved perfect 100% performance across all metrics (Precision, Recall, Accuracy, F1-score) on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. This perfect score is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
269
+
270
+ **Transcription:** The Small model shows strong performance on English (10% token error rate, 7.08% CER, 12.95% WER) and moderate performance for Khmer (46% token error rate, 35.31% CER, 50.70% WER). The Tiny model shows strong performance on English (19% token error rate, 20.98% CER, 31.13% WER) but significantly lower performance for Khmer (56% token error rate, 60.71% CER, 86.16% WER). The larger model benefits from increased embedding dimension (768 vs 384) and more layers (12 vs 4).
271
 
272
 
273
  ## Model Examination [optional]