Automatic Speech Recognition
Transformers
Safetensors
Khmer
English
troryongasr
custom_code
Kimang18 commited on
Commit
86d04b8
·
verified ·
1 Parent(s): a3a97e0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -18
README.md CHANGED
@@ -34,7 +34,7 @@ pipeline_tag: automatic-speech-recognition
34
  <a href="https://kimang18.github.io" target="_blank"><img alt="Personal" src="https://img.shields.io/badge/KHUN-white?logoColor=white"/></a>
35
  </div>
36
  <div align="center" style="line-height: 1;">
37
- <a href="https://huggingface.co/Kimang18/tror-yong-asr-tiny/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/badge/License-Modified_MIT-f5de53?&color=f5de53"/></a>
38
  </div>
39
 
40
 
@@ -143,10 +143,10 @@ The evaluation assesses two capabilities — language detection and transcriptio
143
 
144
  | Model | Dataset | Precision | Recall | Accuracy | F1-score |
145
  |-------|---------|-----------|--------|----------|----------|
146
- | Tiny | google/fleurs (Khmer) | 100% | 100% | 100% | 100% |
147
- | | librispeech.clean (English) | 100% | 100% | 100% | 100% |
148
- | Small | google/fleurs (Khmer) | 100% | 100% | 100% | 100% |
149
- | | librispeech.clean (English) | 100% | 100% | 100% | 100% |
150
  </div>
151
 
152
  **Key Finding:** Both model sizes achieved perfect language detection performance on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio.
@@ -158,8 +158,8 @@ The evaluation assesses two capabilities — language detection and transcriptio
158
 
159
  <div align="center">
160
 
161
- | Model | Metric | Khmer | English | Combined (Khmer + English) |
162
- |-------|--------|---------------------------|-------|---------|
163
  | **Tiny** | Token Error Rate | 56% | 19% | 29% |
164
  | | Character Error Rate (CER) | 60.71% | 20.98% | 32.89% |
165
  | | Word Error Rate (WER) | 86.16% | 31.13% | 46.53% |
@@ -175,9 +175,28 @@ The evaluation assesses two capabilities — language detection and transcriptio
175
  - Performance for Khmer is moderate (46% token error rate, 35.31% CER, 50.70% WER)
176
  - The larger model benefits from increased embedding dimension (768 vs 384) and more layers (12 vs 4)
177
 
178
- **Note:** To compute `CER` and `WER`, whitespaces are added between words in Khmer text (Khmer text has no word boundary like English text). To do so, `khmercut` PyPI package is used to tokenize Khmer text into words, and then the words are joined back together with whitespaces.
179
 
180
- #### Summary
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
181
 
182
  **Language Detection:** Both model sizes achieved perfect 100% performance across all metrics (Precision, Recall, Accuracy, F1-score) on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. This perfect score is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
183
 
@@ -195,24 +214,24 @@ Then, use the code below to get started with the model.
195
 
196
  ```python
197
  from transformers import AutoProcessor
198
- from tror_yong_asr import TrorYongASRModel, transcribe, translate
199
 
200
 
201
  model_id = "KrorngAI/tror-yong-asr-tiny"
202
  processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
203
  model = TrorYongASRModel.from_pretrained(model_id, trust_remote_code=True)
204
 
205
- result1 = transcribe('/path/to/audio_file.mp3', model, processor, max_tokens=64)
206
- print(result1) # namedtuple: text: str, output_ds: torch.Tensor
207
-
208
- result2 = translate('/path/to/audio_file.mp3', model, processor, max_tokens=64)
209
- print(result2) # namedtuple: text:str, output_ids: torch.Tensor
210
 
 
 
211
 
212
- #TODO: add detect_language usage
 
213
  ```
214
 
215
- ### Fine-tuning
216
 
217
  Notebook (TBA)
218
 
@@ -243,8 +262,9 @@ The model can be integrated into:
243
 
244
  **Technical Limitations:**
245
  - **No speech detection**: The model was not trained for this task. User needs to fine-tune the model for this task (TrorYongASRTokenizer has `<|nospeech|>` token.)
 
246
  - **Noise robustness**: Performance may degrade in noisy environments
247
- - **Translate task**: The training data for translation task is scarce. User needs to fine-tune the model for better translation performance.
248
 
249
  **Sociotechnical Limitations:**
250
  - **Accent variability**: May not perform well on diverse Khmer accents
 
34
  <a href="https://kimang18.github.io" target="_blank"><img alt="Personal" src="https://img.shields.io/badge/KHUN-white?logoColor=white"/></a>
35
  </div>
36
  <div align="center" style="line-height: 1;">
37
+ <a href="https://huggingface.co/moonshotai/Kimi-K2.6/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/badge/License-Modified_MIT-f5de53?&color=f5de53"/></a>
38
  </div>
39
 
40
 
 
143
 
144
  | Model | Dataset | Precision | Recall | Accuracy | F1-score |
145
  |-------|---------|-----------|--------|----------|----------|
146
+ | Tiny | Khmer (`fleurs`) | 100% | 100% | 100% | 100% |
147
+ | | English (librispeech.clean) | 100% | 100% | 100% | 100% |
148
+ | Small | Khmer (`fleurs`) | 100% | 100% | 100% | 100% |
149
+ | | English (librispeech.clean) | 100% | 100% | 100% | 100% |
150
  </div>
151
 
152
  **Key Finding:** Both model sizes achieved perfect language detection performance on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio.
 
158
 
159
  <div align="center">
160
 
161
+ | Model | Metric | Khmer (`fleurs`) | English (`librispeech.clean`) | Mixed (Khmer + English) |
162
+ |-------|--------|-------|---------|---------|
163
  | **Tiny** | Token Error Rate | 56% | 19% | 29% |
164
  | | Character Error Rate (CER) | 60.71% | 20.98% | 32.89% |
165
  | | Word Error Rate (WER) | 86.16% | 31.13% | 46.53% |
 
175
  - Performance for Khmer is moderate (46% token error rate, 35.31% CER, 50.70% WER)
176
  - The larger model benefits from increased embedding dimension (768 vs 384) and more layers (12 vs 4)
177
 
178
+ **Note:** To compute `CER` and `WER`, whitespaces are added between words in Khmer text (Khmer text does not have word boundaries like English text). To do so, `khmercut` PyPI package is used to tokenize Khmer text into words, and then the words are joined back together with whitespaces.
179
 
180
+ ##### WER Comparison with Whisper
181
+
182
+ | Tiny | Parameters | Khmer (`fleurs`) | English (`librispeech.clean`) |
183
+ |-------|--------|---------------------------| --- |
184
+ | TrorYongASR | 29M | 86.16% | 31.13% |
185
+ | Whisper | 39M | 100.6% | 7.6% |
186
+
187
+ | Small | Parameters | Khmer (`fleurs`) | English (`librispeech.clean`) |
188
+ |-------|--------|---------------------------| --- |
189
+ | TrorYongASR | 136M | 50.70% | 12.95% |
190
+ | Whisper | 244M | 104.4% | 3.4% |
191
+
192
+ **Comparison Notes:**
193
+ - Whisper models have more parameters for comparable sizes (39M vs 29M for Tiny, 244M vs 136M for Small)
194
+ - Whisper shows significantly lower word error rates on English (7.6% vs 31.13% for Tiny, 3.4% vs 12.95% for Small)
195
+ - Whisper performs worse on Khmer (100.6% vs 86.16% for Tiny, 104.4% vs 50.70% for Small), suggesting overfitting or challenging evaluation conditions
196
+ - Error rates > 100% for Whisper on Khmer indicate the model is overfitting to the training data
197
+
198
+
199
+ #### Result Summary
200
 
201
  **Language Detection:** Both model sizes achieved perfect 100% performance across all metrics (Precision, Recall, Accuracy, F1-score) on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. This perfect score is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
202
 
 
214
 
215
  ```python
216
  from transformers import AutoProcessor
217
+ from tror_yong_asr import TrorYongASRModel, transcribe, translate, detect_language
218
 
219
 
220
  model_id = "KrorngAI/tror-yong-asr-tiny"
221
  processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
222
  model = TrorYongASRModel.from_pretrained(model_id, trust_remote_code=True)
223
 
224
+ result1 = detect_language('/path/to/audio_file.mp3', model, processor)
225
+ print(result1)
 
 
 
226
 
227
+ result2 = transcribe('/path/to/audio_file.mp3', model, processor, max_tokens=64)
228
+ print(result2)
229
 
230
+ result3 = translate('/path/to/audio_file.mp3', model, processor, max_tokens=64)
231
+ print(result3)
232
  ```
233
 
234
+ ## Fine-tuning
235
 
236
  Notebook (TBA)
237
 
 
262
 
263
  **Technical Limitations:**
264
  - **No speech detection**: The model was not trained for this task. User needs to fine-tune the model for this task (TrorYongASRTokenizer has `<|nospeech|>` token.)
265
+ - **Translate task**: The training data for translation task is scarce. User needs to fine-tune the model for better translation performance
266
  - **Noise robustness**: Performance may degrade in noisy environments
267
+ - **No timestamp output**: The model does not timestamps output
268
 
269
  **Sociotechnical Limitations:**
270
  - **Accent variability**: May not perform well on diverse Khmer accents