Automatic Speech Recognition
Transformers
Safetensors
Khmer
English
troryongasr
custom_code
Kimang18 commited on
Commit
b87688a
·
verified ·
1 Parent(s): 19c1e6a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +39 -26
README.md CHANGED
@@ -38,7 +38,7 @@ pipeline_tag: automatic-speech-recognition
38
  </div>
39
 
40
 
41
- # TrorYongASR Tiny
42
 
43
  > [!Note]
44
  > This repository contains model weights and configuration files for the pre-trained model.
@@ -48,18 +48,18 @@ pipeline_tag: automatic-speech-recognition
48
 
49
  ### Model Description
50
 
51
- This is a **29M parameter** ASR (Automatic Speech Recognition) model inspired by [PARSeq](https://github.com/baudm/parseq/tree/main) and [Whisper](https://github.com/openai/whisper/tree/main). Compared to [whisper-tiny](https://huggingface.co/openai/whisper-tiny) with **39M** parameters, **TrorYongASR Tiny** is a lightweight model designed for efficient speech-to-text task, particularly suitable for edge devices and mobile applications. The model supports both Khmer and English languages.
52
 
53
  <div align="center">
54
 
55
- | **Model Size** | Tiny |
56
- |:-------------:|:-----------------:|
57
- | **Parameters** | 29M |
58
- | **Audio Encoder** | 4 layers, 6 heads |
59
- | **Text Decoder** | 1 layer, 12 heads |
60
- | **Embedding Dim** | 384 |
61
- | **Audio Context** | 1500 |
62
- | **Text Context** | 1024 |
63
  </div>
64
 
65
  **Note:** The audio array are processed to log-mel spectrogram with `80` mels (the same as Whisper models of the same size)
@@ -140,32 +140,40 @@ The evaluation assesses two capabilities — language detection and transcriptio
140
  #### Language Detection Results
141
 
142
  <div align="center">
143
-
144
- | Dataset | Precision | Recall | Accuracy | F1-score |
145
- |---------|-----------|--------|----------|----------|
146
- | google/fleurs (Khmer) | 100% | 100% | 100% | 100% |
147
- | librispeech.clean (English) | 100% | 100% | 100% | 100% |
 
 
148
  </div>
149
 
150
  **Key Finding:** Both model sizes achieved perfect language detection performance on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio.
151
 
152
- **Note on Language Detection Performance:** The 100% language detection scores may appear unusually high. This is probably due to the fact that during the training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. This means that with 6 permutations, the model learns to predict language token 6 times for a given audio. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
153
 
154
 
155
  #### Transcription Results
156
 
157
  <div align="center">
158
-
159
- | Metric | Combined (Khmer + English) | Khmer | English |
160
- |--------|---------------------------|-------|---------|
161
- | Token Error Rate | 29% | 56% | 19% |
162
- | Character Error Rate (CER) | 32.89% | 60.71% | 20.98% |
163
- | Word Error Rate (WER) | 46.53% | 86.16% | 31.13% |
 
 
 
164
  </div>
165
 
166
  **Key Observations:**
167
- - The model shows strong performance on English (19% token error rate, 20.98% CER, 31.13% WER)
168
  - Performance drops significantly for Khmer (56% token error rate, 60.71% CER, 86.16% WER)
 
 
 
169
 
170
  **Note:** To compute `CER` and `WER`, whitespaces are added between words in Khmer text (Khmer text has no word boundary like English text). To do so, `khmercut` PyPI package is used to tokenize Khmer text into words, and then the words are joined back together with whitespaces.
171
 
@@ -173,7 +181,7 @@ The evaluation assesses two capabilities — language detection and transcriptio
173
 
174
  **Language Detection:** Both model sizes achieved perfect 100% performance across all metrics (Precision, Recall, Accuracy, F1-score) on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. This perfect score is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
175
 
176
- **Transcription:** The model shows strong performance on English (19% token error rate, 20.98% CER, 31.13% WER) but significantly lower performance for Khmer (56% token error rate, 60.71% CER, 86.16% WER).
177
 
178
 
179
  ## How to Get Started with the Model
@@ -199,8 +207,14 @@ print(result1) # namedtuple: text: str, output_ds: torch.Tensor
199
 
200
  result2 = translate('/path/to/audio_file.mp3', model, processor, max_tokens=64)
201
  print(result2) # namedtuple: text:str, output_ids: torch.Tensor
 
 
 
202
  ```
203
 
 
 
 
204
 
205
  ## Uses
206
 
@@ -287,8 +301,7 @@ For English dataset, all texts are in lowercase.
287
  - **Optimizer:** MuonAdamW (custom implementation)
288
  - **Learning rate:** Linear Warmup (40 optimizer steps) + CosineAnnealing (3960 optimizer steps)
289
  - **Weight decay:** 0.1
290
- - **Batch size:** 8
291
- - **Gradient accumulation:** 8
292
  - **Number of optimizer steps:** 4000
293
  - **Number of epochs:** roughly 2 epochs
294
  - **Gradient Clip Value:** 0.5 (only for parameters trained by AdamW)
 
38
  </div>
39
 
40
 
41
+ # TrorYongASR
42
 
43
  > [!Note]
44
  > This repository contains model weights and configuration files for the pre-trained model.
 
48
 
49
  ### Model Description
50
 
51
+ This is an ASR (Automatic Speech Recognition) model inspired by [PARSeq](https://github.com/baudm/parseq/tree/main) and [Whisper](https://github.com/openai/whisper/tree/main). It is smaller than Whisper model of openai. **TrorYongASR** supports both Khmer and English languages.
52
 
53
  <div align="center">
54
 
55
+ | **Model Size** | Tiny | Small |
56
+ |:-------------:|:-----------------:|:-----:|
57
+ | **Parameters** | 29M | 136M |
58
+ | **Audio Encoder** | 4 layers, 6 heads | 12 layers, 12 heads |
59
+ | **Text Decoder** | 1 layer, 12 heads | 1 layer, 24 heads |
60
+ | **Embedding Dim** | 384 | 768 |
61
+ | **Audio Context** | 1500 | 1500 |
62
+ | **Text Context** | 1024 | 1024 |
63
  </div>
64
 
65
  **Note:** The audio array are processed to log-mel spectrogram with `80` mels (the same as Whisper models of the same size)
 
140
  #### Language Detection Results
141
 
142
  <div align="center">
143
+
144
+ | Model | Dataset | Precision | Recall | Accuracy | F1-score |
145
+ |-------|---------|-----------|--------|----------|----------|
146
+ | Tiny | google/fleurs (Khmer) | 100% | 100% | 100% | 100% |
147
+ | | librispeech.clean (English) | 100% | 100% | 100% | 100% |
148
+ | Small | google/fleurs (Khmer) | 100% | 100% | 100% | 100% |
149
+ | | librispeech.clean (English) | 100% | 100% | 100% | 100% |
150
  </div>
151
 
152
  **Key Finding:** Both model sizes achieved perfect language detection performance on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio.
153
 
154
+ **Note on Language Detection Performance:** The 100% language detection scores may appear unusually high. This is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
155
 
156
 
157
  #### Transcription Results
158
 
159
  <div align="center">
160
+
161
+ | Model | Metric | Combined (Khmer + English) | Khmer | English |
162
+ |-------|--------|---------------------------|-------|---------|
163
+ | **Tiny** | Token Error Rate | 29% | 56% | 19% |
164
+ | | Character Error Rate (CER) | 32.89% | 60.71% | 20.98% |
165
+ | | Word Error Rate (WER) | 46.53% | 86.16% | 31.13% |
166
+ | **Small** | Token Error Rate | 19% | 46% | 10% |
167
+ | | Character Error Rate (CER) | 15.54% | 35.31% | 7.08% |
168
+ | | Word Error Rate (WER) | 23.52% | 50.70% | 12.95% |
169
  </div>
170
 
171
  **Key Observations:**
172
+ - The tiny model shows strong performance on English (19% token error rate, 20.98% CER, 31.13% WER)
173
  - Performance drops significantly for Khmer (56% token error rate, 60.71% CER, 86.16% WER)
174
+ - The small model shows strong performance on English (10% token error rate, 7.08% CER, 12.95% WER)
175
+ - Performance for Khmer is moderate (46% token error rate, 35.31% CER, 50.70% WER)
176
+ - The larger model benefits from increased embedding dimension (768 vs 384) and more layers (12 vs 4)
177
 
178
  **Note:** To compute `CER` and `WER`, whitespaces are added between words in Khmer text (Khmer text has no word boundary like English text). To do so, `khmercut` PyPI package is used to tokenize Khmer text into words, and then the words are joined back together with whitespaces.
179
 
 
181
 
182
  **Language Detection:** Both model sizes achieved perfect 100% performance across all metrics (Precision, Recall, Accuracy, F1-score) on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. This perfect score is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
183
 
184
+ **Transcription:** The Small model shows strong performance on English (10% token error rate, 7.08% CER, 12.95% WER) and moderate performance for Khmer (46% token error rate, 35.31% CER, 50.70% WER). The Tiny model shows strong performance on English (19% token error rate, 20.98% CER, 31.13% WER) but significantly lower performance for Khmer (56% token error rate, 60.71% CER, 86.16% WER). The larger model benefits from increased embedding dimension (768 vs 384) and more layers (12 vs 4).
185
 
186
 
187
  ## How to Get Started with the Model
 
207
 
208
  result2 = translate('/path/to/audio_file.mp3', model, processor, max_tokens=64)
209
  print(result2) # namedtuple: text:str, output_ids: torch.Tensor
210
+
211
+
212
+ #TODO: add detect_language usage
213
  ```
214
 
215
+ ### Fine-tuning
216
+
217
+ Notebook (TBA)
218
 
219
  ## Uses
220
 
 
301
  - **Optimizer:** MuonAdamW (custom implementation)
302
  - **Learning rate:** Linear Warmup (40 optimizer steps) + CosineAnnealing (3960 optimizer steps)
303
  - **Weight decay:** 0.1
304
+ - **Effective Batch size:** 64
 
305
  - **Number of optimizer steps:** 4000
306
  - **Number of epochs:** roughly 2 epochs
307
  - **Gradient Clip Value:** 0.5 (only for parameters trained by AdamW)