Automatic Speech Recognition
Transformers
Safetensors
Khmer
English
troryongasr
custom_code
Kimang18 commited on
Commit
7b0eb8e
·
verified ·
1 Parent(s): 70f7aa2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -22
README.md CHANGED
@@ -48,7 +48,7 @@ pipeline_tag: automatic-speech-recognition
48
 
49
  ### Model Description
50
 
51
- This is an ASR (Automatic Speech Recognition) model inspired by [PARSeq](https://github.com/baudm/parseq/tree/main) and [Whisper](https://github.com/openai/whisper/tree/main). It is smaller than Whisper model of openai. **TrorYongASR** supports both Khmer and English languages.
52
 
53
  <div align="center">
54
 
@@ -81,7 +81,7 @@ This is an ASR (Automatic Speech Recognition) model inspired by [PARSeq](https:/
81
  ## Evaluation
82
 
83
  <!-- This section describes the evaluation protocols and provides the results. -->
84
- The evaluation assesses two capabilities — language detection and transcription — on two datasets ([`google/fleurs`](https://huggingface.co/datasets/Kimang18/google-fleurs-km-kh) for Khmer and [`openslr/librispeech_asr`](https://huggingface.co/datasets/openslr/librispeech_asr) for English). All results are from the test split of each dataset, representing the model's generalization ability to unseen data.
85
 
86
  ### Testing Data
87
 
@@ -95,7 +95,7 @@ The evaluation assesses two capabilities — language detection and transcriptio
95
  | **librispeech.clean** | English | 2620 | Clean speech dataset for English transcription |
96
  </div>
97
 
98
- **Note:** All evaluation results below are from the **test split** of each dataset. For `google/fleurs`, audios longer than `30 seconds` are excluded from the evaluation.
99
 
100
  ### Metrics and Results
101
 
@@ -115,16 +115,20 @@ The evaluation assesses two capabilities — language detection and transcriptio
115
  | **F1-score** | Harmonic mean of precision and recall |
116
  </div>
117
 
118
- #### Language Detection Results
119
 
120
  <div align="center">
121
 
122
- | Model | Dataset | Precision | Recall | Accuracy | F1-score |
123
- |-------|---------|-----------|--------|----------|----------|
124
- | Tiny | Khmer (`fleurs`) | 100% | 100% | 100% | 100% |
125
- | | English (librispeech.clean) | 100% | 100% | 100% | 100% |
126
- | Small | Khmer (`fleurs`) | 100% | 100% | 100% | 100% |
127
- | | English (librispeech.clean) | 100% | 100% | 100% | 100% |
 
 
 
 
128
  </div>
129
 
130
  **Key Finding:** Both model sizes achieved perfect language detection performance on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio.
@@ -145,30 +149,29 @@ The evaluation assesses two capabilities — language detection and transcriptio
145
  | **Word Error Rate (WER)** | Proportion of words that are incorrect |
146
  </div>
147
 
148
- **Note on Token Error Rate:** Token Error Rate measures model's capability in predicting the next token given the audio input and the current sequence of tokens. This metric is weaker than Word Error Rate (WER) and Character Error Rate (CER) because it doesn't account for insertions, deletions, and substitutions as comprehensively. Token Error Rate is used here because Khmer text lacks word boundaries, making WER and CER calculations challenging without additional preprocessing.
149
 
150
- **Note on Translation Task:** The models are also trained for `translation` task, but evaluation is deferred to future work due to scarce data (2000 samples from Khmer audio to English text, and 1000 samples from English audio to Khmer text).
151
 
152
-
153
- #### Transcription Results
154
 
155
  <div align="center">
156
 
157
  | Model | Metric | Khmer (`fleurs`) | English (`librispeech.clean`) | Mixed (Khmer + English) |
158
  |-------|--------|-------|---------|---------|
159
- | **Tiny** | Token Error Rate | 56% | 19% | 29% |
160
  | | Character Error Rate (CER) | 60.71% | 20.98% | 32.89% |
161
- | | Word Error Rate (WER) | 86.16% | 31.13% | 46.53% |
162
- | **Small** | Token Error Rate | 46% | 10% | 19% |
163
  | | Character Error Rate (CER) | 35.31% | 7.08% | 15.54% |
164
- | | Word Error Rate (WER) | 50.70% | 12.95% | 23.52% |
165
  </div>
166
 
167
  **Key Observations:**
168
- - The tiny model shows strong performance on English (19% token error rate, 20.98% CER, 31.13% WER)
169
- - Performance drops significantly for Khmer (56% token error rate, 60.71% CER, 86.16% WER)
170
- - The small model shows strong performance on English (10% token error rate, 7.08% CER, 12.95% WER)
171
- - Performance for Khmer is moderate (46% token error rate, 35.31% CER, 50.70% WER)
172
  - The larger model benefits from increased embedding dimension (768 vs 384) and more layers (12 vs 4)
173
 
174
  **Note:** To compute `CER` and `WER`, whitespaces are added between words in Khmer text (Khmer text does not have word boundaries like English text). To do so, `khmercut` PyPI package is used to tokenize Khmer text into words, and then the words are joined back together with whitespaces.
 
48
 
49
  ### Model Description
50
 
51
+ This is an ASR (Automatic Speech Recognition) model inspired by [PARSeq](https://github.com/baudm/parseq/tree/main) and [Whisper](https://github.com/openai/whisper/tree/main). **TrorYongASR** is smaller than Whisper model of openai and supports only Khmer and English languages.
52
 
53
  <div align="center">
54
 
 
81
  ## Evaluation
82
 
83
  <!-- This section describes the evaluation protocols and provides the results. -->
84
+ The evaluation assesses two capabilities — language detection and transcription — on two datasets ([`google/fleurs`](https://huggingface.co/datasets/Kimang18/google-fleurs-km-kh) for Khmer and [`openslr/librispeech_asr`](https://huggingface.co/datasets/openslr/librispeech_asr) for English). All results are from the **test split** of each dataset, representing the model's generalization ability to unseen data.
85
 
86
  ### Testing Data
87
 
 
95
  | **librispeech.clean** | English | 2620 | Clean speech dataset for English transcription |
96
  </div>
97
 
98
+ **Note:** All evaluation results below are from the **test split** of each dataset. Audios longer than `30 seconds` are excluded from the evaluation (that is why `google/fleurs` has 765 examples instead of 771).
99
 
100
  ### Metrics and Results
101
 
 
115
  | **F1-score** | Harmonic mean of precision and recall |
116
  </div>
117
 
118
+ **Results:**
119
 
120
  <div align="center">
121
 
122
+ | Model | Metrics | Khmer (`fleurs`) | English (librispeech.clean) |
123
+ |-------|---------|------------------|-----------------------------|
124
+ | Tiny | Precision | 100% | 100% |
125
+ | | Recall | 100% | 100% |
126
+ | | Accuracy | 100% | 100% |
127
+ | | F1-score | 100% | 100% |
128
+ | Small | Precision | 100% | 100% |
129
+ | | Recall | 100% | 100% |
130
+ | | Accuracy | 100% | 100% |
131
+ | | F1-score | 100% | 100% |
132
  </div>
133
 
134
  **Key Finding:** Both model sizes achieved perfect language detection performance on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio.
 
149
  | **Word Error Rate (WER)** | Proportion of words that are incorrect |
150
  </div>
151
 
152
+ **Note on Token Error Rate:** Token Error Rate measures model's capability in predicting the next token given the audio input and the current sequence of tokens. This metric is weaker than Word Error Rate (WER) and Character Error Rate (CER) because it doesn't account for insertions, deletions, substitutions, and autoregression as comprehensively. Token Error Rate is used here because Khmer text lacks word boundaries, making WER and CER calculations challenging without additional preprocessing.
153
 
154
+ **Note on Translation Task:** The models are also trained for `translation` task, but evaluation is deferred to future work due to scarce data (there are only 2000 examples from Khmer audio to English text, and 1000 examples from English audio to Khmer text in the pre-training).
155
 
156
+ **Transcription Results:**
 
157
 
158
  <div align="center">
159
 
160
  | Model | Metric | Khmer (`fleurs`) | English (`librispeech.clean`) | Mixed (Khmer + English) |
161
  |-------|--------|-------|---------|---------|
162
+ | **Tiny** | Word Error Rate (WER) | 86.16% | 31.13% | 46.53% |
163
  | | Character Error Rate (CER) | 60.71% | 20.98% | 32.89% |
164
+ | | Token Error Rate | 56% | 19% | 29% |
165
+ | **Small** | Word Error Rate (WER) | 50.70% | 12.95% | 23.52% |
166
  | | Character Error Rate (CER) | 35.31% | 7.08% | 15.54% |
167
+ | | Token Error Rate | 46% | 10% | 19% |
168
  </div>
169
 
170
  **Key Observations:**
171
+ - The tiny model shows strong performance on English (31.13% WER, 20.98% CER, 19% token error rate)
172
+ - Performance drops significantly for Khmer (86.16% WER, 60.71% CER, 56% token error rate)
173
+ - The small model shows strong performance on English (12.95% WER, 7.08% CER, 10% token error rate)
174
+ - Performance for Khmer is moderate (50.70% WER, 35.31% CER, 46% token error rate)
175
  - The larger model benefits from increased embedding dimension (768 vs 384) and more layers (12 vs 4)
176
 
177
  **Note:** To compute `CER` and `WER`, whitespaces are added between words in Khmer text (Khmer text does not have word boundaries like English text). To do so, `khmercut` PyPI package is used to tokenize Khmer text into words, and then the words are joined back together with whitespaces.