Automatic Speech Recognition
Transformers
Safetensors
Khmer
English
troryongasr
custom_code
Kimang18 commited on
Commit
70f7aa2
·
verified ·
1 Parent(s): 86d04b8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -27
README.md CHANGED
@@ -83,9 +83,7 @@ This is an ASR (Automatic Speech Recognition) model inspired by [PARSeq](https:/
83
  <!-- This section describes the evaluation protocols and provides the results. -->
84
  The evaluation assesses two capabilities — language detection and transcription — on two datasets ([`google/fleurs`](https://huggingface.co/datasets/Kimang18/google-fleurs-km-kh) for Khmer and [`openslr/librispeech_asr`](https://huggingface.co/datasets/openslr/librispeech_asr) for English). All results are from the test split of each dataset, representing the model's generalization ability to unseen data.
85
 
86
- ### Testing Data & Metrics
87
-
88
- #### Testing Data
89
 
90
  <!-- This should link to a Dataset Card if possible. -->
91
 
@@ -99,11 +97,11 @@ The evaluation assesses two capabilities — language detection and transcriptio
99
 
100
  **Note:** All evaluation results below are from the **test split** of each dataset. For `google/fleurs`, audios longer than `30 seconds` are excluded from the evaluation.
101
 
102
- #### Metrics
103
 
104
  <!-- These are the evaluation metrics being used, ideally with a description of why. -->
105
 
106
- ##### Language Detection
107
 
108
  **Task:** Given audio input, detect the language.
109
 
@@ -117,26 +115,6 @@ The evaluation assesses two capabilities — language detection and transcriptio
117
  | **F1-score** | Harmonic mean of precision and recall |
118
  </div>
119
 
120
- ##### Transcription
121
-
122
- **Task:** Convert audio to text (transcription).
123
-
124
- <div align="center">
125
-
126
- | Metric | Description |
127
- |--------|-------------|
128
- | **Token Error Rate** | Proportion of incorrectly transcribed tokens |
129
- | **Character Error Rate (CER)** | Proportion of characters that are incorrect |
130
- | **Word Error Rate (WER)** | Proportion of words that are incorrect |
131
- </div>
132
-
133
- **Note on Token Error Rate:** Token Error Rate measures model's capability in predicting the next token given the audio input and the current sequence of tokens. This metric is weaker than Word Error Rate (WER) and Character Error Rate (CER) because it doesn't account for insertions, deletions, and substitutions as comprehensively. Token Error Rate is used here because Khmer text lacks word boundaries, making WER and CER calculations challenging without additional preprocessing.
134
-
135
- **Note on Translation Task:** The models are also trained for `translation` task, but evaluation is deferred to future work due to scarce data (2000 samples from Khmer audio to English text, and 1000 samples from English audio to Khmer text).
136
-
137
-
138
- ### Results
139
-
140
  #### Language Detection Results
141
 
142
  <div align="center">
@@ -154,6 +132,24 @@ The evaluation assesses two capabilities — language detection and transcriptio
154
  **Note on Language Detection Performance:** The 100% language detection scores may appear unusually high. This is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
155
 
156
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
157
  #### Transcription Results
158
 
159
  <div align="center">
@@ -177,7 +173,8 @@ The evaluation assesses two capabilities — language detection and transcriptio
177
 
178
  **Note:** To compute `CER` and `WER`, whitespaces are added between words in Khmer text (Khmer text does not have word boundaries like English text). To do so, `khmercut` PyPI package is used to tokenize Khmer text into words, and then the words are joined back together with whitespaces.
179
 
180
- ##### WER Comparison with Whisper
 
181
 
182
  | Tiny | Parameters | Khmer (`fleurs`) | English (`librispeech.clean`) |
183
  |-------|--------|---------------------------| --- |
@@ -196,7 +193,7 @@ The evaluation assesses two capabilities — language detection and transcriptio
196
  - Error rates > 100% for Whisper on Khmer indicate the model is overfitting to the training data
197
 
198
 
199
- #### Result Summary
200
 
201
  **Language Detection:** Both model sizes achieved perfect 100% performance across all metrics (Precision, Recall, Accuracy, F1-score) on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. This perfect score is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
202
 
 
83
  <!-- This section describes the evaluation protocols and provides the results. -->
84
  The evaluation assesses two capabilities — language detection and transcription — on two datasets ([`google/fleurs`](https://huggingface.co/datasets/Kimang18/google-fleurs-km-kh) for Khmer and [`openslr/librispeech_asr`](https://huggingface.co/datasets/openslr/librispeech_asr) for English). All results are from the test split of each dataset, representing the model's generalization ability to unseen data.
85
 
86
+ ### Testing Data
 
 
87
 
88
  <!-- This should link to a Dataset Card if possible. -->
89
 
 
97
 
98
  **Note:** All evaluation results below are from the **test split** of each dataset. For `google/fleurs`, audios longer than `30 seconds` are excluded from the evaluation.
99
 
100
+ ### Metrics and Results
101
 
102
  <!-- These are the evaluation metrics being used, ideally with a description of why. -->
103
 
104
+ #### Language Detection
105
 
106
  **Task:** Given audio input, detect the language.
107
 
 
115
  | **F1-score** | Harmonic mean of precision and recall |
116
  </div>
117
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
118
  #### Language Detection Results
119
 
120
  <div align="center">
 
132
  **Note on Language Detection Performance:** The 100% language detection scores may appear unusually high. This is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
133
 
134
 
135
+ #### Transcription
136
+
137
+ **Task:** Convert audio to text (transcription).
138
+
139
+ <div align="center">
140
+
141
+ | Metric | Description |
142
+ |--------|-------------|
143
+ | **Token Error Rate** | Proportion of incorrectly transcribed tokens |
144
+ | **Character Error Rate (CER)** | Proportion of characters that are incorrect |
145
+ | **Word Error Rate (WER)** | Proportion of words that are incorrect |
146
+ </div>
147
+
148
+ **Note on Token Error Rate:** Token Error Rate measures model's capability in predicting the next token given the audio input and the current sequence of tokens. This metric is weaker than Word Error Rate (WER) and Character Error Rate (CER) because it doesn't account for insertions, deletions, and substitutions as comprehensively. Token Error Rate is used here because Khmer text lacks word boundaries, making WER and CER calculations challenging without additional preprocessing.
149
+
150
+ **Note on Translation Task:** The models are also trained for `translation` task, but evaluation is deferred to future work due to scarce data (2000 samples from Khmer audio to English text, and 1000 samples from English audio to Khmer text).
151
+
152
+
153
  #### Transcription Results
154
 
155
  <div align="center">
 
173
 
174
  **Note:** To compute `CER` and `WER`, whitespaces are added between words in Khmer text (Khmer text does not have word boundaries like English text). To do so, `khmercut` PyPI package is used to tokenize Khmer text into words, and then the words are joined back together with whitespaces.
175
 
176
+
177
+ #### WER Comparison with Whisper
178
 
179
  | Tiny | Parameters | Khmer (`fleurs`) | English (`librispeech.clean`) |
180
  |-------|--------|---------------------------| --- |
 
193
  - Error rates > 100% for Whisper on Khmer indicate the model is overfitting to the training data
194
 
195
 
196
+ ### Result Summary
197
 
198
  **Language Detection:** Both model sizes achieved perfect 100% performance across all metrics (Precision, Recall, Accuracy, F1-score) on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. This perfect score is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.
199