Bartelds commited on
Commit
0c242d4
·
1 Parent(s): efae98a

Upload checkpoint, sanitized config, and transcripts for ctc-dro_mms_set_4

Browse files
Files changed (5) hide show
  1. README.md +41 -0
  2. config.yaml +357 -0
  3. hyp.trn +0 -0
  4. ref.trn +0 -0
  5. valid.loss.best.pth +3 -0
README.md ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "CTC-DRO MMS-based ASR model - set 4"
3
+ language: multilingual
4
+ tags:
5
+ - asr
6
+ - ctc-dro
7
+ - MMS
8
+ license: cc-by-nc-4.0
9
+ ---
10
+
11
+ # CTC-DRO MMS-based ASR model - set 4
12
+
13
+ This repository contains a CTC-DRO MMS-based automatic speech recognition (ASR) model trained with ESPnet.
14
+ The model was trained on balanced training data from set 4.
15
+
16
+ ## Intended Use
17
+
18
+ This model is intended for ASR. Users can run inference using the provided checkpoint (`valid.loss.best.pth`) and configuration file (`config.yaml`):
19
+ ```bash
20
+ import soundfile as sf
21
+ from espnet2.bin.asr_inference import Speech2Text
22
+
23
+ asr_train_config = "ctc-dro_mms_set_4/config.yaml"
24
+ asr_model_file = "ctc-dro_mms_set_4/valid.loss.best.pth"
25
+
26
+ model = Speech2Text.from_pretrained(
27
+ asr_train_config=asr_train_config,
28
+ asr_model_file=asr_model_file
29
+ )
30
+
31
+ speech, _ = sf.read("input.wav")
32
+ text, *_ = model(speech)[0]
33
+
34
+ print("Recognized text:", text)
35
+ ```
36
+
37
+ ## How to Use
38
+
39
+ 1. Clone this repository.
40
+ 2. Use ESPnet’s inference scripts with the provided `config.yaml` and checkpoint file.
41
+ 3. Ensure any external resources referenced in `config.yaml` are available at the indicated relative paths.
config.yaml ADDED
@@ -0,0 +1,357 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ accum_grad: 8
2
+ adapter: lora
3
+ adapter_conf: {}
4
+ allow_multi_rates: false
5
+ allow_variable_data_keys: false
6
+ aux_ctc_tasks: []
7
+ batch_bins: 1000000
8
+ batch_size: 4
9
+ batch_type: duration_language
10
+ best_model_criterion:
11
+ - - valid
12
+ - loss
13
+ - min
14
+ bpemodel: null
15
+ chunk_default_fs: null
16
+ chunk_excluded_key_prefixes: []
17
+ chunk_length: 500
18
+ chunk_shift_ratio: 0.5
19
+ cleaner: null
20
+ collect_stats: false
21
+ create_graph_in_tensorboard: false
22
+ ctc_conf:
23
+ accumulation: true
24
+ ctc_type: droctc
25
+ dro_group_count: 6
26
+ dro_q_epsilon: 1.0e-10
27
+ dro_step_size: 0.001
28
+ final_step_size: 5.0
29
+ init_strategy: uniform
30
+ initial_step_size: 0.5
31
+ laplace_smoothing: 0.5
32
+ max_epoch: 30
33
+ num_iters_per_epoch: 1200
34
+ running_mean_window: -1
35
+ scheduling: false
36
+ use_running_mean: false
37
+ warmup_steps: 0
38
+ cudnn_benchmark: false
39
+ cudnn_deterministic: true
40
+ cudnn_enabled: true
41
+ decoder: null
42
+ decoder_conf: {}
43
+ detect_anomaly: false
44
+ distributed: false
45
+ drop_last_iter: false
46
+ dry_run: false
47
+ duration_batch_length: -1
48
+ early_stopping_criterion:
49
+ - valid
50
+ - loss
51
+ - min
52
+ encoder: transformer
53
+ encoder_conf:
54
+ attention_dropout_rate: 0.1
55
+ attention_heads: 8
56
+ dropout_rate: 0.1
57
+ input_layer: conv2d2
58
+ linear_units: 1024
59
+ normalize_before: true
60
+ num_blocks: 2
61
+ output_size: 256
62
+ positional_dropout_rate: 0.1
63
+ exclude_weight_decay: false
64
+ exclude_weight_decay_conf: {}
65
+ fold_length:
66
+ - 80000
67
+ - 150
68
+ freeze_param: []
69
+ frontend: s3prl
70
+ frontend_conf:
71
+ download_dir: ./hub
72
+ frontend_conf:
73
+ path_or_url: facebook/mms-300m
74
+ upstream: hf_wav2vec2_custom
75
+ fs: 16k
76
+ multilayer_feature: true
77
+ g2p: null
78
+ grad_clip: 5.0
79
+ grad_clip_type: 2.0
80
+ grad_noise: false
81
+ ignore_init_mismatch: false
82
+ init: xavier_uniform
83
+ init_param: []
84
+ input_size: null
85
+ iterator_type: sequence
86
+ joint_net_conf: null
87
+ keep_nbest_models: 3
88
+ log_interval: null
89
+ log_level: INFO
90
+ max_cache_fd: 32
91
+ max_cache_size: 0.0
92
+ max_epoch: 30
93
+ model: espnet
94
+ model_conf:
95
+ ctc_weight: 1.0
96
+ multiple_iterator: false
97
+ multiprocessing_distributed: false
98
+ nbest_averaging_interval: 0
99
+ ngpu: 1
100
+ no_forward_run: false
101
+ noise_apply_prob: 1.0
102
+ noise_db_range: '13_15'
103
+ noise_scp: null
104
+ non_linguistic_symbols: ./nlsyms.txt
105
+ normalize: utterance_mvn
106
+ normalize_conf: {}
107
+ num_att_plot: 3
108
+ num_cache_chunks: 1024
109
+ num_iters_per_epoch: 1200
110
+ num_workers: 4
111
+ optim: adam
112
+ optim_conf:
113
+ lr: 0.0001
114
+ weight_decay: 1.0e-06
115
+ output_dir: ./inference_results
116
+ patience: null
117
+ postencoder: null
118
+ postencoder_conf: {}
119
+ preencoder: linear
120
+ preencoder_conf:
121
+ input_size: 1024
122
+ output_size: 80
123
+ preprocessor: default
124
+ preprocessor_conf: {}
125
+ pretrain_path: null
126
+ print_config: false
127
+ required:
128
+ - output_dir
129
+ - token_list
130
+ resume: true
131
+ rir_apply_prob: 1.0
132
+ rir_scp: null
133
+ save_strategy: all
134
+ scheduler: null
135
+ scheduler_conf: {}
136
+ seed: 0
137
+ sharded_ddp: false
138
+ short_noise_thres: 0.5
139
+ shuffle_within_batch: false
140
+ sort_batch: descending
141
+ sort_in_batch: descending
142
+ specaug: specaug
143
+ specaug_conf:
144
+ apply_freq_mask: true
145
+ apply_time_mask: true
146
+ apply_time_warp: true
147
+ freq_mask_width_range:
148
+ - 0
149
+ - 27
150
+ num_freq_mask: 2
151
+ num_time_mask: 10
152
+ time_mask_width_ratio_range:
153
+ - 0.0
154
+ - 0.05
155
+ time_warp_mode: bicubic
156
+ time_warp_window: 5
157
+ speech_volume_normalize: null
158
+ token_list:
159
+ - <blank>
160
+ - <unk>
161
+ - <space>
162
+ - E
163
+ - A
164
+ - O
165
+ - N
166
+ - S
167
+ - I
168
+ - ا
169
+ - L
170
+ - T
171
+ - R
172
+ - و
173
+ - D
174
+ - ن
175
+ - ر
176
+ - ی
177
+ - ي
178
+ - M
179
+ - U
180
+ - H
181
+ - P
182
+ - ک
183
+ - م
184
+ - C
185
+ - А
186
+ - Ӹ
187
+ - Н
188
+ - B
189
+ - ت
190
+ - س
191
+ - ل
192
+ - J
193
+ - K
194
+ - ہ
195
+ - Т
196
+ - ے
197
+ - G
198
+ - Ш
199
+ - К
200
+ - Е
201
+ - Л
202
+ - Ы
203
+ - V
204
+ - М
205
+ - ج
206
+ - Ӓ
207
+ - ه
208
+ - ب
209
+ - د
210
+ - О
211
+ - Y
212
+ - '[slv]'
213
+ - Р
214
+ - ڪ
215
+ - پ
216
+ - Z
217
+ - '[mrj]'
218
+ - F
219
+ - گ
220
+ - И
221
+ - В
222
+ - ئ
223
+ - Д
224
+ - '[sot]'
225
+ - ں
226
+ - '[spa]'
227
+ - W
228
+ - Q
229
+ - П
230
+ - Г
231
+ - ف
232
+ - ق
233
+ - С
234
+ - ع
235
+ - ش
236
+ - Ж
237
+ - ز
238
+ - ھ
239
+ - آ
240
+ - Č
241
+ - Í
242
+ - У
243
+ - ح
244
+ - '[urd]'
245
+ - Š
246
+ - ٹ
247
+ - چ
248
+ - Ь
249
+ - ٽ
250
+ - '[snd]'
251
+ - ڻ
252
+ - Й
253
+ - ط
254
+ - ص
255
+ - ٿ
256
+ - Ц
257
+ - خ
258
+ - Ó
259
+ - Я
260
+ - Á
261
+ - É
262
+ - Ч
263
+ - ۾
264
+ - '0'
265
+ - Ž
266
+ - З
267
+ - '1'
268
+ - ۽
269
+ - –
270
+ - ڏ
271
+ - Э
272
+ - ڊ
273
+ - —
274
+ - ڈ
275
+ - ء
276
+ - Ñ
277
+ - ڙ
278
+ - ِ
279
+ - '2'
280
+ - ٻ
281
+ - Х
282
+ - Ӱ
283
+ - ظ
284
+ - ض
285
+ - ث
286
+ - ڳ
287
+ - ،
288
+ - X
289
+ - ¡
290
+ - غ
291
+ - ڑ
292
+ - Ӧ
293
+ - ذ
294
+ - ¿
295
+ - '5'
296
+ - ڌ
297
+ - '3'
298
+ - ڀ
299
+ - ُ
300
+ - '9'
301
+ - Ú
302
+ - '4'
303
+ - '8'
304
+ - ۔
305
+ - '6'
306
+ - ٺ
307
+ - Ю
308
+ - »
309
+ - Б
310
+ - «
311
+ - ڇ
312
+ - ً
313
+ - ڃ
314
+ - '7'
315
+ - ڄ
316
+ - ؤ
317
+ - ڍ
318
+ - Ф
319
+ - َ
320
+ - ٰ
321
+ - ّ
322
+ - ڱ
323
+ - ”
324
+ - ژ
325
+ - ڦ
326
+ - Ё
327
+ - ؛
328
+ - ٍ
329
+ - Щ
330
+ - ؟
331
+ - ’
332
+ - ‘
333
+ - °
334
+ - ۃ
335
+ - إ
336
+ - Ć
337
+ - <sos/eos>
338
+ token_type: char
339
+ train_dtype: float32
340
+ unused_parameters: true
341
+ use_adapter: false
342
+ use_amp: false
343
+ use_lang_prompt: false
344
+ use_matplotlib: true
345
+ use_nlp_prompt: false
346
+ use_preprocessor: true
347
+ use_tensorboard: true
348
+ val_scheduler_criterion:
349
+ - valid
350
+ - loss
351
+ valid_batch_bins: null
352
+ valid_batch_size: null
353
+ valid_batch_type: null
354
+ valid_iterator_type: null
355
+ valid_max_cache_size: null
356
+ version: '202402'
357
+ write_collected_feats: false
hyp.trn ADDED
The diff for this file is too large to render. See raw diff
 
ref.trn ADDED
The diff for this file is too large to render. See raw diff
 
valid.loss.best.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0a1607e267c052e015ac0422604639b451adcf03fb26df3ab1d01dc667de10df
3
+ size 1280860661