Maverick1713 commited on
Commit
f7b69cd
·
verified ·
1 Parent(s): c589b7a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +319 -320
README.md CHANGED
@@ -1,320 +1,319 @@
1
- Fine-tuned Wav2Vec2 on Hindi using the following datasets:
2
-
3
- - [Common Voice](https://huggingface.co/datasets/common_voice),
4
-
5
- - [Indic TTS- IITM](https://www.iitm.ac.in/donlab/tts/index.php) and
6
-
7
- - [IIITH - Indic Speech Datasets](http://speech.iiit.ac.in/index.php/research-svl/69.html)
8
-
9
- The Indic datasets are well balanced across gender and accents. However the CommonVoice dataset is skewed towards male voices
10
-
11
- Fine-tuned on Wav2Vec2 using Hindi dataset :: 60 epochs >> 17.05% WER
12
-
13
- When using this model, make sure that your speech input is sampled at 16kHz.
14
-
15
- ## Usage
16
-
17
- The model can be used directly (without a language model) as follows:
18
-
19
- ```python
20
-
21
- import torch
22
-
23
- import torchaudio
24
-
25
- from datasets import load_dataset
26
-
27
- from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
28
-
29
- test_dataset = load_dataset("common_voice", "hi", split="test")
30
-
31
-
32
-
33
- processor = Wav2Vec2Processor.from_pretrained("Maverick1713/Hindi-ASR")
34
-
35
- model = Wav2Vec2ForCTC.from_pretrained("Maverick1713/Hindi-ASR")
36
-
37
-
38
-
39
- resampler = torchaudio.transforms.Resample(48_000, 16_000)
40
-
41
-
42
-
43
-
44
- def speech_file_to_array_fn(batch):
45
-
46
- speech_array, sampling_rate = torchaudio.load(batch["path"])
47
-
48
- batch["speech"] = resampler(speech_array).squeeze().numpy()
49
-
50
- return batch
51
-
52
-
53
-
54
- test_dataset = test_dataset.map(speech_file_to_array_fn)
55
-
56
- inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
57
-
58
-
59
-
60
- with torch.no_grad():
61
-
62
- logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
63
-
64
-
65
-
66
- predicted_ids = torch.argmax(logits, dim=-1)
67
-
68
- print("Prediction:", processor.batch_decode(predicted_ids))
69
-
70
- print("Reference:", test_dataset["sentence"][:2])
71
-
72
- ```
73
-
74
- ## Predictions
75
-
76
- _Some good ones ..... _
77
-
78
- | Predictions | Reference |
79
-
80
- |-------|-------|
81
-
82
- |फिर वो सूरज तारे पहाड बारिश पदछड़ दिन रात शाम नदी बर्फ़ समुद्र धुंध हवा कुछ भी हो सकती है | फिर वो सूरज तारे पहाड़ बारिश पतझड़ दिन रात शाम नदी बर्फ़ समुद्र धुंध हवा कुछ भी हो सकती है |
83
-
84
- | इस कारण जंगल में बडी दूर स्थित राघव के आश्रम में लोघ कम आने लगे और अधिकांश भक्त सुंदर के आश्रम में जाने लगे | इस कारण जंगल में बड़ी दूर स्थित राघव के आश्रम में लोग कम आने लगे और अधिकांश भक्त सुन्दर के आश्रम में जाने लगे |
85
-
86
- | अपने बचन के अनुसार शुभमूर्त पर अनंत दक्षिणी पर्वत गया और मंत्रों का जप करके सरोवर में उतरा | अपने बचन के अनुसार शुभमुहूर्त पर अनंत दक्षिणी पर्वत गया और मंत्रों का जप करके सरोवर में उतरा |
87
-
88
- _Some crappy stuff .... _
89
-
90
- | Predictions | Reference |
91
-
92
- |-------|-------|
93
-
94
- | वस गनिल साफ़ है। | उसका दिल साफ़ है। |
95
-
96
- | चाय वा एक कुछ लैंगे हब | चायवाय कुछ लेंगे आप |
97
-
98
- | टॉम आधे है स्कूल हें है | टॉम अभी भी स्कूल में है |
99
-
100
- ## Evaluation
101
-
102
- The model can be evaluated as follows on the following two datasets:
103
-
104
- 1. Custom dataset created from 20% of Indic, IIITH and CV (test): WER 17.xx%
105
-
106
- 2. CommonVoice Hindi test dataset: WER 56.xx%
107
-
108
- Links to the datasets are provided above (check the links at the start of the README)
109
-
110
- train-test csv files are shared on the following gdrive links:
111
-
112
- a. IIITH [train](https://storage.googleapis.com/indic-dataset/train_test_splits/iiit_hi_train.csv) [test](https://storage.googleapis.com/indic-dataset/train_test_splits/iiit_hi_test.csv)
113
-
114
- b. Indic TTS [train](https://storage.googleapis.com/indic-dataset/train_test_splits/indic_train_full.csv) [test](https://storage.googleapis.com/indic-dataset/train_test_splits/indic_test_full.csv)
115
-
116
- Update the audio_path as per your local file structure.
117
-
118
- ```python
119
-
120
- import torch
121
-
122
- import torchaudio
123
-
124
- from datasets import load_dataset, load_metric
125
-
126
- from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
127
-
128
- import re
129
-
130
-
131
-
132
-
133
- test_dataset = load_dataset("common_voice", "hi", split="test")
134
-
135
-
136
-
137
- indic = load_dataset("csv", data_files= {'train':"/workspace/data/hi2/indic_train_full.csv",
138
-
139
- "test": "/workspace/data/hi2/indic_test_full.csv"}, download_mode="force_redownload")
140
-
141
- iiith = load_dataset("csv", data_files= {"train": "/workspace/data/hi2/iiit_hi_train.csv",
142
-
143
- "test": "/workspace/data/hi2/iiit_hi_test.csv"}, download_mode="force_redownload")
144
-
145
-
146
-
147
- split = ['train', 'test', 'validation', 'other', 'invalidated']
148
-
149
-
150
-
151
- for sp in split:
152
-
153
- common_voice[sp] = common_voice[sp].remove_columns(['client_id', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'])
154
-
155
-
156
-
157
- common_voice = common_voice.rename_column('path', 'audio_path')
158
-
159
- common_voice = common_voice.rename_column('sentence', 'target_text')
160
-
161
-
162
-
163
- train_dataset = datasets.concatenate_datasets([indic['train'], iiith['train'], common_voice['train']])
164
-
165
- test_dataset = datasets.concatenate_datasets([indic['test'], iiith['test'], common_voice['test'], common_voice['validation']])
166
-
167
-
168
-
169
-
170
-
171
- wer = load_metric("wer")
172
-
173
-
174
-
175
- processor = Wav2Vec2Processor.from_pretrained("Maverick1713/Hindi-ASR")
176
-
177
- model = Wav2Vec2ForCTC.from_pretrained("Maverick1713/Hindi-ASR")
178
-
179
- model.to("cuda")
180
-
181
-
182
-
183
- chars_to_ignore_regex = '[\,\?\.\!\-\'\;\:\"\“\%\‘\”\�Utrnle\_]'
184
-
185
- unicode_ignore_regex = r'[dceMaWpmFui\xa0\u200d]' # Some unwanted unicode chars
186
-
187
- resampler = torchaudio.transforms.Resample(48_000, 16_000)
188
-
189
-
190
-
191
-
192
-
193
- def speech_file_to_array_fn(batch):
194
-
195
- batch["target_text"] = re.sub(chars_to_ignore_regex, '', batch["target_text"])
196
-
197
- batch["target_text"] = re.sub(unicode_ignore_regex, '', batch["target_text"])
198
-
199
-
200
-
201
- speech_array, sampling_rate = torchaudio.load(batch["audio_path"])
202
-
203
- batch["speech"] = resampler(speech_array).squeeze().numpy()
204
-
205
- return batch
206
-
207
-
208
-
209
- test_dataset = test_dataset.map(speech_file_to_array_fn)
210
-
211
-
212
-
213
-
214
- def evaluate(batch):
215
-
216
- inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
217
-
218
- with torch.no_grad():
219
-
220
- logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
221
-
222
- pred_ids = torch.argmax(logits, dim=-1)
223
-
224
- batch["pred_strings"] = processor.batch_decode(pred_ids)
225
-
226
- return batch
227
-
228
-
229
-
230
- result = test_dataset.map(evaluate, batched=True, batch_size=8)
231
-
232
- print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
233
-
234
- ```
235
-
236
- **Test Result on custom dataset**: 17.23 %
237
-
238
- ```python
239
-
240
- import torch
241
-
242
- import torchaudio
243
-
244
- from datasets import load_dataset, load_metric
245
-
246
- from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
247
-
248
- import re
249
-
250
-
251
-
252
- test_dataset = load_dataset("common_voice", "hi", split="test")
253
-
254
- wer = load_metric("wer")
255
-
256
-
257
-
258
- processor = Wav2Vec2Processor.from_pretrained("Maverick1713/Hindi-ASR")
259
-
260
- model = Wav2Vec2ForCTC.from_pretrained("Maverick1713/Hindi-ASR")
261
-
262
- model.to("cuda")
263
-
264
-
265
-
266
- chars_to_ignore_regex = '[\,\?\.\!\-\'\;\:\"\“\%\‘\”\�Utrnle\_]'
267
-
268
- unicode_ignore_regex = r'[dceMaWpmFui\xa0\u200d]'
269
-
270
- resampler = torchaudio.transforms.Resample(48_000, 16_000)
271
-
272
-
273
-
274
-
275
-
276
- def speech_file_to_array_fn(batch):
277
-
278
- batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).sub(unicode_ignore_regex, '', batch["sentence"])
279
-
280
- speech_array, sampling_rate = torchaudio.load(batch["path"])
281
-
282
- batch["speech"] = resampler(speech_array).squeeze().numpy()
283
-
284
- return batch
285
-
286
-
287
-
288
- test_dataset = test_dataset.map(speech_file_to_array_fn)
289
-
290
-
291
-
292
-
293
-
294
- def evaluate(batch):
295
-
296
- inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
297
-
298
- with torch.no_grad():
299
-
300
- logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
301
-
302
- pred_ids = torch.argmax(logits, dim=-1)
303
-
304
- batch["pred_strings"] = processor.batch_decode(pred_ids)
305
-
306
- return batch
307
-
308
-
309
-
310
- result = test_dataset.map(evaluate, batched=True, batch_size=8)
311
-
312
- print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
313
-
314
- ```
315
-
316
- **Test Result on CommonVoice**: 56.46 %
317
-
318
- ## Training
319
-
320
- The Common Voice `train`, `validation`, datasets were used for training as well as
 
1
+ Fine-tuned Wav2Vec2 on Hindi using the following datasets:
2
+
3
+ - [Common Voice](https://huggingface.co/datasets/common_voice),
4
+
5
+ - [Indic TTS- IITM](https://www.iitm.ac.in/donlab/tts/index.php) and
6
+
7
+
8
+ The Indic datasets are well balanced across gender and accents. However the CommonVoice dataset is skewed towards male voices
9
+
10
+ Fine-tuned on Wav2Vec2 using Hindi dataset :: 60 epochs >> 17.05% WER
11
+
12
+ When using this model, make sure that your speech input is sampled at 16kHz.
13
+
14
+ ## Usage
15
+
16
+ The model can be used directly (without a language model) as follows:
17
+
18
+ ```python
19
+
20
+ import torch
21
+
22
+ import torchaudio
23
+
24
+ from datasets import load_dataset
25
+
26
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
27
+
28
+ test_dataset = load_dataset("common_voice", "hi", split="test")
29
+
30
+
31
+
32
+ processor = Wav2Vec2Processor.from_pretrained("Maverick1713/Hindi-ASR")
33
+
34
+ model = Wav2Vec2ForCTC.from_pretrained("Maverick1713/Hindi-ASR")
35
+
36
+
37
+
38
+ resampler = torchaudio.transforms.Resample(48_000, 16_000)
39
+
40
+
41
+
42
+
43
+ def speech_file_to_array_fn(batch):
44
+
45
+ speech_array, sampling_rate = torchaudio.load(batch["path"])
46
+
47
+ batch["speech"] = resampler(speech_array).squeeze().numpy()
48
+
49
+ return batch
50
+
51
+
52
+
53
+ test_dataset = test_dataset.map(speech_file_to_array_fn)
54
+
55
+ inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
56
+
57
+
58
+
59
+ with torch.no_grad():
60
+
61
+ logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
62
+
63
+
64
+
65
+ predicted_ids = torch.argmax(logits, dim=-1)
66
+
67
+ print("Prediction:", processor.batch_decode(predicted_ids))
68
+
69
+ print("Reference:", test_dataset["sentence"][:2])
70
+
71
+ ```
72
+
73
+ ## Predictions
74
+
75
+ _Some good ones ..... _
76
+
77
+ | Predictions | Reference |
78
+
79
+ |-------|-------|
80
+
81
+ |फिर वो सूरज तारे पहाड बारिश पदछड़ दिन रात शाम नदी बर्फ़ समुद्र धुंध हवा कुछ भी हो सकती है | फिर वो सूरज तारे पहाड़ बारिश पतझड़ दिन रात शाम नदी बर्फ़ समुद्र धुंध हवा कुछ भी हो सकती है |
82
+
83
+ | इस कारण जंगल में बडी दूर स्थित राघव के आश्रम में लोघ कम आने लगे और अधिकांश भक्त सुंदर के आश्रम में जाने लगे | इस कारण जंगल में बड़ी दूर स्थित राघव के आश्रम में लोग कम आने लगे और अधिकांश भक्त सुन्दर के आश्रम में जाने लगे |
84
+
85
+ | अपने बचन के अनुसार शुभमूर्त पर अनंत दक्षिणी पर्वत गया और मंत्रों का जप करके सरोवर में उतरा | अपने बचन के अनुसार शुभमुहूर्त पर अनंत दक्षिणी पर्वत गया और मंत्रों का जप करके सरोवर में उतरा |
86
+
87
+ _Some crappy stuff .... _
88
+
89
+ | Predictions | Reference |
90
+
91
+ |-------|-------|
92
+
93
+ | वस गनिल साफ़ है। | उसका दिल साफ़ है। |
94
+
95
+ | चाय वा एक कुछ लैंगे हब | चायवाय कुछ लेंगे आप |
96
+
97
+ | टॉम आधे है स्कूल हें है | टॉम अभी भी स्कूल में है |
98
+
99
+ ## Evaluation
100
+
101
+ The model can be evaluated as follows on the following two datasets:
102
+
103
+ 1. Custom dataset created from 20% of Indic, IIITH and CV (test): WER 17.xx%
104
+
105
+ 2. CommonVoice Hindi test dataset: WER 56.xx%
106
+
107
+ Links to the datasets are provided above (check the links at the start of the README)
108
+
109
+ train-test csv files are shared on the following gdrive links:
110
+
111
+ a. IIITH [train](https://storage.googleapis.com/indic-dataset/train_test_splits/iiit_hi_train.csv) [test](https://storage.googleapis.com/indic-dataset/train_test_splits/iiit_hi_test.csv)
112
+
113
+ b. Indic TTS [train](https://storage.googleapis.com/indic-dataset/train_test_splits/indic_train_full.csv) [test](https://storage.googleapis.com/indic-dataset/train_test_splits/indic_test_full.csv)
114
+
115
+ Update the audio_path as per your local file structure.
116
+
117
+ ```python
118
+
119
+ import torch
120
+
121
+ import torchaudio
122
+
123
+ from datasets import load_dataset, load_metric
124
+
125
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
126
+
127
+ import re
128
+
129
+
130
+
131
+
132
+ test_dataset = load_dataset("common_voice", "hi", split="test")
133
+
134
+
135
+
136
+ indic = load_dataset("csv", data_files= {'train':"/workspace/data/hi2/indic_train_full.csv",
137
+
138
+ "test": "/workspace/data/hi2/indic_test_full.csv"}, download_mode="force_redownload")
139
+
140
+ iiith = load_dataset("csv", data_files= {"train": "/workspace/data/hi2/iiit_hi_train.csv",
141
+
142
+ "test": "/workspace/data/hi2/iiit_hi_test.csv"}, download_mode="force_redownload")
143
+
144
+
145
+
146
+ split = ['train', 'test', 'validation', 'other', 'invalidated']
147
+
148
+
149
+
150
+ for sp in split:
151
+
152
+ common_voice[sp] = common_voice[sp].remove_columns(['client_id', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'])
153
+
154
+
155
+
156
+ common_voice = common_voice.rename_column('path', 'audio_path')
157
+
158
+ common_voice = common_voice.rename_column('sentence', 'target_text')
159
+
160
+
161
+
162
+ train_dataset = datasets.concatenate_datasets([indic['train'], iiith['train'], common_voice['train']])
163
+
164
+ test_dataset = datasets.concatenate_datasets([indic['test'], iiith['test'], common_voice['test'], common_voice['validation']])
165
+
166
+
167
+
168
+
169
+
170
+ wer = load_metric("wer")
171
+
172
+
173
+
174
+ processor = Wav2Vec2Processor.from_pretrained("Maverick1713/Hindi-ASR")
175
+
176
+ model = Wav2Vec2ForCTC.from_pretrained("Maverick1713/Hindi-ASR")
177
+
178
+ model.to("cuda")
179
+
180
+
181
+
182
+ chars_to_ignore_regex = '[\,\?\.\!\-\'\;\:\"\“\%\‘\”\�Utrnle\_]'
183
+
184
+ unicode_ignore_regex = r'[dceMaWpmFui\xa0\u200d]' # Some unwanted unicode chars
185
+
186
+ resampler = torchaudio.transforms.Resample(48_000, 16_000)
187
+
188
+
189
+
190
+
191
+
192
+ def speech_file_to_array_fn(batch):
193
+
194
+ batch["target_text"] = re.sub(chars_to_ignore_regex, '', batch["target_text"])
195
+
196
+ batch["target_text"] = re.sub(unicode_ignore_regex, '', batch["target_text"])
197
+
198
+
199
+
200
+ speech_array, sampling_rate = torchaudio.load(batch["audio_path"])
201
+
202
+ batch["speech"] = resampler(speech_array).squeeze().numpy()
203
+
204
+ return batch
205
+
206
+
207
+
208
+ test_dataset = test_dataset.map(speech_file_to_array_fn)
209
+
210
+
211
+
212
+
213
+ def evaluate(batch):
214
+
215
+ inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
216
+
217
+ with torch.no_grad():
218
+
219
+ logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
220
+
221
+ pred_ids = torch.argmax(logits, dim=-1)
222
+
223
+ batch["pred_strings"] = processor.batch_decode(pred_ids)
224
+
225
+ return batch
226
+
227
+
228
+
229
+ result = test_dataset.map(evaluate, batched=True, batch_size=8)
230
+
231
+ print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
232
+
233
+ ```
234
+
235
+ **Test Result on custom dataset**: 17.23 %
236
+
237
+ ```python
238
+
239
+ import torch
240
+
241
+ import torchaudio
242
+
243
+ from datasets import load_dataset, load_metric
244
+
245
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
246
+
247
+ import re
248
+
249
+
250
+
251
+ test_dataset = load_dataset("common_voice", "hi", split="test")
252
+
253
+ wer = load_metric("wer")
254
+
255
+
256
+
257
+ processor = Wav2Vec2Processor.from_pretrained("Maverick1713/Hindi-ASR")
258
+
259
+ model = Wav2Vec2ForCTC.from_pretrained("Maverick1713/Hindi-ASR")
260
+
261
+ model.to("cuda")
262
+
263
+
264
+
265
+ chars_to_ignore_regex = '[\,\?\.\!\-\'\;\:\"\“\%\‘\”\�Utrnle\_]'
266
+
267
+ unicode_ignore_regex = r'[dceMaWpmFui\xa0\u200d]'
268
+
269
+ resampler = torchaudio.transforms.Resample(48_000, 16_000)
270
+
271
+
272
+
273
+
274
+
275
+ def speech_file_to_array_fn(batch):
276
+
277
+ batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).sub(unicode_ignore_regex, '', batch["sentence"])
278
+
279
+ speech_array, sampling_rate = torchaudio.load(batch["path"])
280
+
281
+ batch["speech"] = resampler(speech_array).squeeze().numpy()
282
+
283
+ return batch
284
+
285
+
286
+
287
+ test_dataset = test_dataset.map(speech_file_to_array_fn)
288
+
289
+
290
+
291
+
292
+
293
+ def evaluate(batch):
294
+
295
+ inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
296
+
297
+ with torch.no_grad():
298
+
299
+ logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
300
+
301
+ pred_ids = torch.argmax(logits, dim=-1)
302
+
303
+ batch["pred_strings"] = processor.batch_decode(pred_ids)
304
+
305
+ return batch
306
+
307
+
308
+
309
+ result = test_dataset.map(evaluate, batched=True, batch_size=8)
310
+
311
+ print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
312
+
313
+ ```
314
+
315
+ **Test Result on CommonVoice**: 56.46 %
316
+
317
+ ## Training
318
+
319
+ The Common Voice `train`, `validation`, datasets were used for training as well as