bangla-speech-processing
/

BanglaASR

Automatic Speech Recognition

Model card Files Files and versions

saiful9379 commited on Jul 2, 2023

Commit

fb730d7

·

1 Parent(s): b3c405c

update readme

Files changed (1) hide show

README.md +78 -1

README.md CHANGED Viewed

@@ -11,4 +11,81 @@ widget:
 - example_title: sample 3
   src: https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31617644.mp3
 pipeline_tag: automatic-speech-recognition
----

 - example_title: sample 3
   src: https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31617644.mp3
 pipeline_tag: automatic-speech-recognition
+---
+Bangla ASR[Whisper BanglaASR] model which was trained Bangla Mozilla Common Voice Dataset. This is Fine-tuning Whisper for Bangla mozilla common voice dataset.
+For training Bangla ASR model here used 40k traning and 7k Validation around 400 hours data. We trained 12000 steps this model and get word
+error rate 4.58%.
+```py
+import os
+import librosa
+import torch
+import torchaudio
+import numpy as np
+from transformers import WhisperTokenizer
+from transformers import WhisperProcessor
+from transformers import WhisperFeatureExtractor
+from transformers import WhisperForConditionalGeneration
+device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+mp3_path = "https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31515636.mp3"
+model_path = "bangla-speech-processing/BanglaASR"
+feature_extractor = WhisperFeatureExtractor.from_pretrained(model_path)
+tokenizer = WhisperTokenizer.from_pretrained(model_path)
+processor = WhisperProcessor.from_pretrained(model_path)
+model = WhisperForConditionalGeneration.from_pretrained(model_path).to(device)
+speech_array, sampling_rate = torchaudio.load(mp3_path, format="mp3")
+speech_array = speech_array[0].numpy()
+speech_array = librosa.resample(np.asarray(speech_array), orig_sr=sampling_rate, target_sr=16000)
+input_features = feature_extractor(speech_array, sampling_rate=16000, return_tensors="pt").input_features
+# batch = processor.feature_extractor.pad(input_features, return_tensors="pt")
+predicted_ids = model.generate(inputs=input_features.to(device))[0]
+transcription = processor.decode(predicted_ids, skip_special_tokens=True)
+print(transcription)
+```
+# Dataset
+Use Mozilla common voice dataset. we used 400 hours data both training 40k and validation 7k mp3 samples.
+For more information about dataser please [click here](https://commonvoice.mozilla.org/bn/datasets)
+# Training Model Information
+| Size | Layers | Width | Heads | Parameters | Bangla-only | Training Status |
+| ------------- | ------------- | --------    |--------    | ------------- | ------------- | --------    |
+tiny   | 4  |384  | 6   | 39 M 	| X |  X
+base   | 6 	|512  | 8 	|74 M 	| X	|  X
+small  | 12 |768  | 12 	|244 M 	| ✓ |  ✓
+medium | 24 |1024 | 16 	|769 M 	| X |  X
+large  | 32 |1280 | 20 	|1550 M | X |  X
+# Evaluation
+Word Error Rate 4.58 %
+For More please check the [github](https://github.com/saiful9379/BanglaASR/tree/main)
+```
+@misc{BanglaASR ,
+  title={Transformer Based Whisper Bangla ASR Model},
+  author={Md Saiful Islam},
+  howpublished={},
+  year={2023}
+}
+```