| | --- |
| | license: mit |
| | tags: |
| | - audio |
| | - automatic-speech-recognition |
| | widget: |
| | - example_title: sample 1 |
| | src: https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31515636.mp3 |
| | - example_title: sample 2 |
| | src: https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31549899.mp3 |
| | - example_title: sample 3 |
| | src: https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31617644.mp3 |
| | pipeline_tag: automatic-speech-recognition |
| | --- |
| | |
| | Bangla ASR model which was trained Bangla Mozilla Common Voice Dataset. This is Fine-tuning Whisper model using Bangla mozilla common voice dataset. |
| | For training this model used 40k training and 7k Validation of around 400 hours of data. We trained 12000 steps and get word |
| | error rate 4.58%. This model was whisper small[244 M] variant model. |
| |
|
| |
|
| | ```py |
| | |
| | import os |
| | import librosa |
| | import torch |
| | import torchaudio |
| | import numpy as np |
| | |
| | from transformers import WhisperTokenizer |
| | from transformers import WhisperProcessor |
| | from transformers import WhisperFeatureExtractor |
| | from transformers import WhisperForConditionalGeneration |
| | |
| | device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
| | |
| | mp3_path = "https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31515636.mp3" |
| | |
| | model_path = "bangla-speech-processing/BanglaASR" |
| | |
| | |
| | feature_extractor = WhisperFeatureExtractor.from_pretrained(model_path) |
| | tokenizer = WhisperTokenizer.from_pretrained(model_path) |
| | processor = WhisperProcessor.from_pretrained(model_path) |
| | model = WhisperForConditionalGeneration.from_pretrained(model_path).to(device) |
| | |
| | |
| | speech_array, sampling_rate = torchaudio.load(mp3_path, format="mp3") |
| | speech_array = speech_array[0].numpy() |
| | speech_array = librosa.resample(np.asarray(speech_array), orig_sr=sampling_rate, target_sr=16000) |
| | input_features = feature_extractor(speech_array, sampling_rate=16000, return_tensors="pt").input_features |
| | |
| | # batch = processor.feature_extractor.pad(input_features, return_tensors="pt") |
| | predicted_ids = model.generate(inputs=input_features.to(device))[0] |
| | |
| | |
| | transcription = processor.decode(predicted_ids, skip_special_tokens=True) |
| | |
| | print(transcription) |
| | |
| | ``` |
| |
|
| |
|
| | # Dataset |
| | Used Mozilla common voice dataset around 400 hours data both training[40k] and validation[7k] mp3 samples. |
| | For more information about dataser please [click here](https://commonvoice.mozilla.org/bn/datasets) |
| |
|
| | # Training Model Information |
| |
|
| |
|
| | | Size | Layers | Width | Heads | Parameters | Bangla-only | Training Status | |
| | | ------------- | ------------- | -------- |-------- | ------------- | ------------- | -------- | |
| | tiny | 4 |384 | 6 | 39 M | X | X |
| | base | 6 |512 | 8 |74 M | X | X |
| | small | 12 |768 | 12 |244 M | ✓ | ✓ |
| | medium | 24 |1024 | 16 |769 M | X | X |
| | large | 32 |1280 | 20 |1550 M | X | X |
| |
|
| | # Evaluation |
| |
|
| | Word Error Rate 4.58 % |
| |
|
| | For More please check the [github](https://github.com/saiful9379/BanglaASR/tree/main) |
| |
|
| | ``` |
| | @misc{BanglaASR , |
| | title={Transformer Based Whisper Bangla ASR Model}, |
| | author={Md Saiful Islam}, |
| | howpublished={}, |
| | year={2023} |
| | } |
| | ``` |
| |
|