Whisper Medium for Multi-Dialectal Bengali ASR (Shobdotori) by Team AutoBot

This repository contains the fine-tuned Whisper-Medium model developed by Team AutoBot for the Shobdotori Kaggle Competition. The model is specialized for transcribing audio from 20 different regional Bengali dialects into standard Bengali text.

Our core strategy was to perform transfer learning on an exceptionally powerful base model, adapting its robust general Bengali ASR capabilities to the specific nuances of the competition's dialectal dataset.

Model Foundation: A Champion Base Model

This model is a fine-tuned version of the 1st place winning model from the Bengali.AI Speech Recognition competition: bengaliAI/tugstugi_bengaliai-regional-asr_whisper-medium. We gratefully acknowledge the extensive work of the original author, which provided a state-of-the-art foundation for our work.

The base model was developed through a sophisticated and rigorous training pipeline, which included:

Base Architecture: OpenAI's whisper-medium.
Massive & Diverse Training Data: It was trained on a vast collection of datasets including OpenSLR 37 & 53, MadASR, Shrutilipi, Kathbath, synthesized audio from GoogleTTS, and a significant amount of pseudo-labeled data from YouTube videos.
Advanced Augmentation: The original training incorporated techniques like spectrogram masking, dithering, resampling, and speed/pitch variations, making the model highly robust.
Custom Bengali Tokenizer: A key innovation of the base model is its custom 12k vocabulary Whisper tokenizer trained specifically on Bengali texts, which dramatically improves inference speed and transcription accuracy for the language.

Fine-tuning for the Shobdotori Competition

Our contribution was to take this champion model and further specialize it for the Shobdotori competition. The fine-tuning process by Team AutoBot involved:

Dataset: The competition's training set, comprising audio from 20 regional Bengali dialects.
Objective: To adapt the model's weights to better recognize the specific phonetic, lexical, and prosodic variations present in these dialects.
Methodology: We fine-tuned the model for 10 epochs using a rigorous text normalization pipeline and gentle on-the-fly audio augmentations (time stretching, pitch shifting, and noise reduction).
Training Innovation: To overcome the disk space limitations of the Kaggle platform, we employed a zero-disk checkpointing strategy, saving the best model's state dictionary in-memory after each validation step.

The result is a model that retains the powerful general transcription capabilities of its predecessor while demonstrating enhanced accuracy on specific regional Bengali speech.

How to Use

You can easily use this model with the transformers library pipeline. Make sure to install the necessary dependencies:

pip install --upgrade transformers accelerate datasets[audio]

### Python Usage

```python
import torch
from transformers import pipeline
from datasets import load_dataset

# Use a GPU if available
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Load the ASR pipeline from the Hub
asr_pipeline = pipeline(
    "automatic-speech-recognition",
    model="YourHuggingFaceUsername/your-model-name-here", # e.g., AutoBot/whisper-medium-shobdotori
    device=device
)

# Replace this with the path to your audio file
audio_file_path = "path/to/your/audio.wav"

# Transcribe the audio
# The task and language are already set in the model's config, but can be specified
result = asr_pipeline(
    audio_file_path,
    generate_kwargs={"language": "bn", "task": "transcribe"}
)

print(result["text"])

Training Details

Base Model: bengaliAI/tugstugi_bengaliai-regional-asr_whisper-medium
Competition: Shobdotori
Team: AutoBot
Training Split: 90% for training, 10% for validation (stratified by region).
Fine-tuning: 10 epochs with Seq2SeqTrainer.
Text Normalization: All non-Bengali characters were removed, and punctuation was standardized.
Audio Augmentation: On-the-fly time stretching, pitch shifting, and noise reduction were applied to the training data.

Evaluation

The model achieved the following score on the competition's blind private test set:

Normalized Levenshtein Similarity: 0.87401

Citation

Please cite the original Whisper paper:

@inproceedings{radford2023robust,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  booktitle={Proceedings of the 40th International Conference on Machine Learning},
  pages={28492--28518},
  year={2023},
  editor={Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},
  volume={202},
  series={Proceedings of Machine Learning Research},
  month={23--29 Jul},
  publisher={PMLR},
}

Acknowledgements

Our work would not have been possible without the foundational model and insights provided by the winner of the Bengali.AI Speech Recognition competition. We extend our thanks for making the model and training details publicly available.

Base Model: bengaliAI/tugstugi_bengaliai-regional-asr_whisper-medium
Original Author's Discussion: Kaggle Discussion Link

Downloads last month: -

Safetensors

Model size

0.8B params

Tensor type

F32

Dataset used to train borhanitrash/Regional_Bengali_Dialects_ASR_Whiper-Medium

Evaluation results

Normalized Levenshtein Similarity on Shobdotori Private Test Set
test set self-reported

0.874