Whisper Fine-tuned on Bangla Dialects (Shobdotori)
This model is a fine-tuned version of openai/whisper-small trained on the dataset from the "Shobdotori: Where Dialects Flow into Bangla" AI Hackathon competition.
It is designed to handle the linguistic diversity of Bangladesh, covering 20 different regional dialects with high accuracy.
Model Performance
- Competition Test Score:
0.88573 - Metric: Levenshtein Similarity Score ($1.0 - \frac{distance}{max(len_{ref}, len_{pred})}$)
- Validation Loss: ~0.0417
Training Dynamics
The model showed strong convergence, effectively minimizing the loss over 1200 steps.
| Step | Training Loss | Validation Loss |
|---|---|---|
| 50 | 0.3669 | 0.3935 |
| 100 | 0.2040 | 0.2728 |
| 200 | 0.0678 | 0.1421 |
| 400 | 0.0055 | 0.0655 |
| 600 | 0.0033 | 0.0543 |
| 800 | 0.0026 | 0.0450 |
| 1000 | 0.0005 | 0.0440 |
| 1200 | 0.0001 | 0.0417 |
Dataset & Dialects
The model was trained on a diverse dataset containing audio samples from the following 20 regions of Bangladesh:
| Region 1 | Region 2 | Region 3 | Region 4 |
|---|---|---|---|
| Barisal | Bhola | Bogura | Brahmanbaria |
| Chittagong | Comilla | Dhaka | Feni |
| Jessore | Jhenaidah | Khulna | Kushtia |
| Lakshmipur | Mymensingh | Natore | Noakhali |
| Pabna | Rajshahi | Rangpur | Sylhet |
Usage
You can use this model directly with the Hugging Face pipeline or the transformers library.
1. Using pipeline (Easiest)
from transformers import pipeline
# Load the pipeline
pipe = pipeline("automatic-speech-recognition", model="YOUR_USERNAME/YOUR_MODEL_NAME")
# Transcribe an audio file
transcription = pipe("path_to_audio.wav")
print(transcription["text"])
2. Using WhisperForConditionalGeneration (Custom)
import librosa
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
# Load model and processor
model_id = "YOUR_USERNAME/YOUR_MODEL_NAME"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)
# Load audio (resampling to 16kHz is mandatory)
audio_path = "path_to_audio.wav"
speech_array, sr = librosa.load(audio_path, sr=16000)
# Process audio
input_features = processor(speech_array, sampling_rate=16000, return_tensors="pt").input_features
# Generate token ids
predicted_ids = model.generate(input_features)
# Decode token ids to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
Training Hyperparameters
The model was trained using the following hyperparameters:
- Learning Rate:
1e-05 - Train Batch Size:
2 - Eval Batch Size:
8 - Gradient Accumulation:
16 - Total Steps:
1200 - Warmup Steps:
450 - FP16:
True - Optimizer: AdamW
Acknowledgements
- Base Model: whisper-small by OpenAI
- Competition: Shobdotori - AI Hackathon
- Downloads last month
- 9
Model tree for mahirmasud/AmarBangla-whisper
Base model
openai/whisper-smallEvaluation results
- Competition Similarity Score on Shobdotori - Where Dialects Flow into Banglaself-reported0.886