Whisper Fine-tuned on Bangla Dialects (Shobdotori)

This model is a fine-tuned version of openai/whisper-small trained on the dataset from the "Shobdotori: Where Dialects Flow into Bangla" AI Hackathon competition.

It is designed to handle the linguistic diversity of Bangladesh, covering 20 different regional dialects with high accuracy.

Model Performance

Competition Test Score: 0.88573
Metric: Levenshtein Similarity Score ($1.0 - \frac{distance}{max(len_{ref}, len_{pred})}$)
Validation Loss: ~0.0417

Training Dynamics

The model showed strong convergence, effectively minimizing the loss over 1200 steps.

Step	Training Loss	Validation Loss
50	0.3669	0.3935
100	0.2040	0.2728
200	0.0678	0.1421
400	0.0055	0.0655
600	0.0033	0.0543
800	0.0026	0.0450
1000	0.0005	0.0440
1200	0.0001	0.0417

Dataset & Dialects

The model was trained on a diverse dataset containing audio samples from the following 20 regions of Bangladesh:

Region 1	Region 2	Region 3	Region 4
Barisal	Bhola	Bogura	Brahmanbaria
Chittagong	Comilla	Dhaka	Feni
Jessore	Jhenaidah	Khulna	Kushtia
Lakshmipur	Mymensingh	Natore	Noakhali
Pabna	Rajshahi	Rangpur	Sylhet

Usage

You can use this model directly with the Hugging Face pipeline or the transformers library.

1. Using `pipeline` (Easiest)

from transformers import pipeline

# Load the pipeline
pipe = pipeline("automatic-speech-recognition", model="YOUR_USERNAME/YOUR_MODEL_NAME")

# Transcribe an audio file
transcription = pipe("path_to_audio.wav")

print(transcription["text"])

2. Using `WhisperForConditionalGeneration` (Custom)

import librosa
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# Load model and processor
model_id = "YOUR_USERNAME/YOUR_MODEL_NAME"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)

# Load audio (resampling to 16kHz is mandatory)
audio_path = "path_to_audio.wav"
speech_array, sr = librosa.load(audio_path, sr=16000)

# Process audio
input_features = processor(speech_array, sampling_rate=16000, return_tensors="pt").input_features

# Generate token ids
predicted_ids = model.generate(input_features)

# Decode token ids to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

print(transcription)

Training Hyperparameters

The model was trained using the following hyperparameters:

Learning Rate: 1e-05
Train Batch Size: 2
Eval Batch Size: 8
Gradient Accumulation: 16
Total Steps: 1200
Warmup Steps: 450
FP16: True
Optimizer: AdamW

Acknowledgements

Base Model: whisper-small by OpenAI
Competition: Shobdotori - AI Hackathon

Downloads last month: 12

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for mahirmasud/AmarBangla-whisper

Base model

openai/whisper-small

Finetuned

(3460)

this model

Space using mahirmasud/AmarBangla-whisper 1

Evaluation results

Competition Similarity Score on Shobdotori - Where Dialects Flow into Bangla
self-reported

0.886