Whisper Fine-tuned on Bangla Dialects (Shobdotori)

This model is a fine-tuned version of openai/whisper-small trained on the dataset from the "Shobdotori: Where Dialects Flow into Bangla" AI Hackathon competition.

It is designed to handle the linguistic diversity of Bangladesh, covering 20 different regional dialects with high accuracy.

Model Performance

  • Competition Test Score: 0.88573
  • Metric: Levenshtein Similarity Score ($1.0 - \frac{distance}{max(len_{ref}, len_{pred})}$)
  • Validation Loss: ~0.0417

Training Dynamics

The model showed strong convergence, effectively minimizing the loss over 1200 steps.

Step Training Loss Validation Loss
50 0.3669 0.3935
100 0.2040 0.2728
200 0.0678 0.1421
400 0.0055 0.0655
600 0.0033 0.0543
800 0.0026 0.0450
1000 0.0005 0.0440
1200 0.0001 0.0417

Dataset & Dialects

The model was trained on a diverse dataset containing audio samples from the following 20 regions of Bangladesh:

Region 1 Region 2 Region 3 Region 4
Barisal Bhola Bogura Brahmanbaria
Chittagong Comilla Dhaka Feni
Jessore Jhenaidah Khulna Kushtia
Lakshmipur Mymensingh Natore Noakhali
Pabna Rajshahi Rangpur Sylhet

Usage

You can use this model directly with the Hugging Face pipeline or the transformers library.

1. Using pipeline (Easiest)

from transformers import pipeline

# Load the pipeline
pipe = pipeline("automatic-speech-recognition", model="YOUR_USERNAME/YOUR_MODEL_NAME")

# Transcribe an audio file
transcription = pipe("path_to_audio.wav")

print(transcription["text"])

2. Using WhisperForConditionalGeneration (Custom)

import librosa
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# Load model and processor
model_id = "YOUR_USERNAME/YOUR_MODEL_NAME"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)

# Load audio (resampling to 16kHz is mandatory)
audio_path = "path_to_audio.wav"
speech_array, sr = librosa.load(audio_path, sr=16000)

# Process audio
input_features = processor(speech_array, sampling_rate=16000, return_tensors="pt").input_features

# Generate token ids
predicted_ids = model.generate(input_features)

# Decode token ids to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

print(transcription)

Training Hyperparameters

The model was trained using the following hyperparameters:

  • Learning Rate: 1e-05
  • Train Batch Size: 2
  • Eval Batch Size: 8
  • Gradient Accumulation: 16
  • Total Steps: 1200
  • Warmup Steps: 450
  • FP16: True
  • Optimizer: AdamW

Acknowledgements

Downloads last month
9
Safetensors
Model size
0.2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for mahirmasud/AmarBangla-whisper

Finetuned
(3163)
this model

Evaluation results

  • Competition Similarity Score on Shobdotori - Where Dialects Flow into Bangla
    self-reported
    0.886