Model Description
This is a fine-tuned Wav2Vec2 model for Bangla Automatic Speech Recognition (ASR). The model is based on utpal07/wav2vec2-bangla-dialect-finetuned-large and has been fine-tuned on a custom Bangla dialect dataset containing ~3,350 audio samples.
Team: Neuralsight
Developers: Utpal Barua, Nuzhat Tabassum Medha
Model Usage
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch
import librosa
import soundfile as sf
# Load model and processor
processor = Wav2Vec2Processor.from_pretrained("utpal07/wav2vec2-bangla-finetuned-xlarge")
model = Wav2Vec2ForCTC.from_pretrained("utpal07/wav2vec2-bangla-finetuned-xlarge")
def transcribe_bangla_audio(audio_path):
# Load and preprocess audio
speech, sr = sf.read(audio_path)
if len(speech.shape) > 1:
speech = np.mean(speech, axis=1)
if sr != 16000:
speech = librosa.resample(speech, orig_sr=sr, target_sr=16000)
# Process and predict
inputs = processor(speech, sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values).logits
pred_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(pred_ids)[0]
return transcription
Training Details
Hyperparameters
- Training Epochs: 50
- Batch Size: 32 (per device)
- Learning Rate: 7.5e-5
- Warmup Steps: 2000
- Gradient Accumulation Steps: 1
- Activation Dropout: 0.1
- Mask Time Probability: 0.75
- Mask Time Length: 10
- Mask Feature Probability: 0.25
- Mask Feature Length: 64
Training Configuration
- Feature Encoder: Frozen during training
- Minimum Audio Duration: 0.5 seconds
- Evaluation Strategy: Steps (every 3000 steps)
- Save Strategy: Steps (every 2000 steps)
- Logging Steps: 100
- Preprocessing Workers: 32
- Downloads last month
- -