You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Model Description

This is a fine-tuned Wav2Vec2 model for Bangla Automatic Speech Recognition (ASR). The model is based on utpal07/wav2vec2-bangla-dialect-finetuned-large and has been fine-tuned on a custom Bangla dialect dataset containing ~3,350 audio samples.

Team: Neuralsight
Developers: Utpal Barua, Nuzhat Tabassum Medha

Model Usage

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch
import librosa
import soundfile as sf

# Load model and processor
processor = Wav2Vec2Processor.from_pretrained("utpal07/wav2vec2-bangla-finetuned-xlarge")
model = Wav2Vec2ForCTC.from_pretrained("utpal07/wav2vec2-bangla-finetuned-xlarge")

def transcribe_bangla_audio(audio_path):
    # Load and preprocess audio
    speech, sr = sf.read(audio_path)
    if len(speech.shape) > 1:
        speech = np.mean(speech, axis=1)
    if sr != 16000:
        speech = librosa.resample(speech, orig_sr=sr, target_sr=16000)
    
    # Process and predict
    inputs = processor(speech, sampling_rate=16000, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values).logits
    pred_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(pred_ids)[0]
    
    return transcription

Training Details

Hyperparameters

Training Epochs: 50
Batch Size: 32 (per device)
Learning Rate: 7.5e-5
Warmup Steps: 2000
Gradient Accumulation Steps: 1
Activation Dropout: 0.1
Mask Time Probability: 0.75
Mask Time Length: 10
Mask Feature Probability: 0.25
Mask Feature Length: 64

Training Configuration

Feature Encoder: Frozen during training
Minimum Audio Duration: 0.5 seconds
Evaluation Strategy: Steps (every 3000 steps)
Save Strategy: Steps (every 2000 steps)
Logging Steps: 100
Preprocessing Workers: 32

Downloads last month: -

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for utpal07/wav2vec2-bangla-finetuned-xlarge

Base model

utpal07/wav2vec2-bangla-dialect-finetuned-large

Finetuned

(1)

this model