You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Model Description

This is a fine-tuned Wav2Vec2 model for Bangla Automatic Speech Recognition (ASR). The model is based on utpal07/wav2vec2-bangla-dialect-finetuned-large and has been fine-tuned on a custom Bangla dialect dataset containing ~3,350 audio samples.

Team: Neuralsight
Developers: Utpal Barua, Nuzhat Tabassum Medha

Model Usage

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch
import librosa
import soundfile as sf

# Load model and processor
processor = Wav2Vec2Processor.from_pretrained("utpal07/wav2vec2-bangla-finetuned-xlarge")
model = Wav2Vec2ForCTC.from_pretrained("utpal07/wav2vec2-bangla-finetuned-xlarge")

def transcribe_bangla_audio(audio_path):
    # Load and preprocess audio
    speech, sr = sf.read(audio_path)
    if len(speech.shape) > 1:
        speech = np.mean(speech, axis=1)
    if sr != 16000:
        speech = librosa.resample(speech, orig_sr=sr, target_sr=16000)
    
    # Process and predict
    inputs = processor(speech, sampling_rate=16000, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values).logits
    pred_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(pred_ids)[0]
    
    return transcription

Training Details

Hyperparameters

  • Training Epochs: 50
  • Batch Size: 32 (per device)
  • Learning Rate: 7.5e-5
  • Warmup Steps: 2000
  • Gradient Accumulation Steps: 1
  • Activation Dropout: 0.1
  • Mask Time Probability: 0.75
  • Mask Time Length: 10
  • Mask Feature Probability: 0.25
  • Mask Feature Length: 64

Training Configuration

  • Feature Encoder: Frozen during training
  • Minimum Audio Duration: 0.5 seconds
  • Evaluation Strategy: Steps (every 3000 steps)
  • Save Strategy: Steps (every 2000 steps)
  • Logging Steps: 100
  • Preprocessing Workers: 32
Downloads last month
-
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for utpal07/wav2vec2-bangla-finetuned-xlarge

Finetuned
(1)
this model