--- library_name: transformers license: mit datasets: - Mohan-diffuser/odia-english-ASR - google/fleurs language: - or metrics: - cer base_model: - openai/whisper-small pipeline_tag: automatic-speech-recognition --- # Model Card for Odia-English Whisper ASR with LoRA This model is a fine-tuned version of OpenAI’s `whisper-small` for automatic speech recognition (ASR) on the Odia-English bilingual dataset. Fine-tuning was done using LoRA (Low-Rank Adaptation) for parameter-efficient training. The model supports transcribing speech in Odia (Bengali script) with Whisper tokenizer and feature extractor. ## Model Details **Developed by:** Dr. Balyogi Mohan Dash **Model type:** Whisper (sequence-to-sequence transformer for ASR) **Language(s):** Odia (written in Bengali script), English **License:** apache-2.0 **Fine-tuned from model:** `openai/whisper-small` ## Model Sources **Training Code:** Private repo / project (not shared in this card) **Dataset:** [Mohan-diffuser/odia-english-ASR](https://huggingface.co/datasets/Mohan-diffuser/odia-english-ASR) ## Uses ### Direct Use * Automatic transcription of Odia-English speech recordings * Educational or accessibility tools for low-resource language ASR * Dataset bootstrapping for speech corpora in Indian languages ## How to Get Started ```python import torch from datasets import load_dataset from transformers import WhisperTokenizer from transformers import WhisperFeatureExtractor from transformers import WhisperForConditionalGeneration from peft import LoraConfig, PeftModel, LoraModel, LoraConfig, get_peft_model from scipy.signal import resample def down_sample_audio(audio_original, original_sample_rate): target_sample_rate = 16000 # Calculate the number of samples for the target sample rate num_samples = int(len(audio_original) * target_sample_rate / original_sample_rate) # Resample the audio array to the target sample rate downsampled_audio = resample(audio_original, num_samples) return downsampled_audio tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small",language='bengali',task='transcribe') feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small",language='bengali',task='transcribe') model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small").to('cuda') model = PeftModel.from_pretrained(model, "Mohan-diffuser/whisper-small-odia-finetuned", is_trainable=False, device_map='cuda') model.eval() model.config.use_cache = True asr_dataset = load_dataset("Mohan-diffuser/odia-english-ASR") idx=0 target = asr_dataset['validation'][idx]['transcription'] audio_original = asr_dataset['validation'][idx]['audio']['array'] original_sample_rate = asr_dataset['validation'][idx]['audio']['sampling_rate'] audio_16000 = down_sample_audio(audio_original, original_sample_rate) input_feature = feature_extractor(raw_speech=audio_16000, sampling_rate=16000, return_tensors='pt').input_features with torch.no_grad(): op = model.generate(input_feature.to('cuda'), language='bengali', task='transcribe') text_pred = tokenizer.batch_decode(op, skip_special_tokens=True)[0] ``` ## Training Details ### Training Data * **Dataset:** [Mohan-diffuser/odia-english-ASR](https://huggingface.co/datasets/Mohan-diffuser/odia-english-ASR) * **Audio:** Native Odia and code-mixed English speech with manual transcriptions. * **Sampling rate:** Downsampled to 16kHz using `scipy.signal.resample`. ### Training Procedure ![Loss Image](images/loss.png) * **LoRA Parameters:** $r=64$, $lora\_alpha=64$, $dropout=0.05$ * **Scheduler:** Linear warmup * **Warmup steps:** 20 * **Max steps:** 1400 * **Batch size:** 8 * **Gradient accumulation:** 4 * **Optimizer:** AdamW on trainable LoRA parameters * **Eval steps:** 100 ## Evaluation ### Dataset * Validation split from [Mohan-diffuser/odia-english-ASR](https://huggingface.co/datasets/Mohan-diffuser/odia-english-ASR) ### Metrics * **CER (Character Error Rate):** Computed using `jiwer.cer` * It acheived a CER of **14.14** in the Valdidation dataset * Manual predictions logged every 100 steps for qualitative monitoring. ## Environmental Impact * **Hardware Used:** Single GPU (4060 TI) * **Training Duration:** \~1400 steps, small-scale LoRA tuning * **Framework:** PyTorch, Transformers, PEFT