|
|
--- |
|
|
library_name: transformers |
|
|
license: mit |
|
|
datasets: |
|
|
- Mohan-diffuser/odia-english-ASR |
|
|
- google/fleurs |
|
|
language: |
|
|
- or |
|
|
metrics: |
|
|
- cer |
|
|
base_model: |
|
|
- openai/whisper-small |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
--- |
|
|
# Model Card for Odia-English Whisper ASR with LoRA |
|
|
|
|
|
This model is a fine-tuned version of OpenAI’s `whisper-small` for automatic speech recognition (ASR) on the Odia-English bilingual dataset. Fine-tuning was done using LoRA (Low-Rank Adaptation) for parameter-efficient training. The model supports transcribing speech in Odia (Bengali script) with Whisper tokenizer and feature extractor. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
**Developed by:** Dr. Balyogi Mohan Dash |
|
|
**Model type:** Whisper (sequence-to-sequence transformer for ASR) |
|
|
**Language(s):** Odia (written in Bengali script), English |
|
|
**License:** apache-2.0 |
|
|
**Fine-tuned from model:** `openai/whisper-small` |
|
|
|
|
|
## Model Sources |
|
|
|
|
|
**Training Code:** Private repo / project (not shared in this card) |
|
|
**Dataset:** [Mohan-diffuser/odia-english-ASR](https://huggingface.co/datasets/Mohan-diffuser/odia-english-ASR) |
|
|
|
|
|
## Uses |
|
|
|
|
|
### Direct Use |
|
|
* Automatic transcription of Odia-English speech recordings |
|
|
* Educational or accessibility tools for low-resource language ASR |
|
|
* Dataset bootstrapping for speech corpora in Indian languages |
|
|
|
|
|
|
|
|
## How to Get Started |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from datasets import load_dataset |
|
|
from transformers import WhisperTokenizer |
|
|
from transformers import WhisperFeatureExtractor |
|
|
from transformers import WhisperForConditionalGeneration |
|
|
from peft import LoraConfig, PeftModel, LoraModel, LoraConfig, get_peft_model |
|
|
from scipy.signal import resample |
|
|
|
|
|
def down_sample_audio(audio_original, original_sample_rate): |
|
|
target_sample_rate = 16000 |
|
|
|
|
|
# Calculate the number of samples for the target sample rate |
|
|
num_samples = int(len(audio_original) * target_sample_rate / original_sample_rate) |
|
|
|
|
|
# Resample the audio array to the target sample rate |
|
|
downsampled_audio = resample(audio_original, num_samples) |
|
|
|
|
|
return downsampled_audio |
|
|
|
|
|
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small",language='bengali',task='transcribe') |
|
|
feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small",language='bengali',task='transcribe') |
|
|
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small").to('cuda') |
|
|
|
|
|
model = PeftModel.from_pretrained(model, "Mohan-diffuser/whisper-small-odia-finetuned", is_trainable=False, device_map='cuda') |
|
|
model.eval() |
|
|
model.config.use_cache = True |
|
|
|
|
|
asr_dataset = load_dataset("Mohan-diffuser/odia-english-ASR") |
|
|
|
|
|
idx=0 |
|
|
target = asr_dataset['validation'][idx]['transcription'] |
|
|
audio_original = asr_dataset['validation'][idx]['audio']['array'] |
|
|
original_sample_rate = asr_dataset['validation'][idx]['audio']['sampling_rate'] |
|
|
|
|
|
audio_16000 = down_sample_audio(audio_original, original_sample_rate) |
|
|
|
|
|
input_feature = feature_extractor(raw_speech=audio_16000, |
|
|
sampling_rate=16000, |
|
|
return_tensors='pt').input_features |
|
|
|
|
|
with torch.no_grad(): |
|
|
op = model.generate(input_feature.to('cuda'), language='bengali', task='transcribe') |
|
|
|
|
|
text_pred = tokenizer.batch_decode(op, skip_special_tokens=True)[0] |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
* **Dataset:** [Mohan-diffuser/odia-english-ASR](https://huggingface.co/datasets/Mohan-diffuser/odia-english-ASR) |
|
|
* **Audio:** Native Odia and code-mixed English speech with manual transcriptions. |
|
|
* **Sampling rate:** Downsampled to 16kHz using `scipy.signal.resample`. |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
 |
|
|
|
|
|
* **LoRA Parameters:** $r=64$, $lora\_alpha=64$, $dropout=0.05$ |
|
|
* **Scheduler:** Linear warmup |
|
|
* **Warmup steps:** 20 |
|
|
* **Max steps:** 1400 |
|
|
* **Batch size:** 8 |
|
|
* **Gradient accumulation:** 4 |
|
|
* **Optimizer:** AdamW on trainable LoRA parameters |
|
|
* **Eval steps:** 100 |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### Dataset |
|
|
|
|
|
* Validation split from [Mohan-diffuser/odia-english-ASR](https://huggingface.co/datasets/Mohan-diffuser/odia-english-ASR) |
|
|
|
|
|
### Metrics |
|
|
|
|
|
* **CER (Character Error Rate):** Computed using `jiwer.cer` |
|
|
* It acheived a CER of **14.14** in the Valdidation dataset |
|
|
* Manual predictions logged every 100 steps for qualitative monitoring. |
|
|
|
|
|
## Environmental Impact |
|
|
|
|
|
* **Hardware Used:** Single GPU (4060 TI) |
|
|
* **Training Duration:** \~1400 steps, small-scale LoRA tuning |
|
|
* **Framework:** PyTorch, Transformers, PEFT |
|
|
|