ai4bharat/Kathbath
Viewer β’ Updated β’ 806k β’ 1.48k β’ 17
Fine-tuned version of openai/whisper-medium on the AI4Bharat Kathbath Hindi speech dataset for automatic speech recognition (ASR) in Hindi.
| Metric | Base whisper-medium | After Fine-tuning | Improvement |
|---|---|---|---|
| WER | 0.4133 (41.3%) | 0.2318 (23.2%) | 43.9% β |
| CER | 0.2292 (22.9%) | 0.0704 (7.0%) | 69.3% β |
Evaluated on 50 examples from the Kathbath valid split (never seen during training).
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
processor = WhisperProcessor.from_pretrained("ShaikhAnis007/whisper-medium-hindi")
model = WhisperForConditionalGeneration.from_pretrained("ShaikhAnis007/whisper-medium-hindi")
# Load your audio (must be 16kHz mono)
# audio = load your audio array here
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
predicted_ids = model.generate(
inputs.input_features,
language="hindi",
task="transcribe"
)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
| Parameter | Value |
|---|---|
| Learning rate | 1e-5 |
| LR scheduler | Linear with warmup |
| Warmup steps | 50 |
| Epochs | 4 |
| Batch size (physical) | 1 |
| Gradient accumulation | 8 (effective batch = 8) |
| fp16 | True |
| gradient_checkpointing | True |
If you use this model, please cite the base model and dataset:
@misc{radford2022whisper,
title={Robust Speech Recognition via Large-Scale Weak Supervision},
author={Radford, Alec and others},
year={2022}
}
@inproceedings{kathbath2022,
title={Kathbath: A Robust Dataset for Hindi ASR},
author={AI4Bharat},
year={2022}
}
Base model
openai/whisper-medium