whisper-tiny-hindi / README.md
makaveli10's picture
Update model
a4201b8
---
license: cc-by-4.0
---
# Whisper-Tiny-hindi
This is a fine-tuned version of [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny), fine-tuned on the following datasets:
| Dataset | Hours (Hi) | License | Source |
|----------------------------------------|------------|-----------------------------------|------------------------------------------------------------------------|
| **Shrutilipi** | ~1,558 h | CC BY 4.0 | [ai4bharat/shrutilipi](https://huggingface.co/datasets/ai4bharat/Shrutilipi) |
| **IITM Madras SpringLab** | ~900 h | CC BY 4.0 | [SpringLab](https://asr.iitm.ac.in/dataset) |
| **Common Voice 11.0 (Mozilla)** | ~20 h | CC 0 1.0 (public domain) | [mozilla/commonvoice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) |
| **IndicSUPERB** | 150 h | Apache License 2.0 | [ai4bharat/indic-superb](https://github.com/AI4Bharat/IndicSUPERB) |
| **snow-mountain** | 67.6 h | CC BY-SA 4.0 | [bridgeconn/snow-mountain](https://huggingface.co/datasets/bridgeconn/snow-mountain/) |
| **yodas** | ~200 h | CC BY 3.0 | [espnet/yodas](https://huggingface.co/datasets/espnet/yodas) |
| **IndicVoices-R_Hindi** | 75 h | CC BY 4.0 | [SPRINGLab/IndicVoices-R_Hindi](https://huggingface.co/datasets/SPRINGLab/IndicVoices-R_Hindi) |
| **Lahaja** | 12.5 h | CC BY 4.0 | [ai4bharat/lahaja](https://ai4bharat.iitm.ac.in/datasets/lahaja) |
| **fleurs** | 30.0 h | CC BY 4.0 | [google/fleurs](https://huggingface.co/datasets/google/fleurs) |
The model is trained on around 3000 hours of hindi speech & optimized for ASR tasks in hindi, with a particular focus on high-accuracy transcription.
## How to use
The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers pipeline method. Chunking is enabled by setting chunk_length_s=30 when instantiating the pipeline. With chunking enabled, the pipeline can be run with batched inference. It can also be extended to predict sequence level timestamps by passing return_timestamps=True:
```python
>>> import torch
>>> from transformers import pipeline
>>> from datasets import load_dataset
>>> device = "cuda:0" if torch.cuda.is_available() else "cpu"
>>> asr_pipe = pipe(
>>> "automatic-speech-recognition",
>>> model="collabora/whisper-tiny-hindi",
>>> chunk_length_s=30,
>>> device=device
>>> )
>>> ds = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="validation")
>>> sample = ds[0]["audio"]
>>> prediction = asr_pipe(sample.copy(), return_timestamps=True)
{'text': ' हमने उस उम्मीदवार को चुना', 'chunks': [{'timestamp': (0.0, 4.42), 'text': ' हमने उस उम्मीदवार को चुना'}]}
```
## Intended Use
- The model is designed for high quality transcription in Hindi.
- And is suitable for academic use in ASR related tasks.
## Limitations
- May not perform well on noisy or low-quality audio.
- Focused primarily on Hindi.
### Model Performance
Whisper Normalization is counter-productive for hindi since it takes the meaning out of a sentence for e.g. consider the hindi phrase:
```
'क्षेत्रफल बढ़ने से उत्पादन बढ़ा।'
```
After whisper normalization:
```
'कषतरफल बढन स उतप दन बढ'
```
So, we use [indic-normalization](https://github.com/anoopkunchukuttan/indic_nlp_library/blob/4cead0ae6c78fe9a19a51ef679f586206df9c476/indicnlp/normalize/indic_normalize.py#L325) for evaluation. Indic-norm produces the below output:
```
'क्षेत्रफल बढ़ने से उत्पादन बढ़ा।'
```
`openai-whisper/tiny` baseline results on `google/fleurs -- hindi`:
```
Word Error Rate (WER) with whisper norm: 172.60 %
Word Error Rate (WER) with indic norm: 196.57 %
```
The model achieves the following benchmarks on the held out test set `google/fleurs -- hindi`:
```
Word Error Rate (WER) with whisper norm: 10.10 %
Word Error Rate (WER) with indic norm: 18.94 %
```
Indic normalization retains diacritics and complex characters in Hindi text, which can increase the Word Error Rate (WER) when compared to Whisper's default normalization but produces more semantically accurate transcriptions.
### Acknowledgments
We thank the contributors and organizations behind the datasets:
- [AI4Bharat](https://ai4bharat.iitm.ac.in/datasets/shrutilipi) for the Shrutilipi dataset.
- [IIT Madras SpringLab](https://asr.iitm.ac.in/dataset) for their springx-hindi dataset.
- [IndicNLP Library](https://github.com/anoopkunchukuttan/indic_nlp_library) by Anoop Kunchukuttan for providing normalization tools that were crucial for evaluation.
### BibTeX entry and citation info
#### Model Citation
```bibtex
@misc{whisper-tiny-hindi,
title = {Whisper-Tiny Fine-Tuned on Hindi},
author = {Collabora Ltd.},
year = {2025},
publisher = {Hugging Face},
note = {Fine-tuned using Shrutilipi and IITM Madras SpringLab datasets},
howpublished = {\url{https://huggingface.co/collabora/whisper-tiny-hindi/}},
}
```
#### IndicNLP Library Citation
```
@misc{kunchukuttan2020indicnlp,
author = "Anoop Kunchukuttan",
title = "{The IndicNLP Library}",
year = "2020",
howpublished={\url{https://github.com/anoopkunchukuttan/indic_nlp_library/blob/master/docs/indicnlp.pdf}}
}
```
#### AI4Bharat - Shrutilipi dataset
```bibtex
@misc{https://doi.org/10.48550/arxiv.2208.12666,
doi = {10.48550/ARXIV.2208.12666},
url = {https://arxiv.org/abs/2208.12666},
author = {Bhogale, Kaushal Santosh and Raman, Abhigyan and Javed, Tahir and Doddapaneni, Sumanth and Kunchukuttan, Anoop and Kumar, Pratyush and Khapra, Mitesh M.},
title = {Effectiveness of Mining Audio and Text Pairs from Public Data for Improving ASR Systems for Low-Resource Languages},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}
```