|
|
--- |
|
|
license: cc-by-4.0 |
|
|
--- |
|
|
# Whisper-Tiny-hindi |
|
|
|
|
|
This is a fine-tuned version of [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny), fine-tuned on the following datasets: |
|
|
| Dataset | Hours (Hi) | License | Source | |
|
|
|----------------------------------------|------------|-----------------------------------|------------------------------------------------------------------------| |
|
|
| **Shrutilipi** | ~1,558 h | CC BY 4.0 | [ai4bharat/shrutilipi](https://huggingface.co/datasets/ai4bharat/Shrutilipi) | |
|
|
| **IITM Madras SpringLab** | ~900 h | CC BY 4.0 | [SpringLab](https://asr.iitm.ac.in/dataset) | |
|
|
| **Common Voice 11.0 (Mozilla)** | ~20 h | CC 0 1.0 (public domain) | [mozilla/commonvoice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) | |
|
|
| **IndicSUPERB** | 150 h | Apache License 2.0 | [ai4bharat/indic-superb](https://github.com/AI4Bharat/IndicSUPERB) | |
|
|
| **snow-mountain** | 67.6 h | CC BY-SA 4.0 | [bridgeconn/snow-mountain](https://huggingface.co/datasets/bridgeconn/snow-mountain/) | |
|
|
| **yodas** | ~200 h | CC BY 3.0 | [espnet/yodas](https://huggingface.co/datasets/espnet/yodas) | |
|
|
| **IndicVoices-R_Hindi** | 75 h | CC BY 4.0 | [SPRINGLab/IndicVoices-R_Hindi](https://huggingface.co/datasets/SPRINGLab/IndicVoices-R_Hindi) | |
|
|
| **Lahaja** | 12.5 h | CC BY 4.0 | [ai4bharat/lahaja](https://ai4bharat.iitm.ac.in/datasets/lahaja) | |
|
|
| **fleurs** | 30.0 h | CC BY 4.0 | [google/fleurs](https://huggingface.co/datasets/google/fleurs) | |
|
|
|
|
|
The model is trained on around 3000 hours of hindi speech & optimized for ASR tasks in hindi, with a particular focus on high-accuracy transcription. |
|
|
|
|
|
## How to use |
|
|
The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers pipeline method. Chunking is enabled by setting chunk_length_s=30 when instantiating the pipeline. With chunking enabled, the pipeline can be run with batched inference. It can also be extended to predict sequence level timestamps by passing return_timestamps=True: |
|
|
|
|
|
```python |
|
|
>>> import torch |
|
|
>>> from transformers import pipeline |
|
|
>>> from datasets import load_dataset |
|
|
|
|
|
>>> device = "cuda:0" if torch.cuda.is_available() else "cpu" |
|
|
|
|
|
>>> asr_pipe = pipe( |
|
|
>>> "automatic-speech-recognition", |
|
|
>>> model="collabora/whisper-tiny-hindi", |
|
|
>>> chunk_length_s=30, |
|
|
>>> device=device |
|
|
>>> ) |
|
|
|
|
|
>>> ds = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="validation") |
|
|
>>> sample = ds[0]["audio"] |
|
|
>>> prediction = asr_pipe(sample.copy(), return_timestamps=True) |
|
|
{'text': ' हमने उस उम्मीदवार को चुना', 'chunks': [{'timestamp': (0.0, 4.42), 'text': ' हमने उस उम्मीदवार को चुना'}]} |
|
|
``` |
|
|
|
|
|
## Intended Use |
|
|
- The model is designed for high quality transcription in Hindi. |
|
|
- And is suitable for academic use in ASR related tasks. |
|
|
|
|
|
## Limitations |
|
|
- May not perform well on noisy or low-quality audio. |
|
|
- Focused primarily on Hindi. |
|
|
|
|
|
### Model Performance |
|
|
Whisper Normalization is counter-productive for hindi since it takes the meaning out of a sentence for e.g. consider the hindi phrase: |
|
|
``` |
|
|
'क्षेत्रफल बढ़ने से उत्पादन बढ़ा।' |
|
|
``` |
|
|
|
|
|
After whisper normalization: |
|
|
``` |
|
|
'कषतरफल बढन स उतप दन बढ' |
|
|
``` |
|
|
|
|
|
So, we use [indic-normalization](https://github.com/anoopkunchukuttan/indic_nlp_library/blob/4cead0ae6c78fe9a19a51ef679f586206df9c476/indicnlp/normalize/indic_normalize.py#L325) for evaluation. Indic-norm produces the below output: |
|
|
``` |
|
|
'क्षेत्रफल बढ़ने से उत्पादन बढ़ा।' |
|
|
``` |
|
|
|
|
|
`openai-whisper/tiny` baseline results on `google/fleurs -- hindi`: |
|
|
``` |
|
|
Word Error Rate (WER) with whisper norm: 172.60 % |
|
|
Word Error Rate (WER) with indic norm: 196.57 % |
|
|
``` |
|
|
|
|
|
The model achieves the following benchmarks on the held out test set `google/fleurs -- hindi`: |
|
|
``` |
|
|
Word Error Rate (WER) with whisper norm: 10.10 % |
|
|
Word Error Rate (WER) with indic norm: 18.94 % |
|
|
``` |
|
|
|
|
|
Indic normalization retains diacritics and complex characters in Hindi text, which can increase the Word Error Rate (WER) when compared to Whisper's default normalization but produces more semantically accurate transcriptions. |
|
|
|
|
|
### Acknowledgments |
|
|
|
|
|
We thank the contributors and organizations behind the datasets: |
|
|
|
|
|
- [AI4Bharat](https://ai4bharat.iitm.ac.in/datasets/shrutilipi) for the Shrutilipi dataset. |
|
|
|
|
|
- [IIT Madras SpringLab](https://asr.iitm.ac.in/dataset) for their springx-hindi dataset. |
|
|
|
|
|
- [IndicNLP Library](https://github.com/anoopkunchukuttan/indic_nlp_library) by Anoop Kunchukuttan for providing normalization tools that were crucial for evaluation. |
|
|
|
|
|
|
|
|
### BibTeX entry and citation info |
|
|
|
|
|
#### Model Citation |
|
|
```bibtex |
|
|
@misc{whisper-tiny-hindi, |
|
|
title = {Whisper-Tiny Fine-Tuned on Hindi}, |
|
|
author = {Collabora Ltd.}, |
|
|
year = {2025}, |
|
|
publisher = {Hugging Face}, |
|
|
note = {Fine-tuned using Shrutilipi and IITM Madras SpringLab datasets}, |
|
|
howpublished = {\url{https://huggingface.co/collabora/whisper-tiny-hindi/}}, |
|
|
} |
|
|
``` |
|
|
|
|
|
#### IndicNLP Library Citation |
|
|
``` |
|
|
@misc{kunchukuttan2020indicnlp, |
|
|
author = "Anoop Kunchukuttan", |
|
|
title = "{The IndicNLP Library}", |
|
|
year = "2020", |
|
|
howpublished={\url{https://github.com/anoopkunchukuttan/indic_nlp_library/blob/master/docs/indicnlp.pdf}} |
|
|
} |
|
|
``` |
|
|
|
|
|
#### AI4Bharat - Shrutilipi dataset |
|
|
```bibtex |
|
|
@misc{https://doi.org/10.48550/arxiv.2208.12666, |
|
|
doi = {10.48550/ARXIV.2208.12666}, |
|
|
url = {https://arxiv.org/abs/2208.12666}, |
|
|
author = {Bhogale, Kaushal Santosh and Raman, Abhigyan and Javed, Tahir and Doddapaneni, Sumanth and Kunchukuttan, Anoop and Kumar, Pratyush and Khapra, Mitesh M.}, |
|
|
title = {Effectiveness of Mining Audio and Text Pairs from Public Data for Improving ASR Systems for Low-Resource Languages}, |
|
|
publisher = {arXiv}, |
|
|
year = {2022}, |
|
|
copyright = {arXiv.org perpetual, non-exclusive license} |
|
|
} |
|
|
``` |
|
|
|