Commit ·
c7b7d2c
1
Parent(s): af26d67
Update models
Browse files- README.md +20 -11
- model.safetensors +1 -1
- small.pt → pytorch_model.bin +2 -2
README.md
CHANGED
|
@@ -4,10 +4,19 @@ license: cc-by-4.0
|
|
| 4 |
# Whisper-Small-hindi
|
| 5 |
|
| 6 |
This is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small), fine-tuned on the following datasets:
|
| 7 |
-
|
| 8 |
-
-
|
| 9 |
-
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
|
| 12 |
## How to use
|
| 13 |
The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers pipeline method. Chunking is enabled by setting chunk_length_s=30 when instantiating the pipeline. With chunking enabled, the pipeline can be run with batched inference. It can also be extended to predict sequence level timestamps by passing return_timestamps=True:
|
|
@@ -28,8 +37,8 @@ The Whisper model is intrinsically designed to work on audio samples of up to 30
|
|
| 28 |
|
| 29 |
>>> ds = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="validation")
|
| 30 |
>>> sample = ds[0]["audio"]
|
| 31 |
-
>>> prediction = asr_pipe(sample.copy(),
|
| 32 |
-
हमने उस उम्मीदवार को चुना।
|
| 33 |
```
|
| 34 |
|
| 35 |
## Intended Use
|
|
@@ -43,17 +52,17 @@ The Whisper model is intrinsically designed to work on audio samples of up to 30
|
|
| 43 |
### Model Performance
|
| 44 |
Whisper Normalization is counter-productive for hindi since it takes the meaning out of a sentence for e.g. consider the hindi phrase:
|
| 45 |
```
|
| 46 |
-
'
|
| 47 |
```
|
| 48 |
|
| 49 |
After whisper normalization:
|
| 50 |
```
|
| 51 |
-
'
|
| 52 |
```
|
| 53 |
|
| 54 |
So, we use [indic-normalization](https://github.com/anoopkunchukuttan/indic_nlp_library/blob/4cead0ae6c78fe9a19a51ef679f586206df9c476/indicnlp/normalize/indic_normalize.py#L325) for evaluation. Indic-norm produces the below output:
|
| 55 |
```
|
| 56 |
-
'
|
| 57 |
```
|
| 58 |
|
| 59 |
`openai-whisper/small` baseline results on `google/fleurs -- hindi`:
|
|
@@ -64,8 +73,8 @@ Word Error Rate (WER) with indic norm: 89.73 %
|
|
| 64 |
|
| 65 |
The model achieves the following benchmarks on the held out test set `google/fleurs -- hindi`:
|
| 66 |
```
|
| 67 |
-
Word Error Rate (WER) with whisper norm:
|
| 68 |
-
Word Error Rate (WER) with indic norm:
|
| 69 |
```
|
| 70 |
|
| 71 |
Indic normalization retains diacritics and complex characters in Hindi text, which can increase the Word Error Rate (WER) when compared to Whisper's default normalization but produces more semantically accurate transcriptions.
|
|
|
|
| 4 |
# Whisper-Small-hindi
|
| 5 |
|
| 6 |
This is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small), fine-tuned on the following datasets:
|
| 7 |
+
| Dataset | Hours (Hi) | License | Source |
|
| 8 |
+
|----------------------------------------|------------|-----------------------------------|------------------------------------------------------------------------|
|
| 9 |
+
| **Shrutilipi** | ~1,558 h | CC BY 4.0 | [ai4bharat/shrutilipi](https://huggingface.co/datasets/ai4bharat/Shrutilipi) |
|
| 10 |
+
| **IITM Madras SpringLab** | ~900 h | CC BY 4.0 | [SpringLab](https://asr.iitm.ac.in/dataset) |
|
| 11 |
+
| **Common Voice 11.0 (Mozilla)** | ~20 h | CC 0 1.0 (public domain) | [mozilla/commonvoice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) |
|
| 12 |
+
| **IndicSUPERB** | 150 h | Apache License 2.0 | [ai4bharat/indic-superb](https://github.com/AI4Bharat/IndicSUPERB) |
|
| 13 |
+
| **snow-mountain** | 67.6 h | CC BY-SA 4.0 | [bridgeconn/snow-mountain](https://huggingface.co/datasets/bridgeconn/snow-mountain/) |
|
| 14 |
+
| **yodas** | ~200 h | CC BY 3.0 | [espnet/yodas](https://huggingface.co/datasets/espnet/yodas) |
|
| 15 |
+
| **IndicVoices-R_Hindi** | 75 h | CC BY 4.0 | [SPRINGLab/IndicVoices-R_Hindi](https://huggingface.co/datasets/SPRINGLab/IndicVoices-R_Hindi) |
|
| 16 |
+
| **Lahaja** | 12.5 h | CC BY 4.0 | [ai4bharat/lahaja](https://ai4bharat.iitm.ac.in/datasets/lahaja) |
|
| 17 |
+
| **fleurs** | 30.0 h | CC BY 4.0 | [google/fleurs](https://huggingface.co/datasets/google/fleurs) |
|
| 18 |
+
|
| 19 |
+
The model is trained on around 3000 hours of hindi speech & optimized for ASR tasks in hindi, with a particular focus on high-accuracy transcription.
|
| 20 |
|
| 21 |
## How to use
|
| 22 |
The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers pipeline method. Chunking is enabled by setting chunk_length_s=30 when instantiating the pipeline. With chunking enabled, the pipeline can be run with batched inference. It can also be extended to predict sequence level timestamps by passing return_timestamps=True:
|
|
|
|
| 37 |
|
| 38 |
>>> ds = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="validation")
|
| 39 |
>>> sample = ds[0]["audio"]
|
| 40 |
+
>>> prediction = asr_pipe(sample.copy(), return_timestamps=True)
|
| 41 |
+
{'text': ' हमने उस उम्मीदवार को चुना।', 'chunks': [{'timestamp': (0.0, 4.42), 'text': ' हमने उस उम्मीदवार को चुना।'}]}
|
| 42 |
```
|
| 43 |
|
| 44 |
## Intended Use
|
|
|
|
| 52 |
### Model Performance
|
| 53 |
Whisper Normalization is counter-productive for hindi since it takes the meaning out of a sentence for e.g. consider the hindi phrase:
|
| 54 |
```
|
| 55 |
+
'क्षेत्रफल बढ़ने से उत्पादन बढ़ा।'
|
| 56 |
```
|
| 57 |
|
| 58 |
After whisper normalization:
|
| 59 |
```
|
| 60 |
+
'कषतरफल बढन स उतप दन बढ'
|
| 61 |
```
|
| 62 |
|
| 63 |
So, we use [indic-normalization](https://github.com/anoopkunchukuttan/indic_nlp_library/blob/4cead0ae6c78fe9a19a51ef679f586206df9c476/indicnlp/normalize/indic_normalize.py#L325) for evaluation. Indic-norm produces the below output:
|
| 64 |
```
|
| 65 |
+
'क्षेत्रफल बढ़ने से उत्पादन बढ़ा।'
|
| 66 |
```
|
| 67 |
|
| 68 |
`openai-whisper/small` baseline results on `google/fleurs -- hindi`:
|
|
|
|
| 73 |
|
| 74 |
The model achieves the following benchmarks on the held out test set `google/fleurs -- hindi`:
|
| 75 |
```
|
| 76 |
+
Word Error Rate (WER) with whisper norm: 7.17 %
|
| 77 |
+
Word Error Rate (WER) with indic norm: 15.10 %
|
| 78 |
```
|
| 79 |
|
| 80 |
Indic normalization retains diacritics and complex characters in Hindi text, which can increase the Word Error Rate (WER) when compared to Whisper's default normalization but produces more semantically accurate transcriptions.
|
model.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 966995080
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:3844aceb78c375a907a3782fd61b553c38719bd3efb17dd0ca2c2af6eef6f535
|
| 3 |
size 966995080
|
small.pt → pytorch_model.bin
RENAMED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:56b44176a2517693721016e73fb6ae8b7a77004b44ac531d1435856764d83233
|
| 3 |
+
size 967103174
|