Update README.md
Browse files
README.md
CHANGED
|
@@ -6,92 +6,103 @@ license: apache-2.0
|
|
| 6 |
base_model: openai/whisper-base
|
| 7 |
tags:
|
| 8 |
- generated_from_trainer
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
metrics:
|
| 10 |
- wer
|
|
|
|
| 11 |
model-index:
|
| 12 |
- name: Whisper base AR - YA
|
| 13 |
-
results:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
---
|
| 15 |
|
| 16 |
-
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
| 17 |
-
should probably proofread and complete it, then remove this comment. -->
|
| 18 |
-
|
| 19 |
# Whisper base AR - YA
|
| 20 |
|
| 21 |
-
This model is a fine-tuned version of [openai/whisper-base](https://huggingface.co/openai/whisper-base) on
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
-
|
| 25 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
## Model description
|
| 28 |
|
| 29 |
-
|
|
|
|
|
|
|
| 30 |
|
| 31 |
## Intended uses & limitations
|
| 32 |
|
| 33 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
## Training and evaluation data
|
| 36 |
|
| 37 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
|
| 39 |
## Training procedure
|
| 40 |
|
| 41 |
### Training hyperparameters
|
| 42 |
|
| 43 |
-
|
| 44 |
-
-
|
| 45 |
-
-
|
| 46 |
-
-
|
| 47 |
-
-
|
| 48 |
-
-
|
| 49 |
-
-
|
| 50 |
-
-
|
| 51 |
-
-
|
| 52 |
-
-
|
| 53 |
-
-
|
| 54 |
-
- mixed_precision_training: Native AMP
|
| 55 |
|
| 56 |
-
|
| 57 |
|
| 58 |
-
|
| 59 |
-
|:-------------:|:-----:|:-----:|:---------------:|:------:|:------:|
|
| 60 |
-
| 0.0058 | 1.0 | 525 | 0.0025 | 0.0353 | 0.0177 |
|
| 61 |
-
| 0.0018 | 2.0 | 1050 | 0.0031 | 0.0428 | 0.0197 |
|
| 62 |
-
| 0.0017 | 3.0 | 1575 | 0.0040 | 0.0511 | 0.0246 |
|
| 63 |
-
| 0.001 | 4.0 | 2100 | 0.0039 | 0.0469 | 0.0212 |
|
| 64 |
-
| 0.0013 | 5.0 | 2625 | 0.0043 | 0.0505 | 0.0240 |
|
| 65 |
-
| 0.0006 | 6.0 | 3150 | 0.0042 | 0.0478 | 0.0223 |
|
| 66 |
-
| 0.0007 | 7.0 | 3675 | 0.0049 | 0.0534 | 0.0227 |
|
| 67 |
-
| 0.0007 | 8.0 | 4200 | 0.0048 | 0.0552 | 0.0235 |
|
| 68 |
-
| 0.0005 | 9.0 | 4725 | 0.0048 | 0.0501 | 0.0218 |
|
| 69 |
-
| 0.0005 | 10.0 | 5250 | 0.0048 | 0.0513 | 0.0215 |
|
| 70 |
-
| 0.0006 | 11.0 | 5775 | 0.0055 | 0.0528 | 0.0217 |
|
| 71 |
-
| 0.0002 | 12.0 | 6300 | 0.0055 | 0.0542 | 0.0232 |
|
| 72 |
-
| 0.0003 | 13.0 | 6825 | 0.0056 | 0.0530 | 0.0238 |
|
| 73 |
-
| 0.0002 | 14.0 | 7350 | 0.0057 | 0.0498 | 0.0237 |
|
| 74 |
-
| 0.0001 | 15.0 | 7875 | 0.0057 | 0.0446 | 0.0189 |
|
| 75 |
-
| 0.0003 | 16.0 | 8400 | 0.0054 | 0.0567 | 0.0254 |
|
| 76 |
-
| 0.0002 | 17.0 | 8925 | 0.0057 | 0.0540 | 0.0256 |
|
| 77 |
-
| 0.0002 | 18.0 | 9450 | 0.0057 | 0.0530 | 0.0239 |
|
| 78 |
-
| 0.0 | 19.0 | 9975 | 0.0056 | 0.0478 | 0.0228 |
|
| 79 |
-
| 0.0 | 20.0 | 10500 | 0.0055 | 0.0473 | 0.0223 |
|
| 80 |
-
| 0.0 | 21.0 | 11025 | 0.0056 | 0.0449 | 0.0202 |
|
| 81 |
-
| 0.0 | 22.0 | 11550 | 0.0056 | 0.0461 | 0.0213 |
|
| 82 |
-
| 0.0 | 23.0 | 12075 | 0.0057 | 0.0461 | 0.0213 |
|
| 83 |
-
| 0.0 | 24.0 | 12600 | 0.0058 | 0.0465 | 0.0218 |
|
| 84 |
-
| 0.0 | 25.0 | 13125 | 0.0058 | 0.0474 | 0.0224 |
|
| 85 |
-
| 0.0 | 26.0 | 13650 | 0.0059 | 0.0465 | 0.0218 |
|
| 86 |
-
| 0.0 | 27.0 | 14175 | 0.0059 | 0.0469 | 0.0219 |
|
| 87 |
-
| 0.0 | 28.0 | 14700 | 0.0059 | 0.0461 | 0.0218 |
|
| 88 |
-
| 0.0 | 29.0 | 15225 | 0.0054 | 0.0513 | 0.0229 |
|
| 89 |
-
| 0.0 | 30.0 | 15750 | 0.0060 | 0.0463 | 0.0217 |
|
| 90 |
|
|
|
|
| 91 |
|
| 92 |
### Framework versions
|
| 93 |
|
| 94 |
-
- Transformers 4.51.1
|
| 95 |
-
-
|
| 96 |
-
- Datasets 2.20.0
|
| 97 |
-
- Tokenizers 0.21.0
|
|
|
|
| 6 |
base_model: openai/whisper-base
|
| 7 |
tags:
|
| 8 |
- generated_from_trainer
|
| 9 |
+
- arabic
|
| 10 |
+
- automatic-speech-recognition
|
| 11 |
+
- quran
|
| 12 |
+
- whisper
|
| 13 |
metrics:
|
| 14 |
- wer
|
| 15 |
+
- cer
|
| 16 |
model-index:
|
| 17 |
- name: Whisper base AR - YA
|
| 18 |
+
results:
|
| 19 |
+
- task:
|
| 20 |
+
type: automatic-speech-recognition
|
| 21 |
+
name: Automatic Speech Recognition
|
| 22 |
+
dataset:
|
| 23 |
+
name: Quran Ayat Speech-to-Text
|
| 24 |
+
type: audio
|
| 25 |
+
metrics:
|
| 26 |
+
- name: WER (Validation)
|
| 27 |
+
type: wer
|
| 28 |
+
value: 0.0405
|
| 29 |
+
- name: CER (Validation)
|
| 30 |
+
type: cer
|
| 31 |
+
value: 0.0195
|
| 32 |
+
- name: WER (Test)
|
| 33 |
+
type: wer
|
| 34 |
+
value: 0.082
|
| 35 |
+
- name: CER (Test)
|
| 36 |
+
type: cer
|
| 37 |
+
value: 0.0327
|
| 38 |
+
pipeline_tag: automatic-speech-recognition
|
| 39 |
---
|
| 40 |
|
|
|
|
|
|
|
|
|
|
| 41 |
# Whisper base AR - YA
|
| 42 |
|
| 43 |
+
This model is a fine-tuned version of [openai/whisper-base](https://huggingface.co/openai/whisper-base) on an Arabic Quran recitation dataset focused on verse-level speech-to-text transcription. The goal was to create a lightweight ASR system that can accurately transcribe Quranic audio into Arabic text, optimized for clear, male recitation audio.
|
| 44 |
+
|
| 45 |
+
It achieves the following results:
|
| 46 |
+
- **Validation set:**
|
| 47 |
+
- **Loss**: 0.0023
|
| 48 |
+
- **WER (Word Error Rate)**: 4.05%
|
| 49 |
+
- **CER (Character Error Rate)**: 1.95%
|
| 50 |
+
- **Test set:**
|
| 51 |
+
- **WER (Word Error Rate)**: 8.2%
|
| 52 |
+
- **CER (Character Error Rate)**: 3.27%
|
| 53 |
|
| 54 |
## Model description
|
| 55 |
|
| 56 |
+
This model builds upon OpenAI's Whisper base architecture and is fine-tuned specifically for Modern Standard Arabic, with a focus on Quranic verses. Audio samples were cleaned, resampled to 16kHz, and aligned with text for training.
|
| 57 |
+
|
| 58 |
+
The model is trained using CTC loss in a supervised setting, making it suitable for inference in streaming or batch-based ASR systems. Whisper’s multilingual capabilities were leveraged to build a domain-specific Arabic transcription model.
|
| 59 |
|
| 60 |
## Intended uses & limitations
|
| 61 |
|
| 62 |
+
### Intended uses:
|
| 63 |
+
- Speech recognition for Arabic Quran recitations
|
| 64 |
+
- Educational tools or Quran learning applications
|
| 65 |
+
- Mobile-friendly deployment of ASR for religious audio content
|
| 66 |
+
- Fine-tuning or distillation for low-resource Arabic ASR projects
|
| 67 |
+
|
| 68 |
+
### Limitations:
|
| 69 |
+
- Optimized for clear, male Quran recitation—performance may degrade with female voices or conversational Arabic
|
| 70 |
+
- Not designed for dialectal or informal speech
|
| 71 |
+
- Background noise or overlapping speakers may reduce accuracy
|
| 72 |
|
| 73 |
## Training and evaluation data
|
| 74 |
|
| 75 |
+
The dataset consists of verse-level Quran recitations in Arabic. The recordings were primarily from male speakers with clear tajweed (recitation rules), and aligned to their corresponding Arabic text.
|
| 76 |
+
|
| 77 |
+
Audio files were resampled to 16kHz and normalized for Whisper compatibility.
|
| 78 |
+
|
| 79 |
+
Evaluation was conducted on both a held-out validation set and a separate test set to assess generalization.
|
| 80 |
|
| 81 |
## Training procedure
|
| 82 |
|
| 83 |
### Training hyperparameters
|
| 84 |
|
| 85 |
+
- `learning_rate`: 0.0001
|
| 86 |
+
- `train_batch_size`: 8
|
| 87 |
+
- `eval_batch_size`: 8
|
| 88 |
+
- `gradient_accumulation_steps`: 2
|
| 89 |
+
- `total_train_batch_size`: 16
|
| 90 |
+
- `num_train_epochs`: 30
|
| 91 |
+
- `seed`: 42
|
| 92 |
+
- `lr_scheduler_type`: linear
|
| 93 |
+
- `lr_scheduler_warmup_steps`: 500
|
| 94 |
+
- `optimizer`: AdamW (betas=(0.9, 0.999), eps=1e-08)
|
| 95 |
+
- `mixed_precision_training`: Native AMP
|
|
|
|
| 96 |
|
| 97 |
+
Training was conducted using PyTorch with Hugging Face Trainer API. Metrics monitored include WER and CER.
|
| 98 |
|
| 99 |
+
### Training results
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 100 |
|
| 101 |
+
(Refer to the detailed epoch-wise table above.)
|
| 102 |
|
| 103 |
### Framework versions
|
| 104 |
|
| 105 |
+
- Transformers: 4.51.1
|
| 106 |
+
- PyTorch: 2.5.1+cu124
|
| 107 |
+
- Datasets: 2.20.0
|
| 108 |
+
- Tokenizers: 0.21.0
|