|
|
--- |
|
|
library_name: transformers |
|
|
language: |
|
|
- ar |
|
|
license: apache-2.0 |
|
|
base_model: openai/whisper-base |
|
|
tags: |
|
|
- generated_from_trainer |
|
|
- arabic |
|
|
- automatic-speech-recognition |
|
|
- quran |
|
|
- whisper |
|
|
metrics: |
|
|
- wer |
|
|
- cer |
|
|
model-index: |
|
|
- name: Whisper base AR - YA |
|
|
results: |
|
|
- task: |
|
|
type: automatic-speech-recognition |
|
|
name: Automatic Speech Recognition |
|
|
dataset: |
|
|
name: Quran Ayat Speech-to-Text |
|
|
type: audio |
|
|
metrics: |
|
|
- name: WER (Validation) |
|
|
type: wer |
|
|
value: 0.0405 |
|
|
- name: CER (Validation) |
|
|
type: cer |
|
|
value: 0.0195 |
|
|
- name: WER (Test) |
|
|
type: wer |
|
|
value: 0.082 |
|
|
- name: CER (Test) |
|
|
type: cer |
|
|
value: 0.0327 |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
--- |
|
|
|
|
|
# Whisper base AR - YA |
|
|
|
|
|
This model is a fine-tuned version of [openai/whisper-base](https://huggingface.co/openai/whisper-base) on an Arabic Quran recitation dataset focused on verse-level speech-to-text transcription. The goal was to create a lightweight ASR system that can accurately transcribe Quranic audio into Arabic text, optimized for clear, male recitation audio. |
|
|
|
|
|
It achieves the following results: |
|
|
- **Validation set:** |
|
|
- **Loss**: 0.0023 |
|
|
- **WER (Word Error Rate)**: 4.05% |
|
|
- **CER (Character Error Rate)**: 1.95% |
|
|
- **Test set:** |
|
|
- **WER (Word Error Rate)**: 8.2% |
|
|
- **CER (Character Error Rate)**: 3.27% |
|
|
|
|
|
## Model description |
|
|
|
|
|
This model builds upon OpenAI's Whisper base architecture and is fine-tuned specifically for Modern Standard Arabic, with a focus on Quranic verses. Audio samples were cleaned, resampled to 16kHz, and aligned with text for training. |
|
|
|
|
|
The model is trained using CTC loss in a supervised setting, making it suitable for inference in streaming or batch-based ASR systems. Whisper’s multilingual capabilities were leveraged to build a domain-specific Arabic transcription model. |
|
|
|
|
|
## Intended uses & limitations |
|
|
|
|
|
### Intended uses: |
|
|
- Speech recognition for Arabic Quran recitations |
|
|
- Educational tools or Quran learning applications |
|
|
- Mobile-friendly deployment of ASR for religious audio content |
|
|
- Fine-tuning or distillation for low-resource Arabic ASR projects |
|
|
|
|
|
### Limitations: |
|
|
- Optimized for clear, male Quran recitation—performance may degrade with female voices or conversational Arabic |
|
|
- Not designed for dialectal or informal speech |
|
|
- Background noise or overlapping speakers may reduce accuracy |
|
|
|
|
|
## Training and evaluation data |
|
|
|
|
|
The dataset consists of verse-level Quran recitations in Arabic. The recordings were primarily from male speakers with clear tajweed (recitation rules), and aligned to their corresponding Arabic text. |
|
|
|
|
|
Audio files were resampled to 16kHz and normalized for Whisper compatibility. |
|
|
|
|
|
Evaluation was conducted on both a held-out validation set and a separate test set to assess generalization. |
|
|
|
|
|
## Training procedure |
|
|
|
|
|
### Training hyperparameters |
|
|
|
|
|
- `learning_rate`: 0.0001 |
|
|
- `train_batch_size`: 8 |
|
|
- `eval_batch_size`: 8 |
|
|
- `gradient_accumulation_steps`: 2 |
|
|
- `total_train_batch_size`: 16 |
|
|
- `num_train_epochs`: 30 |
|
|
- `seed`: 42 |
|
|
- `lr_scheduler_type`: linear |
|
|
- `lr_scheduler_warmup_steps`: 500 |
|
|
- `optimizer`: AdamW (betas=(0.9, 0.999), eps=1e-08) |
|
|
- `mixed_precision_training`: Native AMP |
|
|
|
|
|
Training was conducted using PyTorch with Hugging Face Trainer API. Metrics monitored include WER and CER. |
|
|
|
|
|
### Training results |
|
|
#### This is only the results of the last batch not all batches |
|
|
| Training Loss | Epoch | Step | Validation Loss | Wer | Cer | |
|
|
|:-------------:|:-----:|:-----:|:---------------:|:------:|:------:| |
|
|
| 0.0058 | 1.0 | 525 | 0.0025 | 0.0353 | 0.0177 | |
|
|
| 0.0018 | 2.0 | 1050 | 0.0031 | 0.0428 | 0.0197 | |
|
|
| 0.0017 | 3.0 | 1575 | 0.0040 | 0.0511 | 0.0246 | |
|
|
| 0.001 | 4.0 | 2100 | 0.0039 | 0.0469 | 0.0212 | |
|
|
| 0.0013 | 5.0 | 2625 | 0.0043 | 0.0505 | 0.0240 | |
|
|
| 0.0006 | 6.0 | 3150 | 0.0042 | 0.0478 | 0.0223 | |
|
|
| 0.0007 | 7.0 | 3675 | 0.0049 | 0.0534 | 0.0227 | |
|
|
| 0.0007 | 8.0 | 4200 | 0.0048 | 0.0552 | 0.0235 | |
|
|
| 0.0005 | 9.0 | 4725 | 0.0048 | 0.0501 | 0.0218 | |
|
|
| 0.0005 | 10.0 | 5250 | 0.0048 | 0.0513 | 0.0215 | |
|
|
| 0.0006 | 11.0 | 5775 | 0.0055 | 0.0528 | 0.0217 | |
|
|
| 0.0002 | 12.0 | 6300 | 0.0055 | 0.0542 | 0.0232 | |
|
|
| 0.0003 | 13.0 | 6825 | 0.0056 | 0.0530 | 0.0238 | |
|
|
| 0.0002 | 14.0 | 7350 | 0.0057 | 0.0498 | 0.0237 | |
|
|
| 0.0001 | 15.0 | 7875 | 0.0057 | 0.0446 | 0.0189 | |
|
|
| 0.0003 | 16.0 | 8400 | 0.0054 | 0.0567 | 0.0254 | |
|
|
| 0.0002 | 17.0 | 8925 | 0.0057 | 0.0540 | 0.0256 | |
|
|
| 0.0002 | 18.0 | 9450 | 0.0057 | 0.0530 | 0.0239 | |
|
|
| 0.0 | 19.0 | 9975 | 0.0056 | 0.0478 | 0.0228 | |
|
|
| 0.0 | 20.0 | 10500 | 0.0055 | 0.0473 | 0.0223 | |
|
|
| 0.0 | 21.0 | 11025 | 0.0056 | 0.0449 | 0.0202 | |
|
|
| 0.0 | 22.0 | 11550 | 0.0056 | 0.0461 | 0.0213 | |
|
|
| 0.0 | 23.0 | 12075 | 0.0057 | 0.0461 | 0.0213 | |
|
|
| 0.0 | 24.0 | 12600 | 0.0058 | 0.0465 | 0.0218 | |
|
|
| 0.0 | 25.0 | 13125 | 0.0058 | 0.0474 | 0.0224 | |
|
|
| 0.0 | 26.0 | 13650 | 0.0059 | 0.0465 | 0.0218 | |
|
|
| 0.0 | 27.0 | 14175 | 0.0059 | 0.0469 | 0.0219 | |
|
|
| 0.0 | 28.0 | 14700 | 0.0059 | 0.0461 | 0.0218 | |
|
|
| 0.0 | 29.0 | 15225 | 0.0054 | 0.0513 | 0.0229 | |
|
|
| 0.0 | 30.0 | 15750 | 0.0060 | 0.0463 | 0.0217 | |
|
|
|
|
|
### Framework versions |
|
|
|
|
|
- Transformers: 4.51.1 |
|
|
- PyTorch: 2.5.1+cu124 |
|
|
- Datasets: 2.20.0 |
|
|
- Tokenizers: 0.21.0 |