|
|
--- |
|
|
base_model: facebook/w2v-bert-2.0 |
|
|
library_name: transformers |
|
|
license: mit |
|
|
metrics: |
|
|
- accuracy |
|
|
- f1 |
|
|
- precision |
|
|
- recall |
|
|
tags: |
|
|
- generated_from_trainer |
|
|
- arabic |
|
|
- quran |
|
|
- speech-segmentation |
|
|
model-index: |
|
|
- name: recitation-segmenter-v2 |
|
|
results: [] |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
language: ar |
|
|
--- |
|
|
|
|
|
# recitation-segmenter-v2: Quranic Recitation Segmenter |
|
|
|
|
|
This model is a fine-tuned version of [facebook/w2v-bert-2.0](https://huggingface.co/facebook/w2v-bert-2.0) for segmenting Holy Quran recitations based on pause points (waqf). It was presented in the paper [Automatic Pronunciation Error Detection and Correction of the Holy Quran's Learners Using Deep Learning](https://huggingface.co/papers/2509.00094). |
|
|
|
|
|
Project Page: https://obadx.github.io/prepare-quran-dataset/ |
|
|
GitHub Repository: https://github.com/obadx/recitations-segmenter |
|
|
|
|
|
It achieves the following results on the evaluation set: |
|
|
- Accuracy: 0.9958 |
|
|
- F1: 0.9964 |
|
|
- Loss: 0.0132 |
|
|
- Precision: 0.9976 |
|
|
- Recall: 0.9951 |
|
|
|
|
|
## Model description |
|
|
|
|
|
The `recitation-segmenter-v2` model is an enhanced AI model capable of segmenting Holy Quran recitations based on pause points (`waqf`) with high accuracy. It is built upon a fine-tuned [Wav2Vec2Bert](https://huggingface.co/docs/transformers/model_doc/wav2vec2-bert) model, performing Sequence Frame Level Classification with a 20-millisecond resolution. This model and its accompanying Python library are designed for high-performance processing of any number and length of Quranic recitations, from a few seconds to several hours, without performance degradation. |
|
|
|
|
|
Key Features: |
|
|
* Segments Quranic recitations according to `waqf` (pause rules). |
|
|
* Specifically trained for Quranic recitations. |
|
|
* High accuracy, up to 20 milliseconds precision. |
|
|
* Requires only ~3 GB of GPU memory. |
|
|
* Capable of processing recitations of any duration without performance loss. |
|
|
|
|
|
The model is part of a larger effort described in the associated paper, aiming to bridge gaps in assessing spoken language for the Holy Quran. This includes an automated pipeline to produce high-quality Quranic datasets and a novel ASR-based approach for pronunciation error detection using a custom Quran Phonetic Script (QPS). |
|
|
|
|
|
## Intended uses & limitations |
|
|
|
|
|
This model is primarily intended for: |
|
|
* Automatic segmentation of Holy Quran recitations for educational purposes or content analysis. |
|
|
* Building high-quality Quranic audio databases. |
|
|
* As a foundational component for larger systems focused on pronunciation error detection and correction for Quran learners. |
|
|
|
|
|
**Limitations**: |
|
|
* The segmenter currently considers `sakt` (a very short pause without breath) as a full `waqf` (stop), which might be a nuance for advanced Tajweed analysis. |
|
|
* The model is specifically trained and optimized for Quranic recitations and might not generalize well to other forms of spoken Arabic. |
|
|
|
|
|
## Training and evaluation data |
|
|
|
|
|
The model was fine-tuned on a meticulously collected dataset of Quranic recitations. The data collection process, described in the associated paper, involved a 98% automated pipeline including collection from expert reciters, segmentation at pause points (`waqf`) using a fine-tuned `wav2vec2-BERT` model, transcription of segments, and transcript verification via a novel Tasmeea algorithm. The dataset comprises over 850 hours of audio (~300K annotated utterances). |
|
|
|
|
|
The data preparation involved: |
|
|
1. Downloading Quranic recitations and converting them to Hugging Face Audio Dataset format at 16000 Hz sample rate. |
|
|
2. Pre-segmenting verses based on pauses using `sliero-vad-v4` from [everyayah.com](https://everyayah.com). |
|
|
3. Applying post-processing (e.g., `min_silence_duration_ms`, `min_speech_duration_ms`, `pad_duration_ms`) to refine segments and manual verification for high-quality divisions. |
|
|
4. Applying data augmentation techniques, including time stretching (speeding up/slowing down 40% of recitations) and various audio effects (Aliasing, AddGaussianNoise, BandPassFilter, PitchShift, RoomSimulator, etc.) using the `audiomentations` library. |
|
|
5. Normalizing audio segments to 16000 Hz and chunking them, with a maximum length of 20 seconds, using a sliding window approach for longer segments. |
|
|
|
|
|
The training dataset and its augmented version are available on Hugging Face: |
|
|
* [Training Data](https://huggingface.co/datasets/obadx/recitation-segmentation) |
|
|
* [Augmented Training Data](https://huggingface.co/datasets/obadx/recitation-segmentation-augmented) |
|
|
|
|
|
## Usage |
|
|
|
|
|
You can use this model with its accompanying Python library, `recitations-segmenter`, which integrates with Hugging Face `transformers`. |
|
|
|
|
|
First, ensure `ffmpeg` and `libsoundfile` are installed system-wide. |
|
|
|
|
|
### Requirements |
|
|
|
|
|
Install `ffmpeg` and `libsoundfile` system-wide. |
|
|
|
|
|
#### Linux |
|
|
|
|
|
```bash |
|
|
sudo apt-get update |
|
|
sudo apt-get install -y ffmpeg libsndfile1 portaudio19-dev |
|
|
``` |
|
|
|
|
|
#### Windows & Mac |
|
|
|
|
|
You can create an `anaconda` environment and then install these libraries: |
|
|
|
|
|
```bash |
|
|
conda create -n segment python=3.12 |
|
|
conda activate segment |
|
|
conda install -c conda-forge ffmpeg libsndfile |
|
|
``` |
|
|
|
|
|
### Via pip |
|
|
|
|
|
```bash |
|
|
pip install recitations-segmenter |
|
|
``` |
|
|
|
|
|
### Sample usage (Python API) |
|
|
|
|
|
Here's a complete example for using the library in Python. A Google Colab example is also available: [Open in Colab](https://colab.research.google.com/drive/1-RuRQOj4l2MA_SG2p4m-afR7MAsT5I22?usp=sharing) |
|
|
|
|
|
```python |
|
|
from pathlib import Path |
|
|
|
|
|
from recitations_segmenter import segment_recitations, read_audio, clean_speech_intervals |
|
|
from transformers import AutoFeatureExtractor, AutoModelForAudioFrameClassification |
|
|
import torch |
|
|
|
|
|
if __name__ == '__main__': |
|
|
device = torch.device('cuda') |
|
|
dtype = torch.bfloat16 |
|
|
|
|
|
processor = AutoFeatureExtractor.from_pretrained( |
|
|
"obadx/recitation-segmenter-v2") |
|
|
model = AutoModelForAudioFrameClassification.from_pretrained( |
|
|
"obadx/recitation-segmenter-v2", |
|
|
) |
|
|
|
|
|
model.to(device, dtype=dtype) |
|
|
|
|
|
# Change this to the file pathes of Holy Quran recitations |
|
|
# File pathes with the Holy Quran Recitations |
|
|
file_pathes = [ |
|
|
'./assets/dussary_002282.mp3', |
|
|
'./assets/hussary_053001.mp3', |
|
|
] |
|
|
waves = [read_audio(p) for p in file_pathes] |
|
|
|
|
|
# Extracting speech inervals in samples according to 16000 Sample rate |
|
|
sampled_outputs = segment_recitations( |
|
|
waves, |
|
|
model, |
|
|
processor, |
|
|
device=device, |
|
|
dtype=dtype, |
|
|
batch_size=8, |
|
|
) |
|
|
|
|
|
for out, path in zip(sampled_outputs, file_pathes): |
|
|
# Clean The speech intervals by: |
|
|
# * merging small silence durations |
|
|
# * remove small speech durations |
|
|
# * add padding to each speech duration |
|
|
# Raises: |
|
|
# * NoSpeechIntervals: if the wav is complete silence |
|
|
# * TooHighMinSpeechDruation: if `min_speech_duration` is too high which |
|
|
# resuls for deleting all speech intervals |
|
|
clean_out = clean_speech_intervals( |
|
|
out.speech_intervals, |
|
|
out.is_complete, |
|
|
min_silence_duration_ms=30, |
|
|
min_speech_duration_ms=30, |
|
|
pad_duration_ms=30, |
|
|
return_seconds=True, |
|
|
) |
|
|
|
|
|
print(f'Speech Intervals of: {Path(path).name}: ') |
|
|
print(clean_out.clean_speech_intervals) |
|
|
print(f'Is Recitation Complete: {clean_out.is_complete}') |
|
|
print('-' * 40) |
|
|
``` |
|
|
|
|
|
## Training procedure |
|
|
|
|
|
The model was trained on `Wav2Vec2BertForAudioFrameClassification` using the `transformers` library. More detailed motivations, methodology, and setup can be found in the GitHub repository's "تفاصيل التدريب" section. |
|
|
|
|
|
### Training hyperparameters |
|
|
|
|
|
The following hyperparameters were used during training: |
|
|
- learning_rate: 5e-05 |
|
|
- train_batch_size: 50 |
|
|
- eval_batch_size: 64 |
|
|
- seed: 42 |
|
|
- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments |
|
|
- lr_scheduler_type: constant |
|
|
- lr_scheduler_warmup_ratio: 0.2 |
|
|
- num_epochs: 1 |
|
|
|
|
|
### Training results |
|
|
|
|
|
| Training Loss | Epoch | Step | Accuracy | F1 | Validation Loss | Precision | Recall | |
|
|
|:-------------:|:------:|:----:|:--------:|:------:|:---------------:|:---------:|:------:| |
|
|
| 0.0701 | 0.2507 | 275 | 0.9953 | 0.9959 | 0.0249 | 0.9947 | 0.9971 | |
|
|
| 0.0234 | 0.5014 | 550 | 0.9953 | 0.9959 | 0.0185 | 0.9940 | 0.9977 | |
|
|
| 0.0186 | 0.7521 | 825 | 0.9958 | 0.9964 | 0.0132 | 0.9976 | 0.9951 | |
|
|
|
|
|
### Framework versions |
|
|
|
|
|
- Transformers 4.51.3 |
|
|
- Pytorch 2.2.1+cu121 |
|
|
- Datasets 3.5.0 |
|
|
- Tokenizers 0.21.1 |
|
|
|
|
|
## Citation |
|
|
If you find our work helpful or inspiring, please feel free to cite it. |
|
|
|
|
|
```bibtex |
|
|
@article{ibrahim2025automatic, |
|
|
title={Automatic Pronunciation Error Detection and Correction of the Holy Quran's Learners Using Deep Learning}, |
|
|
author={Abdullah Abdelfattah, Mahmoud I.Khalil, Hazem M.Abbas}, |
|
|
journal={arXiv preprint arXiv:2509.00094}, |
|
|
year={2025} |
|
|
} |
|
|
``` |