Improve model card: Add pipeline tag, language, project/code links, description, and usage

This PR significantly enhances the model card for `recitation-segmenter-v2` by:

* Adding the `pipeline_tag: automatic-speech-recognition` and `language: ar` metadata for better discoverability and context on the Hub.
* Including links to the associated paper ([Automatic Pronunciation Error Detection and Correction of the Holy Quran's Learners Using Deep Learning](https://huggingface.co/papers/2509.00094)), the GitHub repository (https://github.com/obadx/recitations-segmenter), and the project page (https://obadx.github.io/prepare-quran-dataset/).
* Replacing generic placeholders with a detailed model description, intended uses and limitations, and training data information, extracted from the paper abstract and GitHub README.
* Adding relevant `tags` such as `arabic`, `quran`, and `speech-segmentation`.
* Adding a `transformers`-based Python code snippet for easy inference, as provided in the original GitHub repository.
* Including a BibTeX citation for proper academic attribution.

These improvements make the model card more informative, discoverable, and user-friendly.

Files changed (1) hide show

README.md +155 -11

README.md CHANGED Viewed

@@ -1,25 +1,31 @@
 ---
 library_name: transformers
 license: mit
-base_model: facebook/w2v-bert-2.0
-tags:
-- generated_from_trainer
 metrics:
 - accuracy
 - f1
 - precision
 - recall
 model-index:
 - name: recitation-segmenter-v2
   results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# recitation-segmenter-v2
-This model is a fine-tuned version of [facebook/w2v-bert-2.0](https://huggingface.co/facebook/w2v-bert-2.0) on an unknown dataset.
 It achieves the following results on the evaluation set:
 - Accuracy: 0.9958
 - F1: 0.9964
@@ -29,18 +35,145 @@ It achieves the following results on the evaluation set:
 ## Model description
-More information needed
 ## Intended uses & limitations
-More information needed
 ## Training and evaluation data
-More information needed
 ## Training procedure
 ### Training hyperparameters
 The following hyperparameters were used during training:
@@ -61,10 +194,21 @@ The following hyperparameters were used during training:
 | 0.0234        | 0.5014 | 550  | 0.9953   | 0.9959 | 0.0185          | 0.9940    | 0.9977 |
 | 0.0186        | 0.7521 | 825  | 0.9958   | 0.9964 | 0.0132          | 0.9976    | 0.9951 |
 ### Framework versions
 - Transformers 4.51.3
 - Pytorch 2.2.1+cu121
 - Datasets 3.5.0
 - Tokenizers 0.21.1

 ---
+base_model: facebook/w2v-bert-2.0
 library_name: transformers
 license: mit
 metrics:
 - accuracy
 - f1
 - precision
 - recall
+tags:
+- generated_from_trainer
+- arabic
+- quran
+- speech-segmentation
 model-index:
 - name: recitation-segmenter-v2
   results: []
+pipeline_tag: automatic-speech-recognition
+language: ar
 ---
+# recitation-segmenter-v2: Quranic Recitation Segmenter
+This model is a fine-tuned version of [facebook/w2v-bert-2.0](https://huggingface.co/facebook/w2v-bert-2.0) for segmenting Holy Quran recitations based on pause points (waqf). It was presented in the paper [Automatic Pronunciation Error Detection and Correction of the Holy Quran's Learners Using Deep Learning](https://huggingface.co/papers/2509.00094).
+Project Page: https://obadx.github.io/prepare-quran-dataset/
+GitHub Repository: https://github.com/obadx/recitations-segmenter
 It achieves the following results on the evaluation set:
 - Accuracy: 0.9958
 - F1: 0.9964
 ## Model description
+The `recitation-segmenter-v2` model is an enhanced AI model capable of segmenting Holy Quran recitations based on pause points (`waqf`) with high accuracy. It is built upon a fine-tuned [Wav2Vec2Bert](https://huggingface.co/docs/transformers/model_doc/wav2vec2-bert) model, performing Sequence Frame Level Classification with a 20-millisecond resolution. This model and its accompanying Python library are designed for high-performance processing of any number and length of Quranic recitations, from a few seconds to several hours, without performance degradation.
+Key Features:
+*   Segments Quranic recitations according to `waqf` (pause rules).
+*   Specifically trained for Quranic recitations.
+*   High accuracy, up to 20 milliseconds precision.
+*   Requires only ~3 GB of GPU memory.
+*   Capable of processing recitations of any duration without performance loss.
+The model is part of a larger effort described in the associated paper, aiming to bridge gaps in assessing spoken language for the Holy Quran. This includes an automated pipeline to produce high-quality Quranic datasets and a novel ASR-based approach for pronunciation error detection using a custom Quran Phonetic Script (QPS).
 ## Intended uses & limitations
+This model is primarily intended for:
+*   Automatic segmentation of Holy Quran recitations for educational purposes or content analysis.
+*   Building high-quality Quranic audio databases.
+*   As a foundational component for larger systems focused on pronunciation error detection and correction for Quran learners.
+**Limitations**:
+*   The segmenter currently considers `sakt` (a very short pause without breath) as a full `waqf` (stop), which might be a nuance for advanced Tajweed analysis.
+*   The model is specifically trained and optimized for Quranic recitations and might not generalize well to other forms of spoken Arabic.
 ## Training and evaluation data
+The model was fine-tuned on a meticulously collected dataset of Quranic recitations. The data collection process, described in the associated paper, involved a 98% automated pipeline including collection from expert reciters, segmentation at pause points (`waqf`) using a fine-tuned `wav2vec2-BERT` model, transcription of segments, and transcript verification via a novel Tasmeea algorithm. The dataset comprises over 850 hours of audio (~300K annotated utterances).
+The data preparation involved:
+1.  Downloading Quranic recitations and converting them to Hugging Face Audio Dataset format at 16000 Hz sample rate.
+2.  Pre-segmenting verses based on pauses using `sliero-vad-v4` from [everyayah.com](https://everyayah.com).
+3.  Applying post-processing (e.g., `min_silence_duration_ms`, `min_speech_duration_ms`, `pad_duration_ms`) to refine segments and manual verification for high-quality divisions.
+4.  Applying data augmentation techniques, including time stretching (speeding up/slowing down 40% of recitations) and various audio effects (Aliasing, AddGaussianNoise, BandPassFilter, PitchShift, RoomSimulator, etc.) using the `audiomentations` library.
+5.  Normalizing audio segments to 16000 Hz and chunking them, with a maximum length of 20 seconds, using a sliding window approach for longer segments.
+The training dataset and its augmented version are available on Hugging Face:
+*   [Training Data](https://huggingface.co/datasets/obadx/recitation-segmentation)
+*   [Augmented Training Data](https://huggingface.co/datasets/obadx/recitation-segmentation-augmented)
+## Usage
+You can use this model with its accompanying Python library, `recitations-segmenter`, which integrates with Hugging Face `transformers`.
+First, ensure `ffmpeg` and `libsoundfile` are installed system-wide.
+### Requirements
+Install `ffmpeg` and `libsoundfile` system-wide.
+#### Linux
+```bash
+sudo apt-get update
+sudo apt-get install -y ffmpeg libsndfile1 portaudio19-dev
+```
+#### Windows & Mac
+You can create an `anaconda` environment and then install these libraries:
+```bash
+conda create -n segment python=3.12
+conda activate segment
+conda install -c conda-forge ffmpeg libsndfile
+```
+### Via pip
+```bash
+pip install recitations-segmenter
+```
+### Sample usage (Python API)
+Here's a complete example for using the library in Python. A Google Colab example is also available: [Open in Colab](https://colab.research.google.com/drive/1-RuRQOj4l2MA_SG2p4m-afR7MAsT5I22?usp=sharing)
+```python
+from pathlib import Path
+from recitations_segmenter import segment_recitations, read_audio, clean_speech_intervals
+from transformers import AutoFeatureExtractor, AutoModelForAudioFrameClassification
+import torch
+if __name__ == '__main__':
+    device = torch.device('cuda')
+    dtype = torch.bfloat16
+    processor = AutoFeatureExtractor.from_pretrained(
+        "obadx/recitation-segmenter-v2")
+    model = AutoModelForAudioFrameClassification.from_pretrained(
+        "obadx/recitation-segmenter-v2",
+    )
+    model.to(device, dtype=dtype)
+    # Change this to the file pathes of Holy Quran recitations
+    # File pathes with the Holy Quran Recitations
+    file_pathes = [
+        './assets/dussary_002282.mp3',
+        './assets/hussary_053001.mp3',
+    ]
+    waves = [read_audio(p) for p in file_pathes]
+    # Extracting speech inervals in samples according to 16000 Sample rate
+    sampled_outputs = segment_recitations(
+        waves,
+        model,
+        processor,
+        device=device,
+        dtype=dtype,
+        batch_size=8,
+    )
+    for out, path in zip(sampled_outputs, file_pathes):
+        # Clean The speech intervals by:
+        # * merging small silence durations
+        # * remove small speech durations
+        # * add padding to each speech duration
+        # Raises:
+        # * NoSpeechIntervals: if the wav is complete silence
+        # * TooHighMinSpeechDruation: if `min_speech_duration` is too high which
+        # resuls for deleting all speech intervals
+        clean_out = clean_speech_intervals(
+            out.speech_intervals,
+            out.is_complete,
+            min_silence_duration_ms=30,
+            min_speech_duration_ms=30,
+            pad_duration_ms=30,
+            return_seconds=True,
+        )
+        print(f'Speech Intervals of: {Path(path).name}: ')
+        print(clean_out.clean_speech_intervals)
+        print(f'Is Recitation Complete: {clean_out.is_complete}')
+        print('-' * 40)
+```
 ## Training procedure
+The model was trained on `Wav2Vec2BertForAudioFrameClassification` using the `transformers` library. More detailed motivations, methodology, and setup can be found in the GitHub repository's "تفاصيل التدريب" section.
 ### Training hyperparameters
 The following hyperparameters were used during training:
 | 0.0234        | 0.5014 | 550  | 0.9953   | 0.9959 | 0.0185          | 0.9940    | 0.9977 |
 | 0.0186        | 0.7521 | 825  | 0.9958   | 0.9964 | 0.0132          | 0.9976    | 0.9951 |
 ### Framework versions
 - Transformers 4.51.3
 - Pytorch 2.2.1+cu121
 - Datasets 3.5.0
 - Tokenizers 0.21.1
+## Citation
+If you find our work helpful or inspiring, please feel free to cite it.
+```bibtex
+@article{ibrahim2025automatic,
+  title={Automatic Pronunciation Error Detection and Correction of the Holy Quran's Learners Using Deep Learning},
+  author={Ibrahim, Obad and El-Sayed, Tamer and El-Din, Sherif Amin},
+  journal={arXiv preprint arXiv:2509.00094},
+  year={2025}
+}
+```