Spaces:
Running on Zero
Running on Zero
A newer version of the Gradio SDK is available: 6.13.0
metadata
title: Quranic Universal Aligner
emoji: π―
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.5.1
app_file: app.py
pinned: false
short_description: Segment recitations and extract text and word timestamps
license: mit
thumbnail: >-
https://cdn-uploads.huggingface.co/production/uploads/684abe5b6327ae8863d106d2/O-S42Itgk6PbM-xaxgxcD.png
Quranic Universal Aligner
Automatic forced alignment for Quran recitations. Upload an audio recording of any surah and get word-level timestamps aligned to the Quranic text.
What it does
- Voice Activity Detection β Detects speech regions in the audio using a custom VAD model
- Phoneme ASR β Recognizes phonemes from each speech segment using wav2vec2 CTC models
- Anchor Detection β Identifies which chapter/verse the recitation starts from using n-gram voting
- DP Alignment β Matches recognized phonemes against the known Quranic text using substring Levenshtein dynamic programming with word-boundary constraints
- Special Segment Detection β Identifies Basmala and Isti'adha segments that precede surahs
- Word Timestamps (optional) β Submits aligned segments to an external MFA service for precise word-level, letter-level, and phoneme-level timing
Models
| Model | Purpose |
|---|---|
| obadx/recitation-segmenter-v2 | Voice activity detection |
| hetchyy/r15_95m | Phoneme ASR (Base β 95M params, faster) |
| hetchyy/r7 | Phoneme ASR (Large β higher accuracy) |
| hetchyy/Quran-phoneme-mfa | MFA forced alignment (external Space) |
How it works
Alignment algorithm
The core alignment uses substring Levenshtein DP with word-boundary constraints:
- A sliding window of reference words constrains the search space around the expected position
- DP start positions must align with word boundaries; only word-end positions are evaluated as match candidates
- A position prior biases toward sequential matching, penalizing jumps
- Custom phoneme substitution costs account for phonetically similar sounds
- Confidence = 1 β normalized edit distance (green β₯ 80%, yellow 60β79%, red < 60%)
Retry and recovery
When alignment fails for a segment:
- Tier 1: Expanded search window
- Tier 2: Expanded window + relaxed confidence threshold (0.45)
- Re-anchoring: After 2 consecutive failures, n-gram voting re-localizes position within the surah
Animation
Two playback modes with real-time word highlighting:
- Per-segment β Animate a single aligned segment with word/character-level karaoke
- Mega card β Unified text flow across all segments with click-to-seek and configurable opacity windowing (Reveal, Fade, Spotlight, Isolate, Consume modes)