Spaces:

hetchyy
/

Quran-multi-aligner

Running on Zero

File size: 17,947 Bytes

80dea6a
 
583cd50
 
e67922d
583cd50
 
 
f1bbb03
 
 
e67922d
 
 
f1bbb03
 
 
583cd50
 
f1bbb03
 
 
 
 
 
3f29284
80dea6a
 
 
 
 
3f29284
80dea6a
be9c359
 
 
80dea6a
 
 
 
 
 
 
 
 
 
 
 
6cc1216
84de10e
80dea6a
 
84de10e
80dea6a
 
 
 
 
 
 
 
2ce56b1
84de10e
6cc1216
2ce56b1
84de10e
6cc1216
e67922d
 
 
 
 
 
 
 
 
80dea6a
 
 
 
 
 
ea381a8
80dea6a
 
 
84de10e
 
 
6cc1216
84de10e
 
 
80dea6a
 
 
 
 
 
 
 
583cd50
058f17e
80dea6a
 
84de10e
80dea6a
 
 
84de10e
80dea6a
 
 
3f29284
80dea6a
 
583cd50
80dea6a
 
 
 
 
 
 
 
 
 
 
 
ea381a8
80dea6a
 
 
 
 
 
 
 
ea381a8
80dea6a
 
 
 
 
 
 
f53bbb3
 
 
 
80dea6a
f53bbb3
80dea6a
 
 
 
 
 
583cd50
ea381a8
 
 
e67922d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84de10e
80dea6a
84de10e
80dea6a
 
 
 
 
 
 
3f29284
80dea6a
 
 
 
 
 
84de10e
80dea6a
84de10e
80dea6a
 
 
 
3f29284
80dea6a
 
 
 
84de10e
80dea6a
 
 
 
 
84de10e
80dea6a
 
 
 
 
3f29284
80dea6a
 
 
 
 
ea381a8
80dea6a
 
 
 
 
 
 
 
 
 
 
 
2ce56b1
 
583cd50
 
84de10e
2ce56b1
84de10e
2ce56b1
 
 
 
0d6804f
28af18f
2ce56b1
 
 
 
 
0d6804f
 
84de10e
2ce56b1
 
 
 
 
 
0d6804f
 
2ce56b1
 
 
28af18f
84de10e
2ce56b1
 
 
 
 
 
 
 
 
 
 
 
 
84de10e
 
2ce56b1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
583cd50
2ce56b1
 
 
84de10e
2ce56b1
84de10e
2ce56b1
 
 
84de10e
2ce56b1
28af18f
2ce56b1
84de10e
2ce56b1
 
 
 
 
 
 
 
 
28af18f
84de10e
2ce56b1
 
 
84de10e
583cd50
 
 
 
 
 
 
 
 
 
 
 
 
 
3f29284
583cd50
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28af18f
583cd50
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3f29284
583cd50
 
 
 
 
 
e67922d

# Client API Reference

- [Quick Start](#quick-start)
- [Sessions](#sessions)
- [Alignment Endpoints](#alignment-endpoints) — `/process_audio_session`, `/process_url_session`, `/resegment`, `/retranscribe`, `/realign_from_timestamps`
- [Word Timestamps](#word-timestamps) — `/timestamps`, `/timestamps_direct`
- [Utilities](#utilities) — `/estimate_duration`
- [Response Reference](#response-reference) — segment fields, special types, word arrays, GPU warning, errors

## API Changelog

**30/03/2026**
- New `/process_url_session` endpoint: pass a URL (YouTube, SoundCloud, MP3Quran, etc.) instead of uploading audio

**29/03/2026**
- API calls now skip HTML rendering and audio file I/O, returning JSON faster


---

## GPU Usage & Access

- **Free Tier:** Every user receives **free daily GPU quota**. Once your daily GPU quota is exhausted, you can continue using unlimited CPU processing for all endpoints.
- **Unlimited GPU Access:** If you need unlimited API access on GPU (e.g., for high-volume or production use), please get in touch to arrange a payment plan and higher limits.
- **Note:** CPU processing is always unlimited and available, but is much slower. When GPU quota is exceeded, requests will be automatically routed to CPU and a warning will appear in the response.

## Quick Start

```python
from gradio_client import Client

client = Client("https://hetchyy-quran-multi-aligner.hf.space")

# Or pass your HF token to use your own account's ZeroGPU quota
client = Client("https://hetchyy-quran-multi-aligner.hf.space", token="hf_...")

# Full pipeline
result = client.predict(
    "recitation.mp3",   # audio file path
    200,                # min_silence_ms
    1000,               # min_speech_ms
    100,                # pad_ms
    "Base",             # model_name
    "GPU",              # device
    api_name="/process_audio_session"
)
audio_id = result["audio_id"]

# Re-segment with different params (reuses cached audio)
result = client.predict(audio_id, 600, 1500, 300, "Base", "GPU", api_name="/resegment")

# Re-transcribe with a different model (reuses cached segments)
result = client.predict(audio_id, "Large", "GPU", api_name="/retranscribe")

# Realign with custom timestamps
result = client.predict(
    audio_id,
    [{"start": 0.5, "end": 3.2}, {"start": 3.8, "end": 7.1}],
    "Base", "GPU",
    api_name="/realign_from_timestamps"
)

# Get word-level timestamps (uses stored session segments)
ts = client.predict(audio_id, None, "words", api_name="/timestamps")

# Get timestamps without a session (standalone)
ts = client.predict("recitation.mp3", result["segments"], "words", api_name="/timestamps_direct")

# From URL (YouTube, SoundCloud, MP3Quran, etc.)
result = client.predict(
    "https://server8.mp3quran.net/afs/112.mp3",
    200, 1000, 100, "Base", "GPU",
    api_name="/process_url_session"
)
print(result["url_metadata"]["title"])  # Source metadata
# All follow-up calls work the same as with /process_audio_session
```

---

## Sessions

The first call returns an `audio_id` (32-character hex string). Pass it to subsequent calls to skip re-uploading and reprocessing audio. Sessions expire after **5 hours**.

**What the server caches per session:**

| Data | Updated by |
|---|---|
| Preprocessed audio | — |
| Detected speech intervals | — |
| Cleaned segment boundaries | `/resegment`, `/realign_from_timestamps` |
| Model name | `/retranscribe` |
| Alignment segments | Any alignment call |

If `audio_id` is missing, expired, or invalid:
```json
{"error": "Session not found or expired", "segments": []}
```

---

## Alignment Endpoints

### `POST /process_audio_session`

Processes a recitation audio file: detects speech segments, recognizes text, and aligns with the Quran. Creates a session for follow-up calls.

| Parameter | Type | Default | Description |
|---|---|---|---|
| `audio` | file | required | Audio file (any common format) |
| `min_silence_ms` | int | 200 | Minimum silence gap to split segments |
| `min_speech_ms` | int | 1000 | Minimum speech duration to keep a segment |
| `pad_ms` | int | 100 | Padding added to each side of a segment |
| `model_name` | str | `"Base"` | `"Base"` (faster) or `"Large"` (more accurate). **Only these two values are accepted** — any other value will cause an error |
| `device` | str | `"GPU"` | `"GPU"` or `"CPU"` |

If the GPU is temporarily unavailable, processing continues on CPU (slower). When this happens, a `"warning"` field is included in the response (see [GPU Fallback Warning](#gpu-fallback-warning)).

**Segmentation presets:**

| Style | min_silence_ms | min_speech_ms | pad_ms |
|---|---|---|---|
| Mujawwad (slow) | 600 | 1500 | 300 |
| Murattal (normal) | 200 | 1000 | 100 |
| Fast | 75 | 750 | 40 |

**Response:**
```json
{
  "audio_id": "a1b2c3d4e5f67890a1b2c3d4e5f67890",
  "segments": [
    {
      "segment": 1,
      "time_from": 0.480,
      "time_to": 2.880,
      "ref_from": "112:1:1",
      "ref_to": "112:1:4",
      "matched_text": "قُلْ هُوَ ٱللَّهُ أَحَدٌ",
      "confidence": 0.921,
      "has_missing_words": false,
      "error": null
    },
    {
      "segment": 2,
      "time_from": 4.320,
      "time_to": 6.540,
      "ref_from": "",
      "ref_to": "",
      "matched_text": "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيم",
      "confidence": 0.952,
      "has_missing_words": false,
      "special_type": "Basmala",
      "error": null
    }
  ]
}
```

See [Segment Object](#segment-object) for field descriptions. See [Special Segment Types](#special-segment-types) for non-Quranic segments.

---

### `POST /process_url_session`

Downloads audio from a URL, then runs the same pipeline as `/process_audio_session`. Supports YouTube, SoundCloud, MP3Quran, TikTok, and [500+ sites](https://github.com/yt-dlp/yt-dlp/blob/master/supportedsites.md) via yt-dlp.

| Parameter | Type | Default | Description |
|---|---|---|---|
| `url` | str | required | URL to download audio from |
| `min_silence_ms` | int | 200 | Minimum silence gap to split segments |
| `min_speech_ms` | int | 1000 | Minimum speech duration to keep a segment |
| `pad_ms` | int | 100 | Padding added to each side of a segment |
| `model_name` | str | `"Base"` | `"Base"` or `"Large"` only |
| `device` | str | `"GPU"` | `"GPU"` or `"CPU"` |

**Response:** Same as `/process_audio_session`, plus a `url_metadata` field:
```json
{
  "audio_id": "a1b2c3d4e5f67890a1b2c3d4e5f67890",
  "url_metadata": {
    "title": "Surah Al-Ikhlas - Sheikh Mishary",
    "duration": 45.0,
    "source_url": "https://..."
  },
  "segments": [...]
}
```

**Notes:**
- Playlists are rejected — pass a single video/audio URL.
- Some sites (YouTube, Facebook, Instagram) may not work from the server due to IP restrictions. If a download fails, download the audio locally and use `/process_audio_session` instead.
- After the session is created, all follow-up endpoints (`/resegment`, `/retranscribe`, etc.) work identically.

---

### `POST /resegment`

Re-splits the audio into segments using different silence/speech settings, then re-aligns. Reuses the uploaded audio.

| Parameter | Type | Default | Description |
|---|---|---|---|
| `audio_id` | str | required | Session ID from a previous call |
| `min_silence_ms` | int | 200 | New minimum silence gap |
| `min_speech_ms` | int | 1000 | New minimum speech duration |
| `pad_ms` | int | 100 | New padding |
| `model_name` | str | `"Base"` | `"Base"` or `"Large"` only |
| `device` | str | `"GPU"` | `"GPU"` or `"CPU"` |

**Response:** Same shape as `/process_audio_session`. Session boundaries are updated.

---

### `POST /retranscribe`

Re-recognizes text using a different model on the same segments, then re-aligns.

| Parameter | Type | Default | Description |
|---|---|---|---|
| `audio_id` | str | required | Session ID from a previous call |
| `model_name` | str | `"Base"` | `"Base"` or `"Large"` only |
| `device` | str | `"GPU"` | `"GPU"` or `"CPU"` |

**Response:** Same shape as `/process_audio_session`. Session model and results are updated.

> **Note:** Returns an error if `model_name` is the same as the current session's model. To re-run with the same model on different boundaries, use `/resegment` or `/realign_from_timestamps` instead (they already include recognition + alignment).

---

### `POST /realign_from_timestamps`

Aligns audio using custom time boundaries you provide. Useful for manually adjusting where segments start and end.

| Parameter | Type | Default | Description |
|---|---|---|---|
| `audio_id` | str | required | Session ID from a previous call |
| `timestamps` | list | required | Array of `{"start": float, "end": float}` in seconds |
| `model_name` | str | `"Base"` | `"Base"` or `"Large"` only |
| `device` | str | `"GPU"` | `"GPU"` or `"CPU"` |

**Example request body:**
```json
{
  "audio_id": "a1b2c3d4e5f67890a1b2c3d4e5f67890",
  "timestamps": [
    {"start": 0.5, "end": 3.2},
    {"start": 3.8, "end": 5.1},
    {"start": 5.1, "end": 7.4}
  ],
  "model_name": "Base",
  "device": "GPU"
}
```

**Response:** Same shape as `/process_audio_session`. Session boundaries are replaced with the provided timestamps.

---

## Word Timestamps

### `POST /timestamps`

Gets precise word-level (and optionally letter-level) timing for each word in the aligned segments.

| Parameter | Type | Default | Description |
|---|---|---|---|
| `audio_id` | str | required | Session ID from a previous alignment call |
| `segments` | list? | `None` (JSON `null`) | Segment list to align. `None` uses stored segments from the session |
| `granularity` | str | `"words"` | Only `"words"` is supported. `"words+chars"` is currently disabled via API and returns an error |

**Example — using stored segments:**
```python
result = client.predict(
    "a1b2c3d4e5f67890a1b2c3d4e5f67890",  # audio_id
    None,                                # segments (null = use stored)
    "words",                             # granularity
    api_name="/timestamps",
)
```

**Example — with segments override (minimal):**
```python
result = client.predict(
    "a1b2c3d4e5f67890a1b2c3d4e5f67890",
    [   # segments override
        {"time_from": 0.48, "time_to": 2.88, "ref_from": "112:1:1", "ref_to": "112:1:4"},
        {"time_from": 3.12, "time_to": 5.44, "ref_from": "112:2:1", "ref_to": "112:2:3"},
    ],
    "words",
    api_name="/timestamps",
)
```

**Example — special segment (Basmala):**
```python
# Special segments use empty ref_from/ref_to and carry a special_type field
{"time_from": 0.0, "time_to": 2.1, "ref_from": "", "ref_to": "", "special_type": "Basmala"}
```

**Segment input fields:**

| Field | Type | Required | Description |
|---|---|---|---|
| `time_from` | float | yes | Start time in seconds |
| `time_to` | float | yes | End time in seconds |
| `ref_from` | str | yes | First word as `"surah:ayah:word"`. Empty for special segments |
| `ref_to` | str | yes | Last word as `"surah:ayah:word"`. Empty for special segments |
| `segment` | int | no | 1-indexed segment number. Auto-assigned from position if omitted |
| `confidence` | float | no | Defaults to 1.0. Segments with confidence ≤ 0 are skipped |
| `special_type` | str | no | Only for special segments (`"Basmala"`, `"Isti'adha"`, etc.) |

**Response:**
```json
{
  "audio_id": "a1b2c3d4e5f67890a1b2c3d4e5f67890",
  "segments": [
    {
      "segment": 1,
      "words": [
        ["112:1:1", 0.0, 0.32],
        ["112:1:2", 0.32, 0.58],
        ["112:1:3", 0.58, 1.12],
        ["112:1:4", 1.12, 1.68]
      ]
    }
  ]
}
```

See [Word Timestamp Arrays](#word-timestamp-arrays) for field details.

---

### `POST /timestamps_direct`

Same as `/timestamps` but accepts an audio file directly — no session needed.

| Parameter | Type | Default | Description |
|---|---|---|---|
| `audio` | file | required | Audio file (any common format) |
| `segments` | list | required | Segment list with `time_from`/`time_to` boundaries |
| `granularity` | str | `"words"` | Only `"words"` is supported. `"words+chars"` is currently disabled via API and returns an error |

**Response:** Same shape as `/timestamps` but without `audio_id`.

**Example (minimal):**
```python
result = client.predict(
    "recitation.mp3",
    [
        {"time_from": 0.48, "time_to": 2.88, "ref_from": "112:1:1", "ref_to": "112:1:4"},
        {"time_from": 3.12, "time_to": 5.44, "ref_from": "112:2:1", "ref_to": "112:2:3"},
    ],
    "words",
    api_name="/timestamps_direct",
)
```

Segment input format is the same as for `/timestamps` — see above.

---

## Utilities

### `POST /estimate_duration`

Estimate processing time before starting a request.

| Parameter | Type | Default | Description |
|---|---|---|---|
| `endpoint` | str | required | Target endpoint name (e.g. `"process_audio_session"`) |
| `audio_duration_s` | float | `None` | Audio length in seconds. Required if no `audio_id` |
| `audio_id` | str | `None` | Session ID — looks up audio duration from the session |
| `model_name` | str | `"Base"` | `"Base"` or `"Large"` only |
| `device` | str | `"GPU"` | `"GPU"` or `"CPU"` |

**Example — before first processing call:**
```python
est = client.predict(
    "process_audio_session",  # endpoint
    60.0,                     # audio_duration_s (seconds)
    None,                     # audio_id (not yet available)
    "Base",                   # model_name
    "GPU",                    # device
    api_name="/estimate_duration",
)
print(f"Estimated time: {est['estimated_duration_s']}s")
```

**Example — with existing session (e.g. before getting timestamps):**
```python
est = client.predict(
    "timestamps",              # endpoint
    None,                      # audio_duration_s (looked up from session)
    audio_id,                  # audio_id
    "Base",                    # model_name
    "GPU",                     # device
    api_name="/estimate_duration",
)
```

**Response:**
```json
{
  "endpoint": "process_audio_session",
  "estimated_duration_s": 28.0,
  "device": "GPU",
  "model_name": "Base"
}
```

---

## Response Reference

### Segment Object

Returned by all alignment endpoints (`/process_audio_session`, `/resegment`, `/retranscribe`, `/realign_from_timestamps`).

| Field | Type | Description |
|---|---|---|
| `segment` | int | 1-indexed segment number |
| `time_from` | float | Start time in seconds |
| `time_to` | float | End time in seconds |
| `ref_from` | str | First matched word as `"surah:ayah:word"`. Empty string for special segments |
| `ref_to` | str | Last matched word as `"surah:ayah:word"`. Empty string for special segments |
| `matched_text` | str | Quran text for the matched range (or special segment text) |
| `confidence` | float | 0.0–1.0 — how well the segment matched the Quran text |
| `has_missing_words` | bool | Whether some expected words were not found in the audio |
| `special_type` | str | Only present for special (non-Quranic) segments — see below. Absent for normal segments |
| `error` | str? | Error message if alignment failed, else `null` |

### Special Segment Types

Non-Quranic segments detected within recitations. When `special_type` is present, `ref_from` and `ref_to` are empty strings.

| `special_type` | Arabic Text |
|----------------|-------------|
| `Basmala` | بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيم |
| `Isti'adha` | أَعُوذُ بِٱللَّهِ مِنَ الشَّيْطَانِ الرَّجِيم |
| `Amin` | آمِين |
| `Takbir` | اللَّهُ أَكْبَر |
| `Tahmeed` | سَمِعَ اللَّهُ لِمَنْ حَمِدَه |
| `Tasleem` | ٱلسَّلَامُ عَلَيْكُمْ وَرَحْمَةُ ٱللَّه |
| `Sadaqa` | صَدَقَ ٱللَّهُ ٱلْعَظِيم |

### Word Timestamp Arrays

Returned by `/timestamps` and `/timestamps_direct`. Each word is an array: `[location, start, end]` or `[location, start, end, letters]`.

| Index | Type | Description |
|---|---|---|
| 0 | str | Word position as `"surah:ayah:word"` |
| 1 | float | Start time relative to segment (seconds) |
| 2 | float | End time relative to segment (seconds) |

> **Note:** `"words+chars"` granularity (letter-level timestamps) is currently disabled via API. Only word-level timestamps are returned.

### GPU Fallback Warning

When the server's GPU is temporarily unavailable, processing continues on CPU (slower). All endpoints include a `"warning"` field in the response:

```json
{
  "audio_id": "...",
  "warning": "GPU quota reached — processed on CPU (slower). Resets in 13:53:59.",
  "segments": [...]
}
```

The `"warning"` key is **absent** (not `null`) when processing ran on GPU normally. Clients should check `if "warning" in result` rather than checking for `null`.

### Errors

All errors follow the same shape: `{"error": "...", "segments": []}`. Endpoints that have an active session also include `audio_id`.

| Condition | Error message | `audio_id` present? |
|---|---|---|
| Session not found or expired | `"Session not found or expired"` | No |
| No speech detected (process) | `"No speech detected in audio"` | No (no session created) |
| No segments after resegment | `"No segments with these settings"` | Yes |
| Invalid model name | `"Invalid model_name '...'. Must be one of: Base, Large"` | Depends on endpoint |
| Retranscribe with same model | `"Model and boundaries unchanged. Change model_name or call /resegment first."` | Yes |
| Retranscription failed | `"Retranscription failed"` | Yes |
| Realignment failed | `"Alignment failed"` | Yes |
| No segments in session (timestamps) | `"No segments found in session"` | Yes |
| Timestamp alignment failed | `"Alignment failed: ..."` | Yes (session) / No (direct) |
| No segments provided (timestamps direct) | `"No segments provided"` | No |
| URL is empty (process_url) | `"URL is required"` | No |
| URL download failed (process_url) | `"Download failed: ..."` | No |