Quran-multi-aligner / docs /client_api.md
hetchyy's picture
feat: add /process_url_session API endpoint for URL-based alignment
e67922d verified
# Client API Reference
- [Quick Start](#quick-start)
- [Sessions](#sessions)
- [Alignment Endpoints](#alignment-endpoints) โ€” `/process_audio_session`, `/process_url_session`, `/resegment`, `/retranscribe`, `/realign_from_timestamps`
- [Word Timestamps](#word-timestamps) โ€” `/timestamps`, `/timestamps_direct`
- [Utilities](#utilities) โ€” `/estimate_duration`
- [Response Reference](#response-reference) โ€” segment fields, special types, word arrays, GPU warning, errors
## API Changelog
**30/03/2026**
- New `/process_url_session` endpoint: pass a URL (YouTube, SoundCloud, MP3Quran, etc.) instead of uploading audio
**29/03/2026**
- API calls now skip HTML rendering and audio file I/O, returning JSON faster
---
## GPU Usage & Access
- **Free Tier:** Every user receives **free daily GPU quota**. Once your daily GPU quota is exhausted, you can continue using unlimited CPU processing for all endpoints.
- **Unlimited GPU Access:** If you need unlimited API access on GPU (e.g., for high-volume or production use), please get in touch to arrange a payment plan and higher limits.
- **Note:** CPU processing is always unlimited and available, but is much slower. When GPU quota is exceeded, requests will be automatically routed to CPU and a warning will appear in the response.
## Quick Start
```python
from gradio_client import Client
client = Client("https://hetchyy-quran-multi-aligner.hf.space")
# Or pass your HF token to use your own account's ZeroGPU quota
client = Client("https://hetchyy-quran-multi-aligner.hf.space", token="hf_...")
# Full pipeline
result = client.predict(
"recitation.mp3", # audio file path
200, # min_silence_ms
1000, # min_speech_ms
100, # pad_ms
"Base", # model_name
"GPU", # device
api_name="/process_audio_session"
)
audio_id = result["audio_id"]
# Re-segment with different params (reuses cached audio)
result = client.predict(audio_id, 600, 1500, 300, "Base", "GPU", api_name="/resegment")
# Re-transcribe with a different model (reuses cached segments)
result = client.predict(audio_id, "Large", "GPU", api_name="/retranscribe")
# Realign with custom timestamps
result = client.predict(
audio_id,
[{"start": 0.5, "end": 3.2}, {"start": 3.8, "end": 7.1}],
"Base", "GPU",
api_name="/realign_from_timestamps"
)
# Get word-level timestamps (uses stored session segments)
ts = client.predict(audio_id, None, "words", api_name="/timestamps")
# Get timestamps without a session (standalone)
ts = client.predict("recitation.mp3", result["segments"], "words", api_name="/timestamps_direct")
# From URL (YouTube, SoundCloud, MP3Quran, etc.)
result = client.predict(
"https://server8.mp3quran.net/afs/112.mp3",
200, 1000, 100, "Base", "GPU",
api_name="/process_url_session"
)
print(result["url_metadata"]["title"]) # Source metadata
# All follow-up calls work the same as with /process_audio_session
```
---
## Sessions
The first call returns an `audio_id` (32-character hex string). Pass it to subsequent calls to skip re-uploading and reprocessing audio. Sessions expire after **5 hours**.
**What the server caches per session:**
| Data | Updated by |
|---|---|
| Preprocessed audio | โ€” |
| Detected speech intervals | โ€” |
| Cleaned segment boundaries | `/resegment`, `/realign_from_timestamps` |
| Model name | `/retranscribe` |
| Alignment segments | Any alignment call |
If `audio_id` is missing, expired, or invalid:
```json
{"error": "Session not found or expired", "segments": []}
```
---
## Alignment Endpoints
### `POST /process_audio_session`
Processes a recitation audio file: detects speech segments, recognizes text, and aligns with the Quran. Creates a session for follow-up calls.
| Parameter | Type | Default | Description |
|---|---|---|---|
| `audio` | file | required | Audio file (any common format) |
| `min_silence_ms` | int | 200 | Minimum silence gap to split segments |
| `min_speech_ms` | int | 1000 | Minimum speech duration to keep a segment |
| `pad_ms` | int | 100 | Padding added to each side of a segment |
| `model_name` | str | `"Base"` | `"Base"` (faster) or `"Large"` (more accurate). **Only these two values are accepted** โ€” any other value will cause an error |
| `device` | str | `"GPU"` | `"GPU"` or `"CPU"` |
If the GPU is temporarily unavailable, processing continues on CPU (slower). When this happens, a `"warning"` field is included in the response (see [GPU Fallback Warning](#gpu-fallback-warning)).
**Segmentation presets:**
| Style | min_silence_ms | min_speech_ms | pad_ms |
|---|---|---|---|
| Mujawwad (slow) | 600 | 1500 | 300 |
| Murattal (normal) | 200 | 1000 | 100 |
| Fast | 75 | 750 | 40 |
**Response:**
```json
{
"audio_id": "a1b2c3d4e5f67890a1b2c3d4e5f67890",
"segments": [
{
"segment": 1,
"time_from": 0.480,
"time_to": 2.880,
"ref_from": "112:1:1",
"ref_to": "112:1:4",
"matched_text": "ู‚ูู„ู’ ู‡ููˆูŽ ูฑู„ู„ูŽู‘ู‡ู ุฃูŽุญูŽุฏูŒ",
"confidence": 0.921,
"has_missing_words": false,
"error": null
},
{
"segment": 2,
"time_from": 4.320,
"time_to": 6.540,
"ref_from": "",
"ref_to": "",
"matched_text": "ุจูุณู’ู…ู ูฑู„ู„ูŽู‘ู‡ู ูฑู„ุฑูŽู‘ุญู’ู…ูŽูฐู†ู ูฑู„ุฑูŽู‘ุญููŠู…",
"confidence": 0.952,
"has_missing_words": false,
"special_type": "Basmala",
"error": null
}
]
}
```
See [Segment Object](#segment-object) for field descriptions. See [Special Segment Types](#special-segment-types) for non-Quranic segments.
---
### `POST /process_url_session`
Downloads audio from a URL, then runs the same pipeline as `/process_audio_session`. Supports YouTube, SoundCloud, MP3Quran, TikTok, and [500+ sites](https://github.com/yt-dlp/yt-dlp/blob/master/supportedsites.md) via yt-dlp.
| Parameter | Type | Default | Description |
|---|---|---|---|
| `url` | str | required | URL to download audio from |
| `min_silence_ms` | int | 200 | Minimum silence gap to split segments |
| `min_speech_ms` | int | 1000 | Minimum speech duration to keep a segment |
| `pad_ms` | int | 100 | Padding added to each side of a segment |
| `model_name` | str | `"Base"` | `"Base"` or `"Large"` only |
| `device` | str | `"GPU"` | `"GPU"` or `"CPU"` |
**Response:** Same as `/process_audio_session`, plus a `url_metadata` field:
```json
{
"audio_id": "a1b2c3d4e5f67890a1b2c3d4e5f67890",
"url_metadata": {
"title": "Surah Al-Ikhlas - Sheikh Mishary",
"duration": 45.0,
"source_url": "https://..."
},
"segments": [...]
}
```
**Notes:**
- Playlists are rejected โ€” pass a single video/audio URL.
- Some sites (YouTube, Facebook, Instagram) may not work from the server due to IP restrictions. If a download fails, download the audio locally and use `/process_audio_session` instead.
- After the session is created, all follow-up endpoints (`/resegment`, `/retranscribe`, etc.) work identically.
---
### `POST /resegment`
Re-splits the audio into segments using different silence/speech settings, then re-aligns. Reuses the uploaded audio.
| Parameter | Type | Default | Description |
|---|---|---|---|
| `audio_id` | str | required | Session ID from a previous call |
| `min_silence_ms` | int | 200 | New minimum silence gap |
| `min_speech_ms` | int | 1000 | New minimum speech duration |
| `pad_ms` | int | 100 | New padding |
| `model_name` | str | `"Base"` | `"Base"` or `"Large"` only |
| `device` | str | `"GPU"` | `"GPU"` or `"CPU"` |
**Response:** Same shape as `/process_audio_session`. Session boundaries are updated.
---
### `POST /retranscribe`
Re-recognizes text using a different model on the same segments, then re-aligns.
| Parameter | Type | Default | Description |
|---|---|---|---|
| `audio_id` | str | required | Session ID from a previous call |
| `model_name` | str | `"Base"` | `"Base"` or `"Large"` only |
| `device` | str | `"GPU"` | `"GPU"` or `"CPU"` |
**Response:** Same shape as `/process_audio_session`. Session model and results are updated.
> **Note:** Returns an error if `model_name` is the same as the current session's model. To re-run with the same model on different boundaries, use `/resegment` or `/realign_from_timestamps` instead (they already include recognition + alignment).
---
### `POST /realign_from_timestamps`
Aligns audio using custom time boundaries you provide. Useful for manually adjusting where segments start and end.
| Parameter | Type | Default | Description |
|---|---|---|---|
| `audio_id` | str | required | Session ID from a previous call |
| `timestamps` | list | required | Array of `{"start": float, "end": float}` in seconds |
| `model_name` | str | `"Base"` | `"Base"` or `"Large"` only |
| `device` | str | `"GPU"` | `"GPU"` or `"CPU"` |
**Example request body:**
```json
{
"audio_id": "a1b2c3d4e5f67890a1b2c3d4e5f67890",
"timestamps": [
{"start": 0.5, "end": 3.2},
{"start": 3.8, "end": 5.1},
{"start": 5.1, "end": 7.4}
],
"model_name": "Base",
"device": "GPU"
}
```
**Response:** Same shape as `/process_audio_session`. Session boundaries are replaced with the provided timestamps.
---
## Word Timestamps
### `POST /timestamps`
Gets precise word-level (and optionally letter-level) timing for each word in the aligned segments.
| Parameter | Type | Default | Description |
|---|---|---|---|
| `audio_id` | str | required | Session ID from a previous alignment call |
| `segments` | list? | `None` (JSON `null`) | Segment list to align. `None` uses stored segments from the session |
| `granularity` | str | `"words"` | Only `"words"` is supported. `"words+chars"` is currently disabled via API and returns an error |
**Example โ€” using stored segments:**
```python
result = client.predict(
"a1b2c3d4e5f67890a1b2c3d4e5f67890", # audio_id
None, # segments (null = use stored)
"words", # granularity
api_name="/timestamps",
)
```
**Example โ€” with segments override (minimal):**
```python
result = client.predict(
"a1b2c3d4e5f67890a1b2c3d4e5f67890",
[ # segments override
{"time_from": 0.48, "time_to": 2.88, "ref_from": "112:1:1", "ref_to": "112:1:4"},
{"time_from": 3.12, "time_to": 5.44, "ref_from": "112:2:1", "ref_to": "112:2:3"},
],
"words",
api_name="/timestamps",
)
```
**Example โ€” special segment (Basmala):**
```python
# Special segments use empty ref_from/ref_to and carry a special_type field
{"time_from": 0.0, "time_to": 2.1, "ref_from": "", "ref_to": "", "special_type": "Basmala"}
```
**Segment input fields:**
| Field | Type | Required | Description |
|---|---|---|---|
| `time_from` | float | yes | Start time in seconds |
| `time_to` | float | yes | End time in seconds |
| `ref_from` | str | yes | First word as `"surah:ayah:word"`. Empty for special segments |
| `ref_to` | str | yes | Last word as `"surah:ayah:word"`. Empty for special segments |
| `segment` | int | no | 1-indexed segment number. Auto-assigned from position if omitted |
| `confidence` | float | no | Defaults to 1.0. Segments with confidence โ‰ค 0 are skipped |
| `special_type` | str | no | Only for special segments (`"Basmala"`, `"Isti'adha"`, etc.) |
**Response:**
```json
{
"audio_id": "a1b2c3d4e5f67890a1b2c3d4e5f67890",
"segments": [
{
"segment": 1,
"words": [
["112:1:1", 0.0, 0.32],
["112:1:2", 0.32, 0.58],
["112:1:3", 0.58, 1.12],
["112:1:4", 1.12, 1.68]
]
}
]
}
```
See [Word Timestamp Arrays](#word-timestamp-arrays) for field details.
---
### `POST /timestamps_direct`
Same as `/timestamps` but accepts an audio file directly โ€” no session needed.
| Parameter | Type | Default | Description |
|---|---|---|---|
| `audio` | file | required | Audio file (any common format) |
| `segments` | list | required | Segment list with `time_from`/`time_to` boundaries |
| `granularity` | str | `"words"` | Only `"words"` is supported. `"words+chars"` is currently disabled via API and returns an error |
**Response:** Same shape as `/timestamps` but without `audio_id`.
**Example (minimal):**
```python
result = client.predict(
"recitation.mp3",
[
{"time_from": 0.48, "time_to": 2.88, "ref_from": "112:1:1", "ref_to": "112:1:4"},
{"time_from": 3.12, "time_to": 5.44, "ref_from": "112:2:1", "ref_to": "112:2:3"},
],
"words",
api_name="/timestamps_direct",
)
```
Segment input format is the same as for `/timestamps` โ€” see above.
---
## Utilities
### `POST /estimate_duration`
Estimate processing time before starting a request.
| Parameter | Type | Default | Description |
|---|---|---|---|
| `endpoint` | str | required | Target endpoint name (e.g. `"process_audio_session"`) |
| `audio_duration_s` | float | `None` | Audio length in seconds. Required if no `audio_id` |
| `audio_id` | str | `None` | Session ID โ€” looks up audio duration from the session |
| `model_name` | str | `"Base"` | `"Base"` or `"Large"` only |
| `device` | str | `"GPU"` | `"GPU"` or `"CPU"` |
**Example โ€” before first processing call:**
```python
est = client.predict(
"process_audio_session", # endpoint
60.0, # audio_duration_s (seconds)
None, # audio_id (not yet available)
"Base", # model_name
"GPU", # device
api_name="/estimate_duration",
)
print(f"Estimated time: {est['estimated_duration_s']}s")
```
**Example โ€” with existing session (e.g. before getting timestamps):**
```python
est = client.predict(
"timestamps", # endpoint
None, # audio_duration_s (looked up from session)
audio_id, # audio_id
"Base", # model_name
"GPU", # device
api_name="/estimate_duration",
)
```
**Response:**
```json
{
"endpoint": "process_audio_session",
"estimated_duration_s": 28.0,
"device": "GPU",
"model_name": "Base"
}
```
---
## Response Reference
### Segment Object
Returned by all alignment endpoints (`/process_audio_session`, `/resegment`, `/retranscribe`, `/realign_from_timestamps`).
| Field | Type | Description |
|---|---|---|
| `segment` | int | 1-indexed segment number |
| `time_from` | float | Start time in seconds |
| `time_to` | float | End time in seconds |
| `ref_from` | str | First matched word as `"surah:ayah:word"`. Empty string for special segments |
| `ref_to` | str | Last matched word as `"surah:ayah:word"`. Empty string for special segments |
| `matched_text` | str | Quran text for the matched range (or special segment text) |
| `confidence` | float | 0.0โ€“1.0 โ€” how well the segment matched the Quran text |
| `has_missing_words` | bool | Whether some expected words were not found in the audio |
| `special_type` | str | Only present for special (non-Quranic) segments โ€” see below. Absent for normal segments |
| `error` | str? | Error message if alignment failed, else `null` |
### Special Segment Types
Non-Quranic segments detected within recitations. When `special_type` is present, `ref_from` and `ref_to` are empty strings.
| `special_type` | Arabic Text |
|----------------|-------------|
| `Basmala` | ุจูุณู’ู…ู ูฑู„ู„ูŽู‘ู‡ู ูฑู„ุฑูŽู‘ุญู’ู…ูŽูฐู†ู ูฑู„ุฑูŽู‘ุญููŠู… |
| `Isti'adha` | ุฃูŽุนููˆุฐู ุจููฑู„ู„ูŽู‘ู‡ู ู…ูู†ูŽ ุงู„ุดูŽู‘ูŠู’ุทูŽุงู†ู ุงู„ุฑูŽู‘ุฌููŠู… |
| `Amin` | ุขู…ููŠู† |
| `Takbir` | ุงู„ู„ูŽู‘ู‡ู ุฃูŽูƒู’ุจูŽุฑ |
| `Tahmeed` | ุณูŽู…ูุนูŽ ุงู„ู„ูŽู‘ู‡ู ู„ูู…ูŽู†ู’ ุญูŽู…ูุฏูŽู‡ |
| `Tasleem` | ูฑู„ุณูŽู‘ู„ูŽุงู…ู ุนูŽู„ูŽูŠู’ูƒูู…ู’ ูˆูŽุฑูŽุญู’ู…ูŽุฉู ูฑู„ู„ูŽู‘ู‡ |
| `Sadaqa` | ุตูŽุฏูŽู‚ูŽ ูฑู„ู„ูŽู‘ู‡ู ูฑู„ู’ุนูŽุธููŠู… |
### Word Timestamp Arrays
Returned by `/timestamps` and `/timestamps_direct`. Each word is an array: `[location, start, end]` or `[location, start, end, letters]`.
| Index | Type | Description |
|---|---|---|
| 0 | str | Word position as `"surah:ayah:word"` |
| 1 | float | Start time relative to segment (seconds) |
| 2 | float | End time relative to segment (seconds) |
> **Note:** `"words+chars"` granularity (letter-level timestamps) is currently disabled via API. Only word-level timestamps are returned.
### GPU Fallback Warning
When the server's GPU is temporarily unavailable, processing continues on CPU (slower). All endpoints include a `"warning"` field in the response:
```json
{
"audio_id": "...",
"warning": "GPU quota reached โ€” processed on CPU (slower). Resets in 13:53:59.",
"segments": [...]
}
```
The `"warning"` key is **absent** (not `null`) when processing ran on GPU normally. Clients should check `if "warning" in result` rather than checking for `null`.
### Errors
All errors follow the same shape: `{"error": "...", "segments": []}`. Endpoints that have an active session also include `audio_id`.
| Condition | Error message | `audio_id` present? |
|---|---|---|
| Session not found or expired | `"Session not found or expired"` | No |
| No speech detected (process) | `"No speech detected in audio"` | No (no session created) |
| No segments after resegment | `"No segments with these settings"` | Yes |
| Invalid model name | `"Invalid model_name '...'. Must be one of: Base, Large"` | Depends on endpoint |
| Retranscribe with same model | `"Model and boundaries unchanged. Change model_name or call /resegment first."` | Yes |
| Retranscription failed | `"Retranscription failed"` | Yes |
| Realignment failed | `"Alignment failed"` | Yes |
| No segments in session (timestamps) | `"No segments found in session"` | Yes |
| Timestamp alignment failed | `"Alignment failed: ..."` | Yes (session) / No (direct) |
| No segments provided (timestamps direct) | `"No segments provided"` | No |
| URL is empty (process_url) | `"URL is required"` | No |
| URL download failed (process_url) | `"Download failed: ..."` | No |