Spaces:
Running on Zero
Running on Zero
| # Client API Reference | |
| - [Quick Start](#quick-start) | |
| - [Sessions](#sessions) | |
| - [Alignment Endpoints](#alignment-endpoints) โ `/process_audio_session`, `/process_url_session`, `/resegment`, `/retranscribe`, `/realign_from_timestamps` | |
| - [Word Timestamps](#word-timestamps) โ `/timestamps`, `/timestamps_direct` | |
| - [Utilities](#utilities) โ `/estimate_duration` | |
| - [Response Reference](#response-reference) โ segment fields, special types, word arrays, GPU warning, errors | |
| ## API Changelog | |
| **30/03/2026** | |
| - New `/process_url_session` endpoint: pass a URL (YouTube, SoundCloud, MP3Quran, etc.) instead of uploading audio | |
| **29/03/2026** | |
| - API calls now skip HTML rendering and audio file I/O, returning JSON faster | |
| --- | |
| ## GPU Usage & Access | |
| - **Free Tier:** Every user receives **free daily GPU quota**. Once your daily GPU quota is exhausted, you can continue using unlimited CPU processing for all endpoints. | |
| - **Unlimited GPU Access:** If you need unlimited API access on GPU (e.g., for high-volume or production use), please get in touch to arrange a payment plan and higher limits. | |
| - **Note:** CPU processing is always unlimited and available, but is much slower. When GPU quota is exceeded, requests will be automatically routed to CPU and a warning will appear in the response. | |
| ## Quick Start | |
| ```python | |
| from gradio_client import Client | |
| client = Client("https://hetchyy-quran-multi-aligner.hf.space") | |
| # Or pass your HF token to use your own account's ZeroGPU quota | |
| client = Client("https://hetchyy-quran-multi-aligner.hf.space", token="hf_...") | |
| # Full pipeline | |
| result = client.predict( | |
| "recitation.mp3", # audio file path | |
| 200, # min_silence_ms | |
| 1000, # min_speech_ms | |
| 100, # pad_ms | |
| "Base", # model_name | |
| "GPU", # device | |
| api_name="/process_audio_session" | |
| ) | |
| audio_id = result["audio_id"] | |
| # Re-segment with different params (reuses cached audio) | |
| result = client.predict(audio_id, 600, 1500, 300, "Base", "GPU", api_name="/resegment") | |
| # Re-transcribe with a different model (reuses cached segments) | |
| result = client.predict(audio_id, "Large", "GPU", api_name="/retranscribe") | |
| # Realign with custom timestamps | |
| result = client.predict( | |
| audio_id, | |
| [{"start": 0.5, "end": 3.2}, {"start": 3.8, "end": 7.1}], | |
| "Base", "GPU", | |
| api_name="/realign_from_timestamps" | |
| ) | |
| # Get word-level timestamps (uses stored session segments) | |
| ts = client.predict(audio_id, None, "words", api_name="/timestamps") | |
| # Get timestamps without a session (standalone) | |
| ts = client.predict("recitation.mp3", result["segments"], "words", api_name="/timestamps_direct") | |
| # From URL (YouTube, SoundCloud, MP3Quran, etc.) | |
| result = client.predict( | |
| "https://server8.mp3quran.net/afs/112.mp3", | |
| 200, 1000, 100, "Base", "GPU", | |
| api_name="/process_url_session" | |
| ) | |
| print(result["url_metadata"]["title"]) # Source metadata | |
| # All follow-up calls work the same as with /process_audio_session | |
| ``` | |
| --- | |
| ## Sessions | |
| The first call returns an `audio_id` (32-character hex string). Pass it to subsequent calls to skip re-uploading and reprocessing audio. Sessions expire after **5 hours**. | |
| **What the server caches per session:** | |
| | Data | Updated by | | |
| |---|---| | |
| | Preprocessed audio | โ | | |
| | Detected speech intervals | โ | | |
| | Cleaned segment boundaries | `/resegment`, `/realign_from_timestamps` | | |
| | Model name | `/retranscribe` | | |
| | Alignment segments | Any alignment call | | |
| If `audio_id` is missing, expired, or invalid: | |
| ```json | |
| {"error": "Session not found or expired", "segments": []} | |
| ``` | |
| --- | |
| ## Alignment Endpoints | |
| ### `POST /process_audio_session` | |
| Processes a recitation audio file: detects speech segments, recognizes text, and aligns with the Quran. Creates a session for follow-up calls. | |
| | Parameter | Type | Default | Description | | |
| |---|---|---|---| | |
| | `audio` | file | required | Audio file (any common format) | | |
| | `min_silence_ms` | int | 200 | Minimum silence gap to split segments | | |
| | `min_speech_ms` | int | 1000 | Minimum speech duration to keep a segment | | |
| | `pad_ms` | int | 100 | Padding added to each side of a segment | | |
| | `model_name` | str | `"Base"` | `"Base"` (faster) or `"Large"` (more accurate). **Only these two values are accepted** โ any other value will cause an error | | |
| | `device` | str | `"GPU"` | `"GPU"` or `"CPU"` | | |
| If the GPU is temporarily unavailable, processing continues on CPU (slower). When this happens, a `"warning"` field is included in the response (see [GPU Fallback Warning](#gpu-fallback-warning)). | |
| **Segmentation presets:** | |
| | Style | min_silence_ms | min_speech_ms | pad_ms | | |
| |---|---|---|---| | |
| | Mujawwad (slow) | 600 | 1500 | 300 | | |
| | Murattal (normal) | 200 | 1000 | 100 | | |
| | Fast | 75 | 750 | 40 | | |
| **Response:** | |
| ```json | |
| { | |
| "audio_id": "a1b2c3d4e5f67890a1b2c3d4e5f67890", | |
| "segments": [ | |
| { | |
| "segment": 1, | |
| "time_from": 0.480, | |
| "time_to": 2.880, | |
| "ref_from": "112:1:1", | |
| "ref_to": "112:1:4", | |
| "matched_text": "ูููู ูููู ูฑูููููู ุฃูุญูุฏู", | |
| "confidence": 0.921, | |
| "has_missing_words": false, | |
| "error": null | |
| }, | |
| { | |
| "segment": 2, | |
| "time_from": 4.320, | |
| "time_to": 6.540, | |
| "ref_from": "", | |
| "ref_to": "", | |
| "matched_text": "ุจูุณูู ู ูฑูููููู ูฑูุฑููุญูู ููฐูู ูฑูุฑููุญููู ", | |
| "confidence": 0.952, | |
| "has_missing_words": false, | |
| "special_type": "Basmala", | |
| "error": null | |
| } | |
| ] | |
| } | |
| ``` | |
| See [Segment Object](#segment-object) for field descriptions. See [Special Segment Types](#special-segment-types) for non-Quranic segments. | |
| --- | |
| ### `POST /process_url_session` | |
| Downloads audio from a URL, then runs the same pipeline as `/process_audio_session`. Supports YouTube, SoundCloud, MP3Quran, TikTok, and [500+ sites](https://github.com/yt-dlp/yt-dlp/blob/master/supportedsites.md) via yt-dlp. | |
| | Parameter | Type | Default | Description | | |
| |---|---|---|---| | |
| | `url` | str | required | URL to download audio from | | |
| | `min_silence_ms` | int | 200 | Minimum silence gap to split segments | | |
| | `min_speech_ms` | int | 1000 | Minimum speech duration to keep a segment | | |
| | `pad_ms` | int | 100 | Padding added to each side of a segment | | |
| | `model_name` | str | `"Base"` | `"Base"` or `"Large"` only | | |
| | `device` | str | `"GPU"` | `"GPU"` or `"CPU"` | | |
| **Response:** Same as `/process_audio_session`, plus a `url_metadata` field: | |
| ```json | |
| { | |
| "audio_id": "a1b2c3d4e5f67890a1b2c3d4e5f67890", | |
| "url_metadata": { | |
| "title": "Surah Al-Ikhlas - Sheikh Mishary", | |
| "duration": 45.0, | |
| "source_url": "https://..." | |
| }, | |
| "segments": [...] | |
| } | |
| ``` | |
| **Notes:** | |
| - Playlists are rejected โ pass a single video/audio URL. | |
| - Some sites (YouTube, Facebook, Instagram) may not work from the server due to IP restrictions. If a download fails, download the audio locally and use `/process_audio_session` instead. | |
| - After the session is created, all follow-up endpoints (`/resegment`, `/retranscribe`, etc.) work identically. | |
| --- | |
| ### `POST /resegment` | |
| Re-splits the audio into segments using different silence/speech settings, then re-aligns. Reuses the uploaded audio. | |
| | Parameter | Type | Default | Description | | |
| |---|---|---|---| | |
| | `audio_id` | str | required | Session ID from a previous call | | |
| | `min_silence_ms` | int | 200 | New minimum silence gap | | |
| | `min_speech_ms` | int | 1000 | New minimum speech duration | | |
| | `pad_ms` | int | 100 | New padding | | |
| | `model_name` | str | `"Base"` | `"Base"` or `"Large"` only | | |
| | `device` | str | `"GPU"` | `"GPU"` or `"CPU"` | | |
| **Response:** Same shape as `/process_audio_session`. Session boundaries are updated. | |
| --- | |
| ### `POST /retranscribe` | |
| Re-recognizes text using a different model on the same segments, then re-aligns. | |
| | Parameter | Type | Default | Description | | |
| |---|---|---|---| | |
| | `audio_id` | str | required | Session ID from a previous call | | |
| | `model_name` | str | `"Base"` | `"Base"` or `"Large"` only | | |
| | `device` | str | `"GPU"` | `"GPU"` or `"CPU"` | | |
| **Response:** Same shape as `/process_audio_session`. Session model and results are updated. | |
| > **Note:** Returns an error if `model_name` is the same as the current session's model. To re-run with the same model on different boundaries, use `/resegment` or `/realign_from_timestamps` instead (they already include recognition + alignment). | |
| --- | |
| ### `POST /realign_from_timestamps` | |
| Aligns audio using custom time boundaries you provide. Useful for manually adjusting where segments start and end. | |
| | Parameter | Type | Default | Description | | |
| |---|---|---|---| | |
| | `audio_id` | str | required | Session ID from a previous call | | |
| | `timestamps` | list | required | Array of `{"start": float, "end": float}` in seconds | | |
| | `model_name` | str | `"Base"` | `"Base"` or `"Large"` only | | |
| | `device` | str | `"GPU"` | `"GPU"` or `"CPU"` | | |
| **Example request body:** | |
| ```json | |
| { | |
| "audio_id": "a1b2c3d4e5f67890a1b2c3d4e5f67890", | |
| "timestamps": [ | |
| {"start": 0.5, "end": 3.2}, | |
| {"start": 3.8, "end": 5.1}, | |
| {"start": 5.1, "end": 7.4} | |
| ], | |
| "model_name": "Base", | |
| "device": "GPU" | |
| } | |
| ``` | |
| **Response:** Same shape as `/process_audio_session`. Session boundaries are replaced with the provided timestamps. | |
| --- | |
| ## Word Timestamps | |
| ### `POST /timestamps` | |
| Gets precise word-level (and optionally letter-level) timing for each word in the aligned segments. | |
| | Parameter | Type | Default | Description | | |
| |---|---|---|---| | |
| | `audio_id` | str | required | Session ID from a previous alignment call | | |
| | `segments` | list? | `None` (JSON `null`) | Segment list to align. `None` uses stored segments from the session | | |
| | `granularity` | str | `"words"` | Only `"words"` is supported. `"words+chars"` is currently disabled via API and returns an error | | |
| **Example โ using stored segments:** | |
| ```python | |
| result = client.predict( | |
| "a1b2c3d4e5f67890a1b2c3d4e5f67890", # audio_id | |
| None, # segments (null = use stored) | |
| "words", # granularity | |
| api_name="/timestamps", | |
| ) | |
| ``` | |
| **Example โ with segments override (minimal):** | |
| ```python | |
| result = client.predict( | |
| "a1b2c3d4e5f67890a1b2c3d4e5f67890", | |
| [ # segments override | |
| {"time_from": 0.48, "time_to": 2.88, "ref_from": "112:1:1", "ref_to": "112:1:4"}, | |
| {"time_from": 3.12, "time_to": 5.44, "ref_from": "112:2:1", "ref_to": "112:2:3"}, | |
| ], | |
| "words", | |
| api_name="/timestamps", | |
| ) | |
| ``` | |
| **Example โ special segment (Basmala):** | |
| ```python | |
| # Special segments use empty ref_from/ref_to and carry a special_type field | |
| {"time_from": 0.0, "time_to": 2.1, "ref_from": "", "ref_to": "", "special_type": "Basmala"} | |
| ``` | |
| **Segment input fields:** | |
| | Field | Type | Required | Description | | |
| |---|---|---|---| | |
| | `time_from` | float | yes | Start time in seconds | | |
| | `time_to` | float | yes | End time in seconds | | |
| | `ref_from` | str | yes | First word as `"surah:ayah:word"`. Empty for special segments | | |
| | `ref_to` | str | yes | Last word as `"surah:ayah:word"`. Empty for special segments | | |
| | `segment` | int | no | 1-indexed segment number. Auto-assigned from position if omitted | | |
| | `confidence` | float | no | Defaults to 1.0. Segments with confidence โค 0 are skipped | | |
| | `special_type` | str | no | Only for special segments (`"Basmala"`, `"Isti'adha"`, etc.) | | |
| **Response:** | |
| ```json | |
| { | |
| "audio_id": "a1b2c3d4e5f67890a1b2c3d4e5f67890", | |
| "segments": [ | |
| { | |
| "segment": 1, | |
| "words": [ | |
| ["112:1:1", 0.0, 0.32], | |
| ["112:1:2", 0.32, 0.58], | |
| ["112:1:3", 0.58, 1.12], | |
| ["112:1:4", 1.12, 1.68] | |
| ] | |
| } | |
| ] | |
| } | |
| ``` | |
| See [Word Timestamp Arrays](#word-timestamp-arrays) for field details. | |
| --- | |
| ### `POST /timestamps_direct` | |
| Same as `/timestamps` but accepts an audio file directly โ no session needed. | |
| | Parameter | Type | Default | Description | | |
| |---|---|---|---| | |
| | `audio` | file | required | Audio file (any common format) | | |
| | `segments` | list | required | Segment list with `time_from`/`time_to` boundaries | | |
| | `granularity` | str | `"words"` | Only `"words"` is supported. `"words+chars"` is currently disabled via API and returns an error | | |
| **Response:** Same shape as `/timestamps` but without `audio_id`. | |
| **Example (minimal):** | |
| ```python | |
| result = client.predict( | |
| "recitation.mp3", | |
| [ | |
| {"time_from": 0.48, "time_to": 2.88, "ref_from": "112:1:1", "ref_to": "112:1:4"}, | |
| {"time_from": 3.12, "time_to": 5.44, "ref_from": "112:2:1", "ref_to": "112:2:3"}, | |
| ], | |
| "words", | |
| api_name="/timestamps_direct", | |
| ) | |
| ``` | |
| Segment input format is the same as for `/timestamps` โ see above. | |
| --- | |
| ## Utilities | |
| ### `POST /estimate_duration` | |
| Estimate processing time before starting a request. | |
| | Parameter | Type | Default | Description | | |
| |---|---|---|---| | |
| | `endpoint` | str | required | Target endpoint name (e.g. `"process_audio_session"`) | | |
| | `audio_duration_s` | float | `None` | Audio length in seconds. Required if no `audio_id` | | |
| | `audio_id` | str | `None` | Session ID โ looks up audio duration from the session | | |
| | `model_name` | str | `"Base"` | `"Base"` or `"Large"` only | | |
| | `device` | str | `"GPU"` | `"GPU"` or `"CPU"` | | |
| **Example โ before first processing call:** | |
| ```python | |
| est = client.predict( | |
| "process_audio_session", # endpoint | |
| 60.0, # audio_duration_s (seconds) | |
| None, # audio_id (not yet available) | |
| "Base", # model_name | |
| "GPU", # device | |
| api_name="/estimate_duration", | |
| ) | |
| print(f"Estimated time: {est['estimated_duration_s']}s") | |
| ``` | |
| **Example โ with existing session (e.g. before getting timestamps):** | |
| ```python | |
| est = client.predict( | |
| "timestamps", # endpoint | |
| None, # audio_duration_s (looked up from session) | |
| audio_id, # audio_id | |
| "Base", # model_name | |
| "GPU", # device | |
| api_name="/estimate_duration", | |
| ) | |
| ``` | |
| **Response:** | |
| ```json | |
| { | |
| "endpoint": "process_audio_session", | |
| "estimated_duration_s": 28.0, | |
| "device": "GPU", | |
| "model_name": "Base" | |
| } | |
| ``` | |
| --- | |
| ## Response Reference | |
| ### Segment Object | |
| Returned by all alignment endpoints (`/process_audio_session`, `/resegment`, `/retranscribe`, `/realign_from_timestamps`). | |
| | Field | Type | Description | | |
| |---|---|---| | |
| | `segment` | int | 1-indexed segment number | | |
| | `time_from` | float | Start time in seconds | | |
| | `time_to` | float | End time in seconds | | |
| | `ref_from` | str | First matched word as `"surah:ayah:word"`. Empty string for special segments | | |
| | `ref_to` | str | Last matched word as `"surah:ayah:word"`. Empty string for special segments | | |
| | `matched_text` | str | Quran text for the matched range (or special segment text) | | |
| | `confidence` | float | 0.0โ1.0 โ how well the segment matched the Quran text | | |
| | `has_missing_words` | bool | Whether some expected words were not found in the audio | | |
| | `special_type` | str | Only present for special (non-Quranic) segments โ see below. Absent for normal segments | | |
| | `error` | str? | Error message if alignment failed, else `null` | | |
| ### Special Segment Types | |
| Non-Quranic segments detected within recitations. When `special_type` is present, `ref_from` and `ref_to` are empty strings. | |
| | `special_type` | Arabic Text | | |
| |----------------|-------------| | |
| | `Basmala` | ุจูุณูู ู ูฑูููููู ูฑูุฑููุญูู ููฐูู ูฑูุฑููุญููู | | |
| | `Isti'adha` | ุฃูุนููุฐู ุจููฑูููููู ู ููู ุงูุดููููุทูุงูู ุงูุฑููุฌููู | | |
| | `Amin` | ุขู ููู | | |
| | `Takbir` | ุงูููููู ุฃูููุจูุฑ | | |
| | `Tahmeed` | ุณูู ูุนู ุงูููููู ููู ููู ุญูู ูุฏูู | | |
| | `Tasleem` | ูฑูุณููููุงู ู ุนูููููููู ู ููุฑูุญูู ูุฉู ูฑููููู | | |
| | `Sadaqa` | ุตูุฏููู ูฑูููููู ูฑููุนูุธููู | | |
| ### Word Timestamp Arrays | |
| Returned by `/timestamps` and `/timestamps_direct`. Each word is an array: `[location, start, end]` or `[location, start, end, letters]`. | |
| | Index | Type | Description | | |
| |---|---|---| | |
| | 0 | str | Word position as `"surah:ayah:word"` | | |
| | 1 | float | Start time relative to segment (seconds) | | |
| | 2 | float | End time relative to segment (seconds) | | |
| > **Note:** `"words+chars"` granularity (letter-level timestamps) is currently disabled via API. Only word-level timestamps are returned. | |
| ### GPU Fallback Warning | |
| When the server's GPU is temporarily unavailable, processing continues on CPU (slower). All endpoints include a `"warning"` field in the response: | |
| ```json | |
| { | |
| "audio_id": "...", | |
| "warning": "GPU quota reached โ processed on CPU (slower). Resets in 13:53:59.", | |
| "segments": [...] | |
| } | |
| ``` | |
| The `"warning"` key is **absent** (not `null`) when processing ran on GPU normally. Clients should check `if "warning" in result` rather than checking for `null`. | |
| ### Errors | |
| All errors follow the same shape: `{"error": "...", "segments": []}`. Endpoints that have an active session also include `audio_id`. | |
| | Condition | Error message | `audio_id` present? | | |
| |---|---|---| | |
| | Session not found or expired | `"Session not found or expired"` | No | | |
| | No speech detected (process) | `"No speech detected in audio"` | No (no session created) | | |
| | No segments after resegment | `"No segments with these settings"` | Yes | | |
| | Invalid model name | `"Invalid model_name '...'. Must be one of: Base, Large"` | Depends on endpoint | | |
| | Retranscribe with same model | `"Model and boundaries unchanged. Change model_name or call /resegment first."` | Yes | | |
| | Retranscription failed | `"Retranscription failed"` | Yes | | |
| | Realignment failed | `"Alignment failed"` | Yes | | |
| | No segments in session (timestamps) | `"No segments found in session"` | Yes | | |
| | Timestamp alignment failed | `"Alignment failed: ..."` | Yes (session) / No (direct) | | |
| | No segments provided (timestamps direct) | `"No segments provided"` | No | | |
| | URL is empty (process_url) | `"URL is required"` | No | | |
| | URL download failed (process_url) | `"Download failed: ..."` | No | | |