# Client API Reference - [Quick Start](#quick-start) - [Sessions](#sessions) - [Alignment Endpoints](#alignment-endpoints) — `/process_audio_session`, `/process_url_session`, `/resegment`, `/retranscribe`, `/realign_from_timestamps` - [Word Timestamps](#word-timestamps) — `/timestamps`, `/timestamps_direct` - [Utilities](#utilities) — `/estimate_duration` - [Response Reference](#response-reference) — segment fields, special types, word arrays, GPU warning, errors ## API Changelog **30/03/2026** - New `/process_url_session` endpoint: pass a URL (YouTube, SoundCloud, MP3Quran, etc.) instead of uploading audio **29/03/2026** - API calls now skip HTML rendering and audio file I/O, returning JSON faster --- ## GPU Usage & Access - **Free Tier:** Every user receives **free daily GPU quota**. Once your daily GPU quota is exhausted, you can continue using unlimited CPU processing for all endpoints. - **Unlimited GPU Access:** If you need unlimited API access on GPU (e.g., for high-volume or production use), please get in touch to arrange a payment plan and higher limits. - **Note:** CPU processing is always unlimited and available, but is much slower. When GPU quota is exceeded, requests will be automatically routed to CPU and a warning will appear in the response. ## Quick Start ```python from gradio_client import Client client = Client("https://hetchyy-quran-multi-aligner.hf.space") # Or pass your HF token to use your own account's ZeroGPU quota client = Client("https://hetchyy-quran-multi-aligner.hf.space", token="hf_...") # Full pipeline result = client.predict( "recitation.mp3", # audio file path 200, # min_silence_ms 1000, # min_speech_ms 100, # pad_ms "Base", # model_name "GPU", # device api_name="/process_audio_session" ) audio_id = result["audio_id"] # Re-segment with different params (reuses cached audio) result = client.predict(audio_id, 600, 1500, 300, "Base", "GPU", api_name="/resegment") # Re-transcribe with a different model (reuses cached segments) result = client.predict(audio_id, "Large", "GPU", api_name="/retranscribe") # Realign with custom timestamps result = client.predict( audio_id, [{"start": 0.5, "end": 3.2}, {"start": 3.8, "end": 7.1}], "Base", "GPU", api_name="/realign_from_timestamps" ) # Get word-level timestamps (uses stored session segments) ts = client.predict(audio_id, None, "words", api_name="/timestamps") # Get timestamps without a session (standalone) ts = client.predict("recitation.mp3", result["segments"], "words", api_name="/timestamps_direct") # From URL (YouTube, SoundCloud, MP3Quran, etc.) result = client.predict( "https://server8.mp3quran.net/afs/112.mp3", 200, 1000, 100, "Base", "GPU", api_name="/process_url_session" ) print(result["url_metadata"]["title"]) # Source metadata # All follow-up calls work the same as with /process_audio_session ``` --- ## Sessions The first call returns an `audio_id` (32-character hex string). Pass it to subsequent calls to skip re-uploading and reprocessing audio. Sessions expire after **5 hours**. **What the server caches per session:** | Data | Updated by | |---|---| | Preprocessed audio | — | | Detected speech intervals | — | | Cleaned segment boundaries | `/resegment`, `/realign_from_timestamps` | | Model name | `/retranscribe` | | Alignment segments | Any alignment call | If `audio_id` is missing, expired, or invalid: ```json {"error": "Session not found or expired", "segments": []} ``` --- ## Alignment Endpoints ### `POST /process_audio_session` Processes a recitation audio file: detects speech segments, recognizes text, and aligns with the Quran. Creates a session for follow-up calls. | Parameter | Type | Default | Description | |---|---|---|---| | `audio` | file | required | Audio file (any common format) | | `min_silence_ms` | int | 200 | Minimum silence gap to split segments | | `min_speech_ms` | int | 1000 | Minimum speech duration to keep a segment | | `pad_ms` | int | 100 | Padding added to each side of a segment | | `model_name` | str | `"Base"` | `"Base"` (faster) or `"Large"` (more accurate). **Only these two values are accepted** — any other value will cause an error | | `device` | str | `"GPU"` | `"GPU"` or `"CPU"` | If the GPU is temporarily unavailable, processing continues on CPU (slower). When this happens, a `"warning"` field is included in the response (see [GPU Fallback Warning](#gpu-fallback-warning)). **Segmentation presets:** | Style | min_silence_ms | min_speech_ms | pad_ms | |---|---|---|---| | Mujawwad (slow) | 600 | 1500 | 300 | | Murattal (normal) | 200 | 1000 | 100 | | Fast | 75 | 750 | 40 | **Response:** ```json { "audio_id": "a1b2c3d4e5f67890a1b2c3d4e5f67890", "segments": [ { "segment": 1, "time_from": 0.480, "time_to": 2.880, "ref_from": "112:1:1", "ref_to": "112:1:4", "matched_text": "قُلْ هُوَ ٱللَّهُ أَحَدٌ", "confidence": 0.921, "has_missing_words": false, "error": null }, { "segment": 2, "time_from": 4.320, "time_to": 6.540, "ref_from": "", "ref_to": "", "matched_text": "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيم", "confidence": 0.952, "has_missing_words": false, "special_type": "Basmala", "error": null } ] } ``` See [Segment Object](#segment-object) for field descriptions. See [Special Segment Types](#special-segment-types) for non-Quranic segments. --- ### `POST /process_url_session` Downloads audio from a URL, then runs the same pipeline as `/process_audio_session`. Supports YouTube, SoundCloud, MP3Quran, TikTok, and [500+ sites](https://github.com/yt-dlp/yt-dlp/blob/master/supportedsites.md) via yt-dlp. | Parameter | Type | Default | Description | |---|---|---|---| | `url` | str | required | URL to download audio from | | `min_silence_ms` | int | 200 | Minimum silence gap to split segments | | `min_speech_ms` | int | 1000 | Minimum speech duration to keep a segment | | `pad_ms` | int | 100 | Padding added to each side of a segment | | `model_name` | str | `"Base"` | `"Base"` or `"Large"` only | | `device` | str | `"GPU"` | `"GPU"` or `"CPU"` | **Response:** Same as `/process_audio_session`, plus a `url_metadata` field: ```json { "audio_id": "a1b2c3d4e5f67890a1b2c3d4e5f67890", "url_metadata": { "title": "Surah Al-Ikhlas - Sheikh Mishary", "duration": 45.0, "source_url": "https://..." }, "segments": [...] } ``` **Notes:** - Playlists are rejected — pass a single video/audio URL. - Some sites (YouTube, Facebook, Instagram) may not work from the server due to IP restrictions. If a download fails, download the audio locally and use `/process_audio_session` instead. - After the session is created, all follow-up endpoints (`/resegment`, `/retranscribe`, etc.) work identically. --- ### `POST /resegment` Re-splits the audio into segments using different silence/speech settings, then re-aligns. Reuses the uploaded audio. | Parameter | Type | Default | Description | |---|---|---|---| | `audio_id` | str | required | Session ID from a previous call | | `min_silence_ms` | int | 200 | New minimum silence gap | | `min_speech_ms` | int | 1000 | New minimum speech duration | | `pad_ms` | int | 100 | New padding | | `model_name` | str | `"Base"` | `"Base"` or `"Large"` only | | `device` | str | `"GPU"` | `"GPU"` or `"CPU"` | **Response:** Same shape as `/process_audio_session`. Session boundaries are updated. --- ### `POST /retranscribe` Re-recognizes text using a different model on the same segments, then re-aligns. | Parameter | Type | Default | Description | |---|---|---|---| | `audio_id` | str | required | Session ID from a previous call | | `model_name` | str | `"Base"` | `"Base"` or `"Large"` only | | `device` | str | `"GPU"` | `"GPU"` or `"CPU"` | **Response:** Same shape as `/process_audio_session`. Session model and results are updated. > **Note:** Returns an error if `model_name` is the same as the current session's model. To re-run with the same model on different boundaries, use `/resegment` or `/realign_from_timestamps` instead (they already include recognition + alignment). --- ### `POST /realign_from_timestamps` Aligns audio using custom time boundaries you provide. Useful for manually adjusting where segments start and end. | Parameter | Type | Default | Description | |---|---|---|---| | `audio_id` | str | required | Session ID from a previous call | | `timestamps` | list | required | Array of `{"start": float, "end": float}` in seconds | | `model_name` | str | `"Base"` | `"Base"` or `"Large"` only | | `device` | str | `"GPU"` | `"GPU"` or `"CPU"` | **Example request body:** ```json { "audio_id": "a1b2c3d4e5f67890a1b2c3d4e5f67890", "timestamps": [ {"start": 0.5, "end": 3.2}, {"start": 3.8, "end": 5.1}, {"start": 5.1, "end": 7.4} ], "model_name": "Base", "device": "GPU" } ``` **Response:** Same shape as `/process_audio_session`. Session boundaries are replaced with the provided timestamps. --- ## Word Timestamps ### `POST /timestamps` Gets precise word-level (and optionally letter-level) timing for each word in the aligned segments. | Parameter | Type | Default | Description | |---|---|---|---| | `audio_id` | str | required | Session ID from a previous alignment call | | `segments` | list? | `None` (JSON `null`) | Segment list to align. `None` uses stored segments from the session | | `granularity` | str | `"words"` | Only `"words"` is supported. `"words+chars"` is currently disabled via API and returns an error | **Example — using stored segments:** ```python result = client.predict( "a1b2c3d4e5f67890a1b2c3d4e5f67890", # audio_id None, # segments (null = use stored) "words", # granularity api_name="/timestamps", ) ``` **Example — with segments override (minimal):** ```python result = client.predict( "a1b2c3d4e5f67890a1b2c3d4e5f67890", [ # segments override {"time_from": 0.48, "time_to": 2.88, "ref_from": "112:1:1", "ref_to": "112:1:4"}, {"time_from": 3.12, "time_to": 5.44, "ref_from": "112:2:1", "ref_to": "112:2:3"}, ], "words", api_name="/timestamps", ) ``` **Example — special segment (Basmala):** ```python # Special segments use empty ref_from/ref_to and carry a special_type field {"time_from": 0.0, "time_to": 2.1, "ref_from": "", "ref_to": "", "special_type": "Basmala"} ``` **Segment input fields:** | Field | Type | Required | Description | |---|---|---|---| | `time_from` | float | yes | Start time in seconds | | `time_to` | float | yes | End time in seconds | | `ref_from` | str | yes | First word as `"surah:ayah:word"`. Empty for special segments | | `ref_to` | str | yes | Last word as `"surah:ayah:word"`. Empty for special segments | | `segment` | int | no | 1-indexed segment number. Auto-assigned from position if omitted | | `confidence` | float | no | Defaults to 1.0. Segments with confidence ≤ 0 are skipped | | `special_type` | str | no | Only for special segments (`"Basmala"`, `"Isti'adha"`, etc.) | **Response:** ```json { "audio_id": "a1b2c3d4e5f67890a1b2c3d4e5f67890", "segments": [ { "segment": 1, "words": [ ["112:1:1", 0.0, 0.32], ["112:1:2", 0.32, 0.58], ["112:1:3", 0.58, 1.12], ["112:1:4", 1.12, 1.68] ] } ] } ``` See [Word Timestamp Arrays](#word-timestamp-arrays) for field details. --- ### `POST /timestamps_direct` Same as `/timestamps` but accepts an audio file directly — no session needed. | Parameter | Type | Default | Description | |---|---|---|---| | `audio` | file | required | Audio file (any common format) | | `segments` | list | required | Segment list with `time_from`/`time_to` boundaries | | `granularity` | str | `"words"` | Only `"words"` is supported. `"words+chars"` is currently disabled via API and returns an error | **Response:** Same shape as `/timestamps` but without `audio_id`. **Example (minimal):** ```python result = client.predict( "recitation.mp3", [ {"time_from": 0.48, "time_to": 2.88, "ref_from": "112:1:1", "ref_to": "112:1:4"}, {"time_from": 3.12, "time_to": 5.44, "ref_from": "112:2:1", "ref_to": "112:2:3"}, ], "words", api_name="/timestamps_direct", ) ``` Segment input format is the same as for `/timestamps` — see above. --- ## Utilities ### `POST /estimate_duration` Estimate processing time before starting a request. | Parameter | Type | Default | Description | |---|---|---|---| | `endpoint` | str | required | Target endpoint name (e.g. `"process_audio_session"`) | | `audio_duration_s` | float | `None` | Audio length in seconds. Required if no `audio_id` | | `audio_id` | str | `None` | Session ID — looks up audio duration from the session | | `model_name` | str | `"Base"` | `"Base"` or `"Large"` only | | `device` | str | `"GPU"` | `"GPU"` or `"CPU"` | **Example — before first processing call:** ```python est = client.predict( "process_audio_session", # endpoint 60.0, # audio_duration_s (seconds) None, # audio_id (not yet available) "Base", # model_name "GPU", # device api_name="/estimate_duration", ) print(f"Estimated time: {est['estimated_duration_s']}s") ``` **Example — with existing session (e.g. before getting timestamps):** ```python est = client.predict( "timestamps", # endpoint None, # audio_duration_s (looked up from session) audio_id, # audio_id "Base", # model_name "GPU", # device api_name="/estimate_duration", ) ``` **Response:** ```json { "endpoint": "process_audio_session", "estimated_duration_s": 28.0, "device": "GPU", "model_name": "Base" } ``` --- ## Response Reference ### Segment Object Returned by all alignment endpoints (`/process_audio_session`, `/resegment`, `/retranscribe`, `/realign_from_timestamps`). | Field | Type | Description | |---|---|---| | `segment` | int | 1-indexed segment number | | `time_from` | float | Start time in seconds | | `time_to` | float | End time in seconds | | `ref_from` | str | First matched word as `"surah:ayah:word"`. Empty string for special segments | | `ref_to` | str | Last matched word as `"surah:ayah:word"`. Empty string for special segments | | `matched_text` | str | Quran text for the matched range (or special segment text) | | `confidence` | float | 0.0–1.0 — how well the segment matched the Quran text | | `has_missing_words` | bool | Whether some expected words were not found in the audio | | `special_type` | str | Only present for special (non-Quranic) segments — see below. Absent for normal segments | | `error` | str? | Error message if alignment failed, else `null` | ### Special Segment Types Non-Quranic segments detected within recitations. When `special_type` is present, `ref_from` and `ref_to` are empty strings. | `special_type` | Arabic Text | |----------------|-------------| | `Basmala` | بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيم | | `Isti'adha` | أَعُوذُ بِٱللَّهِ مِنَ الشَّيْطَانِ الرَّجِيم | | `Amin` | آمِين | | `Takbir` | اللَّهُ أَكْبَر | | `Tahmeed` | سَمِعَ اللَّهُ لِمَنْ حَمِدَه | | `Tasleem` | ٱلسَّلَامُ عَلَيْكُمْ وَرَحْمَةُ ٱللَّه | | `Sadaqa` | صَدَقَ ٱللَّهُ ٱلْعَظِيم | ### Word Timestamp Arrays Returned by `/timestamps` and `/timestamps_direct`. Each word is an array: `[location, start, end]` or `[location, start, end, letters]`. | Index | Type | Description | |---|---|---| | 0 | str | Word position as `"surah:ayah:word"` | | 1 | float | Start time relative to segment (seconds) | | 2 | float | End time relative to segment (seconds) | > **Note:** `"words+chars"` granularity (letter-level timestamps) is currently disabled via API. Only word-level timestamps are returned. ### GPU Fallback Warning When the server's GPU is temporarily unavailable, processing continues on CPU (slower). All endpoints include a `"warning"` field in the response: ```json { "audio_id": "...", "warning": "GPU quota reached — processed on CPU (slower). Resets in 13:53:59.", "segments": [...] } ``` The `"warning"` key is **absent** (not `null`) when processing ran on GPU normally. Clients should check `if "warning" in result` rather than checking for `null`. ### Errors All errors follow the same shape: `{"error": "...", "segments": []}`. Endpoints that have an active session also include `audio_id`. | Condition | Error message | `audio_id` present? | |---|---|---| | Session not found or expired | `"Session not found or expired"` | No | | No speech detected (process) | `"No speech detected in audio"` | No (no session created) | | No segments after resegment | `"No segments with these settings"` | Yes | | Invalid model name | `"Invalid model_name '...'. Must be one of: Base, Large"` | Depends on endpoint | | Retranscribe with same model | `"Model and boundaries unchanged. Change model_name or call /resegment first."` | Yes | | Retranscription failed | `"Retranscription failed"` | Yes | | Realignment failed | `"Alignment failed"` | Yes | | No segments in session (timestamps) | `"No segments found in session"` | Yes | | Timestamp alignment failed | `"Alignment failed: ..."` | Yes (session) / No (direct) | | No segments provided (timestamps direct) | `"No segments provided"` | No | | URL is empty (process_url) | `"URL is required"` | No | | URL download failed (process_url) | `"Download failed: ..."` | No |