Spaces:

hetchyy
/

Quran-multi-aligner

Running on Zero

App Files Files Community

Quran-multi-aligner / docs /client_api.md

hetchyy

feat: add /process_url_session API endpoint for URL-based alignment

e67922d verified about 12 hours ago

preview code

raw

history blame contribute delete

17.9 kB

	# Client API Reference

	- [Quick Start](#quick-start)
	- [Sessions](#sessions)
	- [Alignment Endpoints](#alignment-endpoints) — `/process_audio_session`, `/process_url_session`, `/resegment`, `/retranscribe`, `/realign_from_timestamps`
	- [Word Timestamps](#word-timestamps) — `/timestamps`, `/timestamps_direct`
	- [Utilities](#utilities) — `/estimate_duration`
	- [Response Reference](#response-reference) — segment fields, special types, word arrays, GPU warning, errors

	## API Changelog

	30/03/2026
	- New `/process_url_session` endpoint: pass a URL (YouTube, SoundCloud, MP3Quran, etc.) instead of uploading audio

	29/03/2026
	- API calls now skip HTML rendering and audio file I/O, returning JSON faster


	---

	## GPU Usage & Access

	- Free Tier: Every user receives free daily GPU quota. Once your daily GPU quota is exhausted, you can continue using unlimited CPU processing for all endpoints.
	- Unlimited GPU Access: If you need unlimited API access on GPU (e.g., for high-volume or production use), please get in touch to arrange a payment plan and higher limits.
	- Note: CPU processing is always unlimited and available, but is much slower. When GPU quota is exceeded, requests will be automatically routed to CPU and a warning will appear in the response.

	## Quick Start

	```python
	from gradio_client import Client

	client = Client("https://hetchyy-quran-multi-aligner.hf.space")

	# Or pass your HF token to use your own account's ZeroGPU quota
	client = Client("https://hetchyy-quran-multi-aligner.hf.space", token="hf_...")

	# Full pipeline
	result = client.predict(
	"recitation.mp3", # audio file path
	200, # min_silence_ms
	1000, # min_speech_ms
	100, # pad_ms
	"Base", # model_name
	"GPU", # device
	api_name="/process_audio_session"
	)
	audio_id = result["audio_id"]

	# Re-segment with different params (reuses cached audio)
	result = client.predict(audio_id, 600, 1500, 300, "Base", "GPU", api_name="/resegment")

	# Re-transcribe with a different model (reuses cached segments)
	result = client.predict(audio_id, "Large", "GPU", api_name="/retranscribe")

	# Realign with custom timestamps
	result = client.predict(
	audio_id,
	[{"start": 0.5, "end": 3.2}, {"start": 3.8, "end": 7.1}],
	"Base", "GPU",
	api_name="/realign_from_timestamps"
	)

	# Get word-level timestamps (uses stored session segments)
	ts = client.predict(audio_id, None, "words", api_name="/timestamps")

	# Get timestamps without a session (standalone)
	ts = client.predict("recitation.mp3", result["segments"], "words", api_name="/timestamps_direct")

	# From URL (YouTube, SoundCloud, MP3Quran, etc.)
	result = client.predict(
	"https://server8.mp3quran.net/afs/112.mp3",
	200, 1000, 100, "Base", "GPU",
	api_name="/process_url_session"
	)
	print(result["url_metadata"]["title"]) # Source metadata
	# All follow-up calls work the same as with /process_audio_session
	```

	---

	## Sessions

	The first call returns an `audio_id` (32-character hex string). Pass it to subsequent calls to skip re-uploading and reprocessing audio. Sessions expire after 5 hours.

	What the server caches per session:

	\| Data \| Updated by \|
	\|---\|---\|
	\| Preprocessed audio \| — \|
	\| Detected speech intervals \| — \|
	\| Cleaned segment boundaries \| `/resegment`, `/realign_from_timestamps` \|
	\| Model name \| `/retranscribe` \|
	\| Alignment segments \| Any alignment call \|

	If `audio_id` is missing, expired, or invalid:
	```json
	{"error": "Session not found or expired", "segments": []}
	```

	---

	## Alignment Endpoints

	### `POST /process_audio_session`

	Processes a recitation audio file: detects speech segments, recognizes text, and aligns with the Quran. Creates a session for follow-up calls.

	\| Parameter \| Type \| Default \| Description \|
	\|---\|---\|---\|---\|
	\| `audio` \| file \| required \| Audio file (any common format) \|
	\| `min_silence_ms` \| int \| 200 \| Minimum silence gap to split segments \|
	\| `min_speech_ms` \| int \| 1000 \| Minimum speech duration to keep a segment \|
	\| `pad_ms` \| int \| 100 \| Padding added to each side of a segment \|
	\| `model_name` \| str \| `"Base"` \| `"Base"` (faster) or `"Large"` (more accurate). Only these two values are accepted — any other value will cause an error \|
	\| `device` \| str \| `"GPU"` \| `"GPU"` or `"CPU"` \|

	If the GPU is temporarily unavailable, processing continues on CPU (slower). When this happens, a `"warning"` field is included in the response (see [GPU Fallback Warning](#gpu-fallback-warning)).

	Segmentation presets:

	\| Style \| min_silence_ms \| min_speech_ms \| pad_ms \|
	\|---\|---\|---\|---\|
	\| Mujawwad (slow) \| 600 \| 1500 \| 300 \|
	\| Murattal (normal) \| 200 \| 1000 \| 100 \|
	\| Fast \| 75 \| 750 \| 40 \|

	Response:
	```json
	{
	"audio_id": "a1b2c3d4e5f67890a1b2c3d4e5f67890",
	"segments": [
	{
	"segment": 1,
	"time_from": 0.480,
	"time_to": 2.880,
	"ref_from": "112:1:1",
	"ref_to": "112:1:4",
	"matched_text": "قُلْ هُوَ ٱللَّهُ أَحَدٌ",
	"confidence": 0.921,
	"has_missing_words": false,
	"error": null
	},
	{
	"segment": 2,
	"time_from": 4.320,
	"time_to": 6.540,
	"ref_from": "",
	"ref_to": "",
	"matched_text": "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيم",
	"confidence": 0.952,
	"has_missing_words": false,
	"special_type": "Basmala",
	"error": null
	}
	]
	}
	```

	See [Segment Object](#segment-object) for field descriptions. See [Special Segment Types](#special-segment-types) for non-Quranic segments.

	---

	### `POST /process_url_session`

	Downloads audio from a URL, then runs the same pipeline as `/process_audio_session`. Supports YouTube, SoundCloud, MP3Quran, TikTok, and [500+ sites](https://github.com/yt-dlp/yt-dlp/blob/master/supportedsites.md) via yt-dlp.

	\| Parameter \| Type \| Default \| Description \|
	\|---\|---\|---\|---\|
	\| `url` \| str \| required \| URL to download audio from \|
	\| `min_silence_ms` \| int \| 200 \| Minimum silence gap to split segments \|
	\| `min_speech_ms` \| int \| 1000 \| Minimum speech duration to keep a segment \|
	\| `pad_ms` \| int \| 100 \| Padding added to each side of a segment \|
	\| `model_name` \| str \| `"Base"` \| `"Base"` or `"Large"` only \|
	\| `device` \| str \| `"GPU"` \| `"GPU"` or `"CPU"` \|

	Response: Same as `/process_audio_session`, plus a `url_metadata` field:
	```json
	{
	"audio_id": "a1b2c3d4e5f67890a1b2c3d4e5f67890",
	"url_metadata": {
	"title": "Surah Al-Ikhlas - Sheikh Mishary",
	"duration": 45.0,
	"source_url": "https://..."
	},
	"segments": [...]
	}
	```

	Notes:
	- Playlists are rejected — pass a single video/audio URL.
	- Some sites (YouTube, Facebook, Instagram) may not work from the server due to IP restrictions. If a download fails, download the audio locally and use `/process_audio_session` instead.
	- After the session is created, all follow-up endpoints (`/resegment`, `/retranscribe`, etc.) work identically.

	---

	### `POST /resegment`

	Re-splits the audio into segments using different silence/speech settings, then re-aligns. Reuses the uploaded audio.

	\| Parameter \| Type \| Default \| Description \|
	\|---\|---\|---\|---\|
	\| `audio_id` \| str \| required \| Session ID from a previous call \|
	\| `min_silence_ms` \| int \| 200 \| New minimum silence gap \|
	\| `min_speech_ms` \| int \| 1000 \| New minimum speech duration \|
	\| `pad_ms` \| int \| 100 \| New padding \|
	\| `model_name` \| str \| `"Base"` \| `"Base"` or `"Large"` only \|
	\| `device` \| str \| `"GPU"` \| `"GPU"` or `"CPU"` \|

	Response: Same shape as `/process_audio_session`. Session boundaries are updated.

	---

	### `POST /retranscribe`

	Re-recognizes text using a different model on the same segments, then re-aligns.

	\| Parameter \| Type \| Default \| Description \|
	\|---\|---\|---\|---\|
	\| `audio_id` \| str \| required \| Session ID from a previous call \|
	\| `model_name` \| str \| `"Base"` \| `"Base"` or `"Large"` only \|
	\| `device` \| str \| `"GPU"` \| `"GPU"` or `"CPU"` \|

	Response: Same shape as `/process_audio_session`. Session model and results are updated.

	> Note: Returns an error if `model_name` is the same as the current session's model. To re-run with the same model on different boundaries, use `/resegment` or `/realign_from_timestamps` instead (they already include recognition + alignment).

	---

	### `POST /realign_from_timestamps`

	Aligns audio using custom time boundaries you provide. Useful for manually adjusting where segments start and end.

	\| Parameter \| Type \| Default \| Description \|
	\|---\|---\|---\|---\|
	\| `audio_id` \| str \| required \| Session ID from a previous call \|
	\| `timestamps` \| list \| required \| Array of `{"start": float, "end": float}` in seconds \|
	\| `model_name` \| str \| `"Base"` \| `"Base"` or `"Large"` only \|
	\| `device` \| str \| `"GPU"` \| `"GPU"` or `"CPU"` \|

	Example request body:
	```json
	{
	"audio_id": "a1b2c3d4e5f67890a1b2c3d4e5f67890",
	"timestamps": [
	{"start": 0.5, "end": 3.2},
	{"start": 3.8, "end": 5.1},
	{"start": 5.1, "end": 7.4}
	],
	"model_name": "Base",
	"device": "GPU"
	}
	```

	Response: Same shape as `/process_audio_session`. Session boundaries are replaced with the provided timestamps.

	---

	## Word Timestamps

	### `POST /timestamps`

	Gets precise word-level (and optionally letter-level) timing for each word in the aligned segments.

	\| Parameter \| Type \| Default \| Description \|
	\|---\|---\|---\|---\|
	\| `audio_id` \| str \| required \| Session ID from a previous alignment call \|
	\| `segments` \| list? \| `None` (JSON `null`) \| Segment list to align. `None` uses stored segments from the session \|
	\| `granularity` \| str \| `"words"` \| Only `"words"` is supported. `"words+chars"` is currently disabled via API and returns an error \|

	Example — using stored segments:
	```python
	result = client.predict(
	"a1b2c3d4e5f67890a1b2c3d4e5f67890", # audio_id
	None, # segments (null = use stored)
	"words", # granularity
	api_name="/timestamps",
	)
	```

	Example — with segments override (minimal):
	```python
	result = client.predict(
	"a1b2c3d4e5f67890a1b2c3d4e5f67890",
	[ # segments override
	{"time_from": 0.48, "time_to": 2.88, "ref_from": "112:1:1", "ref_to": "112:1:4"},
	{"time_from": 3.12, "time_to": 5.44, "ref_from": "112:2:1", "ref_to": "112:2:3"},
	],
	"words",
	api_name="/timestamps",
	)
	```

	Example — special segment (Basmala):
	```python
	# Special segments use empty ref_from/ref_to and carry a special_type field
	{"time_from": 0.0, "time_to": 2.1, "ref_from": "", "ref_to": "", "special_type": "Basmala"}
	```

	Segment input fields:

	\| Field \| Type \| Required \| Description \|
	\|---\|---\|---\|---\|
	\| `time_from` \| float \| yes \| Start time in seconds \|
	\| `time_to` \| float \| yes \| End time in seconds \|
	\| `ref_from` \| str \| yes \| First word as `"surah:ayah:word"`. Empty for special segments \|
	\| `ref_to` \| str \| yes \| Last word as `"surah:ayah:word"`. Empty for special segments \|
	\| `segment` \| int \| no \| 1-indexed segment number. Auto-assigned from position if omitted \|
	\| `confidence` \| float \| no \| Defaults to 1.0. Segments with confidence ≤ 0 are skipped \|
	\| `special_type` \| str \| no \| Only for special segments (`"Basmala"`, `"Isti'adha"`, etc.) \|

	Response:
	```json
	{
	"audio_id": "a1b2c3d4e5f67890a1b2c3d4e5f67890",
	"segments": [
	{
	"segment": 1,
	"words": [
	["112:1:1", 0.0, 0.32],
	["112:1:2", 0.32, 0.58],
	["112:1:3", 0.58, 1.12],
	["112:1:4", 1.12, 1.68]
	]
	}
	]
	}
	```

	See [Word Timestamp Arrays](#word-timestamp-arrays) for field details.

	---

	### `POST /timestamps_direct`

	Same as `/timestamps` but accepts an audio file directly — no session needed.

	\| Parameter \| Type \| Default \| Description \|
	\|---\|---\|---\|---\|
	\| `audio` \| file \| required \| Audio file (any common format) \|
	\| `segments` \| list \| required \| Segment list with `time_from`/`time_to` boundaries \|
	\| `granularity` \| str \| `"words"` \| Only `"words"` is supported. `"words+chars"` is currently disabled via API and returns an error \|

	Response: Same shape as `/timestamps` but without `audio_id`.

	Example (minimal):
	```python
	result = client.predict(
	"recitation.mp3",
	[
	{"time_from": 0.48, "time_to": 2.88, "ref_from": "112:1:1", "ref_to": "112:1:4"},
	{"time_from": 3.12, "time_to": 5.44, "ref_from": "112:2:1", "ref_to": "112:2:3"},
	],
	"words",
	api_name="/timestamps_direct",
	)
	```

	Segment input format is the same as for `/timestamps` — see above.

	---

	## Utilities

	### `POST /estimate_duration`

	Estimate processing time before starting a request.

	\| Parameter \| Type \| Default \| Description \|
	\|---\|---\|---\|---\|
	\| `endpoint` \| str \| required \| Target endpoint name (e.g. `"process_audio_session"`) \|
	\| `audio_duration_s` \| float \| `None` \| Audio length in seconds. Required if no `audio_id` \|
	\| `audio_id` \| str \| `None` \| Session ID — looks up audio duration from the session \|
	\| `model_name` \| str \| `"Base"` \| `"Base"` or `"Large"` only \|
	\| `device` \| str \| `"GPU"` \| `"GPU"` or `"CPU"` \|

	Example — before first processing call:
	```python
	est = client.predict(
	"process_audio_session", # endpoint
	60.0, # audio_duration_s (seconds)
	None, # audio_id (not yet available)
	"Base", # model_name
	"GPU", # device
	api_name="/estimate_duration",
	)
	print(f"Estimated time: {est['estimated_duration_s']}s")
	```

	Example — with existing session (e.g. before getting timestamps):
	```python
	est = client.predict(
	"timestamps", # endpoint
	None, # audio_duration_s (looked up from session)
	audio_id, # audio_id
	"Base", # model_name
	"GPU", # device
	api_name="/estimate_duration",
	)
	```

	Response:
	```json
	{
	"endpoint": "process_audio_session",
	"estimated_duration_s": 28.0,
	"device": "GPU",
	"model_name": "Base"
	}
	```

	---

	## Response Reference

	### Segment Object

	Returned by all alignment endpoints (`/process_audio_session`, `/resegment`, `/retranscribe`, `/realign_from_timestamps`).

	\| Field \| Type \| Description \|
	\|---\|---\|---\|
	\| `segment` \| int \| 1-indexed segment number \|
	\| `time_from` \| float \| Start time in seconds \|
	\| `time_to` \| float \| End time in seconds \|
	\| `ref_from` \| str \| First matched word as `"surah:ayah:word"`. Empty string for special segments \|
	\| `ref_to` \| str \| Last matched word as `"surah:ayah:word"`. Empty string for special segments \|
	\| `matched_text` \| str \| Quran text for the matched range (or special segment text) \|
	\| `confidence` \| float \| 0.0–1.0 — how well the segment matched the Quran text \|
	\| `has_missing_words` \| bool \| Whether some expected words were not found in the audio \|
	\| `special_type` \| str \| Only present for special (non-Quranic) segments — see below. Absent for normal segments \|
	\| `error` \| str? \| Error message if alignment failed, else `null` \|

	### Special Segment Types

	Non-Quranic segments detected within recitations. When `special_type` is present, `ref_from` and `ref_to` are empty strings.

	\| `special_type` \| Arabic Text \|
	\|----------------\|-------------\|
	\| `Basmala` \| بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيم \|
	\| `Isti'adha` \| أَعُوذُ بِٱللَّهِ مِنَ الشَّيْطَانِ الرَّجِيم \|
	\| `Amin` \| آمِين \|
	\| `Takbir` \| اللَّهُ أَكْبَر \|
	\| `Tahmeed` \| سَمِعَ اللَّهُ لِمَنْ حَمِدَه \|
	\| `Tasleem` \| ٱلسَّلَامُ عَلَيْكُمْ وَرَحْمَةُ ٱللَّه \|
	\| `Sadaqa` \| صَدَقَ ٱللَّهُ ٱلْعَظِيم \|

	### Word Timestamp Arrays

	Returned by `/timestamps` and `/timestamps_direct`. Each word is an array: `[location, start, end]` or `[location, start, end, letters]`.

	\| Index \| Type \| Description \|
	\|---\|---\|---\|
	\| 0 \| str \| Word position as `"surah:ayah:word"` \|
	\| 1 \| float \| Start time relative to segment (seconds) \|
	\| 2 \| float \| End time relative to segment (seconds) \|

	> Note: `"words+chars"` granularity (letter-level timestamps) is currently disabled via API. Only word-level timestamps are returned.

	### GPU Fallback Warning

	When the server's GPU is temporarily unavailable, processing continues on CPU (slower). All endpoints include a `"warning"` field in the response:

	```json
	{
	"audio_id": "...",
	"warning": "GPU quota reached — processed on CPU (slower). Resets in 13:53:59.",
	"segments": [...]
	}
	```

	The `"warning"` key is absent (not `null`) when processing ran on GPU normally. Clients should check `if "warning" in result` rather than checking for `null`.

	### Errors

	All errors follow the same shape: `{"error": "...", "segments": []}`. Endpoints that have an active session also include `audio_id`.

	\| Condition \| Error message \| `audio_id` present? \|
	\|---\|---\|---\|
	\| Session not found or expired \| `"Session not found or expired"` \| No \|
	\| No speech detected (process) \| `"No speech detected in audio"` \| No (no session created) \|
	\| No segments after resegment \| `"No segments with these settings"` \| Yes \|
	\| Invalid model name \| `"Invalid model_name '...'. Must be one of: Base, Large"` \| Depends on endpoint \|
	\| Retranscribe with same model \| `"Model and boundaries unchanged. Change model_name or call /resegment first."` \| Yes \|
	\| Retranscription failed \| `"Retranscription failed"` \| Yes \|
	\| Realignment failed \| `"Alignment failed"` \| Yes \|
	\| No segments in session (timestamps) \| `"No segments found in session"` \| Yes \|
	\| Timestamp alignment failed \| `"Alignment failed: ..."` \| Yes (session) / No (direct) \|
	\| No segments provided (timestamps direct) \| `"No segments provided"` \| No \|
	\| URL is empty (process_url) \| `"URL is required"` \| No \|
	\| URL download failed (process_url) \| `"Download failed: ..."` \| No \|