Spaces:
Running
on
Zero
API Documentation
Current Endpoints
POST /process_audio_json
Stateless endpoint. Accepts audio and segmentation parameters, returns aligned JSON output.
Inputs: audio file, min_silence_ms, min_speech_ms, pad_ms, model_name, device
Returns: JSON with segments array (segment index, timestamps, Quran references, matched text, confidence, errors).
Limitation: Every call requires re-uploading the audio. No way to resegment or retranscribe without re-sending the full file.
Planned: Session-Based Endpoints
The Gradio UI already caches intermediate results (preprocessed audio, VAD output, segment boundaries, model name) in gr.State so that resegment/retranscribe operations skip expensive steps. But gr.State is WebSocket-only β API clients using gradio_client can't benefit from this.
Approach: Server-Side Session Store
On the first request, the server stores all intermediate data keyed by a UUID (audio_id) and returns it in the response. Subsequent requests reference this audio_id instead of re-uploading audio.
What gets stored per session:
- Preprocessed audio (float32, 16kHz mono) β saved to disk as
.npy - Raw VAD speech intervals β in memory (small)
- VAD completeness flags β in memory
- Cleaned segment boundaries β in memory
- Model name used β in memory
Lifecycle: Sessions expire after the same TTL as the existing Gradio cache (5 hours). A background thread purges expired sessions periodically. Audio files live under /tmp/sessions/{audio_id}/.
POST /process_audio_session
Full pipeline. Same as /process_audio_json but additionally creates a server-side session.
Inputs: audio file, min_silence_ms, min_speech_ms, pad_ms, model_name, device
Returns: Same JSON as /process_audio_json with an added audio_id field.
POST /resegment_session
Re-cleans VAD boundaries with new segmentation parameters and re-runs ASR. Skips audio upload, preprocessing, and VAD inference.
Inputs: audio_id, min_silence_ms, min_speech_ms, pad_ms, model_name, device
Returns: JSON with segments array and the same audio_id.
POST /retranscribe_session
Re-runs ASR with a different model on the existing segment boundaries. Skips audio upload, preprocessing, VAD, and resegmentation.
Inputs: audio_id, model_name, device
Returns: JSON with segments array and the same audio_id.
POST /realign_from_timestamps
Accepts an arbitrary list of (start, end) timestamp pairs and runs ASR + phoneme alignment on each slice. Skips VAD entirely β the client defines the segment boundaries directly. This is the core endpoint for timeline-based editing where the user drags segment boundaries manually.
Inputs: audio_id, timestamps (list of {start, end} objects in seconds), model_name, device
Returns: JSON with segments array and the same audio_id. Session boundaries are updated to match the provided timestamps.
Subsumes /resegment_session for most client use cases β the client can split, merge, and drag boundaries however they want, then send the final timestamp list in one call.
Planned: Segment Editing Endpoints
Fine-grained operations for modifying individual segments without reprocessing the full recitation.
POST /split_segment
Split one segment at a given timestamp into two. Re-runs alignment on each half independently.
Inputs: audio_id, segment_index, split_time (seconds)
Returns: Updated segments array with the split segment replaced by two new segments.
POST /merge_segments
Merge two adjacent segments into one. Re-runs alignment on the combined audio slice.
Inputs: audio_id, segment_index_a, segment_index_b (must be adjacent)
Returns: Updated segments array with the two segments replaced by one.
POST /adjust_boundary
Shift a segment's start or end time. Re-runs alignment on the affected segment(s) and its neighbour if boundaries overlap.
Inputs: audio_id, segment_index, new_start (seconds, optional), new_end (seconds, optional)
Returns: Updated segments array.
POST /override_segment_text
Manually assign a Quran reference range to a segment, skipping alignment entirely. For when the aligner gets it wrong and the user knows the correct ayah.
Inputs: audio_id, segment_index, ref_from (e.g. "2:255:1"), ref_to (e.g. "2:255:7")
Returns: Updated segment with the overridden reference and corresponding Quran text.
POST /bulk_update_segments
Batch update: client sends a full modified segment list (adjusted times, overridden labels). Server validates, persists to session, and optionally re-aligns changed segments.
Inputs: audio_id, segments (list of {start, end, ref_from?, ref_to?}), realign (boolean, default true β re-run ASR on segments whose boundaries changed)
Returns: Full updated segments array.
Planned: Word-Level Timing
POST /compute_word_timestamps
Compute word-level start/end times for every word in every segment. This is the backbone of karaoke-style highlighting and word-by-word caption animation.
Inputs: audio_id, model_name, device
Returns: JSON with per-segment word timestamps:
{
"audio_id": "...",
"segments": [
{
"segment": 1,
"words": [
{"word": "Ψ¨ΩΨ³ΩΩ
Ω", "start": 0.81, "end": 1.12},
{"word": "Ψ§ΩΩΩΩΩΩ", "start": 1.12, "end": 1.45}
]
}
]
}
Planned: Export Endpoints
Generate subtitle files from session data. All accept audio_id and optionally use word-level timestamps if previously computed.
POST /export_srt
Standard SRT subtitle format. One entry per segment (or per word if word_level=true).
Inputs: audio_id, word_level (boolean, default false)
Returns: SRT file content.
POST /export_vtt
WebVTT format. Supports styling cues and is the standard for web video players.
Inputs: audio_id, word_level (boolean, default false)
Returns: VTT file content.
POST /export_ass
ASS/SSA format with Arabic font and styling presets. Most useful for video editors producing styled Quran captions.
Inputs: audio_id, word_level (boolean, default false), font_name (optional), font_size (optional)
Returns: ASS file content.
Planned: Quran Lookup Endpoints
Utility endpoints for client-side UI (dropdowns, search, manual labelling).
GET /quran_text
Return Quran text with diacritics for a given reference range.
Inputs: ref_from (e.g. "2:255:1"), ref_to (e.g. "2:255:7")
Returns: {"text": "...", "ref_from": "...", "ref_to": "..."}. All 114 chapters are pre-cached in memory.
GET /surah_info
List of all surahs with metadata.
Returns: Array of {number, name_arabic, name_english, ayah_count, revelation_type}.
Planned: Recitation Analytics
POST /recitation_stats
Derive pace and timing analytics from an existing session's alignment results.
Inputs: audio_id
Returns:
{
"audio_id": "...",
"total_duration_sec": 312.5,
"total_segments": 7,
"total_words": 86,
"words_per_minute": 16.5,
"avg_segment_duration_sec": 8.2,
"avg_pause_duration_sec": 1.4,
"per_segment": [
{
"segment": 1,
"ref_from": "112:1:1",
"ref_to": "112:1:4",
"duration_sec": 2.18,
"word_count": 4,
"words_per_minute": 110.1,
"pause_after_sec": 1.82
}
]
}
Useful for learning apps tracking student fluency, reciter comparisons, or detecting rushed/slow sections.
Planned: Streaming
POST /process_chunk
Streaming-friendly endpoint for incremental audio processing. The client sends audio chunks as they become available, and the server returns partial alignment results progressively. Designed for live "now playing" displays (e.g. Quran radio showing the current ayah in real time).
Inputs: audio_id (optional β omit on first chunk to start a new session), audio_chunk (raw audio bytes), is_final (boolean)
Returns:
{
"audio_id": "...",
"status": "partial",
"latest_segments": [
{
"segment": 5,
"ref_from": "36:1:1",
"ref_to": "36:1:2",
"matched_text": "ΩΨ³Ω",
"time_from": 24.3,
"time_to": 25.8,
"confidence": 0.95
}
]
}
When is_final=true, the server finalises the session and returns the complete aligned output (same structure as /process_audio_session).
Chunking notes: The server buffers audio internally and runs VAD + ASR when enough speech has accumulated to form a segment. Earlier segments are locked in and won't change; only the trailing edge is provisional.
Planned: Health / Status
GET /health
Server status for monitoring dashboards and client-side availability checks.
Returns:
{
"status": "ok",
"gpu_available": true,
"gpu_quota_exhausted": false,
"quota_reset_time": null,
"active_sessions": 12,
"models_loaded": ["Base", "Large"],
"uptime_sec": 84200
}
Error Handling
If audio_id is missing, expired, or invalid, session endpoints return:
{"error": "Session not found or expired", "segments": []}
The client should call /process_audio_session again to get a fresh session.
Design Notes
- Thread safety: Gradio handles concurrent requests via threading. The session store uses a lock around its internal dict.
- Storage: Audio on disk (can be large), metadata in memory (always small). Audio loaded via memory-mapped reads on demand.
- No auth needed: Session IDs are 128-bit random UUIDs β effectively unguessable.
- HF Spaces compatibility:
/tmpis ephemeral and cleared on restart, which is fine since sessions are transient. The existingallowed_paths=["/tmp"]covers the new directory. - Backward compatible:
/process_audio_jsonremains unchanged.