Spaces:

hetchyy
/

Quran-multi-aligner

Running on Zero

App Files Files Community

Quran-multi-aligner / docs /api.md

hetchyy

Initial commit

20e9692 6 days ago

preview code

raw

history blame contribute delete

9.92 kB

API Documentation

Current Endpoints

`POST /process_audio_json`

Stateless endpoint. Accepts audio and segmentation parameters, returns aligned JSON output.

Inputs: audio file, min_silence_ms, min_speech_ms, pad_ms, model_name, device

Returns: JSON with segments array (segment index, timestamps, Quran references, matched text, confidence, errors).

Limitation: Every call requires re-uploading the audio. No way to resegment or retranscribe without re-sending the full file.

Planned: Session-Based Endpoints

The Gradio UI already caches intermediate results (preprocessed audio, VAD output, segment boundaries, model name) in gr.State so that resegment/retranscribe operations skip expensive steps. But gr.State is WebSocket-only — API clients using gradio_client can't benefit from this.

Approach: Server-Side Session Store

On the first request, the server stores all intermediate data keyed by a UUID (audio_id) and returns it in the response. Subsequent requests reference this audio_id instead of re-uploading audio.

What gets stored per session:

Preprocessed audio (float32, 16kHz mono) — saved to disk as .npy
Raw VAD speech intervals — in memory (small)
VAD completeness flags — in memory
Cleaned segment boundaries — in memory
Model name used — in memory

Lifecycle: Sessions expire after the same TTL as the existing Gradio cache (5 hours). A background thread purges expired sessions periodically. Audio files live under /tmp/sessions/{audio_id}/.

`POST /process_audio_session`

Full pipeline. Same as /process_audio_json but additionally creates a server-side session.

Inputs: audio file, min_silence_ms, min_speech_ms, pad_ms, model_name, device

Returns: Same JSON as /process_audio_json with an added audio_id field.

`POST /resegment_session`

Re-cleans VAD boundaries with new segmentation parameters and re-runs ASR. Skips audio upload, preprocessing, and VAD inference.

Inputs: audio_id, min_silence_ms, min_speech_ms, pad_ms, model_name, device

Returns: JSON with segments array and the same audio_id.

`POST /retranscribe_session`

Re-runs ASR with a different model on the existing segment boundaries. Skips audio upload, preprocessing, VAD, and resegmentation.

Inputs: audio_id, model_name, device

Returns: JSON with segments array and the same audio_id.

`POST /realign_from_timestamps`

Accepts an arbitrary list of (start, end) timestamp pairs and runs ASR + phoneme alignment on each slice. Skips VAD entirely — the client defines the segment boundaries directly. This is the core endpoint for timeline-based editing where the user drags segment boundaries manually.

Inputs: audio_id, timestamps (list of {start, end} objects in seconds), model_name, device

Returns: JSON with segments array and the same audio_id. Session boundaries are updated to match the provided timestamps.

Subsumes /resegment_session for most client use cases — the client can split, merge, and drag boundaries however they want, then send the final timestamp list in one call.

Planned: Segment Editing Endpoints

Fine-grained operations for modifying individual segments without reprocessing the full recitation.

`POST /split_segment`

Split one segment at a given timestamp into two. Re-runs alignment on each half independently.

Inputs: audio_id, segment_index, split_time (seconds)

Returns: Updated segments array with the split segment replaced by two new segments.

`POST /merge_segments`

Merge two adjacent segments into one. Re-runs alignment on the combined audio slice.

Inputs: audio_id, segment_index_a, segment_index_b (must be adjacent)

Returns: Updated segments array with the two segments replaced by one.

`POST /adjust_boundary`

Shift a segment's start or end time. Re-runs alignment on the affected segment(s) and its neighbour if boundaries overlap.

Inputs: audio_id, segment_index, new_start (seconds, optional), new_end (seconds, optional)

Returns: Updated segments array.

`POST /override_segment_text`

Manually assign a Quran reference range to a segment, skipping alignment entirely. For when the aligner gets it wrong and the user knows the correct ayah.

Inputs: audio_id, segment_index, ref_from (e.g. "2:255:1"), ref_to (e.g. "2:255:7")

Returns: Updated segment with the overridden reference and corresponding Quran text.

`POST /bulk_update_segments`

Batch update: client sends a full modified segment list (adjusted times, overridden labels). Server validates, persists to session, and optionally re-aligns changed segments.

Inputs: audio_id, segments (list of {start, end, ref_from?, ref_to?}), realign (boolean, default true — re-run ASR on segments whose boundaries changed)

Returns: Full updated segments array.

Planned: Word-Level Timing

`POST /compute_word_timestamps`

Compute word-level start/end times for every word in every segment. This is the backbone of karaoke-style highlighting and word-by-word caption animation.

Inputs: audio_id, model_name, device

Returns: JSON with per-segment word timestamps:

{
  "audio_id": "...",
  "segments": [
    {
      "segment": 1,
      "words": [
        {"word": "بِسْمِ", "start": 0.81, "end": 1.12},
        {"word": "اللَّهِ", "start": 1.12, "end": 1.45}
      ]
    }
  ]
}

Planned: Export Endpoints

Generate subtitle files from session data. All accept audio_id and optionally use word-level timestamps if previously computed.

`POST /export_srt`

Standard SRT subtitle format. One entry per segment (or per word if word_level=true).

Inputs: audio_id, word_level (boolean, default false)

Returns: SRT file content.

`POST /export_vtt`

WebVTT format. Supports styling cues and is the standard for web video players.

Inputs: audio_id, word_level (boolean, default false)

Returns: VTT file content.

`POST /export_ass`

ASS/SSA format with Arabic font and styling presets. Most useful for video editors producing styled Quran captions.

Inputs: audio_id, word_level (boolean, default false), font_name (optional), font_size (optional)

Returns: ASS file content.

Planned: Quran Lookup Endpoints

Utility endpoints for client-side UI (dropdowns, search, manual labelling).

`GET /quran_text`

Return Quran text with diacritics for a given reference range.

Inputs: ref_from (e.g. "2:255:1"), ref_to (e.g. "2:255:7")

Returns: {"text": "...", "ref_from": "...", "ref_to": "..."}. All 114 chapters are pre-cached in memory.

`GET /surah_info`

List of all surahs with metadata.

Returns: Array of {number, name_arabic, name_english, ayah_count, revelation_type}.

Planned: Recitation Analytics

`POST /recitation_stats`

Derive pace and timing analytics from an existing session's alignment results.

Inputs: audio_id

Returns:

{
  "audio_id": "...",
  "total_duration_sec": 312.5,
  "total_segments": 7,
  "total_words": 86,
  "words_per_minute": 16.5,
  "avg_segment_duration_sec": 8.2,
  "avg_pause_duration_sec": 1.4,
  "per_segment": [
    {
      "segment": 1,
      "ref_from": "112:1:1",
      "ref_to": "112:1:4",
      "duration_sec": 2.18,
      "word_count": 4,
      "words_per_minute": 110.1,
      "pause_after_sec": 1.82
    }
  ]
}

Useful for learning apps tracking student fluency, reciter comparisons, or detecting rushed/slow sections.

Planned: Streaming

`POST /process_chunk`

Streaming-friendly endpoint for incremental audio processing. The client sends audio chunks as they become available, and the server returns partial alignment results progressively. Designed for live "now playing" displays (e.g. Quran radio showing the current ayah in real time).

Inputs: audio_id (optional — omit on first chunk to start a new session), audio_chunk (raw audio bytes), is_final (boolean)

Returns:

{
  "audio_id": "...",
  "status": "partial",
  "latest_segments": [
    {
      "segment": 5,
      "ref_from": "36:1:1",
      "ref_to": "36:1:2",
      "matched_text": "يسٓ",
      "time_from": 24.3,
      "time_to": 25.8,
      "confidence": 0.95
    }
  ]
}

When is_final=true, the server finalises the session and returns the complete aligned output (same structure as /process_audio_session).

Chunking notes: The server buffers audio internally and runs VAD + ASR when enough speech has accumulated to form a segment. Earlier segments are locked in and won't change; only the trailing edge is provisional.

Planned: Health / Status

`GET /health`

Server status for monitoring dashboards and client-side availability checks.

Returns:

{
  "status": "ok",
  "gpu_available": true,
  "gpu_quota_exhausted": false,
  "quota_reset_time": null,
  "active_sessions": 12,
  "models_loaded": ["Base", "Large"],
  "uptime_sec": 84200
}

Error Handling

If audio_id is missing, expired, or invalid, session endpoints return:

{"error": "Session not found or expired", "segments": []}

The client should call /process_audio_session again to get a fresh session.

Design Notes

Thread safety: Gradio handles concurrent requests via threading. The session store uses a lock around its internal dict.
Storage: Audio on disk (can be large), metadata in memory (always small). Audio loaded via memory-mapped reads on demand.
No auth needed: Session IDs are 128-bit random UUIDs — effectively unguessable.
HF Spaces compatibility: /tmp is ephemeral and cleared on restart, which is fine since sessions are transient. The existing allowed_paths=["/tmp"] covers the new directory.
Backward compatible: /process_audio_json remains unchanged.