hetchyy's picture
Initial commit
20e9692
# API Documentation
## Current Endpoints
### `POST /process_audio_json`
Stateless endpoint. Accepts audio and segmentation parameters, returns aligned JSON output.
**Inputs:** audio file, min_silence_ms, min_speech_ms, pad_ms, model_name, device
**Returns:** JSON with `segments` array (segment index, timestamps, Quran references, matched text, confidence, errors).
**Limitation:** Every call requires re-uploading the audio. No way to resegment or retranscribe without re-sending the full file.
---
## Planned: Session-Based Endpoints
The Gradio UI already caches intermediate results (preprocessed audio, VAD output, segment boundaries, model name) in `gr.State` so that resegment/retranscribe operations skip expensive steps. But `gr.State` is WebSocket-only β€” API clients using `gradio_client` can't benefit from this.
### Approach: Server-Side Session Store
On the first request, the server stores all intermediate data keyed by a UUID (`audio_id`) and returns it in the response. Subsequent requests reference this `audio_id` instead of re-uploading audio.
**What gets stored per session:**
- Preprocessed audio (float32, 16kHz mono) β€” saved to disk as `.npy`
- Raw VAD speech intervals β€” in memory (small)
- VAD completeness flags β€” in memory
- Cleaned segment boundaries β€” in memory
- Model name used β€” in memory
**Lifecycle:** Sessions expire after the same TTL as the existing Gradio cache (5 hours). A background thread purges expired sessions periodically. Audio files live under `/tmp/sessions/{audio_id}/`.
### `POST /process_audio_session`
Full pipeline. Same as `/process_audio_json` but additionally creates a server-side session.
**Inputs:** audio file, min_silence_ms, min_speech_ms, pad_ms, model_name, device
**Returns:** Same JSON as `/process_audio_json` with an added `audio_id` field.
### `POST /resegment_session`
Re-cleans VAD boundaries with new segmentation parameters and re-runs ASR. Skips audio upload, preprocessing, and VAD inference.
**Inputs:** audio_id, min_silence_ms, min_speech_ms, pad_ms, model_name, device
**Returns:** JSON with `segments` array and the same `audio_id`.
### `POST /retranscribe_session`
Re-runs ASR with a different model on the existing segment boundaries. Skips audio upload, preprocessing, VAD, and resegmentation.
**Inputs:** audio_id, model_name, device
**Returns:** JSON with `segments` array and the same `audio_id`.
### `POST /realign_from_timestamps`
Accepts an arbitrary list of `(start, end)` timestamp pairs and runs ASR + phoneme alignment on each slice. Skips VAD entirely β€” the client defines the segment boundaries directly. This is the core endpoint for timeline-based editing where the user drags segment boundaries manually.
**Inputs:** audio_id, timestamps (list of `{start, end}` objects in seconds), model_name, device
**Returns:** JSON with `segments` array and the same `audio_id`. Session boundaries are updated to match the provided timestamps.
Subsumes `/resegment_session` for most client use cases β€” the client can split, merge, and drag boundaries however they want, then send the final timestamp list in one call.
---
## Planned: Segment Editing Endpoints
Fine-grained operations for modifying individual segments without reprocessing the full recitation.
### `POST /split_segment`
Split one segment at a given timestamp into two. Re-runs alignment on each half independently.
**Inputs:** audio_id, segment_index, split_time (seconds)
**Returns:** Updated `segments` array with the split segment replaced by two new segments.
### `POST /merge_segments`
Merge two adjacent segments into one. Re-runs alignment on the combined audio slice.
**Inputs:** audio_id, segment_index_a, segment_index_b (must be adjacent)
**Returns:** Updated `segments` array with the two segments replaced by one.
### `POST /adjust_boundary`
Shift a segment's start or end time. Re-runs alignment on the affected segment(s) and its neighbour if boundaries overlap.
**Inputs:** audio_id, segment_index, new_start (seconds, optional), new_end (seconds, optional)
**Returns:** Updated `segments` array.
### `POST /override_segment_text`
Manually assign a Quran reference range to a segment, skipping alignment entirely. For when the aligner gets it wrong and the user knows the correct ayah.
**Inputs:** audio_id, segment_index, ref_from (e.g. `"2:255:1"`), ref_to (e.g. `"2:255:7"`)
**Returns:** Updated segment with the overridden reference and corresponding Quran text.
### `POST /bulk_update_segments`
Batch update: client sends a full modified segment list (adjusted times, overridden labels). Server validates, persists to session, and optionally re-aligns changed segments.
**Inputs:** audio_id, segments (list of `{start, end, ref_from?, ref_to?}`), realign (boolean, default true β€” re-run ASR on segments whose boundaries changed)
**Returns:** Full updated `segments` array.
---
## Planned: Word-Level Timing
### `POST /compute_word_timestamps`
Compute word-level start/end times for every word in every segment. This is the backbone of karaoke-style highlighting and word-by-word caption animation.
**Inputs:** audio_id, model_name, device
**Returns:** JSON with per-segment word timestamps:
```json
{
"audio_id": "...",
"segments": [
{
"segment": 1,
"words": [
{"word": "بِسْمِ", "start": 0.81, "end": 1.12},
{"word": "Ψ§Ω„Ω„ΩŽΩ‘Ω‡Ω", "start": 1.12, "end": 1.45}
]
}
]
}
```
---
## Planned: Export Endpoints
Generate subtitle files from session data. All accept `audio_id` and optionally use word-level timestamps if previously computed.
### `POST /export_srt`
Standard SRT subtitle format. One entry per segment (or per word if `word_level=true`).
**Inputs:** audio_id, word_level (boolean, default false)
**Returns:** SRT file content.
### `POST /export_vtt`
WebVTT format. Supports styling cues and is the standard for web video players.
**Inputs:** audio_id, word_level (boolean, default false)
**Returns:** VTT file content.
### `POST /export_ass`
ASS/SSA format with Arabic font and styling presets. Most useful for video editors producing styled Quran captions.
**Inputs:** audio_id, word_level (boolean, default false), font_name (optional), font_size (optional)
**Returns:** ASS file content.
---
## Planned: Quran Lookup Endpoints
Utility endpoints for client-side UI (dropdowns, search, manual labelling).
### `GET /quran_text`
Return Quran text with diacritics for a given reference range.
**Inputs:** ref_from (e.g. `"2:255:1"`), ref_to (e.g. `"2:255:7"`)
**Returns:** `{"text": "...", "ref_from": "...", "ref_to": "..."}`. All 114 chapters are pre-cached in memory.
### `GET /surah_info`
List of all surahs with metadata.
**Returns:** Array of `{number, name_arabic, name_english, ayah_count, revelation_type}`.
---
## Planned: Recitation Analytics
### `POST /recitation_stats`
Derive pace and timing analytics from an existing session's alignment results.
**Inputs:** audio_id
**Returns:**
```json
{
"audio_id": "...",
"total_duration_sec": 312.5,
"total_segments": 7,
"total_words": 86,
"words_per_minute": 16.5,
"avg_segment_duration_sec": 8.2,
"avg_pause_duration_sec": 1.4,
"per_segment": [
{
"segment": 1,
"ref_from": "112:1:1",
"ref_to": "112:1:4",
"duration_sec": 2.18,
"word_count": 4,
"words_per_minute": 110.1,
"pause_after_sec": 1.82
}
]
}
```
Useful for learning apps tracking student fluency, reciter comparisons, or detecting rushed/slow sections.
---
## Planned: Streaming
### `POST /process_chunk`
Streaming-friendly endpoint for incremental audio processing. The client sends audio chunks as they become available, and the server returns partial alignment results progressively. Designed for live "now playing" displays (e.g. Quran radio showing the current ayah in real time).
**Inputs:** audio_id (optional β€” omit on first chunk to start a new session), audio_chunk (raw audio bytes), is_final (boolean)
**Returns:**
```json
{
"audio_id": "...",
"status": "partial",
"latest_segments": [
{
"segment": 5,
"ref_from": "36:1:1",
"ref_to": "36:1:2",
"matched_text": "ΩŠΨ³Ω“",
"time_from": 24.3,
"time_to": 25.8,
"confidence": 0.95
}
]
}
```
When `is_final=true`, the server finalises the session and returns the complete aligned output (same structure as `/process_audio_session`).
**Chunking notes:** The server buffers audio internally and runs VAD + ASR when enough speech has accumulated to form a segment. Earlier segments are locked in and won't change; only the trailing edge is provisional.
---
## Planned: Health / Status
### `GET /health`
Server status for monitoring dashboards and client-side availability checks.
**Returns:**
```json
{
"status": "ok",
"gpu_available": true,
"gpu_quota_exhausted": false,
"quota_reset_time": null,
"active_sessions": 12,
"models_loaded": ["Base", "Large"],
"uptime_sec": 84200
}
```
---
## Error Handling
If `audio_id` is missing, expired, or invalid, session endpoints return:
```json
{"error": "Session not found or expired", "segments": []}
```
The client should call `/process_audio_session` again to get a fresh session.
---
## Design Notes
- **Thread safety:** Gradio handles concurrent requests via threading. The session store uses a lock around its internal dict.
- **Storage:** Audio on disk (can be large), metadata in memory (always small). Audio loaded via memory-mapped reads on demand.
- **No auth needed:** Session IDs are 128-bit random UUIDs β€” effectively unguessable.
- **HF Spaces compatibility:** `/tmp` is ephemeral and cleared on restart, which is fine since sessions are transient. The existing `allowed_paths=["/tmp"]` covers the new directory.
- **Backward compatible:** `/process_audio_json` remains unchanged.