Spaces:
Running
on
Zero
Running
on
Zero
File size: 9,918 Bytes
20e9692 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 |
# API Documentation
## Current Endpoints
### `POST /process_audio_json`
Stateless endpoint. Accepts audio and segmentation parameters, returns aligned JSON output.
**Inputs:** audio file, min_silence_ms, min_speech_ms, pad_ms, model_name, device
**Returns:** JSON with `segments` array (segment index, timestamps, Quran references, matched text, confidence, errors).
**Limitation:** Every call requires re-uploading the audio. No way to resegment or retranscribe without re-sending the full file.
---
## Planned: Session-Based Endpoints
The Gradio UI already caches intermediate results (preprocessed audio, VAD output, segment boundaries, model name) in `gr.State` so that resegment/retranscribe operations skip expensive steps. But `gr.State` is WebSocket-only β API clients using `gradio_client` can't benefit from this.
### Approach: Server-Side Session Store
On the first request, the server stores all intermediate data keyed by a UUID (`audio_id`) and returns it in the response. Subsequent requests reference this `audio_id` instead of re-uploading audio.
**What gets stored per session:**
- Preprocessed audio (float32, 16kHz mono) β saved to disk as `.npy`
- Raw VAD speech intervals β in memory (small)
- VAD completeness flags β in memory
- Cleaned segment boundaries β in memory
- Model name used β in memory
**Lifecycle:** Sessions expire after the same TTL as the existing Gradio cache (5 hours). A background thread purges expired sessions periodically. Audio files live under `/tmp/sessions/{audio_id}/`.
### `POST /process_audio_session`
Full pipeline. Same as `/process_audio_json` but additionally creates a server-side session.
**Inputs:** audio file, min_silence_ms, min_speech_ms, pad_ms, model_name, device
**Returns:** Same JSON as `/process_audio_json` with an added `audio_id` field.
### `POST /resegment_session`
Re-cleans VAD boundaries with new segmentation parameters and re-runs ASR. Skips audio upload, preprocessing, and VAD inference.
**Inputs:** audio_id, min_silence_ms, min_speech_ms, pad_ms, model_name, device
**Returns:** JSON with `segments` array and the same `audio_id`.
### `POST /retranscribe_session`
Re-runs ASR with a different model on the existing segment boundaries. Skips audio upload, preprocessing, VAD, and resegmentation.
**Inputs:** audio_id, model_name, device
**Returns:** JSON with `segments` array and the same `audio_id`.
### `POST /realign_from_timestamps`
Accepts an arbitrary list of `(start, end)` timestamp pairs and runs ASR + phoneme alignment on each slice. Skips VAD entirely β the client defines the segment boundaries directly. This is the core endpoint for timeline-based editing where the user drags segment boundaries manually.
**Inputs:** audio_id, timestamps (list of `{start, end}` objects in seconds), model_name, device
**Returns:** JSON with `segments` array and the same `audio_id`. Session boundaries are updated to match the provided timestamps.
Subsumes `/resegment_session` for most client use cases β the client can split, merge, and drag boundaries however they want, then send the final timestamp list in one call.
---
## Planned: Segment Editing Endpoints
Fine-grained operations for modifying individual segments without reprocessing the full recitation.
### `POST /split_segment`
Split one segment at a given timestamp into two. Re-runs alignment on each half independently.
**Inputs:** audio_id, segment_index, split_time (seconds)
**Returns:** Updated `segments` array with the split segment replaced by two new segments.
### `POST /merge_segments`
Merge two adjacent segments into one. Re-runs alignment on the combined audio slice.
**Inputs:** audio_id, segment_index_a, segment_index_b (must be adjacent)
**Returns:** Updated `segments` array with the two segments replaced by one.
### `POST /adjust_boundary`
Shift a segment's start or end time. Re-runs alignment on the affected segment(s) and its neighbour if boundaries overlap.
**Inputs:** audio_id, segment_index, new_start (seconds, optional), new_end (seconds, optional)
**Returns:** Updated `segments` array.
### `POST /override_segment_text`
Manually assign a Quran reference range to a segment, skipping alignment entirely. For when the aligner gets it wrong and the user knows the correct ayah.
**Inputs:** audio_id, segment_index, ref_from (e.g. `"2:255:1"`), ref_to (e.g. `"2:255:7"`)
**Returns:** Updated segment with the overridden reference and corresponding Quran text.
### `POST /bulk_update_segments`
Batch update: client sends a full modified segment list (adjusted times, overridden labels). Server validates, persists to session, and optionally re-aligns changed segments.
**Inputs:** audio_id, segments (list of `{start, end, ref_from?, ref_to?}`), realign (boolean, default true β re-run ASR on segments whose boundaries changed)
**Returns:** Full updated `segments` array.
---
## Planned: Word-Level Timing
### `POST /compute_word_timestamps`
Compute word-level start/end times for every word in every segment. This is the backbone of karaoke-style highlighting and word-by-word caption animation.
**Inputs:** audio_id, model_name, device
**Returns:** JSON with per-segment word timestamps:
```json
{
"audio_id": "...",
"segments": [
{
"segment": 1,
"words": [
{"word": "Ψ¨ΩΨ³ΩΩ
Ω", "start": 0.81, "end": 1.12},
{"word": "Ψ§ΩΩΩΩΩΩ", "start": 1.12, "end": 1.45}
]
}
]
}
```
---
## Planned: Export Endpoints
Generate subtitle files from session data. All accept `audio_id` and optionally use word-level timestamps if previously computed.
### `POST /export_srt`
Standard SRT subtitle format. One entry per segment (or per word if `word_level=true`).
**Inputs:** audio_id, word_level (boolean, default false)
**Returns:** SRT file content.
### `POST /export_vtt`
WebVTT format. Supports styling cues and is the standard for web video players.
**Inputs:** audio_id, word_level (boolean, default false)
**Returns:** VTT file content.
### `POST /export_ass`
ASS/SSA format with Arabic font and styling presets. Most useful for video editors producing styled Quran captions.
**Inputs:** audio_id, word_level (boolean, default false), font_name (optional), font_size (optional)
**Returns:** ASS file content.
---
## Planned: Quran Lookup Endpoints
Utility endpoints for client-side UI (dropdowns, search, manual labelling).
### `GET /quran_text`
Return Quran text with diacritics for a given reference range.
**Inputs:** ref_from (e.g. `"2:255:1"`), ref_to (e.g. `"2:255:7"`)
**Returns:** `{"text": "...", "ref_from": "...", "ref_to": "..."}`. All 114 chapters are pre-cached in memory.
### `GET /surah_info`
List of all surahs with metadata.
**Returns:** Array of `{number, name_arabic, name_english, ayah_count, revelation_type}`.
---
## Planned: Recitation Analytics
### `POST /recitation_stats`
Derive pace and timing analytics from an existing session's alignment results.
**Inputs:** audio_id
**Returns:**
```json
{
"audio_id": "...",
"total_duration_sec": 312.5,
"total_segments": 7,
"total_words": 86,
"words_per_minute": 16.5,
"avg_segment_duration_sec": 8.2,
"avg_pause_duration_sec": 1.4,
"per_segment": [
{
"segment": 1,
"ref_from": "112:1:1",
"ref_to": "112:1:4",
"duration_sec": 2.18,
"word_count": 4,
"words_per_minute": 110.1,
"pause_after_sec": 1.82
}
]
}
```
Useful for learning apps tracking student fluency, reciter comparisons, or detecting rushed/slow sections.
---
## Planned: Streaming
### `POST /process_chunk`
Streaming-friendly endpoint for incremental audio processing. The client sends audio chunks as they become available, and the server returns partial alignment results progressively. Designed for live "now playing" displays (e.g. Quran radio showing the current ayah in real time).
**Inputs:** audio_id (optional β omit on first chunk to start a new session), audio_chunk (raw audio bytes), is_final (boolean)
**Returns:**
```json
{
"audio_id": "...",
"status": "partial",
"latest_segments": [
{
"segment": 5,
"ref_from": "36:1:1",
"ref_to": "36:1:2",
"matched_text": "ΩΨ³Ω",
"time_from": 24.3,
"time_to": 25.8,
"confidence": 0.95
}
]
}
```
When `is_final=true`, the server finalises the session and returns the complete aligned output (same structure as `/process_audio_session`).
**Chunking notes:** The server buffers audio internally and runs VAD + ASR when enough speech has accumulated to form a segment. Earlier segments are locked in and won't change; only the trailing edge is provisional.
---
## Planned: Health / Status
### `GET /health`
Server status for monitoring dashboards and client-side availability checks.
**Returns:**
```json
{
"status": "ok",
"gpu_available": true,
"gpu_quota_exhausted": false,
"quota_reset_time": null,
"active_sessions": 12,
"models_loaded": ["Base", "Large"],
"uptime_sec": 84200
}
```
---
## Error Handling
If `audio_id` is missing, expired, or invalid, session endpoints return:
```json
{"error": "Session not found or expired", "segments": []}
```
The client should call `/process_audio_session` again to get a fresh session.
---
## Design Notes
- **Thread safety:** Gradio handles concurrent requests via threading. The session store uses a lock around its internal dict.
- **Storage:** Audio on disk (can be large), metadata in memory (always small). Audio loaded via memory-mapped reads on demand.
- **No auth needed:** Session IDs are 128-bit random UUIDs β effectively unguessable.
- **HF Spaces compatibility:** `/tmp` is ephemeral and cleared on restart, which is fine since sessions are transient. The existing `allowed_paths=["/tmp"]` covers the new directory.
- **Backward compatible:** `/process_audio_json` remains unchanged.
|