File size: 9,918 Bytes
20e9692
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
# API Documentation

## Current Endpoints

### `POST /process_audio_json`

Stateless endpoint. Accepts audio and segmentation parameters, returns aligned JSON output.

**Inputs:** audio file, min_silence_ms, min_speech_ms, pad_ms, model_name, device

**Returns:** JSON with `segments` array (segment index, timestamps, Quran references, matched text, confidence, errors).

**Limitation:** Every call requires re-uploading the audio. No way to resegment or retranscribe without re-sending the full file.

---

## Planned: Session-Based Endpoints

The Gradio UI already caches intermediate results (preprocessed audio, VAD output, segment boundaries, model name) in `gr.State` so that resegment/retranscribe operations skip expensive steps. But `gr.State` is WebSocket-only β€” API clients using `gradio_client` can't benefit from this.

### Approach: Server-Side Session Store

On the first request, the server stores all intermediate data keyed by a UUID (`audio_id`) and returns it in the response. Subsequent requests reference this `audio_id` instead of re-uploading audio.

**What gets stored per session:**
- Preprocessed audio (float32, 16kHz mono) β€” saved to disk as `.npy`
- Raw VAD speech intervals β€” in memory (small)
- VAD completeness flags β€” in memory
- Cleaned segment boundaries β€” in memory
- Model name used β€” in memory

**Lifecycle:** Sessions expire after the same TTL as the existing Gradio cache (5 hours). A background thread purges expired sessions periodically. Audio files live under `/tmp/sessions/{audio_id}/`.

### `POST /process_audio_session`

Full pipeline. Same as `/process_audio_json` but additionally creates a server-side session.

**Inputs:** audio file, min_silence_ms, min_speech_ms, pad_ms, model_name, device

**Returns:** Same JSON as `/process_audio_json` with an added `audio_id` field.

### `POST /resegment_session`

Re-cleans VAD boundaries with new segmentation parameters and re-runs ASR. Skips audio upload, preprocessing, and VAD inference.

**Inputs:** audio_id, min_silence_ms, min_speech_ms, pad_ms, model_name, device

**Returns:** JSON with `segments` array and the same `audio_id`.

### `POST /retranscribe_session`

Re-runs ASR with a different model on the existing segment boundaries. Skips audio upload, preprocessing, VAD, and resegmentation.

**Inputs:** audio_id, model_name, device

**Returns:** JSON with `segments` array and the same `audio_id`.

### `POST /realign_from_timestamps`

Accepts an arbitrary list of `(start, end)` timestamp pairs and runs ASR + phoneme alignment on each slice. Skips VAD entirely β€” the client defines the segment boundaries directly. This is the core endpoint for timeline-based editing where the user drags segment boundaries manually.

**Inputs:** audio_id, timestamps (list of `{start, end}` objects in seconds), model_name, device

**Returns:** JSON with `segments` array and the same `audio_id`. Session boundaries are updated to match the provided timestamps.

Subsumes `/resegment_session` for most client use cases β€” the client can split, merge, and drag boundaries however they want, then send the final timestamp list in one call.

---

## Planned: Segment Editing Endpoints

Fine-grained operations for modifying individual segments without reprocessing the full recitation.

### `POST /split_segment`

Split one segment at a given timestamp into two. Re-runs alignment on each half independently.

**Inputs:** audio_id, segment_index, split_time (seconds)

**Returns:** Updated `segments` array with the split segment replaced by two new segments.

### `POST /merge_segments`

Merge two adjacent segments into one. Re-runs alignment on the combined audio slice.

**Inputs:** audio_id, segment_index_a, segment_index_b (must be adjacent)

**Returns:** Updated `segments` array with the two segments replaced by one.

### `POST /adjust_boundary`

Shift a segment's start or end time. Re-runs alignment on the affected segment(s) and its neighbour if boundaries overlap.

**Inputs:** audio_id, segment_index, new_start (seconds, optional), new_end (seconds, optional)

**Returns:** Updated `segments` array.

### `POST /override_segment_text`

Manually assign a Quran reference range to a segment, skipping alignment entirely. For when the aligner gets it wrong and the user knows the correct ayah.

**Inputs:** audio_id, segment_index, ref_from (e.g. `"2:255:1"`), ref_to (e.g. `"2:255:7"`)

**Returns:** Updated segment with the overridden reference and corresponding Quran text.

### `POST /bulk_update_segments`

Batch update: client sends a full modified segment list (adjusted times, overridden labels). Server validates, persists to session, and optionally re-aligns changed segments.

**Inputs:** audio_id, segments (list of `{start, end, ref_from?, ref_to?}`), realign (boolean, default true β€” re-run ASR on segments whose boundaries changed)

**Returns:** Full updated `segments` array.

---

## Planned: Word-Level Timing

### `POST /compute_word_timestamps`

Compute word-level start/end times for every word in every segment. This is the backbone of karaoke-style highlighting and word-by-word caption animation.

**Inputs:** audio_id, model_name, device

**Returns:** JSON with per-segment word timestamps:
```json
{
  "audio_id": "...",
  "segments": [
    {
      "segment": 1,
      "words": [
        {"word": "بِسْمِ", "start": 0.81, "end": 1.12},
        {"word": "Ψ§Ω„Ω„ΩŽΩ‘Ω‡Ω", "start": 1.12, "end": 1.45}
      ]
    }
  ]
}
```

---

## Planned: Export Endpoints

Generate subtitle files from session data. All accept `audio_id` and optionally use word-level timestamps if previously computed.

### `POST /export_srt`

Standard SRT subtitle format. One entry per segment (or per word if `word_level=true`).

**Inputs:** audio_id, word_level (boolean, default false)

**Returns:** SRT file content.

### `POST /export_vtt`

WebVTT format. Supports styling cues and is the standard for web video players.

**Inputs:** audio_id, word_level (boolean, default false)

**Returns:** VTT file content.

### `POST /export_ass`

ASS/SSA format with Arabic font and styling presets. Most useful for video editors producing styled Quran captions.

**Inputs:** audio_id, word_level (boolean, default false), font_name (optional), font_size (optional)

**Returns:** ASS file content.

---

## Planned: Quran Lookup Endpoints

Utility endpoints for client-side UI (dropdowns, search, manual labelling).

### `GET /quran_text`

Return Quran text with diacritics for a given reference range.

**Inputs:** ref_from (e.g. `"2:255:1"`), ref_to (e.g. `"2:255:7"`)

**Returns:** `{"text": "...", "ref_from": "...", "ref_to": "..."}`. All 114 chapters are pre-cached in memory.

### `GET /surah_info`

List of all surahs with metadata.

**Returns:** Array of `{number, name_arabic, name_english, ayah_count, revelation_type}`.

---

## Planned: Recitation Analytics

### `POST /recitation_stats`

Derive pace and timing analytics from an existing session's alignment results.

**Inputs:** audio_id

**Returns:**
```json
{
  "audio_id": "...",
  "total_duration_sec": 312.5,
  "total_segments": 7,
  "total_words": 86,
  "words_per_minute": 16.5,
  "avg_segment_duration_sec": 8.2,
  "avg_pause_duration_sec": 1.4,
  "per_segment": [
    {
      "segment": 1,
      "ref_from": "112:1:1",
      "ref_to": "112:1:4",
      "duration_sec": 2.18,
      "word_count": 4,
      "words_per_minute": 110.1,
      "pause_after_sec": 1.82
    }
  ]
}
```

Useful for learning apps tracking student fluency, reciter comparisons, or detecting rushed/slow sections.

---

## Planned: Streaming

### `POST /process_chunk`

Streaming-friendly endpoint for incremental audio processing. The client sends audio chunks as they become available, and the server returns partial alignment results progressively. Designed for live "now playing" displays (e.g. Quran radio showing the current ayah in real time).

**Inputs:** audio_id (optional β€” omit on first chunk to start a new session), audio_chunk (raw audio bytes), is_final (boolean)

**Returns:**
```json
{
  "audio_id": "...",
  "status": "partial",
  "latest_segments": [
    {
      "segment": 5,
      "ref_from": "36:1:1",
      "ref_to": "36:1:2",
      "matched_text": "ΩŠΨ³Ω“",
      "time_from": 24.3,
      "time_to": 25.8,
      "confidence": 0.95
    }
  ]
}
```

When `is_final=true`, the server finalises the session and returns the complete aligned output (same structure as `/process_audio_session`).

**Chunking notes:** The server buffers audio internally and runs VAD + ASR when enough speech has accumulated to form a segment. Earlier segments are locked in and won't change; only the trailing edge is provisional.

---

## Planned: Health / Status

### `GET /health`

Server status for monitoring dashboards and client-side availability checks.

**Returns:**
```json
{
  "status": "ok",
  "gpu_available": true,
  "gpu_quota_exhausted": false,
  "quota_reset_time": null,
  "active_sessions": 12,
  "models_loaded": ["Base", "Large"],
  "uptime_sec": 84200
}
```

---

## Error Handling

If `audio_id` is missing, expired, or invalid, session endpoints return:

```json
{"error": "Session not found or expired", "segments": []}
```

The client should call `/process_audio_session` again to get a fresh session.

---

## Design Notes

- **Thread safety:** Gradio handles concurrent requests via threading. The session store uses a lock around its internal dict.
- **Storage:** Audio on disk (can be large), metadata in memory (always small). Audio loaded via memory-mapped reads on demand.
- **No auth needed:** Session IDs are 128-bit random UUIDs β€” effectively unguessable.
- **HF Spaces compatibility:** `/tmp` is ephemeral and cleared on restart, which is fine since sessions are transient. The existing `allowed_paths=["/tmp"]` covers the new directory.
- **Backward compatible:** `/process_audio_json` remains unchanged.