Spaces:

hetchyy
/

quranic-universal-aligner

Running on Zero

File size: 11,298 Bytes

602b5d3

# Debug Process API — Response Schema

Hidden endpoint for development debugging. Returns comprehensive structured data from every pipeline stage.

## Endpoint

```
POST /api/debug_process
```

## Parameters

| Parameter | Type | Description |
|-----------|------|-------------|
| `audio_data` | Audio (numpy) | Audio file to process |
| `min_silence_ms` | int | Minimum silence duration for VAD segment splitting |
| `min_speech_ms` | int | Minimum speech duration to keep a segment |
| `pad_ms` | int | Padding added to each segment boundary |
| `model_name` | str | ASR model: `"Base"` or `"Large"` |
| `device` | str | `"GPU"` or `"CPU"` |
| `hf_token` | str | HF token for authentication |

## Usage

```python
from gradio_client import Client

client = Client("hetchyy/quranic-universal-aligner")
result = client.predict(
    "path/to/audio.mp3",
    300, 100, 50,        # silence, speech, pad
    "Base", "GPU",
    "hf_xxxx...",        # HF token
    api_name="/debug_process"
)
```

---

## Response Schema

### Top Level

```json
{
  "status": "ok",
  "timestamp": "2026-04-03T12:00:00+00:00",
  "profiling": { ... },
  "vad": { ... },
  "asr": { ... },
  "anchor": { ... },
  "specials": { ... },
  "alignment_detail": [ ... ],
  "events": [ ... ],
  "segments": [ ... ]
}
```

On error: `{"error": "message"}` (auth failure, pipeline failure, no speech).

---

### `profiling`

All timing fields from `ProfilingData` plus computed fields. Times in seconds unless noted.

| Field | Type | Description |
|-------|------|-------------|
| `resample_time` | float | Audio resampling to 16kHz |
| `vad_model_load_time` | float | VAD model loading |
| `vad_model_move_time` | float | VAD model GPU transfer |
| `vad_inference_time` | float | VAD model inference |
| `vad_gpu_time` | float | Actual VAD GPU execution |
| `vad_wall_time` | float | VAD wall-clock (includes queue wait) |
| `asr_time` | float | ASR wall-clock (includes queue wait) |
| `asr_gpu_time` | float | Actual ASR GPU execution |
| `asr_model_move_time` | float | ASR model GPU transfer |
| `asr_sorting_time` | float | Duration-sorting for batching |
| `asr_batch_build_time` | float | Dynamic batch construction |
| `asr_batch_profiling` | array | Per-batch timing (see below) |
| `anchor_time` | float | N-gram voting anchor detection |
| `phoneme_total_time` | float | Overall phoneme matching |
| `phoneme_ref_build_time` | float | Chapter reference build |
| `phoneme_dp_total_time` | float | Total DP across all segments |
| `phoneme_dp_min_time` | float | Min DP time per segment |
| `phoneme_dp_max_time` | float | Max DP time per segment |
| `phoneme_dp_avg_time` | float | Average DP time per segment (computed) |
| `phoneme_window_setup_time` | float | Total window slicing |
| `phoneme_result_build_time` | float | Result construction |
| `phoneme_num_segments` | int | Number of DP alignment calls |
| `match_wall_time` | float | Total matching wall-clock |
| `tier1_attempts` | int | Tier 1 retry attempts |
| `tier1_passed` | int | Tier 1 retries that succeeded |
| `tier1_segments` | int[] | Segment indices that went to tier 1 |
| `tier2_attempts` | int | Tier 2 retry attempts |
| `tier2_passed` | int | Tier 2 retries that succeeded |
| `tier2_segments` | int[] | Segment indices that went to tier 2 |
| `consec_reanchors` | int | Times consecutive-failure reanchor triggered |
| `segments_attempted` | int | Total segments processed |
| `segments_passed` | int | Segments that matched successfully |
| `special_merges` | int | Basmala-fused wins |
| `transition_skips` | int | Transition segments detected |
| `phoneme_wraps_detected` | int | Repetition wraps |
| `result_build_time` | float | Total result building |
| `result_audio_encode_time` | float | Audio int16 conversion |
| `gpu_peak_vram_mb` | float | Peak GPU VRAM (MB) |
| `gpu_reserved_vram_mb` | float | Reserved GPU VRAM (MB) |
| `total_time` | float | End-to-end pipeline time |
| `summary_text` | str | Formatted profiling summary (same as terminal output) |

#### `asr_batch_profiling[]`

| Field | Type | Description |
|-------|------|-------------|
| `batch_num` | int | Batch index (1-based) |
| `size` | int | Number of segments in batch |
| `time` | float | Total batch processing time |
| `feat_time` | float | Feature extraction + GPU transfer |
| `infer_time` | float | Model inference |
| `decode_time` | float | CTC greedy decode |
| `min_dur` | float | Shortest audio in batch (seconds) |
| `max_dur` | float | Longest audio in batch (seconds) |
| `avg_dur` | float | Average audio duration |
| `total_seconds` | float | Sum of all segment durations |
| `pad_waste` | float | Fraction of padding waste (0–1) |

---

### `vad`

VAD segmentation details — raw model output vs. cleaned intervals.

| Field | Type | Description |
|-------|------|-------------|
| `raw_interval_count` | int | Intervals from VAD model before cleaning |
| `raw_intervals` | float[][] | `[[start, end], ...]` before silence merge / min_speech filter |
| `cleaned_interval_count` | int | Intervals after cleaning |
| `cleaned_intervals` | float[][] | `[[start, end], ...]` final segment boundaries |
| `params` | object | `{min_silence_ms, min_speech_ms, pad_ms}` |

---

### `asr`

ASR phoneme recognition results per segment.

| Field | Type | Description |
|-------|------|-------------|
| `model_name` | str | `"Base"` or `"Large"` |
| `num_segments` | int | Total segments transcribed |
| `per_segment_phonemes` | array | Per-segment phoneme output (see below) |

#### `per_segment_phonemes[]`

| Field | Type | Description |
|-------|------|-------------|
| `segment_idx` | int | Segment index (0-based) |
| `phonemes` | str[] | Array of phoneme strings from CTC decode |

---

### `anchor`

N-gram voting for chapter/verse anchor detection.

| Field | Type | Description |
|-------|------|-------------|
| `segments_used` | int | Number of segments used for voting |
| `combined_phoneme_count` | int | Total phonemes in combined segments |
| `ngrams_extracted` | int | N-grams extracted from ASR output |
| `ngrams_matched` | int | N-grams found in Quran index |
| `ngrams_missed` | int | N-grams not in index |
| `distinct_pairs` | int | Distinct (surah, ayah) pairs voted for |
| `surah_ranking` | array | Candidate surahs ranked by best run weight |
| `winner_surah` | int | Winning surah number |
| `winner_ayah` | int | Starting ayah of best contiguous run |
| `start_pointer` | int | Word index corresponding to winner ayah |

#### `surah_ranking[]`

| Field | Type | Description |
|-------|------|-------------|
| `surah` | int | Surah number |
| `total_weight` | float | Sum of all vote weights |
| `best_run` | object | `{start_ayah, end_ayah, weight}` — best contiguous ayah run |

---

### `specials`

Special segment detection (Isti'adha, Basmala, Takbir at recording start).

| Field | Type | Description |
|-------|------|-------------|
| `candidates_tested` | array | Every detection attempt with edit distance |
| `detected` | array | Confirmed special segments |
| `first_quran_idx` | int | Index where Quran content starts (after specials) |

#### `candidates_tested[]`

| Field | Type | Description |
|-------|------|-------------|
| `segment_idx` | int | Which segment was tested |
| `type` | str | Candidate type (`"Isti'adha"`, `"Basmala"`, `"Combined Isti'adha+Basmala"`, `"Takbir"`) |
| `edit_distance` | float | Normalized edit distance (0 = exact match) |
| `threshold` | float | Maximum edit distance for acceptance |
| `matched` | bool | Whether distance ≤ threshold |

#### `detected[]`

| Field | Type | Description |
|-------|------|-------------|
| `segment_idx` | int | Segment index |
| `type` | str | Special type |
| `confidence` | float | 1 − edit_distance |

---

### `alignment_detail[]`

Per-segment DP alignment results. One entry per alignment attempt (primary + retries appear separately).

| Field | Type | Description |
|-------|------|-------------|
| `segment_idx` | int | 1-based segment display index |
| `asr_phonemes` | str | Space-separated ASR phonemes (truncated to 60) |
| `asr_phoneme_count` | int | Full phoneme count |
| `window` | object | `{pointer, surah}` — DP search window info |
| `expected_pointer` | int | Word pointer at time of alignment |
| `retry_tier` | str\|null | `null` for primary, `"tier1"` or `"tier2"` for retries |
| `result` | object\|null | Alignment result (null if failed) |
| `timing` | object | `{window_setup_ms, dp_ms, result_build_ms}` |
| `failed_reason` | str\|null | Why alignment failed (if applicable) |

#### `result` (when present)

| Field | Type | Description |
|-------|------|-------------|
| `matched_ref` | str | Reference location (`"2:255:1-2:255:3"`) |
| `start_word_idx` | int | First matched word index in chapter reference |
| `end_word_idx` | int | Last matched word index |
| `edit_cost` | float | Raw edit distance (with substitution costs) |
| `confidence` | float | 1 − normalized_edit_distance |
| `j_start` | int | Start position in reference phoneme window |
| `best_j` | int | End position in reference phoneme window |
| `basmala_consumed` | bool | Whether Basmala prefix was consumed |
| `n_wraps` | int | Number of repetition wraps |
| `wrap_points` | array\|null | `[(i, j_end, j_start), ...]` for each wrap |

---

### `events[]`

Pipeline events in chronological order. Each has a `type` field plus event-specific data.

#### Event Types

| Type | Fields | Description |
|------|--------|-------------|
| `gap` | `position`, `segment_before`/`segment_after`/`segment_idx`, `missing_words` | Missing words between consecutive segments or at boundaries |
| `reanchor` | `at_segment`, `reason`, `new_surah`, `new_ayah`, `new_pointer` | Global re-anchor after consecutive failures or transition mode exit |
| `chapter_transition` | `at_segment`, `from_surah`, `to_surah` | Sequential chapter boundary crossing |
| `chapter_end` | `at_segment`, `from_surah`, `next_action` | End of chapter detected |
| `basmala_fused` | `segment_idx`, `fused_conf`, `plain_conf`, `chose` | Basmala merged with first verse (chosen when fused > plain) |
| `transition_detected` | `segment_idx`, `transition_type`, `confidence`, `context` | Non-Quranic transition segment (Amin, Takbir, Tahmeed, etc.) |
| `tahmeed_merge` | `segment_idx`, `merged_segment` | Two Tahmeed segments merged |
| `retry_tier1` | `segment_idx`, `passed`, `confidence` | Tier 1 retry succeeded |
| `retry_tier2` | `segment_idx`, `passed`, `confidence` | Tier 2 retry succeeded |
| `retry_failed` | `segment_idx`, `tier1`, `tier2` | All retry tiers exhausted |

---

### `segments[]`

Final alignment output (same schema as `/process_audio_session` response).

| Field | Type | Description |
|-------|------|-------------|
| `segment` | int | 1-based segment number |
| `time_from` | float | Start time (seconds) |
| `time_to` | float | End time (seconds) |
| `ref_from` | str | Reference start (`"surah:ayah:word"`) |
| `ref_to` | str | Reference end |
| `matched_text` | str | Matched Quran text |
| `confidence` | float | Alignment confidence (0–1) |
| `has_missing_words` | bool | Gap detected before/after this segment |
| `error` | str\|null | Error message if alignment failed |
| `special_type` | str | Present only for special segments |