# Debug Process API — Response Schema Hidden endpoint for development debugging. Returns comprehensive structured data from every pipeline stage. ## Endpoint ``` POST /api/debug_process ``` ## Parameters | Parameter | Type | Description | |-----------|------|-------------| | `audio_data` | Audio (numpy) | Audio file to process | | `min_silence_ms` | int | Minimum silence duration for VAD segment splitting | | `min_speech_ms` | int | Minimum speech duration to keep a segment | | `pad_ms` | int | Padding added to each segment boundary | | `model_name` | str | ASR model: `"Base"` or `"Large"` | | `device` | str | `"GPU"` or `"CPU"` | | `hf_token` | str | HF token for authentication | ## Usage ```python from gradio_client import Client client = Client("hetchyy/quranic-universal-aligner") result = client.predict( "path/to/audio.mp3", 300, 100, 50, # silence, speech, pad "Base", "GPU", "hf_xxxx...", # HF token api_name="/debug_process" ) ``` --- ## Response Schema ### Top Level ```json { "status": "ok", "timestamp": "2026-04-03T12:00:00+00:00", "profiling": { ... }, "vad": { ... }, "asr": { ... }, "anchor": { ... }, "specials": { ... }, "alignment_detail": [ ... ], "events": [ ... ], "segments": [ ... ] } ``` On error: `{"error": "message"}` (auth failure, pipeline failure, no speech). --- ### `profiling` All timing fields from `ProfilingData` plus computed fields. Times in seconds unless noted. | Field | Type | Description | |-------|------|-------------| | `resample_time` | float | Audio resampling to 16kHz | | `vad_model_load_time` | float | VAD model loading | | `vad_model_move_time` | float | VAD model GPU transfer | | `vad_inference_time` | float | VAD model inference | | `vad_gpu_time` | float | Actual VAD GPU execution | | `vad_wall_time` | float | VAD wall-clock (includes queue wait) | | `asr_time` | float | ASR wall-clock (includes queue wait) | | `asr_gpu_time` | float | Actual ASR GPU execution | | `asr_model_move_time` | float | ASR model GPU transfer | | `asr_sorting_time` | float | Duration-sorting for batching | | `asr_batch_build_time` | float | Dynamic batch construction | | `asr_batch_profiling` | array | Per-batch timing (see below) | | `anchor_time` | float | N-gram voting anchor detection | | `phoneme_total_time` | float | Overall phoneme matching | | `phoneme_ref_build_time` | float | Chapter reference build | | `phoneme_dp_total_time` | float | Total DP across all segments | | `phoneme_dp_min_time` | float | Min DP time per segment | | `phoneme_dp_max_time` | float | Max DP time per segment | | `phoneme_dp_avg_time` | float | Average DP time per segment (computed) | | `phoneme_window_setup_time` | float | Total window slicing | | `phoneme_result_build_time` | float | Result construction | | `phoneme_num_segments` | int | Number of DP alignment calls | | `match_wall_time` | float | Total matching wall-clock | | `tier1_attempts` | int | Tier 1 retry attempts | | `tier1_passed` | int | Tier 1 retries that succeeded | | `tier1_segments` | int[] | Segment indices that went to tier 1 | | `tier2_attempts` | int | Tier 2 retry attempts | | `tier2_passed` | int | Tier 2 retries that succeeded | | `tier2_segments` | int[] | Segment indices that went to tier 2 | | `consec_reanchors` | int | Times consecutive-failure reanchor triggered | | `segments_attempted` | int | Total segments processed | | `segments_passed` | int | Segments that matched successfully | | `special_merges` | int | Basmala-fused wins | | `transition_skips` | int | Transition segments detected | | `phoneme_wraps_detected` | int | Repetition wraps | | `result_build_time` | float | Total result building | | `result_audio_encode_time` | float | Audio int16 conversion | | `gpu_peak_vram_mb` | float | Peak GPU VRAM (MB) | | `gpu_reserved_vram_mb` | float | Reserved GPU VRAM (MB) | | `total_time` | float | End-to-end pipeline time | | `summary_text` | str | Formatted profiling summary (same as terminal output) | #### `asr_batch_profiling[]` | Field | Type | Description | |-------|------|-------------| | `batch_num` | int | Batch index (1-based) | | `size` | int | Number of segments in batch | | `time` | float | Total batch processing time | | `feat_time` | float | Feature extraction + GPU transfer | | `infer_time` | float | Model inference | | `decode_time` | float | CTC greedy decode | | `min_dur` | float | Shortest audio in batch (seconds) | | `max_dur` | float | Longest audio in batch (seconds) | | `avg_dur` | float | Average audio duration | | `total_seconds` | float | Sum of all segment durations | | `pad_waste` | float | Fraction of padding waste (0–1) | --- ### `vad` VAD segmentation details — raw model output vs. cleaned intervals. | Field | Type | Description | |-------|------|-------------| | `raw_interval_count` | int | Intervals from VAD model before cleaning | | `raw_intervals` | float[][] | `[[start, end], ...]` before silence merge / min_speech filter | | `cleaned_interval_count` | int | Intervals after cleaning | | `cleaned_intervals` | float[][] | `[[start, end], ...]` final segment boundaries | | `params` | object | `{min_silence_ms, min_speech_ms, pad_ms}` | --- ### `asr` ASR phoneme recognition results per segment. | Field | Type | Description | |-------|------|-------------| | `model_name` | str | `"Base"` or `"Large"` | | `num_segments` | int | Total segments transcribed | | `per_segment_phonemes` | array | Per-segment phoneme output (see below) | #### `per_segment_phonemes[]` | Field | Type | Description | |-------|------|-------------| | `segment_idx` | int | Segment index (0-based) | | `phonemes` | str[] | Array of phoneme strings from CTC decode | --- ### `anchor` N-gram voting for chapter/verse anchor detection. | Field | Type | Description | |-------|------|-------------| | `segments_used` | int | Number of segments used for voting | | `combined_phoneme_count` | int | Total phonemes in combined segments | | `ngrams_extracted` | int | N-grams extracted from ASR output | | `ngrams_matched` | int | N-grams found in Quran index | | `ngrams_missed` | int | N-grams not in index | | `distinct_pairs` | int | Distinct (surah, ayah) pairs voted for | | `surah_ranking` | array | Candidate surahs ranked by best run weight | | `winner_surah` | int | Winning surah number | | `winner_ayah` | int | Starting ayah of best contiguous run | | `start_pointer` | int | Word index corresponding to winner ayah | #### `surah_ranking[]` | Field | Type | Description | |-------|------|-------------| | `surah` | int | Surah number | | `total_weight` | float | Sum of all vote weights | | `best_run` | object | `{start_ayah, end_ayah, weight}` — best contiguous ayah run | --- ### `specials` Special segment detection (Isti'adha, Basmala, Takbir at recording start). | Field | Type | Description | |-------|------|-------------| | `candidates_tested` | array | Every detection attempt with edit distance | | `detected` | array | Confirmed special segments | | `first_quran_idx` | int | Index where Quran content starts (after specials) | #### `candidates_tested[]` | Field | Type | Description | |-------|------|-------------| | `segment_idx` | int | Which segment was tested | | `type` | str | Candidate type (`"Isti'adha"`, `"Basmala"`, `"Combined Isti'adha+Basmala"`, `"Takbir"`) | | `edit_distance` | float | Normalized edit distance (0 = exact match) | | `threshold` | float | Maximum edit distance for acceptance | | `matched` | bool | Whether distance ≤ threshold | #### `detected[]` | Field | Type | Description | |-------|------|-------------| | `segment_idx` | int | Segment index | | `type` | str | Special type | | `confidence` | float | 1 − edit_distance | --- ### `alignment_detail[]` Per-segment DP alignment results. One entry per alignment attempt (primary + retries appear separately). | Field | Type | Description | |-------|------|-------------| | `segment_idx` | int | 1-based segment display index | | `asr_phonemes` | str | Space-separated ASR phonemes (truncated to 60) | | `asr_phoneme_count` | int | Full phoneme count | | `window` | object | `{pointer, surah}` — DP search window info | | `expected_pointer` | int | Word pointer at time of alignment | | `retry_tier` | str\|null | `null` for primary, `"tier1"` or `"tier2"` for retries | | `result` | object\|null | Alignment result (null if failed) | | `timing` | object | `{window_setup_ms, dp_ms, result_build_ms}` | | `failed_reason` | str\|null | Why alignment failed (if applicable) | #### `result` (when present) | Field | Type | Description | |-------|------|-------------| | `matched_ref` | str | Reference location (`"2:255:1-2:255:3"`) | | `start_word_idx` | int | First matched word index in chapter reference | | `end_word_idx` | int | Last matched word index | | `edit_cost` | float | Raw edit distance (with substitution costs) | | `confidence` | float | 1 − normalized_edit_distance | | `j_start` | int | Start position in reference phoneme window | | `best_j` | int | End position in reference phoneme window | | `basmala_consumed` | bool | Whether Basmala prefix was consumed | | `n_wraps` | int | Number of repetition wraps | | `wrap_points` | array\|null | `[(i, j_end, j_start), ...]` for each wrap | --- ### `events[]` Pipeline events in chronological order. Each has a `type` field plus event-specific data. #### Event Types | Type | Fields | Description | |------|--------|-------------| | `gap` | `position`, `segment_before`/`segment_after`/`segment_idx`, `missing_words` | Missing words between consecutive segments or at boundaries | | `reanchor` | `at_segment`, `reason`, `new_surah`, `new_ayah`, `new_pointer` | Global re-anchor after consecutive failures or transition mode exit | | `chapter_transition` | `at_segment`, `from_surah`, `to_surah` | Sequential chapter boundary crossing | | `chapter_end` | `at_segment`, `from_surah`, `next_action` | End of chapter detected | | `basmala_fused` | `segment_idx`, `fused_conf`, `plain_conf`, `chose` | Basmala merged with first verse (chosen when fused > plain) | | `transition_detected` | `segment_idx`, `transition_type`, `confidence`, `context` | Non-Quranic transition segment (Amin, Takbir, Tahmeed, etc.) | | `tahmeed_merge` | `segment_idx`, `merged_segment` | Two Tahmeed segments merged | | `retry_tier1` | `segment_idx`, `passed`, `confidence` | Tier 1 retry succeeded | | `retry_tier2` | `segment_idx`, `passed`, `confidence` | Tier 2 retry succeeded | | `retry_failed` | `segment_idx`, `tier1`, `tier2` | All retry tiers exhausted | --- ### `segments[]` Final alignment output (same schema as `/process_audio_session` response). | Field | Type | Description | |-------|------|-------------| | `segment` | int | 1-based segment number | | `time_from` | float | Start time (seconds) | | `time_to` | float | End time (seconds) | | `ref_from` | str | Reference start (`"surah:ayah:word"`) | | `ref_to` | str | Reference end | | `matched_text` | str | Matched Quran text | | `confidence` | float | Alignment confidence (0–1) | | `has_missing_words` | bool | Gap detected before/after this segment | | `error` | str\|null | Error message if alignment failed | | `special_type` | str | Present only for special segments |