Spaces:

hetchyy
/

quranic-universal-aligner

Running on Zero

App Files Files Community

quranic-universal-aligner / docs /debug_api_schema.md

hetchyy

Upload folder using huggingface_hub

602b5d3 verified 2 days ago

preview code

raw

history blame contribute delete

11.3 kB

	# Debug Process API — Response Schema

	Hidden endpoint for development debugging. Returns comprehensive structured data from every pipeline stage.

	## Endpoint

	```
	POST /api/debug_process
	```

	## Parameters

	\| Parameter \| Type \| Description \|
	\|-----------\|------\|-------------\|
	\| `audio_data` \| Audio (numpy) \| Audio file to process \|
	\| `min_silence_ms` \| int \| Minimum silence duration for VAD segment splitting \|
	\| `min_speech_ms` \| int \| Minimum speech duration to keep a segment \|
	\| `pad_ms` \| int \| Padding added to each segment boundary \|
	\| `model_name` \| str \| ASR model: `"Base"` or `"Large"` \|
	\| `device` \| str \| `"GPU"` or `"CPU"` \|
	\| `hf_token` \| str \| HF token for authentication \|

	## Usage

	```python
	from gradio_client import Client

	client = Client("hetchyy/quranic-universal-aligner")
	result = client.predict(
	"path/to/audio.mp3",
	300, 100, 50, # silence, speech, pad
	"Base", "GPU",
	"hf_xxxx...", # HF token
	api_name="/debug_process"
	)
	```

	---

	## Response Schema

	### Top Level

	```json
	{
	"status": "ok",
	"timestamp": "2026-04-03T12:00:00+00:00",
	"profiling": { ... },
	"vad": { ... },
	"asr": { ... },
	"anchor": { ... },
	"specials": { ... },
	"alignment_detail": [ ... ],
	"events": [ ... ],
	"segments": [ ... ]
	}
	```

	On error: `{"error": "message"}` (auth failure, pipeline failure, no speech).

	---

	### `profiling`

	All timing fields from `ProfilingData` plus computed fields. Times in seconds unless noted.

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `resample_time` \| float \| Audio resampling to 16kHz \|
	\| `vad_model_load_time` \| float \| VAD model loading \|
	\| `vad_model_move_time` \| float \| VAD model GPU transfer \|
	\| `vad_inference_time` \| float \| VAD model inference \|
	\| `vad_gpu_time` \| float \| Actual VAD GPU execution \|
	\| `vad_wall_time` \| float \| VAD wall-clock (includes queue wait) \|
	\| `asr_time` \| float \| ASR wall-clock (includes queue wait) \|
	\| `asr_gpu_time` \| float \| Actual ASR GPU execution \|
	\| `asr_model_move_time` \| float \| ASR model GPU transfer \|
	\| `asr_sorting_time` \| float \| Duration-sorting for batching \|
	\| `asr_batch_build_time` \| float \| Dynamic batch construction \|
	\| `asr_batch_profiling` \| array \| Per-batch timing (see below) \|
	\| `anchor_time` \| float \| N-gram voting anchor detection \|
	\| `phoneme_total_time` \| float \| Overall phoneme matching \|
	\| `phoneme_ref_build_time` \| float \| Chapter reference build \|
	\| `phoneme_dp_total_time` \| float \| Total DP across all segments \|
	\| `phoneme_dp_min_time` \| float \| Min DP time per segment \|
	\| `phoneme_dp_max_time` \| float \| Max DP time per segment \|
	\| `phoneme_dp_avg_time` \| float \| Average DP time per segment (computed) \|
	\| `phoneme_window_setup_time` \| float \| Total window slicing \|
	\| `phoneme_result_build_time` \| float \| Result construction \|
	\| `phoneme_num_segments` \| int \| Number of DP alignment calls \|
	\| `match_wall_time` \| float \| Total matching wall-clock \|
	\| `tier1_attempts` \| int \| Tier 1 retry attempts \|
	\| `tier1_passed` \| int \| Tier 1 retries that succeeded \|
	\| `tier1_segments` \| int[] \| Segment indices that went to tier 1 \|
	\| `tier2_attempts` \| int \| Tier 2 retry attempts \|
	\| `tier2_passed` \| int \| Tier 2 retries that succeeded \|
	\| `tier2_segments` \| int[] \| Segment indices that went to tier 2 \|
	\| `consec_reanchors` \| int \| Times consecutive-failure reanchor triggered \|
	\| `segments_attempted` \| int \| Total segments processed \|
	\| `segments_passed` \| int \| Segments that matched successfully \|
	\| `special_merges` \| int \| Basmala-fused wins \|
	\| `transition_skips` \| int \| Transition segments detected \|
	\| `phoneme_wraps_detected` \| int \| Repetition wraps \|
	\| `result_build_time` \| float \| Total result building \|
	\| `result_audio_encode_time` \| float \| Audio int16 conversion \|
	\| `gpu_peak_vram_mb` \| float \| Peak GPU VRAM (MB) \|
	\| `gpu_reserved_vram_mb` \| float \| Reserved GPU VRAM (MB) \|
	\| `total_time` \| float \| End-to-end pipeline time \|
	\| `summary_text` \| str \| Formatted profiling summary (same as terminal output) \|

	#### `asr_batch_profiling[]`

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `batch_num` \| int \| Batch index (1-based) \|
	\| `size` \| int \| Number of segments in batch \|
	\| `time` \| float \| Total batch processing time \|
	\| `feat_time` \| float \| Feature extraction + GPU transfer \|
	\| `infer_time` \| float \| Model inference \|
	\| `decode_time` \| float \| CTC greedy decode \|
	\| `min_dur` \| float \| Shortest audio in batch (seconds) \|
	\| `max_dur` \| float \| Longest audio in batch (seconds) \|
	\| `avg_dur` \| float \| Average audio duration \|
	\| `total_seconds` \| float \| Sum of all segment durations \|
	\| `pad_waste` \| float \| Fraction of padding waste (0–1) \|

	---

	### `vad`

	VAD segmentation details — raw model output vs. cleaned intervals.

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `raw_interval_count` \| int \| Intervals from VAD model before cleaning \|
	\| `raw_intervals` \| float[][] \| `[[start, end], ...]` before silence merge / min_speech filter \|
	\| `cleaned_interval_count` \| int \| Intervals after cleaning \|
	\| `cleaned_intervals` \| float[][] \| `[[start, end], ...]` final segment boundaries \|
	\| `params` \| object \| `{min_silence_ms, min_speech_ms, pad_ms}` \|

	---

	### `asr`

	ASR phoneme recognition results per segment.

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `model_name` \| str \| `"Base"` or `"Large"` \|
	\| `num_segments` \| int \| Total segments transcribed \|
	\| `per_segment_phonemes` \| array \| Per-segment phoneme output (see below) \|

	#### `per_segment_phonemes[]`

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `segment_idx` \| int \| Segment index (0-based) \|
	\| `phonemes` \| str[] \| Array of phoneme strings from CTC decode \|

	---

	### `anchor`

	N-gram voting for chapter/verse anchor detection.

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `segments_used` \| int \| Number of segments used for voting \|
	\| `combined_phoneme_count` \| int \| Total phonemes in combined segments \|
	\| `ngrams_extracted` \| int \| N-grams extracted from ASR output \|
	\| `ngrams_matched` \| int \| N-grams found in Quran index \|
	\| `ngrams_missed` \| int \| N-grams not in index \|
	\| `distinct_pairs` \| int \| Distinct (surah, ayah) pairs voted for \|
	\| `surah_ranking` \| array \| Candidate surahs ranked by best run weight \|
	\| `winner_surah` \| int \| Winning surah number \|
	\| `winner_ayah` \| int \| Starting ayah of best contiguous run \|
	\| `start_pointer` \| int \| Word index corresponding to winner ayah \|

	#### `surah_ranking[]`

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `surah` \| int \| Surah number \|
	\| `total_weight` \| float \| Sum of all vote weights \|
	\| `best_run` \| object \| `{start_ayah, end_ayah, weight}` — best contiguous ayah run \|

	---

	### `specials`

	Special segment detection (Isti'adha, Basmala, Takbir at recording start).

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `candidates_tested` \| array \| Every detection attempt with edit distance \|
	\| `detected` \| array \| Confirmed special segments \|
	\| `first_quran_idx` \| int \| Index where Quran content starts (after specials) \|

	#### `candidates_tested[]`

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `segment_idx` \| int \| Which segment was tested \|
	\| `type` \| str \| Candidate type (`"Isti'adha"`, `"Basmala"`, `"Combined Isti'adha+Basmala"`, `"Takbir"`) \|
	\| `edit_distance` \| float \| Normalized edit distance (0 = exact match) \|
	\| `threshold` \| float \| Maximum edit distance for acceptance \|
	\| `matched` \| bool \| Whether distance ≤ threshold \|

	#### `detected[]`

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `segment_idx` \| int \| Segment index \|
	\| `type` \| str \| Special type \|
	\| `confidence` \| float \| 1 − edit_distance \|

	---

	### `alignment_detail[]`

	Per-segment DP alignment results. One entry per alignment attempt (primary + retries appear separately).

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `segment_idx` \| int \| 1-based segment display index \|
	\| `asr_phonemes` \| str \| Space-separated ASR phonemes (truncated to 60) \|
	\| `asr_phoneme_count` \| int \| Full phoneme count \|
	\| `window` \| object \| `{pointer, surah}` — DP search window info \|
	\| `expected_pointer` \| int \| Word pointer at time of alignment \|
	\| `retry_tier` \| str\\|null \| `null` for primary, `"tier1"` or `"tier2"` for retries \|
	\| `result` \| object\\|null \| Alignment result (null if failed) \|
	\| `timing` \| object \| `{window_setup_ms, dp_ms, result_build_ms}` \|
	\| `failed_reason` \| str\\|null \| Why alignment failed (if applicable) \|

	#### `result` (when present)

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `matched_ref` \| str \| Reference location (`"2:255:1-2:255:3"`) \|
	\| `start_word_idx` \| int \| First matched word index in chapter reference \|
	\| `end_word_idx` \| int \| Last matched word index \|
	\| `edit_cost` \| float \| Raw edit distance (with substitution costs) \|
	\| `confidence` \| float \| 1 − normalized_edit_distance \|
	\| `j_start` \| int \| Start position in reference phoneme window \|
	\| `best_j` \| int \| End position in reference phoneme window \|
	\| `basmala_consumed` \| bool \| Whether Basmala prefix was consumed \|
	\| `n_wraps` \| int \| Number of repetition wraps \|
	\| `wrap_points` \| array\\|null \| `[(i, j_end, j_start), ...]` for each wrap \|

	---

	### `events[]`

	Pipeline events in chronological order. Each has a `type` field plus event-specific data.

	#### Event Types

	\| Type \| Fields \| Description \|
	\|------\|--------\|-------------\|
	\| `gap` \| `position`, `segment_before`/`segment_after`/`segment_idx`, `missing_words` \| Missing words between consecutive segments or at boundaries \|
	\| `reanchor` \| `at_segment`, `reason`, `new_surah`, `new_ayah`, `new_pointer` \| Global re-anchor after consecutive failures or transition mode exit \|
	\| `chapter_transition` \| `at_segment`, `from_surah`, `to_surah` \| Sequential chapter boundary crossing \|
	\| `chapter_end` \| `at_segment`, `from_surah`, `next_action` \| End of chapter detected \|
	\| `basmala_fused` \| `segment_idx`, `fused_conf`, `plain_conf`, `chose` \| Basmala merged with first verse (chosen when fused > plain) \|
	\| `transition_detected` \| `segment_idx`, `transition_type`, `confidence`, `context` \| Non-Quranic transition segment (Amin, Takbir, Tahmeed, etc.) \|
	\| `tahmeed_merge` \| `segment_idx`, `merged_segment` \| Two Tahmeed segments merged \|
	\| `retry_tier1` \| `segment_idx`, `passed`, `confidence` \| Tier 1 retry succeeded \|
	\| `retry_tier2` \| `segment_idx`, `passed`, `confidence` \| Tier 2 retry succeeded \|
	\| `retry_failed` \| `segment_idx`, `tier1`, `tier2` \| All retry tiers exhausted \|

	---

	### `segments[]`

	Final alignment output (same schema as `/process_audio_session` response).

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `segment` \| int \| 1-based segment number \|
	\| `time_from` \| float \| Start time (seconds) \|
	\| `time_to` \| float \| End time (seconds) \|
	\| `ref_from` \| str \| Reference start (`"surah:ayah:word"`) \|
	\| `ref_to` \| str \| Reference end \|
	\| `matched_text` \| str \| Matched Quran text \|
	\| `confidence` \| float \| Alignment confidence (0–1) \|
	\| `has_missing_words` \| bool \| Gap detected before/after this segment \|
	\| `error` \| str\\|null \| Error message if alignment failed \|
	\| `special_type` \| str \| Present only for special segments \|