Spaces:

hetchyy
/

quranic-universal-aligner

Running on Zero

App Files Files Community

quranic-universal-aligner / docs /debug_api_schema.md

hetchyy

Upload folder using huggingface_hub

602b5d3 verified 2 days ago

preview code

raw

history blame contribute delete

11.3 kB

A newer version of the Gradio SDK is available: 6.11.0

Upgrade

Debug Process API — Response Schema

Hidden endpoint for development debugging. Returns comprehensive structured data from every pipeline stage.

Endpoint

POST /api/debug_process

Parameters

Parameter	Type	Description
`audio_data`	Audio (numpy)	Audio file to process
`min_silence_ms`	int	Minimum silence duration for VAD segment splitting
`min_speech_ms`	int	Minimum speech duration to keep a segment
`pad_ms`	int	Padding added to each segment boundary
`model_name`	str	ASR model: `"Base"` or `"Large"`
`device`	str	`"GPU"` or `"CPU"`
`hf_token`	str	HF token for authentication

Usage

from gradio_client import Client

client = Client("hetchyy/quranic-universal-aligner")
result = client.predict(
    "path/to/audio.mp3",
    300, 100, 50,        # silence, speech, pad
    "Base", "GPU",
    "hf_xxxx...",        # HF token
    api_name="/debug_process"
)

Response Schema

Top Level

{
  "status": "ok",
  "timestamp": "2026-04-03T12:00:00+00:00",
  "profiling": { ... },
  "vad": { ... },
  "asr": { ... },
  "anchor": { ... },
  "specials": { ... },
  "alignment_detail": [ ... ],
  "events": [ ... ],
  "segments": [ ... ]
}

On error: {"error": "message"} (auth failure, pipeline failure, no speech).

`profiling`

All timing fields from ProfilingData plus computed fields. Times in seconds unless noted.

Field	Type	Description
`resample_time`	float	Audio resampling to 16kHz
`vad_model_load_time`	float	VAD model loading
`vad_model_move_time`	float	VAD model GPU transfer
`vad_inference_time`	float	VAD model inference
`vad_gpu_time`	float	Actual VAD GPU execution
`vad_wall_time`	float	VAD wall-clock (includes queue wait)
`asr_time`	float	ASR wall-clock (includes queue wait)
`asr_gpu_time`	float	Actual ASR GPU execution
`asr_model_move_time`	float	ASR model GPU transfer
`asr_sorting_time`	float	Duration-sorting for batching
`asr_batch_build_time`	float	Dynamic batch construction
`asr_batch_profiling`	array	Per-batch timing (see below)
`anchor_time`	float	N-gram voting anchor detection
`phoneme_total_time`	float	Overall phoneme matching
`phoneme_ref_build_time`	float	Chapter reference build
`phoneme_dp_total_time`	float	Total DP across all segments
`phoneme_dp_min_time`	float	Min DP time per segment
`phoneme_dp_max_time`	float	Max DP time per segment
`phoneme_dp_avg_time`	float	Average DP time per segment (computed)
`phoneme_window_setup_time`	float	Total window slicing
`phoneme_result_build_time`	float	Result construction
`phoneme_num_segments`	int	Number of DP alignment calls
`match_wall_time`	float	Total matching wall-clock
`tier1_attempts`	int	Tier 1 retry attempts
`tier1_passed`	int	Tier 1 retries that succeeded
`tier1_segments`	int[]	Segment indices that went to tier 1
`tier2_attempts`	int	Tier 2 retry attempts
`tier2_passed`	int	Tier 2 retries that succeeded
`tier2_segments`	int[]	Segment indices that went to tier 2
`consec_reanchors`	int	Times consecutive-failure reanchor triggered
`segments_attempted`	int	Total segments processed
`segments_passed`	int	Segments that matched successfully
`special_merges`	int	Basmala-fused wins
`transition_skips`	int	Transition segments detected
`phoneme_wraps_detected`	int	Repetition wraps
`result_build_time`	float	Total result building
`result_audio_encode_time`	float	Audio int16 conversion
`gpu_peak_vram_mb`	float	Peak GPU VRAM (MB)
`gpu_reserved_vram_mb`	float	Reserved GPU VRAM (MB)
`total_time`	float	End-to-end pipeline time
`summary_text`	str	Formatted profiling summary (same as terminal output)

`asr_batch_profiling[]`

Field	Type	Description
`batch_num`	int	Batch index (1-based)
`size`	int	Number of segments in batch
`time`	float	Total batch processing time
`feat_time`	float	Feature extraction + GPU transfer
`infer_time`	float	Model inference
`decode_time`	float	CTC greedy decode
`min_dur`	float	Shortest audio in batch (seconds)
`max_dur`	float	Longest audio in batch (seconds)
`avg_dur`	float	Average audio duration
`total_seconds`	float	Sum of all segment durations
`pad_waste`	float	Fraction of padding waste (0–1)

`vad`

VAD segmentation details — raw model output vs. cleaned intervals.

Field	Type	Description
`raw_interval_count`	int	Intervals from VAD model before cleaning
`raw_intervals`	float[][]	`[[start, end], ...]` before silence merge / min_speech filter
`cleaned_interval_count`	int	Intervals after cleaning
`cleaned_intervals`	float[][]	`[[start, end], ...]` final segment boundaries
`params`	object	`{min_silence_ms, min_speech_ms, pad_ms}`

`asr`

ASR phoneme recognition results per segment.

Field	Type	Description
`model_name`	str	`"Base"` or `"Large"`
`num_segments`	int	Total segments transcribed
`per_segment_phonemes`	array	Per-segment phoneme output (see below)

`per_segment_phonemes[]`

Field	Type	Description
`segment_idx`	int	Segment index (0-based)
`phonemes`	str[]	Array of phoneme strings from CTC decode

`anchor`

N-gram voting for chapter/verse anchor detection.

Field	Type	Description
`segments_used`	int	Number of segments used for voting
`combined_phoneme_count`	int	Total phonemes in combined segments
`ngrams_extracted`	int	N-grams extracted from ASR output
`ngrams_matched`	int	N-grams found in Quran index
`ngrams_missed`	int	N-grams not in index
`distinct_pairs`	int	Distinct (surah, ayah) pairs voted for
`surah_ranking`	array	Candidate surahs ranked by best run weight
`winner_surah`	int	Winning surah number
`winner_ayah`	int	Starting ayah of best contiguous run
`start_pointer`	int	Word index corresponding to winner ayah

`surah_ranking[]`

Field	Type	Description
`surah`	int	Surah number
`total_weight`	float	Sum of all vote weights
`best_run`	object	`{start_ayah, end_ayah, weight}` — best contiguous ayah run

`specials`

Special segment detection (Isti'adha, Basmala, Takbir at recording start).

Field	Type	Description
`candidates_tested`	array	Every detection attempt with edit distance
`detected`	array	Confirmed special segments
`first_quran_idx`	int	Index where Quran content starts (after specials)

`candidates_tested[]`

Field	Type	Description
`segment_idx`	int	Which segment was tested
`type`	str	Candidate type (`"Isti'adha"`, `"Basmala"`, `"Combined Isti'adha+Basmala"`, `"Takbir"`)
`edit_distance`	float	Normalized edit distance (0 = exact match)
`threshold`	float	Maximum edit distance for acceptance
`matched`	bool	Whether distance ≤ threshold

`detected[]`

Field	Type	Description
`segment_idx`	int	Segment index
`type`	str	Special type
`confidence`	float	1 − edit_distance

`alignment_detail[]`

Per-segment DP alignment results. One entry per alignment attempt (primary + retries appear separately).

Field	Type	Description
`segment_idx`	int	1-based segment display index
`asr_phonemes`	str	Space-separated ASR phonemes (truncated to 60)
`asr_phoneme_count`	int	Full phoneme count
`window`	object	`{pointer, surah}` — DP search window info
`expected_pointer`	int	Word pointer at time of alignment
`retry_tier`	str\|null	`null` for primary, `"tier1"` or `"tier2"` for retries
`result`	object\|null	Alignment result (null if failed)
`timing`	object	`{window_setup_ms, dp_ms, result_build_ms}`
`failed_reason`	str\|null	Why alignment failed (if applicable)

`result` (when present)

Field	Type	Description
`matched_ref`	str	Reference location (`"2:255:1-2:255:3"`)
`start_word_idx`	int	First matched word index in chapter reference
`end_word_idx`	int	Last matched word index
`edit_cost`	float	Raw edit distance (with substitution costs)
`confidence`	float	1 − normalized_edit_distance
`j_start`	int	Start position in reference phoneme window
`best_j`	int	End position in reference phoneme window
`basmala_consumed`	bool	Whether Basmala prefix was consumed
`n_wraps`	int	Number of repetition wraps
`wrap_points`	array\|null	`[(i, j_end, j_start), ...]` for each wrap

`events[]`

Pipeline events in chronological order. Each has a type field plus event-specific data.

Event Types

Type	Fields	Description
`gap`	`position`, `segment_before`/`segment_after`/`segment_idx`, `missing_words`	Missing words between consecutive segments or at boundaries
`reanchor`	`at_segment`, `reason`, `new_surah`, `new_ayah`, `new_pointer`	Global re-anchor after consecutive failures or transition mode exit
`chapter_transition`	`at_segment`, `from_surah`, `to_surah`	Sequential chapter boundary crossing
`chapter_end`	`at_segment`, `from_surah`, `next_action`	End of chapter detected
`basmala_fused`	`segment_idx`, `fused_conf`, `plain_conf`, `chose`	Basmala merged with first verse (chosen when fused > plain)
`transition_detected`	`segment_idx`, `transition_type`, `confidence`, `context`	Non-Quranic transition segment (Amin, Takbir, Tahmeed, etc.)
`tahmeed_merge`	`segment_idx`, `merged_segment`	Two Tahmeed segments merged
`retry_tier1`	`segment_idx`, `passed`, `confidence`	Tier 1 retry succeeded
`retry_tier2`	`segment_idx`, `passed`, `confidence`	Tier 2 retry succeeded
`retry_failed`	`segment_idx`, `tier1`, `tier2`	All retry tiers exhausted

`segments[]`

Final alignment output (same schema as /process_audio_session response).

Field	Type	Description
`segment`	int	1-based segment number
`time_from`	float	Start time (seconds)
`time_to`	float	End time (seconds)
`ref_from`	str	Reference start (`"surah:ayah:word"`)
`ref_to`	str	Reference end
`matched_text`	str	Matched Quran text
`confidence`	float	Alignment confidence (0–1)
`has_missing_words`	bool	Gap detected before/after this segment
`error`	str\|null	Error message if alignment failed
`special_type`	str	Present only for special segments

Debug Process API — Response Schema

Endpoint

Parameters

Usage

Response Schema

Top Level

profiling

asr_batch_profiling[]

vad

asr

per_segment_phonemes[]

anchor

surah_ranking[]

specials

candidates_tested[]

detected[]

alignment_detail[]

result (when present)

events[]