quranic-universal-aligner / docs /debug_api_schema.md
hetchyy's picture
Upload folder using huggingface_hub
602b5d3 verified

A newer version of the Gradio SDK is available: 6.11.0

Upgrade

Debug Process API β€” Response Schema

Hidden endpoint for development debugging. Returns comprehensive structured data from every pipeline stage.

Endpoint

POST /api/debug_process

Parameters

Parameter Type Description
audio_data Audio (numpy) Audio file to process
min_silence_ms int Minimum silence duration for VAD segment splitting
min_speech_ms int Minimum speech duration to keep a segment
pad_ms int Padding added to each segment boundary
model_name str ASR model: "Base" or "Large"
device str "GPU" or "CPU"
hf_token str HF token for authentication

Usage

from gradio_client import Client

client = Client("hetchyy/quranic-universal-aligner")
result = client.predict(
    "path/to/audio.mp3",
    300, 100, 50,        # silence, speech, pad
    "Base", "GPU",
    "hf_xxxx...",        # HF token
    api_name="/debug_process"
)

Response Schema

Top Level

{
  "status": "ok",
  "timestamp": "2026-04-03T12:00:00+00:00",
  "profiling": { ... },
  "vad": { ... },
  "asr": { ... },
  "anchor": { ... },
  "specials": { ... },
  "alignment_detail": [ ... ],
  "events": [ ... ],
  "segments": [ ... ]
}

On error: {"error": "message"} (auth failure, pipeline failure, no speech).


profiling

All timing fields from ProfilingData plus computed fields. Times in seconds unless noted.

Field Type Description
resample_time float Audio resampling to 16kHz
vad_model_load_time float VAD model loading
vad_model_move_time float VAD model GPU transfer
vad_inference_time float VAD model inference
vad_gpu_time float Actual VAD GPU execution
vad_wall_time float VAD wall-clock (includes queue wait)
asr_time float ASR wall-clock (includes queue wait)
asr_gpu_time float Actual ASR GPU execution
asr_model_move_time float ASR model GPU transfer
asr_sorting_time float Duration-sorting for batching
asr_batch_build_time float Dynamic batch construction
asr_batch_profiling array Per-batch timing (see below)
anchor_time float N-gram voting anchor detection
phoneme_total_time float Overall phoneme matching
phoneme_ref_build_time float Chapter reference build
phoneme_dp_total_time float Total DP across all segments
phoneme_dp_min_time float Min DP time per segment
phoneme_dp_max_time float Max DP time per segment
phoneme_dp_avg_time float Average DP time per segment (computed)
phoneme_window_setup_time float Total window slicing
phoneme_result_build_time float Result construction
phoneme_num_segments int Number of DP alignment calls
match_wall_time float Total matching wall-clock
tier1_attempts int Tier 1 retry attempts
tier1_passed int Tier 1 retries that succeeded
tier1_segments int[] Segment indices that went to tier 1
tier2_attempts int Tier 2 retry attempts
tier2_passed int Tier 2 retries that succeeded
tier2_segments int[] Segment indices that went to tier 2
consec_reanchors int Times consecutive-failure reanchor triggered
segments_attempted int Total segments processed
segments_passed int Segments that matched successfully
special_merges int Basmala-fused wins
transition_skips int Transition segments detected
phoneme_wraps_detected int Repetition wraps
result_build_time float Total result building
result_audio_encode_time float Audio int16 conversion
gpu_peak_vram_mb float Peak GPU VRAM (MB)
gpu_reserved_vram_mb float Reserved GPU VRAM (MB)
total_time float End-to-end pipeline time
summary_text str Formatted profiling summary (same as terminal output)

asr_batch_profiling[]

Field Type Description
batch_num int Batch index (1-based)
size int Number of segments in batch
time float Total batch processing time
feat_time float Feature extraction + GPU transfer
infer_time float Model inference
decode_time float CTC greedy decode
min_dur float Shortest audio in batch (seconds)
max_dur float Longest audio in batch (seconds)
avg_dur float Average audio duration
total_seconds float Sum of all segment durations
pad_waste float Fraction of padding waste (0–1)

vad

VAD segmentation details β€” raw model output vs. cleaned intervals.

Field Type Description
raw_interval_count int Intervals from VAD model before cleaning
raw_intervals float[][] [[start, end], ...] before silence merge / min_speech filter
cleaned_interval_count int Intervals after cleaning
cleaned_intervals float[][] [[start, end], ...] final segment boundaries
params object {min_silence_ms, min_speech_ms, pad_ms}

asr

ASR phoneme recognition results per segment.

Field Type Description
model_name str "Base" or "Large"
num_segments int Total segments transcribed
per_segment_phonemes array Per-segment phoneme output (see below)

per_segment_phonemes[]

Field Type Description
segment_idx int Segment index (0-based)
phonemes str[] Array of phoneme strings from CTC decode

anchor

N-gram voting for chapter/verse anchor detection.

Field Type Description
segments_used int Number of segments used for voting
combined_phoneme_count int Total phonemes in combined segments
ngrams_extracted int N-grams extracted from ASR output
ngrams_matched int N-grams found in Quran index
ngrams_missed int N-grams not in index
distinct_pairs int Distinct (surah, ayah) pairs voted for
surah_ranking array Candidate surahs ranked by best run weight
winner_surah int Winning surah number
winner_ayah int Starting ayah of best contiguous run
start_pointer int Word index corresponding to winner ayah

surah_ranking[]

Field Type Description
surah int Surah number
total_weight float Sum of all vote weights
best_run object {start_ayah, end_ayah, weight} β€” best contiguous ayah run

specials

Special segment detection (Isti'adha, Basmala, Takbir at recording start).

Field Type Description
candidates_tested array Every detection attempt with edit distance
detected array Confirmed special segments
first_quran_idx int Index where Quran content starts (after specials)

candidates_tested[]

Field Type Description
segment_idx int Which segment was tested
type str Candidate type ("Isti'adha", "Basmala", "Combined Isti'adha+Basmala", "Takbir")
edit_distance float Normalized edit distance (0 = exact match)
threshold float Maximum edit distance for acceptance
matched bool Whether distance ≀ threshold

detected[]

Field Type Description
segment_idx int Segment index
type str Special type
confidence float 1 βˆ’ edit_distance

alignment_detail[]

Per-segment DP alignment results. One entry per alignment attempt (primary + retries appear separately).

Field Type Description
segment_idx int 1-based segment display index
asr_phonemes str Space-separated ASR phonemes (truncated to 60)
asr_phoneme_count int Full phoneme count
window object {pointer, surah} β€” DP search window info
expected_pointer int Word pointer at time of alignment
retry_tier str|null null for primary, "tier1" or "tier2" for retries
result object|null Alignment result (null if failed)
timing object {window_setup_ms, dp_ms, result_build_ms}
failed_reason str|null Why alignment failed (if applicable)

result (when present)

Field Type Description
matched_ref str Reference location ("2:255:1-2:255:3")
start_word_idx int First matched word index in chapter reference
end_word_idx int Last matched word index
edit_cost float Raw edit distance (with substitution costs)
confidence float 1 βˆ’ normalized_edit_distance
j_start int Start position in reference phoneme window
best_j int End position in reference phoneme window
basmala_consumed bool Whether Basmala prefix was consumed
n_wraps int Number of repetition wraps
wrap_points array|null [(i, j_end, j_start), ...] for each wrap

events[]

Pipeline events in chronological order. Each has a type field plus event-specific data.

Event Types

Type Fields Description
gap position, segment_before/segment_after/segment_idx, missing_words Missing words between consecutive segments or at boundaries
reanchor at_segment, reason, new_surah, new_ayah, new_pointer Global re-anchor after consecutive failures or transition mode exit
chapter_transition at_segment, from_surah, to_surah Sequential chapter boundary crossing
chapter_end at_segment, from_surah, next_action End of chapter detected
basmala_fused segment_idx, fused_conf, plain_conf, chose Basmala merged with first verse (chosen when fused > plain)
transition_detected segment_idx, transition_type, confidence, context Non-Quranic transition segment (Amin, Takbir, Tahmeed, etc.)
tahmeed_merge segment_idx, merged_segment Two Tahmeed segments merged
retry_tier1 segment_idx, passed, confidence Tier 1 retry succeeded
retry_tier2 segment_idx, passed, confidence Tier 2 retry succeeded
retry_failed segment_idx, tier1, tier2 All retry tiers exhausted

segments[]

Final alignment output (same schema as /process_audio_session response).

Field Type Description
segment int 1-based segment number
time_from float Start time (seconds)
time_to float End time (seconds)
ref_from str Reference start ("surah:ayah:word")
ref_to str Reference end
matched_text str Matched Quran text
confidence float Alignment confidence (0–1)
has_missing_words bool Gap detected before/after this segment
error str|null Error message if alignment failed
special_type str Present only for special segments