hetchyy Claude Opus 4.6 commited on
Commit
6cdb091
·
1 Parent(s): 0351f22

Add session-based API endpoints for stateless client access

Browse files

Implement 4 endpoints (process_audio_session, resegment_session,
retranscribe_session, realign_from_timestamps) that persist session
data to /tmp/aligner_sessions so gradio_client consumers can reuse
cached audio and VAD results across follow-up calls without gr.State.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

.gitignore CHANGED
@@ -49,6 +49,5 @@ test_api.py
49
  data/api_result.json
50
 
51
  CLAUDE.md
52
- inference_optimization.md
53
 
54
  docs/
 
49
  data/api_result.json
50
 
51
  CLAUDE.md
 
52
 
53
  docs/
CLAUDE.md CHANGED
@@ -2,83 +2,184 @@
2
 
3
  This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4
 
 
 
5
  ## Project Overview
6
 
7
  Quran recitation alignment tool that segments audio recordings and aligns them with Quranic text using phoneme-based ASR and dynamic programming. Deployed as a Hugging Face Space with Gradio.
8
 
9
  **Pipeline:** Audio → Preprocessing (16kHz mono) → VAD Segmentation → Phoneme ASR (wav2vec2) → Special Segment Detection (Basmala/Isti'adha) → N-gram Anchor Voting → DP Alignment → Word-level Timestamps (optional via external MFA) → UI Rendering
10
 
11
- ## Architecture
12
-
13
- ### Entry Point
14
-
15
- `app.py` (~85 lines) Bootstrap entry point: path setup, Cython build, imports `build_interface()` from `src/ui/interface.py`, and `__main__` block with model preloading.
16
-
17
- ### Top-level Modules (`src/`)
18
-
19
- - **`src/pipeline.py`** GPU-decorated pipeline functions: VAD+ASR GPU leases, post-VAD alignment pipeline, `process_audio`, `resegment_audio`, `retranscribe_audio`, `save_json_export`.
20
- - **`src/mfa.py`** — MFA forced-alignment integration: upload/submit to external MFA Space, SSE result polling, progress bar HTML, and `compute_mfa_timestamps` generator that injects word/letter timestamps into segment HTML.
21
-
22
- ### Core Infrastructure (`src/core/`)
23
-
24
- - **`segment_types.py`** — Shared dataclasses (`VadSegment`, `SegmentInfo`, `ProfilingData`).
25
- - **`quran_index.py`** — Quran text index for reference lookups.
26
- - **`zero_gpu.py`** — `@gpu_with_fallback` decorator for ZeroGPU quota handling with automatic CPU fallback.
27
- - **`usage_logger.py`** — HF Dataset logging (ParquetScheduler for alignment runs).
28
-
29
- ### Alignment (`src/alignment/`)
30
-
31
- - **`alignment_pipeline.py`** Main alignment orchestrator. Coordinates ASR anchor detection → DP alignment.
32
- - **`phoneme_asr.py`** wav2vec2 CTC inference with dynamic batching (duration-based batch construction to minimize padding waste).
33
- - **`phoneme_anchor.py`** N-gram rarity-weighted voting to determine which chapter/verse a segment belongs to.
34
- - **`phoneme_matcher.py`** — Substring Levenshtein DP alignment between ASR phonemes and reference Quran phonemes. Uses windowed alignment with lookback/lookahead.
35
- - **`_dp_core.pyx`** — Cython-accelerated DP inner loop (10-20x speedup). Falls back to pure Python if not compiled.
36
- - **`phonemizer_utils.py`** Phonemizer wrapper for Arabic/Quranic text phonemization.
37
- - **`special_segments.py`** Detects Basmala and Isti'adha via phoneme edit distance.
38
- - **`phoneme_matcher_cache.py`** — Pre-loads and caches phonemized chapter references from `data/phoneme_cache.pkl`.
39
- - **`ngram_index.py`** — N-gram index data structure used by anchor voting, loaded from `data/phoneme_ngram_index_5.pkl`.
40
-
41
- ### Segmenter (`src/segmenter/`)
42
-
43
- - **`segmenter_model.py`** Model lifecycle and device management for the VAD segmenter.
44
- - **`segmenter_aoti.py`** Ahead-of-time compiled model support.
45
- - **`vad.py`** Voice activity detection and speech segment extraction.
46
-
47
- ### UI (`src/ui/`)
48
-
49
- - **`interface.py`** — `build_interface()`: Gradio layout (CSS, JS animation system, component definitions).
50
- - **`event_wiring.py`** Connects all Gradio component events.
51
- - **`handlers.py`** Python event handler functions.
52
- - **`segments.py`** Segment rendering helpers (HTML cards, confidence classes, timestamps, audio encoding).
53
- - **`styles.py`** CSS builder.
54
- - **`js_config.py`** JS configuration bridge.
55
-
56
- ### Configuration
57
-
58
- `config.py` — Centralized settings: model paths, alignment hyperparameters (edit costs, thresholds, window sizes), segmentation presets (Mujawwad/Murattal/Fast), batching strategy, UI settings, and debug flags.
59
-
60
- ### Data Files (`data/`)
61
-
62
- - `phoneme_cache.pkl` (7.9MB) Pre-phonemized Quran text for all 114 chapters
63
- - `phoneme_ngram_index_5.pkl` (6.2MB) — 5-gram index for anchor detection
64
- - `phoneme_sub_costs.json` — Custom phoneme substitution cost matrix
65
- - `digital_khatt_v2_script.json` (14.8MB) Full Quran text with positional metadata
66
- - `surah_info.json` Chapter metadata (names, verse counts)
67
- - `font_data.py` Base64-encoded Arabic fonts for offline rendering
68
-
69
- ### Models
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
 
71
  | Model | ID | Purpose |
72
  |-------|----|---------|
73
  | VAD | `obadx/recitation-segmenter-v2` | Voice activity detection |
74
  | ASR Base | `hetchyy/r15_95m` | Phoneme recognition (95M params) |
75
- | ASR Large | `hetchyy/r7` | Phoneme recognition (higher accuracy, slower) |
76
  | MFA | External Space `hetchyy-quran-phoneme-mfa` | Word-level forced alignment |
77
 
78
- ### Key Patterns
79
 
80
- - **State caching:** Preprocessed audio, VAD intervals, and segment boundaries are cached in Gradio `gr.State` to allow resegmentation/retranscription without re-uploading.
81
- - **Environment detection:** `IS_HF_SPACE` flag switches behavior for HF Spaces deployment (ZeroGPU, model preloading).
82
- - **Retry/re-anchor:** Alignment retries with expanded windows on failure; re-anchors after `MAX_CONSECUTIVE_FAILURES` (2) consecutive failures.
83
  - **Confidence scoring:** Green ≥80%, Yellow 60-79%, Red <60%.
84
- - **Animation system:** Client-side JS with multiple display modes (Reveal, Fade, Spotlight, Isolated, Custom), word/character granularity, and verse-aware windowing.
 
 
2
 
3
  This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4
 
5
+ **Keep this file up to date.** After any file/folder structure change, update the tree below without asking. After implementing features or making architectural changes, suggest additions to this file explaining why they would help future context.
6
+
7
  ## Project Overview
8
 
9
  Quran recitation alignment tool that segments audio recordings and aligns them with Quranic text using phoneme-based ASR and dynamic programming. Deployed as a Hugging Face Space with Gradio.
10
 
11
  **Pipeline:** Audio → Preprocessing (16kHz mono) → VAD Segmentation → Phoneme ASR (wav2vec2) → Special Segment Detection (Basmala/Isti'adha) → N-gram Anchor Voting → DP Alignment → Word-level Timestamps (optional via external MFA) → UI Rendering
12
 
13
+ ## Commands
14
+
15
+ ```bash
16
+ # Run locally
17
+ python app.py # Start on port 7860
18
+ python app.py --share # With public HF link
19
+
20
+ # Build Cython DP extension (auto-attempted on startup, falls back to pure Python)
21
+ python setup.py build_ext --inplace
22
+
23
+ # Rebuild data caches (run offline, not during serving)
24
+ python scripts/build_phoneme_cache.py
25
+ python scripts/build_phoneme_ngram_index.py
26
+ ```
27
+
28
+ ## File Tree
29
+
30
+ ```
31
+ ├── app.py # ~85 lines — Bootstrap only: path setup, Cython build, build_interface(), model preloading
32
+ ├── config.py # All constants, hyperparameters, model paths, presets, UI settings, debug flags
33
+ ├── align_config.py # Override config for constrained (known-surah) alignment (tighter windows, no debug)
34
+ ├── setup.py # Cython build for _dp_core.pyx
35
+ ├── requirements.txt # Pinned deps: torch 2.8, transformers 5.0, gradio >=6.5.1
36
+
37
+ ├── src/
38
+ │ ├── pipeline.py # GPU-decorated pipeline: VAD+ASR leases, post-VAD alignment, process/resegment/retranscribe/realign
39
+ │ ├── mfa.py # MFA forced-alignment: upload to external Space, SSE polling, timestamp injection into HTML
40
+ │ │
41
+ │ ├── api/
42
+ │ │ └── session_api.py # Session persistence + 4 endpoint wrappers (process/resegment/retranscribe/realign)
43
+ │ │
44
+ │ ├── core/
45
+ │ │ ├── segment_types.py # Dataclasses: VadSegment, SegmentInfo, ProfilingData (50+ timing fields)
46
+ │ │ ├── quran_index.py # QuranIndex: dual-script word lookup (QPC Hafs for compute, DigitalKhatt for display)
47
+ │ │ ├── zero_gpu.py # @gpu_with_fallback decorator: ZeroGPU quota detection, automatic CPU fallback
48
+ │ │ └── usage_logger.py # HF Dataset logging: ParquetScheduler, audio embedding, error JSONL fallback
49
+ │ │
50
+ │ ├── alignment/
51
+ │ │ ├── alignment_pipeline.py # Orchestrator: sequential alignment with retry tiers, re-anchoring, chapter transitions
52
+ │ │ ├── phoneme_asr.py # wav2vec2 CTC inference with dynamic batching (duration-based, padding waste minimization)
53
+ │ │ ├── phoneme_anchor.py # N-gram rarity-weighted voting: determines chapter/verse anchor point
54
+ │ │ ├── phoneme_matcher.py # Substring Levenshtein DP with word-boundary constraints and position prior
55
+ │ │ ├── _dp_core.pyx # Cython DP inner loop (10-20x speedup), pure Python fallback
56
+ │ │ ├── special_segments.py # Basmala/Isti'adha detection via phoneme edit distance (threshold 0.35)
57
+ │ │ ├── phoneme_matcher_cache.py# Pre-loads ChapterReference objects from phoneme_cache.pkl
58
+ │ │ ├── ngram_index.py # PhonemeNgramIndex dataclass, loaded from pickle
59
+ │ │ └── phonemizer_utils.py # Singleton wrapper for Quranic Phonemizer
60
+ │ │
61
+ │ ├── segmenter/
62
+ │ │ ├── segmenter_model.py # VAD model lifecycle: load, GPU/CPU movement, device management
63
+ │ │ ├── segmenter_aoti.py # Ahead-of-time compilation via torch.export for ZeroGPU persistence
64
+ │ │ └── vad.py # VAD inference: detect_speech_segments() with interval cleaning
65
+ │ │
66
+ │ └── ui/
67
+ │ ├── interface.py # build_interface(): Gradio Blocks layout, component definitions, state components
68
+ │ ├── event_wiring.py # Connects Gradio component events to handlers and pipeline functions
69
+ │ ├── handlers.py # Python callbacks: preset buttons, slider wiring, animation mode changes
70
+ │ ├── segments.py # Segment card HTML rendering: confidence badges, verse markers, audio players
71
+ │ ├── styles.py # CSS: fonts, segment cards, confidence colors, mega card, animation UI
72
+ │ ├── js_config.py # Python→JS bridge: exports config as window.* globals, concatenates JS files
73
+ │ └── static/
74
+ │ ├── animation-core.js # Per-segment animation: audio warmup, element caching, window opacity engine, tick loop
75
+ │ └── animate-all.js # Mega card: builds unified text flow, deduplicates shared words, click-to-seek
76
+
77
+ ├── data/
78
+ │ ├── phoneme_cache.pkl # 7.9MB — Pre-phonemized Quran text (114 chapters)
79
+ │ ├── phoneme_ngram_index_5.pkl # 6.2MB — 5-gram index for anchor voting
80
+ │ ├── phoneme_sub_costs.json # Custom phoneme substitution cost matrix
81
+ │ ├── digital_khatt_v2_script.json# 14.8MB — Full Quran text with positional metadata
82
+ │ ├── qpc_hafs.json # QPC Hafs Quran text (computational reference)
83
+ │ ├── surah_info.json # Chapter metadata (names, verse counts)
84
+ │ ├── ligatures.json # Surah name ligature mappings for DigitalKhatt font
85
+ │ ├── font_data.py # Base64-encoded Arabic fonts for offline rendering
86
+ │ ├── DigitalKhattV2.otf # Arabic Quran font
87
+ │ └── surah-name-v2.ttf # Surah name ligature font
88
+
89
+ ├── scripts/
90
+ │ ├── build_phoneme_cache.py # Generate phoneme_cache.pkl from Quran text
91
+ │ ├── build_phoneme_ngram_index.py# Generate phoneme_ngram_index_5.pkl from cache
92
+ │ ├── export_onnx.py # Export models to ONNX format
93
+ │ ├── add_open_tanween.py # Text preprocessing: add open tanween marks
94
+ │ └── fix_stop_sign_spacing.py # Text preprocessing: fix stop sign spacing
95
+
96
+ ├── tests/
97
+ │ └── test_session_api.py # Integration tests for session API (requires running server)
98
+
99
+ ├── docs/
100
+ │ ├── api.md # API endpoint documentation (current + planned)
101
+ │ ├── client_api.md # Client-side API docs
102
+ │ └── usage-logging.md # Usage logging schema and design
103
+
104
+ └── usage_logs/errors/ # Runtime error JSONL files (fallback when Hub upload fails)
105
+ ```
106
+
107
+ ## Architecture Principles
108
+
109
+ **`app.py` must stay minimal** (~85 lines). It only bootstraps: path setup, Cython build, `build_interface()`, and model preloading. All logic lives in `src/`.
110
+
111
+ **All constants go in `config.py`.** Model paths, thresholds, window sizes, edit costs, UI settings, presets, slider ranges, debug flags — everything configurable lives here. Never hardcode magic numbers in module code.
112
+
113
+ ## DP Alignment Algorithm
114
+
115
+ The core alignment (`phoneme_matcher.py`) uses **substring Levenshtein DP** with word-boundary constraints to find where ASR phonemes best match within the Quran reference:
116
+
117
+ 1. **Windowed search:** A window of `LOOKBACK_WORDS` (15) before and `LOOKAHEAD_WORDS` (10) after the current pointer defines the search region. Pre-flattened phoneme arrays avoid per-segment rebuilds.
118
+ 2. **Word-boundary constraints:** DP start positions must align with word boundaries (INF cost elsewhere). Only word-end positions are evaluated as candidates.
119
+ 3. **Position prior:** Adds `START_PRIOR_WEIGHT` (0.005) penalty per word away from the expected position, biasing sequential matching.
120
+ 4. **Edit costs:** Substitution (1.0), insertion (1.0), deletion (0.8). Custom substitution costs from `phoneme_sub_costs.json` for phonetically similar pairs.
121
+ 5. **Scoring:** `normalized_edit_distance + position_prior`. Confidence = `1 - normalized_distance`.
122
+ 6. **Cython acceleration:** `_dp_core.pyx` provides 10-20x speedup for the inner loop. Falls back to pure Python if not compiled.
123
+
124
+ ### Special Cases
125
+
126
+ - **Basmala/Isti'adha detection** (`special_segments.py`): Before main alignment, checks first segments against hardcoded phoneme sequences using edit distance (threshold 0.35). If a combined Isti'adha+Basmala is detected in one segment, it splits at the midpoint.
127
+ - **Fused Basmala:** After chapter transitions, tries prepending Basmala phonemes to the first verse segment and compares confidence with plain alignment. Picks the better match.
128
+ - **N-gram anchor voting** (`phoneme_anchor.py`): Extracts 5-grams from ASR output, looks up in pre-built index, weights by `1/count` (rarity). Finds best contiguous ayah run, trims edges below 15% of max weight.
129
+ - **Graduated retry on failure** (`alignment_pipeline.py`):
130
+ - Tier 1: Expanded window (60 lookback, 40 lookahead), same threshold
131
+ - Tier 2: Expanded window + relaxed threshold (0.45)
132
+ - **Re-anchoring:** After 2 consecutive failures (`MAX_CONSECUTIVE_FAILURES`), runs n-gram voting on remaining segments to jump to a new position within the surah.
133
+ - **Chapter transitions:** When the pointer exceeds chapter end, detects inter-chapter specials and moves to the next chapter. After Surah 1, triggers global re-anchor.
134
+
135
+ ## Animation System
136
+
137
+ Two animation modes, both driven by `requestAnimationFrame` tick loops matching `audio.currentTime` to word/character timestamps:
138
+
139
+ ### Per-Segment Animation (`animation-core.js`)
140
+ Each segment card has an "Animate" button. On click: builds word/char element caches from `.word`/`.char` spans, activates lazy audio, starts RAF loop. The tick function uses a **fast path** (check current word → next word, covers ~99% of frames) with full-scan fallback for seeking.
141
+
142
+ ### Mega Card Animation (`animate-all.js`)
143
+ "Animate All" builds a **unified text flow** from all segment cards: clones word elements, deduplicates shared positions (overlapping segment boundaries), inserts surah separators with ligature font, handles fused Basmala prefixes. Uses a single `<audio>` element for the full recording. Segment transitions are boundary-driven (when `currentTime >= segEndTime`, advance to next segment's tick loop).
144
+
145
+ ### Window Opacity Engine
146
+ Both modes use the same windowing system: configurable prev/after word counts with opacity gradients. Display modes (Reveal, Fade, Spotlight, Isolate, Consume, Custom) are presets that set opacity + window size. Verse-only mode hides all words outside the current verse. Settings persist to `localStorage`.
147
+
148
+ Click-to-seek in mega card: click a word → find its segment from timing, reset highlights, seek unified audio.
149
+
150
+ ## Profiling & Performance
151
+
152
+ **Always consider performance when adding features.** The `ProfilingData` dataclass tracks 50+ timing fields across every pipeline stage: resampling, VAD (model load, inference, GPU time), ASR (per-batch timing, padding waste), anchor detection, DP alignment (per-segment min/max/avg), retry counts, result building, and audio encoding.
153
+
154
+ Key optimizations to maintain:
155
+ - **Dynamic batching** (ASR): Groups segments by duration to minimize padding waste (max 15%). Tracks `pad_waste` per batch.
156
+ - **Pre-flattened phoneme arrays** (DP): Chapter references pre-concatenate all word phonemes with offset mapping, avoiding per-segment array construction.
157
+ - **Lazy audio loading** (UI): Audio elements use `data-src` with a play button; `<audio>` controls only activate on click. First 5 segments use `preload="auto"`.
158
+ - **Audio warmup** (JS): `pointerdown` event primes AudioContext + silent WAV before first play.
159
+ - **RAF fast path** (animation): Checks current/next word index before falling back to full scan.
160
+ - **Cython DP core**: 10-20x speedup for the alignment inner loop.
161
+ - **AoT compilation** (ZeroGPU): Compiles VAD model ahead-of-time for persistence across GPU leases.
162
+
163
+ ## Audio & Temp Storage
164
+
165
+ Audio files use HF Spaces' `/tmp` directory. `SEGMENT_AUDIO_DIR = /tmp/segments`. Per-segment WAVs are written to a UUID-keyed subdirectory for each run. Full recording WAV is written separately for mega card playback. Gradio's `allowed_paths=["/tmp"]` enables serving these files. Cache cleanup runs every 5 hours (`DELETE_CACHE_FREQUENCY`), deleting files older than 5 hours.
166
+
167
+ Audio preprocessing: resample to 16kHz mono via librosa (`soxr_lq` for speed), normalize int16/int32/float32 → float32, stereo → mono by averaging.
168
+
169
+ ## Models
170
 
171
  | Model | ID | Purpose |
172
  |-------|----|---------|
173
  | VAD | `obadx/recitation-segmenter-v2` | Voice activity detection |
174
  | ASR Base | `hetchyy/r15_95m` | Phoneme recognition (95M params) |
175
+ | ASR Large | `hetchyy/r7` | Phoneme recognition (higher accuracy, 3x slower) |
176
  | MFA | External Space `hetchyy-quran-phoneme-mfa` | Word-level forced alignment |
177
 
178
+ ## Key Patterns
179
 
180
+ - **State caching:** Preprocessed audio, raw VAD intervals, and segment boundaries are cached in `gr.State` to allow resegment/retranscribe without re-uploading or re-running VAD.
181
+ - **GPU quota management:** `@gpu_with_fallback` decorator detects ZeroGPU quota exhaustion, parses reset time, falls back to CPU with `gr.Warning()` toast.
182
+ - **Idempotent model movement:** `ensure_models_on_gpu()`/`ensure_models_on_cpu()` check current device before moving.
183
  - **Confidence scoring:** Green ≥80%, Yellow 60-79%, Red <60%.
184
+ - **Dual-script Quran text:** QPC Hafs for phoneme computation, DigitalKhatt for display rendering (proper Arabic typography with verse markers as combining marks).
185
+ - **Usage logging:** Alignment runs logged to HF Dataset via ParquetScheduler. Audio embedded as bytes. Error fallback to local JSONL.
config.py CHANGED
@@ -23,6 +23,13 @@ AUDIO_PRELOAD_COUNT = 5 # First N segments use preload="auto
23
  DELETE_CACHE_FREQUENCY = 3600*5 # Gradio cache cleanup interval (seconds)
24
  DELETE_CACHE_AGE = 3600*5 # Delete cached files older than this (seconds)
25
 
 
 
 
 
 
 
 
26
  # =============================================================================
27
  # Model and data paths
28
  # =============================================================================
 
23
  DELETE_CACHE_FREQUENCY = 3600*5 # Gradio cache cleanup interval (seconds)
24
  DELETE_CACHE_AGE = 3600*5 # Delete cached files older than this (seconds)
25
 
26
+ # =============================================================================
27
+ # Session API settings
28
+ # =============================================================================
29
+
30
+ SESSION_DIR = Path("/tmp/aligner_sessions") # Per-session cached data (audio, VAD, metadata)
31
+ SESSION_EXPIRY_SECONDS = 3600*5 # 5 hours — matches DELETE_CACHE_AGE
32
+
33
  # =============================================================================
34
  # Model and data paths
35
  # =============================================================================
src/api/__init__.py ADDED
File without changes
src/api/session_api.py ADDED
@@ -0,0 +1,276 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Session-based API: persistence layer + endpoint wrappers.
2
+
3
+ Sessions store preprocessed audio and VAD data in /tmp so that
4
+ follow-up calls (resegment, retranscribe, realign) skip expensive
5
+ re-uploads and re-inference.
6
+ """
7
+
8
+ import hashlib
9
+ import json
10
+ import os
11
+ import re
12
+ import shutil
13
+ import time
14
+ import uuid
15
+
16
+ import numpy as np
17
+
18
+ from config import SESSION_DIR, SESSION_EXPIRY_SECONDS
19
+
20
+ # ---------------------------------------------------------------------------
21
+ # Session manager
22
+ # ---------------------------------------------------------------------------
23
+
24
+ _last_cleanup_time = 0.0
25
+ _CLEANUP_INTERVAL = 1800 # sweep at most every 30 min
26
+
27
+ _VALID_ID = re.compile(r"^[0-9a-f]{32}$")
28
+
29
+
30
+ def _session_dir(audio_id: str):
31
+ return SESSION_DIR / audio_id
32
+
33
+
34
+ def _validate_id(audio_id: str) -> bool:
35
+ return isinstance(audio_id, str) and bool(_VALID_ID.match(audio_id))
36
+
37
+
38
+ def _is_expired(meta: dict) -> bool:
39
+ return (time.time() - meta.get("created_at", 0)) > SESSION_EXPIRY_SECONDS
40
+
41
+
42
+ def _read_metadata(session_path):
43
+ meta_path = session_path / "metadata.json"
44
+ if not meta_path.exists():
45
+ return None
46
+ with open(meta_path) as f:
47
+ return json.load(f)
48
+
49
+
50
+ def _write_metadata(session_path, meta: dict):
51
+ """Atomic write via temp file + os.replace."""
52
+ tmp = session_path / "metadata.tmp"
53
+ with open(tmp, "w") as f:
54
+ json.dump(meta, f)
55
+ os.replace(tmp, session_path / "metadata.json")
56
+
57
+
58
+ def _sweep_expired():
59
+ """Delete expired session directories (runs at most every 30 min)."""
60
+ global _last_cleanup_time
61
+ now = time.time()
62
+ if now - _last_cleanup_time < _CLEANUP_INTERVAL:
63
+ return
64
+ _last_cleanup_time = now
65
+ if not SESSION_DIR.exists():
66
+ return
67
+ for entry in SESSION_DIR.iterdir():
68
+ if not entry.is_dir():
69
+ continue
70
+ meta = _read_metadata(entry)
71
+ if meta is None or _is_expired(meta):
72
+ shutil.rmtree(entry, ignore_errors=True)
73
+
74
+
75
+ def _intervals_hash(intervals) -> str:
76
+ return hashlib.md5(json.dumps(intervals).encode()).hexdigest()
77
+
78
+
79
+ def create_session(audio, speech_intervals, is_complete, intervals, model_name):
80
+ """Persist session data and return audio_id (32-char hex UUID)."""
81
+ _sweep_expired()
82
+ audio_id = uuid.uuid4().hex
83
+ path = _session_dir(audio_id)
84
+ path.mkdir(parents=True, exist_ok=True)
85
+
86
+ np.save(path / "audio.npy", audio)
87
+ np.save(path / "speech_intervals.npy", speech_intervals)
88
+
89
+ meta = {
90
+ "is_complete": bool(is_complete),
91
+ "intervals": intervals,
92
+ "model_name": model_name,
93
+ "intervals_hash": _intervals_hash(intervals),
94
+ "created_at": time.time(),
95
+ }
96
+ _write_metadata(path, meta)
97
+ return audio_id
98
+
99
+
100
+ def load_session(audio_id):
101
+ """Load session data. Returns dict or None if missing/expired/invalid."""
102
+ if not _validate_id(audio_id):
103
+ return None
104
+ path = _session_dir(audio_id)
105
+ if not path.exists():
106
+ return None
107
+ meta = _read_metadata(path)
108
+ if meta is None or _is_expired(meta):
109
+ shutil.rmtree(path, ignore_errors=True)
110
+ return None
111
+
112
+ audio = np.load(path / "audio.npy")
113
+ speech_intervals = np.load(path / "speech_intervals.npy")
114
+
115
+ return {
116
+ "audio": audio,
117
+ "speech_intervals": speech_intervals,
118
+ "is_complete": meta["is_complete"],
119
+ "intervals": meta["intervals"],
120
+ "model_name": meta["model_name"],
121
+ "intervals_hash": meta.get("intervals_hash", ""),
122
+ "audio_id": audio_id,
123
+ }
124
+
125
+
126
+ def update_session(audio_id, *, intervals=None, model_name=None):
127
+ """Update mutable session fields (intervals, model_name)."""
128
+ path = _session_dir(audio_id)
129
+ meta = _read_metadata(path)
130
+ if meta is None:
131
+ return
132
+ if intervals is not None:
133
+ meta["intervals"] = intervals
134
+ meta["intervals_hash"] = _intervals_hash(intervals)
135
+ if model_name is not None:
136
+ meta["model_name"] = model_name
137
+ _write_metadata(path, meta)
138
+
139
+
140
+ # ---------------------------------------------------------------------------
141
+ # Response formatting
142
+ # ---------------------------------------------------------------------------
143
+
144
+ _SESSION_ERROR = {"error": "Session not found or expired", "segments": []}
145
+
146
+
147
+ def _format_response(audio_id, json_output):
148
+ """Convert pipeline json_output to the documented API response schema."""
149
+ segments = []
150
+ for seg in json_output.get("segments", []):
151
+ segments.append({
152
+ "segment": seg["segment"],
153
+ "time_from": seg["time_from"],
154
+ "time_to": seg["time_to"],
155
+ "ref_from": seg["ref_from"],
156
+ "ref_to": seg["ref_to"],
157
+ "matched_text": seg["matched_text"],
158
+ "confidence": seg["confidence"],
159
+ "has_missing_words": seg.get("has_missing_words", False),
160
+ "error": seg["error"],
161
+ })
162
+ return {"audio_id": audio_id, "segments": segments}
163
+
164
+
165
+ # ---------------------------------------------------------------------------
166
+ # Endpoint wrappers
167
+ # ---------------------------------------------------------------------------
168
+
169
+ def process_audio_session(audio_data, min_silence_ms, min_speech_ms, pad_ms,
170
+ model_name="Base", device="GPU"):
171
+ """Full pipeline: preprocess -> VAD -> ASR -> alignment. Creates session."""
172
+ from src.pipeline import process_audio
173
+
174
+ result = process_audio(
175
+ audio_data, int(min_silence_ms), int(min_speech_ms), int(pad_ms),
176
+ model_name, device,
177
+ )
178
+ # result is a 9-tuple:
179
+ # (html, json_output, speech_intervals, is_complete, audio, sr, intervals, seg_dir, log_row)
180
+ json_output = result[1]
181
+ if json_output is None:
182
+ return {"error": "No speech detected in audio", "segments": []}
183
+
184
+ speech_intervals = result[2]
185
+ is_complete = result[3]
186
+ audio = result[4]
187
+ intervals = result[6]
188
+
189
+ audio_id = create_session(
190
+ audio, speech_intervals, is_complete, intervals, model_name,
191
+ )
192
+ return _format_response(audio_id, json_output)
193
+
194
+
195
+ def resegment_session(audio_id, min_silence_ms, min_speech_ms, pad_ms,
196
+ model_name="Base", device="GPU"):
197
+ """Re-clean VAD boundaries with new params and re-run ASR + alignment."""
198
+ session = load_session(audio_id)
199
+ if session is None:
200
+ return _SESSION_ERROR
201
+
202
+ from src.pipeline import resegment_audio
203
+
204
+ result = resegment_audio(
205
+ session["speech_intervals"], session["is_complete"],
206
+ session["audio"], 16000,
207
+ int(min_silence_ms), int(min_speech_ms), int(pad_ms),
208
+ model_name, device,
209
+ )
210
+ json_output = result[1]
211
+ if json_output is None:
212
+ return {"audio_id": audio_id, "error": "No segments with these settings", "segments": []}
213
+
214
+ new_intervals = result[6]
215
+ update_session(audio_id, intervals=new_intervals, model_name=model_name)
216
+ return _format_response(audio_id, json_output)
217
+
218
+
219
+ def retranscribe_session(audio_id, model_name="Base", device="GPU"):
220
+ """Re-run ASR with a different model on current segment boundaries."""
221
+ session = load_session(audio_id)
222
+ if session is None:
223
+ return _SESSION_ERROR
224
+
225
+ # Guard: reject if model and boundaries unchanged
226
+ if (model_name == session["model_name"]
227
+ and _intervals_hash(session["intervals"]) == session["intervals_hash"]):
228
+ return {
229
+ "audio_id": audio_id,
230
+ "error": "Model and boundaries unchanged. Change model_name or call /resegment_session first.",
231
+ "segments": [],
232
+ }
233
+
234
+ from src.pipeline import retranscribe_audio
235
+
236
+ result = retranscribe_audio(
237
+ session["intervals"],
238
+ session["audio"], 16000,
239
+ session["speech_intervals"], session["is_complete"],
240
+ model_name, device,
241
+ )
242
+ json_output = result[1]
243
+ if json_output is None:
244
+ return {"audio_id": audio_id, "error": "Retranscription failed", "segments": []}
245
+
246
+ update_session(audio_id, model_name=model_name)
247
+ return _format_response(audio_id, json_output)
248
+
249
+
250
+ def realign_from_timestamps(audio_id, timestamps, model_name="Base", device="GPU"):
251
+ """Run ASR + alignment on caller-provided timestamp intervals."""
252
+ session = load_session(audio_id)
253
+ if session is None:
254
+ return _SESSION_ERROR
255
+
256
+ # Parse timestamps: accept list of {"start": f, "end": f} dicts
257
+ if isinstance(timestamps, str):
258
+ timestamps = json.loads(timestamps)
259
+
260
+ intervals = [(ts["start"], ts["end"]) for ts in timestamps]
261
+
262
+ from src.pipeline import realign_audio
263
+
264
+ result = realign_audio(
265
+ intervals,
266
+ session["audio"], 16000,
267
+ session["speech_intervals"], session["is_complete"],
268
+ model_name, device,
269
+ )
270
+ json_output = result[1]
271
+ if json_output is None:
272
+ return {"audio_id": audio_id, "error": "Alignment failed", "segments": []}
273
+
274
+ new_intervals = result[6]
275
+ update_session(audio_id, intervals=new_intervals, model_name=model_name)
276
+ return _format_response(audio_id, json_output)
src/pipeline.py CHANGED
@@ -473,6 +473,7 @@ def _run_post_vad_pipeline(
473
  "ref_to": parse_ref(seg.matched_ref)[1],
474
  "matched_text": seg.matched_text or "",
475
  "confidence": round(seg.match_score, 3),
 
476
  "potentially_undersegmented": seg.potentially_undersegmented,
477
  "error": seg.error
478
  }
@@ -721,6 +722,56 @@ def retranscribe_audio(
721
  return html, json_output, cached_speech_intervals, cached_is_complete, cached_audio, cached_sample_rate, cached_intervals, seg_dir, log_row
722
 
723
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
724
  def _retranscribe_wrapper(
725
  cached_intervals, cached_audio, cached_sample_rate,
726
  cached_speech_intervals, cached_is_complete,
 
473
  "ref_to": parse_ref(seg.matched_ref)[1],
474
  "matched_text": seg.matched_text or "",
475
  "confidence": round(seg.match_score, 3),
476
+ "has_missing_words": seg.has_missing_words,
477
  "potentially_undersegmented": seg.potentially_undersegmented,
478
  "error": seg.error
479
  }
 
722
  return html, json_output, cached_speech_intervals, cached_is_complete, cached_audio, cached_sample_rate, cached_intervals, seg_dir, log_row
723
 
724
 
725
+ def realign_audio(
726
+ intervals,
727
+ cached_audio, cached_sample_rate,
728
+ cached_speech_intervals, cached_is_complete,
729
+ model_name="Base", device="GPU",
730
+ cached_log_row=None,
731
+ request: gr.Request = None,
732
+ progress=gr.Progress(),
733
+ ):
734
+ """Run ASR + alignment on caller-provided intervals.
735
+
736
+ Same as retranscribe_audio but uses externally-provided intervals
737
+ instead of cached_intervals, bypassing VAD entirely.
738
+
739
+ Returns:
740
+ (html, json_output, cached_speech_intervals, cached_is_complete,
741
+ cached_audio, cached_sample_rate, intervals, segment_dir, log_row)
742
+ """
743
+ import time
744
+
745
+ if cached_audio is None:
746
+ return "<div>No cached data.</div>", None, None, None, None, None, None, None, None
747
+
748
+ device = device.lower()
749
+
750
+ from src.core.zero_gpu import reset_quota_flag, force_cpu_mode
751
+ reset_quota_flag()
752
+ if device == "cpu":
753
+ force_cpu_mode()
754
+
755
+ print(f"\n{'='*60}")
756
+ print(f"REALIGNING with {len(intervals)} custom timestamps, model={model_name}")
757
+ print(f"{'='*60}")
758
+
759
+ profiling = ProfilingData()
760
+ pipeline_start = time.time()
761
+
762
+ pct, desc = PROGRESS_RETRANSCRIBE["retranscribe"]
763
+ progress(pct, desc=desc.format(model=model_name))
764
+
765
+ html, json_output, seg_dir, log_row = _run_post_vad_pipeline(
766
+ cached_audio, cached_sample_rate, intervals,
767
+ model_name, device, profiling, pipeline_start, PROGRESS_RETRANSCRIBE,
768
+ progress=progress,
769
+ request=request, log_row=cached_log_row,
770
+ )
771
+
772
+ return html, json_output, cached_speech_intervals, cached_is_complete, cached_audio, cached_sample_rate, intervals, seg_dir, log_row
773
+
774
+
775
  def _retranscribe_wrapper(
776
  cached_intervals, cached_audio, cached_sample_rate,
777
  cached_speech_intervals, cached_is_complete,
src/ui/event_wiring.py CHANGED
@@ -3,7 +3,11 @@ import gradio as gr
3
 
4
  from src.pipeline import (
5
  process_audio, resegment_audio,
6
- _retranscribe_wrapper, process_audio_json, save_json_export,
 
 
 
 
7
  )
8
  from src.mfa import compute_mfa_timestamps
9
  from src.ui.handlers import (
@@ -418,11 +422,30 @@ def _wire_settings_restoration(app, c):
418
 
419
 
420
  def _wire_api_endpoint(c):
421
- """Hidden API-only endpoint for JSON output."""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
422
  gr.Button(visible=False).click(
423
- fn=process_audio_json,
424
- inputs=[c.audio_input, c.min_silence_slider, c.min_speech_slider,
425
- c.pad_slider, c.model_radio, c.device_radio],
426
- outputs=[c.output_json],
427
- api_name="process_audio_json"
428
  )
 
3
 
4
  from src.pipeline import (
5
  process_audio, resegment_audio,
6
+ _retranscribe_wrapper, save_json_export,
7
+ )
8
+ from src.api.session_api import (
9
+ process_audio_session, resegment_session,
10
+ retranscribe_session, realign_from_timestamps,
11
  )
12
  from src.mfa import compute_mfa_timestamps
13
  from src.ui.handlers import (
 
422
 
423
 
424
  def _wire_api_endpoint(c):
425
+ """Hidden API-only endpoints for session-based programmatic access."""
426
+ gr.Button(visible=False).click(
427
+ fn=process_audio_session,
428
+ inputs=[c.api_audio, c.api_silence, c.api_speech, c.api_pad,
429
+ c.api_model, c.api_device],
430
+ outputs=[c.api_result],
431
+ api_name="process_audio_session",
432
+ )
433
+ gr.Button(visible=False).click(
434
+ fn=resegment_session,
435
+ inputs=[c.api_audio_id, c.api_silence, c.api_speech, c.api_pad,
436
+ c.api_model, c.api_device],
437
+ outputs=[c.api_result],
438
+ api_name="resegment_session",
439
+ )
440
+ gr.Button(visible=False).click(
441
+ fn=retranscribe_session,
442
+ inputs=[c.api_audio_id, c.api_model, c.api_device],
443
+ outputs=[c.api_result],
444
+ api_name="retranscribe_session",
445
+ )
446
  gr.Button(visible=False).click(
447
+ fn=realign_from_timestamps,
448
+ inputs=[c.api_audio_id, c.api_timestamps, c.api_model, c.api_device],
449
+ outputs=[c.api_result],
450
+ api_name="realign_from_timestamps",
 
451
  )
src/ui/interface.py CHANGED
@@ -67,6 +67,17 @@ def build_interface():
67
  c.cached_log_row = gr.State(value=None)
68
  c.resegment_panel_visible = gr.State(value=False)
69
 
 
 
 
 
 
 
 
 
 
 
 
70
  wire_events(app, c)
71
 
72
  return app
 
67
  c.cached_log_row = gr.State(value=None)
68
  c.resegment_panel_visible = gr.State(value=False)
69
 
70
+ # Session API components (hidden, API-only)
71
+ c.api_audio = gr.Audio(visible=False, type="numpy")
72
+ c.api_audio_id = gr.Textbox(visible=False)
73
+ c.api_silence = gr.Number(visible=False, precision=0)
74
+ c.api_speech = gr.Number(visible=False, precision=0)
75
+ c.api_pad = gr.Number(visible=False, precision=0)
76
+ c.api_model = gr.Textbox(visible=False)
77
+ c.api_device = gr.Textbox(visible=False)
78
+ c.api_timestamps = gr.JSON(visible=False)
79
+ c.api_result = gr.JSON(visible=False)
80
+
81
  wire_events(app, c)
82
 
83
  return app
tests/test_session_api.py ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Integration tests for session-based API endpoints.
2
+
3
+ Requires the app to be running on localhost:7860.
4
+ Start with: python app.py
5
+
6
+ Run with: python -m pytest tests/test_session_api.py -v -s
7
+ """
8
+
9
+ import pytest
10
+ from gradio_client import Client
11
+
12
+ SERVER_URL = "http://localhost:7860"
13
+ AUDIO_FILE = "data/112.mp3" # Surah Al-Ikhlas (~15s)
14
+
15
+
16
+ @pytest.fixture(scope="module")
17
+ def client():
18
+ return Client(SERVER_URL)
19
+
20
+
21
+ @pytest.fixture(scope="module")
22
+ def session(client):
23
+ """Run process_audio_session once, share audio_id across tests."""
24
+ result = client.predict(
25
+ AUDIO_FILE, 200, 1000, 100, "Base", "CPU",
26
+ api_name="/process_audio_session",
27
+ )
28
+ assert "audio_id" in result, f"Missing audio_id: {result}"
29
+ assert result["audio_id"] is not None
30
+ return result
31
+
32
+
33
+ # -- 1. process_audio_session -----------------------------------------------
34
+
35
+ def test_process_audio_session(session):
36
+ assert len(session["segments"]) > 0, "Expected at least one segment"
37
+ seg = session["segments"][0]
38
+ for field in ("segment", "time_from", "time_to", "ref_from", "ref_to",
39
+ "matched_text", "confidence", "has_missing_words", "error"):
40
+ assert field in seg, f"Missing field: {field}"
41
+ assert seg["segment"] == 1
42
+ assert seg["time_from"] >= 0
43
+ assert seg["time_to"] > seg["time_from"]
44
+ assert 0 <= seg["confidence"] <= 1
45
+
46
+
47
+ # -- 2. resegment_session ---------------------------------------------------
48
+
49
+ def test_resegment_session(client, session):
50
+ audio_id = session["audio_id"]
51
+ result = client.predict(
52
+ audio_id, 600, 1500, 300, "Base", "CPU",
53
+ api_name="/resegment_session",
54
+ )
55
+ assert result["audio_id"] == audio_id
56
+ assert "segments" in result
57
+ assert len(result["segments"]) > 0
58
+
59
+
60
+ # -- 3. retranscribe_session ------------------------------------------------
61
+
62
+ def test_retranscribe_session(client, session):
63
+ audio_id = session["audio_id"]
64
+ result = client.predict(
65
+ audio_id, "Large", "CPU",
66
+ api_name="/retranscribe_session",
67
+ )
68
+ assert result["audio_id"] == audio_id
69
+ assert len(result["segments"]) > 0
70
+
71
+
72
+ # -- 4. retranscribe guard --------------------------------------------------
73
+
74
+ def test_retranscribe_guard(client, session):
75
+ """Same model + same boundaries should return error."""
76
+ audio_id = session["audio_id"]
77
+ result = client.predict(
78
+ audio_id, "Large", "CPU",
79
+ api_name="/retranscribe_session",
80
+ )
81
+ assert "error" in result
82
+ assert result["segments"] == []
83
+
84
+
85
+ # -- 5. realign_from_timestamps ---------------------------------------------
86
+
87
+ def test_realign_from_timestamps(client, session):
88
+ audio_id = session["audio_id"]
89
+ timestamps = [
90
+ {"start": 0.5, "end": 3.0},
91
+ {"start": 3.5, "end": 6.0},
92
+ ]
93
+ result = client.predict(
94
+ audio_id, timestamps, "Base", "CPU",
95
+ api_name="/realign_from_timestamps",
96
+ )
97
+ assert result["audio_id"] == audio_id
98
+ assert len(result["segments"]) == 2
99
+
100
+
101
+ # -- 6. invalid audio_id ----------------------------------------------------
102
+
103
+ def test_invalid_audio_id(client):
104
+ result = client.predict(
105
+ "00000000000000000000000000000000", "Base", "CPU",
106
+ api_name="/retranscribe_session",
107
+ )
108
+ assert "error" in result
109
+ assert "not found" in result["error"].lower() or "expired" in result["error"].lower()
110
+ assert result["segments"] == []
111
+
112
+
113
+ # -- 7. resegment after realign (session still valid) -----------------------
114
+
115
+ def test_resegment_after_realign(client, session):
116
+ audio_id = session["audio_id"]
117
+ result = client.predict(
118
+ audio_id, 200, 1000, 100, "Base", "CPU",
119
+ api_name="/resegment_session",
120
+ )
121
+ assert result["audio_id"] == audio_id
122
+ assert len(result["segments"]) > 0