Quran-multi-aligner

Running on Zero

hetchyy Claude Opus 4.6 commited on Feb 15

Commit

6cdb091

1 Parent(s): 0351f22

Add session-based API endpoints for stateless client access

Implement 4 endpoints (process_audio_session, resegment_session,
retranscribe_session, realign_from_timestamps) that persist session
data to /tmp/aligner_sessions so gradio_client consumers can reuse
cached audio and VAD results across follow-up calls without gr.State.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Files changed (9) hide show

.gitignore +0 -1
CLAUDE.md +166 -65
config.py +7 -0
src/api/__init__.py +0 -0
src/api/session_api.py +276 -0
src/pipeline.py +51 -0
src/ui/event_wiring.py +30 -7
src/ui/interface.py +11 -0
tests/test_session_api.py +122 -0

.gitignore CHANGED Viewed

@@ -49,6 +49,5 @@ test_api.py
 data/api_result.json
 CLAUDE.md
-inference_optimization.md
 docs/

 data/api_result.json
 CLAUDE.md
 docs/

CLAUDE.md CHANGED Viewed

@@ -2,83 +2,184 @@
 This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
 ## Project Overview
 Quran recitation alignment tool that segments audio recordings and aligns them with Quranic text using phoneme-based ASR and dynamic programming. Deployed as a Hugging Face Space with Gradio.
 **Pipeline:** Audio → Preprocessing (16kHz mono) → VAD Segmentation → Phoneme ASR (wav2vec2) → Special Segment Detection (Basmala/Isti'adha) → N-gram Anchor Voting → DP Alignment → Word-level Timestamps (optional via external MFA) → UI Rendering
-## Architecture
-### Entry Point
-`app.py` (~85 lines) — Bootstrap entry point: path setup, Cython build, imports `build_interface()` from `src/ui/interface.py`, and `__main__` block with model preloading.
-### Top-level Modules (`src/`)
-- **`src/pipeline.py`** — GPU-decorated pipeline functions: VAD+ASR GPU leases, post-VAD alignment pipeline, `process_audio`, `resegment_audio`, `retranscribe_audio`, `save_json_export`.
-- **`src/mfa.py`** — MFA forced-alignment integration: upload/submit to external MFA Space, SSE result polling, progress bar HTML, and `compute_mfa_timestamps` generator that injects word/letter timestamps into segment HTML.
-### Core Infrastructure (`src/core/`)
-- **`segment_types.py`** — Shared dataclasses (`VadSegment`, `SegmentInfo`, `ProfilingData`).
-- **`quran_index.py`** — Quran text index for reference lookups.
-- **`zero_gpu.py`** — `@gpu_with_fallback` decorator for ZeroGPU quota handling with automatic CPU fallback.
-- **`usage_logger.py`** — HF Dataset logging (ParquetScheduler for alignment runs).
-### Alignment (`src/alignment/`)
-- **`alignment_pipeline.py`** — Main alignment orchestrator. Coordinates ASR → anchor detection → DP alignment.
-- **`phoneme_asr.py`** — wav2vec2 CTC inference with dynamic batching (duration-based batch construction to minimize padding waste).
-- **`phoneme_anchor.py`** — N-gram rarity-weighted voting to determine which chapter/verse a segment belongs to.
-- **`phoneme_matcher.py`** — Substring Levenshtein DP alignment between ASR phonemes and reference Quran phonemes. Uses windowed alignment with lookback/lookahead.
-- **`_dp_core.pyx`** — Cython-accelerated DP inner loop (10-20x speedup). Falls back to pure Python if not compiled.
-- **`phonemizer_utils.py`** — Phonemizer wrapper for Arabic/Quranic text phonemization.
-- **`special_segments.py`** — Detects Basmala and Isti'adha via phoneme edit distance.
-- **`phoneme_matcher_cache.py`** — Pre-loads and caches phonemized chapter references from `data/phoneme_cache.pkl`.
-- **`ngram_index.py`** — N-gram index data structure used by anchor voting, loaded from `data/phoneme_ngram_index_5.pkl`.
-### Segmenter (`src/segmenter/`)
-- **`segmenter_model.py`** — Model lifecycle and device management for the VAD segmenter.
-- **`segmenter_aoti.py`** — Ahead-of-time compiled model support.
-- **`vad.py`** — Voice activity detection and speech segment extraction.
-### UI (`src/ui/`)
-- **`interface.py`** — `build_interface()`: Gradio layout (CSS, JS animation system, component definitions).
-- **`event_wiring.py`** — Connects all Gradio component events.
-- **`handlers.py`** — Python event handler functions.
-- **`segments.py`** — Segment rendering helpers (HTML cards, confidence classes, timestamps, audio encoding).
-- **`styles.py`** — CSS builder.
-- **`js_config.py`** — JS configuration bridge.
-### Configuration
-`config.py` — Centralized settings: model paths, alignment hyperparameters (edit costs, thresholds, window sizes), segmentation presets (Mujawwad/Murattal/Fast), batching strategy, UI settings, and debug flags.
-### Data Files (`data/`)
-- `phoneme_cache.pkl` (7.9MB) — Pre-phonemized Quran text for all 114 chapters
-- `phoneme_ngram_index_5.pkl` (6.2MB) — 5-gram index for anchor detection
-- `phoneme_sub_costs.json` — Custom phoneme substitution cost matrix
-- `digital_khatt_v2_script.json` (14.8MB) — Full Quran text with positional metadata
-- `surah_info.json` — Chapter metadata (names, verse counts)
-- `font_data.py` — Base64-encoded Arabic fonts for offline rendering
-### Models
 | Model | ID | Purpose |
 |-------|----|---------|
 | VAD | `obadx/recitation-segmenter-v2` | Voice activity detection |
 | ASR Base | `hetchyy/r15_95m` | Phoneme recognition (95M params) |
-| ASR Large | `hetchyy/r7` | Phoneme recognition (higher accuracy, slower) |
 | MFA | External Space `hetchyy-quran-phoneme-mfa` | Word-level forced alignment |
-### Key Patterns
-- **State caching:** Preprocessed audio, VAD intervals, and segment boundaries are cached in Gradio `gr.State` to allow resegmentation/retranscription without re-uploading.
-- **Environment detection:** `IS_HF_SPACE` flag switches behavior for HF Spaces deployment (ZeroGPU, model preloading).
-- **Retry/re-anchor:** Alignment retries with expanded windows on failure; re-anchors after `MAX_CONSECUTIVE_FAILURES` (2) consecutive failures.
 - **Confidence scoring:** Green ≥80%, Yellow 60-79%, Red <60%.
-- **Animation system:** Client-side JS with multiple display modes (Reveal, Fade, Spotlight, Isolated, Custom), word/character granularity, and verse-aware windowing.

 This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+**Keep this file up to date.** After any file/folder structure change, update the tree below without asking. After implementing features or making architectural changes, suggest additions to this file explaining why they would help future context.
 ## Project Overview
 Quran recitation alignment tool that segments audio recordings and aligns them with Quranic text using phoneme-based ASR and dynamic programming. Deployed as a Hugging Face Space with Gradio.
 **Pipeline:** Audio → Preprocessing (16kHz mono) → VAD Segmentation → Phoneme ASR (wav2vec2) → Special Segment Detection (Basmala/Isti'adha) → N-gram Anchor Voting → DP Alignment → Word-level Timestamps (optional via external MFA) → UI Rendering
+## Commands
+```bash
+# Run locally
+python app.py                    # Start on port 7860
+python app.py --share            # With public HF link
+# Build Cython DP extension (auto-attempted on startup, falls back to pure Python)
+python setup.py build_ext --inplace
+# Rebuild data caches (run offline, not during serving)
+python scripts/build_phoneme_cache.py
+python scripts/build_phoneme_ngram_index.py
+```
+## File Tree
+```
+├── app.py                          # ~85 lines — Bootstrap only: path setup, Cython build, build_interface(), model preloading
+├── config.py                       # All constants, hyperparameters, model paths, presets, UI settings, debug flags
+├── align_config.py                 # Override config for constrained (known-surah) alignment (tighter windows, no debug)
+├── setup.py                        # Cython build for _dp_core.pyx
+├── requirements.txt                # Pinned deps: torch 2.8, transformers 5.0, gradio >=6.5.1
+│
+├── src/
+│   ├── pipeline.py                 # GPU-decorated pipeline: VAD+ASR leases, post-VAD alignment, process/resegment/retranscribe/realign
+│   ├── mfa.py                      # MFA forced-alignment: upload to external Space, SSE polling, timestamp injection into HTML
+│   │
+│   ├── api/
+│   │   └── session_api.py          # Session persistence + 4 endpoint wrappers (process/resegment/retranscribe/realign)
+│   │
+│   ├── core/
+│   │   ├── segment_types.py        # Dataclasses: VadSegment, SegmentInfo, ProfilingData (50+ timing fields)
+│   │   ├── quran_index.py          # QuranIndex: dual-script word lookup (QPC Hafs for compute, DigitalKhatt for display)
+│   │   ├── zero_gpu.py             # @gpu_with_fallback decorator: ZeroGPU quota detection, automatic CPU fallback
+│   │   └── usage_logger.py         # HF Dataset logging: ParquetScheduler, audio embedding, error JSONL fallback
+│   │
+│   ├── alignment/
+│   │   ├── alignment_pipeline.py   # Orchestrator: sequential alignment with retry tiers, re-anchoring, chapter transitions
+│   │   ├── phoneme_asr.py          # wav2vec2 CTC inference with dynamic batching (duration-based, padding waste minimization)
+│   │   ├── phoneme_anchor.py       # N-gram rarity-weighted voting: determines chapter/verse anchor point
+│   │   ├── phoneme_matcher.py      # Substring Levenshtein DP with word-boundary constraints and position prior
+│   │   ├── _dp_core.pyx            # Cython DP inner loop (10-20x speedup), pure Python fallback
+│   │   ├── special_segments.py     # Basmala/Isti'adha detection via phoneme edit distance (threshold 0.35)
+│   │   ├── phoneme_matcher_cache.py# Pre-loads ChapterReference objects from phoneme_cache.pkl
+│   │   ├── ngram_index.py          # PhonemeNgramIndex dataclass, loaded from pickle
+│   │   └── phonemizer_utils.py     # Singleton wrapper for Quranic Phonemizer
+│   │
+│   ├── segmenter/
+│   │   ├── segmenter_model.py      # VAD model lifecycle: load, GPU/CPU movement, device management
+│   │   ├── segmenter_aoti.py       # Ahead-of-time compilation via torch.export for ZeroGPU persistence
+│   │   └── vad.py                  # VAD inference: detect_speech_segments() with interval cleaning
+│   │
+│   └── ui/
+│       ├── interface.py            # build_interface(): Gradio Blocks layout, component definitions, state components
+│       ├── event_wiring.py         # Connects Gradio component events to handlers and pipeline functions
+│       ├── handlers.py             # Python callbacks: preset buttons, slider wiring, animation mode changes
+│       ├── segments.py             # Segment card HTML rendering: confidence badges, verse markers, audio players
+│       ├── styles.py               # CSS: fonts, segment cards, confidence colors, mega card, animation UI
+│       ├── js_config.py            # Python→JS bridge: exports config as window.* globals, concatenates JS files
+│       └── static/
+│           ├── animation-core.js   # Per-segment animation: audio warmup, element caching, window opacity engine, tick loop
+│           └── animate-all.js      # Mega card: builds unified text flow, deduplicates shared words, click-to-seek
+│
+├── data/
+│   ├── phoneme_cache.pkl           # 7.9MB — Pre-phonemized Quran text (114 chapters)
+│   ├── phoneme_ngram_index_5.pkl   # 6.2MB — 5-gram index for anchor voting
+│   ├── phoneme_sub_costs.json      # Custom phoneme substitution cost matrix
+│   ├── digital_khatt_v2_script.json# 14.8MB — Full Quran text with positional metadata
+│   ├── qpc_hafs.json               # QPC Hafs Quran text (computational reference)
+│   ├── surah_info.json             # Chapter metadata (names, verse counts)
+│   ├── ligatures.json              # Surah name ligature mappings for DigitalKhatt font
+│   ├── font_data.py                # Base64-encoded Arabic fonts for offline rendering
+│   ├── DigitalKhattV2.otf          # Arabic Quran font
+│   └── surah-name-v2.ttf           # Surah name ligature font
+│
+├── scripts/
+│   ├── build_phoneme_cache.py      # Generate phoneme_cache.pkl from Quran text
+│   ├── build_phoneme_ngram_index.py# Generate phoneme_ngram_index_5.pkl from cache
+│   ├── export_onnx.py              # Export models to ONNX format
+│   ├── add_open_tanween.py         # Text preprocessing: add open tanween marks
+│   └── fix_stop_sign_spacing.py    # Text preprocessing: fix stop sign spacing
+│
+├── tests/
+│   └── test_session_api.py         # Integration tests for session API (requires running server)
+│
+├── docs/
+│   ├── api.md                      # API endpoint documentation (current + planned)
+│   ├── client_api.md               # Client-side API docs
+│   └── usage-logging.md            # Usage logging schema and design
+│
+└── usage_logs/errors/              # Runtime error JSONL files (fallback when Hub upload fails)
+```
+## Architecture Principles
+**`app.py` must stay minimal** (~85 lines). It only bootstraps: path setup, Cython build, `build_interface()`, and model preloading. All logic lives in `src/`.
+**All constants go in `config.py`.** Model paths, thresholds, window sizes, edit costs, UI settings, presets, slider ranges, debug flags — everything configurable lives here. Never hardcode magic numbers in module code.
+## DP Alignment Algorithm
+The core alignment (`phoneme_matcher.py`) uses **substring Levenshtein DP** with word-boundary constraints to find where ASR phonemes best match within the Quran reference:
+1. **Windowed search:** A window of `LOOKBACK_WORDS` (15) before and `LOOKAHEAD_WORDS` (10) after the current pointer defines the search region. Pre-flattened phoneme arrays avoid per-segment rebuilds.
+2. **Word-boundary constraints:** DP start positions must align with word boundaries (INF cost elsewhere). Only word-end positions are evaluated as candidates.
+3. **Position prior:** Adds `START_PRIOR_WEIGHT` (0.005) penalty per word away from the expected position, biasing sequential matching.
+4. **Edit costs:** Substitution (1.0), insertion (1.0), deletion (0.8). Custom substitution costs from `phoneme_sub_costs.json` for phonetically similar pairs.
+5. **Scoring:** `normalized_edit_distance + position_prior`. Confidence = `1 - normalized_distance`.
+6. **Cython acceleration:** `_dp_core.pyx` provides 10-20x speedup for the inner loop. Falls back to pure Python if not compiled.
+### Special Cases
+- **Basmala/Isti'adha detection** (`special_segments.py`): Before main alignment, checks first segments against hardcoded phoneme sequences using edit distance (threshold 0.35). If a combined Isti'adha+Basmala is detected in one segment, it splits at the midpoint.
+- **Fused Basmala:** After chapter transitions, tries prepending Basmala phonemes to the first verse segment and compares confidence with plain alignment. Picks the better match.
+- **N-gram anchor voting** (`phoneme_anchor.py`): Extracts 5-grams from ASR output, looks up in pre-built index, weights by `1/count` (rarity). Finds best contiguous ayah run, trims edges below 15% of max weight.
+- **Graduated retry on failure** (`alignment_pipeline.py`):
+  - Tier 1: Expanded window (60 lookback, 40 lookahead), same threshold
+  - Tier 2: Expanded window + relaxed threshold (0.45)
+- **Re-anchoring:** After 2 consecutive failures (`MAX_CONSECUTIVE_FAILURES`), runs n-gram voting on remaining segments to jump to a new position within the surah.
+- **Chapter transitions:** When the pointer exceeds chapter end, detects inter-chapter specials and moves to the next chapter. After Surah 1, triggers global re-anchor.
+## Animation System
+Two animation modes, both driven by `requestAnimationFrame` tick loops matching `audio.currentTime` to word/character timestamps:
+### Per-Segment Animation (`animation-core.js`)
+Each segment card has an "Animate" button. On click: builds word/char element caches from `.word`/`.char` spans, activates lazy audio, starts RAF loop. The tick function uses a **fast path** (check current word → next word, covers ~99% of frames) with full-scan fallback for seeking.
+### Mega Card Animation (`animate-all.js`)
+"Animate All" builds a **unified text flow** from all segment cards: clones word elements, deduplicates shared positions (overlapping segment boundaries), inserts surah separators with ligature font, handles fused Basmala prefixes. Uses a single `<audio>` element for the full recording. Segment transitions are boundary-driven (when `currentTime >= segEndTime`, advance to next segment's tick loop).
+### Window Opacity Engine
+Both modes use the same windowing system: configurable prev/after word counts with opacity gradients. Display modes (Reveal, Fade, Spotlight, Isolate, Consume, Custom) are presets that set opacity + window size. Verse-only mode hides all words outside the current verse. Settings persist to `localStorage`.
+Click-to-seek in mega card: click a word → find its segment from timing, reset highlights, seek unified audio.
+## Profiling & Performance
+**Always consider performance when adding features.** The `ProfilingData` dataclass tracks 50+ timing fields across every pipeline stage: resampling, VAD (model load, inference, GPU time), ASR (per-batch timing, padding waste), anchor detection, DP alignment (per-segment min/max/avg), retry counts, result building, and audio encoding.
+Key optimizations to maintain:
+- **Dynamic batching** (ASR): Groups segments by duration to minimize padding waste (max 15%). Tracks `pad_waste` per batch.
+- **Pre-flattened phoneme arrays** (DP): Chapter references pre-concatenate all word phonemes with offset mapping, avoiding per-segment array construction.
+- **Lazy audio loading** (UI): Audio elements use `data-src` with a play button; `<audio>` controls only activate on click. First 5 segments use `preload="auto"`.
+- **Audio warmup** (JS): `pointerdown` event primes AudioContext + silent WAV before first play.
+- **RAF fast path** (animation): Checks current/next word index before falling back to full scan.
+- **Cython DP core**: 10-20x speedup for the alignment inner loop.
+- **AoT compilation** (ZeroGPU): Compiles VAD model ahead-of-time for persistence across GPU leases.
+## Audio & Temp Storage
+Audio files use HF Spaces' `/tmp` directory. `SEGMENT_AUDIO_DIR = /tmp/segments`. Per-segment WAVs are written to a UUID-keyed subdirectory for each run. Full recording WAV is written separately for mega card playback. Gradio's `allowed_paths=["/tmp"]` enables serving these files. Cache cleanup runs every 5 hours (`DELETE_CACHE_FREQUENCY`), deleting files older than 5 hours.
+Audio preprocessing: resample to 16kHz mono via librosa (`soxr_lq` for speed), normalize int16/int32/float32 → float32, stereo → mono by averaging.
+## Models
 | Model | ID | Purpose |
 |-------|----|---------|
 | VAD | `obadx/recitation-segmenter-v2` | Voice activity detection |
 | ASR Base | `hetchyy/r15_95m` | Phoneme recognition (95M params) |
+| ASR Large | `hetchyy/r7` | Phoneme recognition (higher accuracy, 3x slower) |
 | MFA | External Space `hetchyy-quran-phoneme-mfa` | Word-level forced alignment |
+## Key Patterns
+- **State caching:** Preprocessed audio, raw VAD intervals, and segment boundaries are cached in `gr.State` to allow resegment/retranscribe without re-uploading or re-running VAD.
+- **GPU quota management:** `@gpu_with_fallback` decorator detects ZeroGPU quota exhaustion, parses reset time, falls back to CPU with `gr.Warning()` toast.
+- **Idempotent model movement:** `ensure_models_on_gpu()`/`ensure_models_on_cpu()` check current device before moving.
 - **Confidence scoring:** Green ≥80%, Yellow 60-79%, Red <60%.
+- **Dual-script Quran text:** QPC Hafs for phoneme computation, DigitalKhatt for display rendering (proper Arabic typography with verse markers as combining marks).
+- **Usage logging:** Alignment runs logged to HF Dataset via ParquetScheduler. Audio embedded as bytes. Error fallback to local JSONL.

config.py CHANGED Viewed

@@ -23,6 +23,13 @@ AUDIO_PRELOAD_COUNT = 5                     # First N segments use preload="auto
 DELETE_CACHE_FREQUENCY = 3600*5             # Gradio cache cleanup interval (seconds)
 DELETE_CACHE_AGE = 3600*5                   # Delete cached files older than this (seconds)
 # =============================================================================
 # Model and data paths
 # =============================================================================

 DELETE_CACHE_FREQUENCY = 3600*5             # Gradio cache cleanup interval (seconds)
 DELETE_CACHE_AGE = 3600*5                   # Delete cached files older than this (seconds)
+# =============================================================================
+# Session API settings
+# =============================================================================
+SESSION_DIR = Path("/tmp/aligner_sessions")  # Per-session cached data (audio, VAD, metadata)
+SESSION_EXPIRY_SECONDS = 3600*5              # 5 hours — matches DELETE_CACHE_AGE
 # =============================================================================
 # Model and data paths
 # =============================================================================

src/api/__init__.py ADDED Viewed

File without changes

src/api/session_api.py ADDED Viewed

	@@ -0,0 +1,276 @@

+"""Session-based API: persistence layer + endpoint wrappers.
+Sessions store preprocessed audio and VAD data in /tmp so that
+follow-up calls (resegment, retranscribe, realign) skip expensive
+re-uploads and re-inference.
+"""
+import hashlib
+import json
+import os
+import re
+import shutil
+import time
+import uuid
+import numpy as np
+from config import SESSION_DIR, SESSION_EXPIRY_SECONDS
+# ---------------------------------------------------------------------------
+# Session manager
+# ---------------------------------------------------------------------------
+_last_cleanup_time = 0.0
+_CLEANUP_INTERVAL = 1800  # sweep at most every 30 min
+_VALID_ID = re.compile(r"^[0-9a-f]{32}$")
+def _session_dir(audio_id: str):
+    return SESSION_DIR / audio_id
+def _validate_id(audio_id: str) -> bool:
+    return isinstance(audio_id, str) and bool(_VALID_ID.match(audio_id))
+def _is_expired(meta: dict) -> bool:
+    return (time.time() - meta.get("created_at", 0)) > SESSION_EXPIRY_SECONDS
+def _read_metadata(session_path):
+    meta_path = session_path / "metadata.json"
+    if not meta_path.exists():
+        return None
+    with open(meta_path) as f:
+        return json.load(f)
+def _write_metadata(session_path, meta: dict):
+    """Atomic write via temp file + os.replace."""
+    tmp = session_path / "metadata.tmp"
+    with open(tmp, "w") as f:
+        json.dump(meta, f)
+    os.replace(tmp, session_path / "metadata.json")
+def _sweep_expired():
+    """Delete expired session directories (runs at most every 30 min)."""
+    global _last_cleanup_time
+    now = time.time()
+    if now - _last_cleanup_time < _CLEANUP_INTERVAL:
+        return
+    _last_cleanup_time = now
+    if not SESSION_DIR.exists():
+        return
+    for entry in SESSION_DIR.iterdir():
+        if not entry.is_dir():
+            continue
+        meta = _read_metadata(entry)
+        if meta is None or _is_expired(meta):
+            shutil.rmtree(entry, ignore_errors=True)
+def _intervals_hash(intervals) -> str:
+    return hashlib.md5(json.dumps(intervals).encode()).hexdigest()
+def create_session(audio, speech_intervals, is_complete, intervals, model_name):
+    """Persist session data and return audio_id (32-char hex UUID)."""
+    _sweep_expired()
+    audio_id = uuid.uuid4().hex
+    path = _session_dir(audio_id)
+    path.mkdir(parents=True, exist_ok=True)
+    np.save(path / "audio.npy", audio)
+    np.save(path / "speech_intervals.npy", speech_intervals)
+    meta = {
+        "is_complete": bool(is_complete),
+        "intervals": intervals,
+        "model_name": model_name,
+        "intervals_hash": _intervals_hash(intervals),
+        "created_at": time.time(),
+    }
+    _write_metadata(path, meta)
+    return audio_id
+def load_session(audio_id):
+    """Load session data. Returns dict or None if missing/expired/invalid."""
+    if not _validate_id(audio_id):
+        return None
+    path = _session_dir(audio_id)
+    if not path.exists():
+        return None
+    meta = _read_metadata(path)
+    if meta is None or _is_expired(meta):
+        shutil.rmtree(path, ignore_errors=True)
+        return None
+    audio = np.load(path / "audio.npy")
+    speech_intervals = np.load(path / "speech_intervals.npy")
+    return {
+        "audio": audio,
+        "speech_intervals": speech_intervals,
+        "is_complete": meta["is_complete"],
+        "intervals": meta["intervals"],
+        "model_name": meta["model_name"],
+        "intervals_hash": meta.get("intervals_hash", ""),
+        "audio_id": audio_id,
+    }
+def update_session(audio_id, *, intervals=None, model_name=None):
+    """Update mutable session fields (intervals, model_name)."""
+    path = _session_dir(audio_id)
+    meta = _read_metadata(path)
+    if meta is None:
+        return
+    if intervals is not None:
+        meta["intervals"] = intervals
+        meta["intervals_hash"] = _intervals_hash(intervals)
+    if model_name is not None:
+        meta["model_name"] = model_name
+    _write_metadata(path, meta)
+# ---------------------------------------------------------------------------
+# Response formatting
+# ---------------------------------------------------------------------------
+_SESSION_ERROR = {"error": "Session not found or expired", "segments": []}
+def _format_response(audio_id, json_output):
+    """Convert pipeline json_output to the documented API response schema."""
+    segments = []
+    for seg in json_output.get("segments", []):
+        segments.append({
+            "segment": seg["segment"],
+            "time_from": seg["time_from"],
+            "time_to": seg["time_to"],
+            "ref_from": seg["ref_from"],
+            "ref_to": seg["ref_to"],
+            "matched_text": seg["matched_text"],
+            "confidence": seg["confidence"],
+            "has_missing_words": seg.get("has_missing_words", False),
+            "error": seg["error"],
+        })
+    return {"audio_id": audio_id, "segments": segments}
+# ---------------------------------------------------------------------------
+# Endpoint wrappers
+# ---------------------------------------------------------------------------
+def process_audio_session(audio_data, min_silence_ms, min_speech_ms, pad_ms,
+                          model_name="Base", device="GPU"):
+    """Full pipeline: preprocess -> VAD -> ASR -> alignment. Creates session."""
+    from src.pipeline import process_audio
+    result = process_audio(
+        audio_data, int(min_silence_ms), int(min_speech_ms), int(pad_ms),
+        model_name, device,
+    )
+    # result is a 9-tuple:
+    # (html, json_output, speech_intervals, is_complete, audio, sr, intervals, seg_dir, log_row)
+    json_output = result[1]
+    if json_output is None:
+        return {"error": "No speech detected in audio", "segments": []}
+    speech_intervals = result[2]
+    is_complete = result[3]
+    audio = result[4]
+    intervals = result[6]
+    audio_id = create_session(
+        audio, speech_intervals, is_complete, intervals, model_name,
+    )
+    return _format_response(audio_id, json_output)
+def resegment_session(audio_id, min_silence_ms, min_speech_ms, pad_ms,
+                       model_name="Base", device="GPU"):
+    """Re-clean VAD boundaries with new params and re-run ASR + alignment."""
+    session = load_session(audio_id)
+    if session is None:
+        return _SESSION_ERROR
+    from src.pipeline import resegment_audio
+    result = resegment_audio(
+        session["speech_intervals"], session["is_complete"],
+        session["audio"], 16000,
+        int(min_silence_ms), int(min_speech_ms), int(pad_ms),
+        model_name, device,
+    )
+    json_output = result[1]
+    if json_output is None:
+        return {"audio_id": audio_id, "error": "No segments with these settings", "segments": []}
+    new_intervals = result[6]
+    update_session(audio_id, intervals=new_intervals, model_name=model_name)
+    return _format_response(audio_id, json_output)
+def retranscribe_session(audio_id, model_name="Base", device="GPU"):
+    """Re-run ASR with a different model on current segment boundaries."""
+    session = load_session(audio_id)
+    if session is None:
+        return _SESSION_ERROR
+    # Guard: reject if model and boundaries unchanged
+    if (model_name == session["model_name"]
+            and _intervals_hash(session["intervals"]) == session["intervals_hash"]):
+        return {
+            "audio_id": audio_id,
+            "error": "Model and boundaries unchanged. Change model_name or call /resegment_session first.",
+            "segments": [],
+        }
+    from src.pipeline import retranscribe_audio
+    result = retranscribe_audio(
+        session["intervals"],
+        session["audio"], 16000,
+        session["speech_intervals"], session["is_complete"],
+        model_name, device,
+    )
+    json_output = result[1]
+    if json_output is None:
+        return {"audio_id": audio_id, "error": "Retranscription failed", "segments": []}
+    update_session(audio_id, model_name=model_name)
+    return _format_response(audio_id, json_output)
+def realign_from_timestamps(audio_id, timestamps, model_name="Base", device="GPU"):
+    """Run ASR + alignment on caller-provided timestamp intervals."""
+    session = load_session(audio_id)
+    if session is None:
+        return _SESSION_ERROR
+    # Parse timestamps: accept list of {"start": f, "end": f} dicts
+    if isinstance(timestamps, str):
+        timestamps = json.loads(timestamps)
+    intervals = [(ts["start"], ts["end"]) for ts in timestamps]
+    from src.pipeline import realign_audio
+    result = realign_audio(
+        intervals,
+        session["audio"], 16000,
+        session["speech_intervals"], session["is_complete"],
+        model_name, device,
+    )
+    json_output = result[1]
+    if json_output is None:
+        return {"audio_id": audio_id, "error": "Alignment failed", "segments": []}
+    new_intervals = result[6]
+    update_session(audio_id, intervals=new_intervals, model_name=model_name)
+    return _format_response(audio_id, json_output)

src/pipeline.py CHANGED Viewed

@@ -473,6 +473,7 @@ def _run_post_vad_pipeline(
             "ref_to": parse_ref(seg.matched_ref)[1],
             "matched_text": seg.matched_text or "",
             "confidence": round(seg.match_score, 3),
             "potentially_undersegmented": seg.potentially_undersegmented,
             "error": seg.error
         }
@@ -721,6 +722,56 @@ def retranscribe_audio(
     return html, json_output, cached_speech_intervals, cached_is_complete, cached_audio, cached_sample_rate, cached_intervals, seg_dir, log_row
 def _retranscribe_wrapper(
     cached_intervals, cached_audio, cached_sample_rate,
     cached_speech_intervals, cached_is_complete,

             "ref_to": parse_ref(seg.matched_ref)[1],
             "matched_text": seg.matched_text or "",
             "confidence": round(seg.match_score, 3),
+            "has_missing_words": seg.has_missing_words,
             "potentially_undersegmented": seg.potentially_undersegmented,
             "error": seg.error
         }
     return html, json_output, cached_speech_intervals, cached_is_complete, cached_audio, cached_sample_rate, cached_intervals, seg_dir, log_row
+def realign_audio(
+    intervals,
+    cached_audio, cached_sample_rate,
+    cached_speech_intervals, cached_is_complete,
+    model_name="Base", device="GPU",
+    cached_log_row=None,
+    request: gr.Request = None,
+    progress=gr.Progress(),
+):
+    """Run ASR + alignment on caller-provided intervals.
+    Same as retranscribe_audio but uses externally-provided intervals
+    instead of cached_intervals, bypassing VAD entirely.
+    Returns:
+        (html, json_output, cached_speech_intervals, cached_is_complete,
+         cached_audio, cached_sample_rate, intervals, segment_dir, log_row)
+    """
+    import time
+    if cached_audio is None:
+        return "<div>No cached data.</div>", None, None, None, None, None, None, None, None
+    device = device.lower()
+    from src.core.zero_gpu import reset_quota_flag, force_cpu_mode
+    reset_quota_flag()
+    if device == "cpu":
+        force_cpu_mode()
+    print(f"\n{'='*60}")
+    print(f"REALIGNING with {len(intervals)} custom timestamps, model={model_name}")
+    print(f"{'='*60}")
+    profiling = ProfilingData()
+    pipeline_start = time.time()
+    pct, desc = PROGRESS_RETRANSCRIBE["retranscribe"]
+    progress(pct, desc=desc.format(model=model_name))
+    html, json_output, seg_dir, log_row = _run_post_vad_pipeline(
+        cached_audio, cached_sample_rate, intervals,
+        model_name, device, profiling, pipeline_start, PROGRESS_RETRANSCRIBE,
+        progress=progress,
+        request=request, log_row=cached_log_row,
+    )
+    return html, json_output, cached_speech_intervals, cached_is_complete, cached_audio, cached_sample_rate, intervals, seg_dir, log_row
 def _retranscribe_wrapper(
     cached_intervals, cached_audio, cached_sample_rate,
     cached_speech_intervals, cached_is_complete,

src/ui/event_wiring.py CHANGED Viewed

@@ -3,7 +3,11 @@ import gradio as gr
 from src.pipeline import (
     process_audio, resegment_audio,
-    _retranscribe_wrapper, process_audio_json, save_json_export,
 )
 from src.mfa import compute_mfa_timestamps
 from src.ui.handlers import (
@@ -418,11 +422,30 @@ def _wire_settings_restoration(app, c):
 def _wire_api_endpoint(c):
-    """Hidden API-only endpoint for JSON output."""
     gr.Button(visible=False).click(
-        fn=process_audio_json,
-        inputs=[c.audio_input, c.min_silence_slider, c.min_speech_slider,
-                c.pad_slider, c.model_radio, c.device_radio],
-        outputs=[c.output_json],
-        api_name="process_audio_json"
     )

 from src.pipeline import (
     process_audio, resegment_audio,
+    _retranscribe_wrapper, save_json_export,
+)
+from src.api.session_api import (
+    process_audio_session, resegment_session,
+    retranscribe_session, realign_from_timestamps,
 )
 from src.mfa import compute_mfa_timestamps
 from src.ui.handlers import (
 def _wire_api_endpoint(c):
+    """Hidden API-only endpoints for session-based programmatic access."""
+    gr.Button(visible=False).click(
+        fn=process_audio_session,
+        inputs=[c.api_audio, c.api_silence, c.api_speech, c.api_pad,
+                c.api_model, c.api_device],
+        outputs=[c.api_result],
+        api_name="process_audio_session",
+    )
+    gr.Button(visible=False).click(
+        fn=resegment_session,
+        inputs=[c.api_audio_id, c.api_silence, c.api_speech, c.api_pad,
+                c.api_model, c.api_device],
+        outputs=[c.api_result],
+        api_name="resegment_session",
+    )
+    gr.Button(visible=False).click(
+        fn=retranscribe_session,
+        inputs=[c.api_audio_id, c.api_model, c.api_device],
+        outputs=[c.api_result],
+        api_name="retranscribe_session",
+    )
     gr.Button(visible=False).click(
+        fn=realign_from_timestamps,
+        inputs=[c.api_audio_id, c.api_timestamps, c.api_model, c.api_device],
+        outputs=[c.api_result],
+        api_name="realign_from_timestamps",
     )

src/ui/interface.py CHANGED Viewed

@@ -67,6 +67,17 @@ def build_interface():
         c.cached_log_row = gr.State(value=None)
         c.resegment_panel_visible = gr.State(value=False)
         wire_events(app, c)
     return app

         c.cached_log_row = gr.State(value=None)
         c.resegment_panel_visible = gr.State(value=False)
+        # Session API components (hidden, API-only)
+        c.api_audio = gr.Audio(visible=False, type="numpy")
+        c.api_audio_id = gr.Textbox(visible=False)
+        c.api_silence = gr.Number(visible=False, precision=0)
+        c.api_speech = gr.Number(visible=False, precision=0)
+        c.api_pad = gr.Number(visible=False, precision=0)
+        c.api_model = gr.Textbox(visible=False)
+        c.api_device = gr.Textbox(visible=False)
+        c.api_timestamps = gr.JSON(visible=False)
+        c.api_result = gr.JSON(visible=False)
         wire_events(app, c)
     return app

tests/test_session_api.py ADDED Viewed

	@@ -0,0 +1,122 @@

+"""Integration tests for session-based API endpoints.
+Requires the app to be running on localhost:7860.
+Start with: python app.py
+Run with: python -m pytest tests/test_session_api.py -v -s
+"""
+import pytest
+from gradio_client import Client
+SERVER_URL = "http://localhost:7860"
+AUDIO_FILE = "data/112.mp3"  # Surah Al-Ikhlas (~15s)
+@pytest.fixture(scope="module")
+def client():
+    return Client(SERVER_URL)
+@pytest.fixture(scope="module")
+def session(client):
+    """Run process_audio_session once, share audio_id across tests."""
+    result = client.predict(
+        AUDIO_FILE, 200, 1000, 100, "Base", "CPU",
+        api_name="/process_audio_session",
+    )
+    assert "audio_id" in result, f"Missing audio_id: {result}"
+    assert result["audio_id"] is not None
+    return result
+# -- 1. process_audio_session -----------------------------------------------
+def test_process_audio_session(session):
+    assert len(session["segments"]) > 0, "Expected at least one segment"
+    seg = session["segments"][0]
+    for field in ("segment", "time_from", "time_to", "ref_from", "ref_to",
+                  "matched_text", "confidence", "has_missing_words", "error"):
+        assert field in seg, f"Missing field: {field}"
+    assert seg["segment"] == 1
+    assert seg["time_from"] >= 0
+    assert seg["time_to"] > seg["time_from"]
+    assert 0 <= seg["confidence"] <= 1
+# -- 2. resegment_session ---------------------------------------------------
+def test_resegment_session(client, session):
+    audio_id = session["audio_id"]
+    result = client.predict(
+        audio_id, 600, 1500, 300, "Base", "CPU",
+        api_name="/resegment_session",
+    )
+    assert result["audio_id"] == audio_id
+    assert "segments" in result
+    assert len(result["segments"]) > 0
+# -- 3. retranscribe_session ------------------------------------------------
+def test_retranscribe_session(client, session):
+    audio_id = session["audio_id"]
+    result = client.predict(
+        audio_id, "Large", "CPU",
+        api_name="/retranscribe_session",
+    )
+    assert result["audio_id"] == audio_id
+    assert len(result["segments"]) > 0
+# -- 4. retranscribe guard --------------------------------------------------
+def test_retranscribe_guard(client, session):
+    """Same model + same boundaries should return error."""
+    audio_id = session["audio_id"]
+    result = client.predict(
+        audio_id, "Large", "CPU",
+        api_name="/retranscribe_session",
+    )
+    assert "error" in result
+    assert result["segments"] == []
+# -- 5. realign_from_timestamps ---------------------------------------------
+def test_realign_from_timestamps(client, session):
+    audio_id = session["audio_id"]
+    timestamps = [
+        {"start": 0.5, "end": 3.0},
+        {"start": 3.5, "end": 6.0},
+    ]
+    result = client.predict(
+        audio_id, timestamps, "Base", "CPU",
+        api_name="/realign_from_timestamps",
+    )
+    assert result["audio_id"] == audio_id
+    assert len(result["segments"]) == 2
+# -- 6. invalid audio_id ----------------------------------------------------
+def test_invalid_audio_id(client):
+    result = client.predict(
+        "00000000000000000000000000000000", "Base", "CPU",
+        api_name="/retranscribe_session",
+    )
+    assert "error" in result
+    assert "not found" in result["error"].lower() or "expired" in result["error"].lower()
+    assert result["segments"] == []
+# -- 7. resegment after realign (session still valid) -----------------------
+def test_resegment_after_realign(client, session):
+    audio_id = session["audio_id"]
+    result = client.predict(
+        audio_id, 200, 1000, 100, "Base", "CPU",
+        api_name="/resegment_session",
+    )
+    assert result["audio_id"] == audio_id
+    assert len(result["segments"]) > 0