Spaces:
Running on Zero
Running on Zero
Add session-based API endpoints for stateless client access
Browse filesImplement 4 endpoints (process_audio_session, resegment_session,
retranscribe_session, realign_from_timestamps) that persist session
data to /tmp/aligner_sessions so gradio_client consumers can reuse
cached audio and VAD results across follow-up calls without gr.State.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- .gitignore +0 -1
- CLAUDE.md +166 -65
- config.py +7 -0
- src/api/__init__.py +0 -0
- src/api/session_api.py +276 -0
- src/pipeline.py +51 -0
- src/ui/event_wiring.py +30 -7
- src/ui/interface.py +11 -0
- tests/test_session_api.py +122 -0
.gitignore
CHANGED
|
@@ -49,6 +49,5 @@ test_api.py
|
|
| 49 |
data/api_result.json
|
| 50 |
|
| 51 |
CLAUDE.md
|
| 52 |
-
inference_optimization.md
|
| 53 |
|
| 54 |
docs/
|
|
|
|
| 49 |
data/api_result.json
|
| 50 |
|
| 51 |
CLAUDE.md
|
|
|
|
| 52 |
|
| 53 |
docs/
|
CLAUDE.md
CHANGED
|
@@ -2,83 +2,184 @@
|
|
| 2 |
|
| 3 |
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
| 4 |
|
|
|
|
|
|
|
| 5 |
## Project Overview
|
| 6 |
|
| 7 |
Quran recitation alignment tool that segments audio recordings and aligns them with Quranic text using phoneme-based ASR and dynamic programming. Deployed as a Hugging Face Space with Gradio.
|
| 8 |
|
| 9 |
**Pipeline:** Audio → Preprocessing (16kHz mono) → VAD Segmentation → Phoneme ASR (wav2vec2) → Special Segment Detection (Basmala/Isti'adha) → N-gram Anchor Voting → DP Alignment → Word-level Timestamps (optional via external MFA) → UI Rendering
|
| 10 |
|
| 11 |
-
##
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
#
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
#
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
#
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
| Model | ID | Purpose |
|
| 72 |
|-------|----|---------|
|
| 73 |
| VAD | `obadx/recitation-segmenter-v2` | Voice activity detection |
|
| 74 |
| ASR Base | `hetchyy/r15_95m` | Phoneme recognition (95M params) |
|
| 75 |
-
| ASR Large | `hetchyy/r7` | Phoneme recognition (higher accuracy, slower) |
|
| 76 |
| MFA | External Space `hetchyy-quran-phoneme-mfa` | Word-level forced alignment |
|
| 77 |
|
| 78 |
-
##
|
| 79 |
|
| 80 |
-
- **State caching:** Preprocessed audio, VAD intervals, and segment boundaries are cached in
|
| 81 |
-
- **
|
| 82 |
-
- **
|
| 83 |
- **Confidence scoring:** Green ≥80%, Yellow 60-79%, Red <60%.
|
| 84 |
-
- **
|
|
|
|
|
|
| 2 |
|
| 3 |
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
| 4 |
|
| 5 |
+
**Keep this file up to date.** After any file/folder structure change, update the tree below without asking. After implementing features or making architectural changes, suggest additions to this file explaining why they would help future context.
|
| 6 |
+
|
| 7 |
## Project Overview
|
| 8 |
|
| 9 |
Quran recitation alignment tool that segments audio recordings and aligns them with Quranic text using phoneme-based ASR and dynamic programming. Deployed as a Hugging Face Space with Gradio.
|
| 10 |
|
| 11 |
**Pipeline:** Audio → Preprocessing (16kHz mono) → VAD Segmentation → Phoneme ASR (wav2vec2) → Special Segment Detection (Basmala/Isti'adha) → N-gram Anchor Voting → DP Alignment → Word-level Timestamps (optional via external MFA) → UI Rendering
|
| 12 |
|
| 13 |
+
## Commands
|
| 14 |
+
|
| 15 |
+
```bash
|
| 16 |
+
# Run locally
|
| 17 |
+
python app.py # Start on port 7860
|
| 18 |
+
python app.py --share # With public HF link
|
| 19 |
+
|
| 20 |
+
# Build Cython DP extension (auto-attempted on startup, falls back to pure Python)
|
| 21 |
+
python setup.py build_ext --inplace
|
| 22 |
+
|
| 23 |
+
# Rebuild data caches (run offline, not during serving)
|
| 24 |
+
python scripts/build_phoneme_cache.py
|
| 25 |
+
python scripts/build_phoneme_ngram_index.py
|
| 26 |
+
```
|
| 27 |
+
|
| 28 |
+
## File Tree
|
| 29 |
+
|
| 30 |
+
```
|
| 31 |
+
├── app.py # ~85 lines — Bootstrap only: path setup, Cython build, build_interface(), model preloading
|
| 32 |
+
├── config.py # All constants, hyperparameters, model paths, presets, UI settings, debug flags
|
| 33 |
+
├── align_config.py # Override config for constrained (known-surah) alignment (tighter windows, no debug)
|
| 34 |
+
├── setup.py # Cython build for _dp_core.pyx
|
| 35 |
+
├── requirements.txt # Pinned deps: torch 2.8, transformers 5.0, gradio >=6.5.1
|
| 36 |
+
│
|
| 37 |
+
├── src/
|
| 38 |
+
│ ├── pipeline.py # GPU-decorated pipeline: VAD+ASR leases, post-VAD alignment, process/resegment/retranscribe/realign
|
| 39 |
+
│ ├── mfa.py # MFA forced-alignment: upload to external Space, SSE polling, timestamp injection into HTML
|
| 40 |
+
│ │
|
| 41 |
+
│ ├── api/
|
| 42 |
+
│ │ └── session_api.py # Session persistence + 4 endpoint wrappers (process/resegment/retranscribe/realign)
|
| 43 |
+
│ │
|
| 44 |
+
│ ├── core/
|
| 45 |
+
│ │ ├── segment_types.py # Dataclasses: VadSegment, SegmentInfo, ProfilingData (50+ timing fields)
|
| 46 |
+
│ │ ├── quran_index.py # QuranIndex: dual-script word lookup (QPC Hafs for compute, DigitalKhatt for display)
|
| 47 |
+
│ │ ├── zero_gpu.py # @gpu_with_fallback decorator: ZeroGPU quota detection, automatic CPU fallback
|
| 48 |
+
│ │ └── usage_logger.py # HF Dataset logging: ParquetScheduler, audio embedding, error JSONL fallback
|
| 49 |
+
│ │
|
| 50 |
+
│ ├── alignment/
|
| 51 |
+
│ │ ├── alignment_pipeline.py # Orchestrator: sequential alignment with retry tiers, re-anchoring, chapter transitions
|
| 52 |
+
│ │ ├── phoneme_asr.py # wav2vec2 CTC inference with dynamic batching (duration-based, padding waste minimization)
|
| 53 |
+
│ │ ├── phoneme_anchor.py # N-gram rarity-weighted voting: determines chapter/verse anchor point
|
| 54 |
+
│ │ ├── phoneme_matcher.py # Substring Levenshtein DP with word-boundary constraints and position prior
|
| 55 |
+
│ │ ├── _dp_core.pyx # Cython DP inner loop (10-20x speedup), pure Python fallback
|
| 56 |
+
│ │ ├── special_segments.py # Basmala/Isti'adha detection via phoneme edit distance (threshold 0.35)
|
| 57 |
+
│ │ ├── phoneme_matcher_cache.py# Pre-loads ChapterReference objects from phoneme_cache.pkl
|
| 58 |
+
│ │ ├── ngram_index.py # PhonemeNgramIndex dataclass, loaded from pickle
|
| 59 |
+
│ │ └── phonemizer_utils.py # Singleton wrapper for Quranic Phonemizer
|
| 60 |
+
│ │
|
| 61 |
+
│ ├── segmenter/
|
| 62 |
+
│ │ ├── segmenter_model.py # VAD model lifecycle: load, GPU/CPU movement, device management
|
| 63 |
+
│ │ ├── segmenter_aoti.py # Ahead-of-time compilation via torch.export for ZeroGPU persistence
|
| 64 |
+
│ │ └── vad.py # VAD inference: detect_speech_segments() with interval cleaning
|
| 65 |
+
│ │
|
| 66 |
+
│ └── ui/
|
| 67 |
+
│ ├── interface.py # build_interface(): Gradio Blocks layout, component definitions, state components
|
| 68 |
+
│ ├── event_wiring.py # Connects Gradio component events to handlers and pipeline functions
|
| 69 |
+
│ ├── handlers.py # Python callbacks: preset buttons, slider wiring, animation mode changes
|
| 70 |
+
│ ├── segments.py # Segment card HTML rendering: confidence badges, verse markers, audio players
|
| 71 |
+
│ ├── styles.py # CSS: fonts, segment cards, confidence colors, mega card, animation UI
|
| 72 |
+
│ ├── js_config.py # Python→JS bridge: exports config as window.* globals, concatenates JS files
|
| 73 |
+
│ └── static/
|
| 74 |
+
│ ├── animation-core.js # Per-segment animation: audio warmup, element caching, window opacity engine, tick loop
|
| 75 |
+
│ └── animate-all.js # Mega card: builds unified text flow, deduplicates shared words, click-to-seek
|
| 76 |
+
│
|
| 77 |
+
├── data/
|
| 78 |
+
│ ├── phoneme_cache.pkl # 7.9MB — Pre-phonemized Quran text (114 chapters)
|
| 79 |
+
│ ├── phoneme_ngram_index_5.pkl # 6.2MB — 5-gram index for anchor voting
|
| 80 |
+
│ ├── phoneme_sub_costs.json # Custom phoneme substitution cost matrix
|
| 81 |
+
│ ├── digital_khatt_v2_script.json# 14.8MB — Full Quran text with positional metadata
|
| 82 |
+
│ ├── qpc_hafs.json # QPC Hafs Quran text (computational reference)
|
| 83 |
+
│ ├── surah_info.json # Chapter metadata (names, verse counts)
|
| 84 |
+
│ ├── ligatures.json # Surah name ligature mappings for DigitalKhatt font
|
| 85 |
+
│ ├── font_data.py # Base64-encoded Arabic fonts for offline rendering
|
| 86 |
+
│ ├── DigitalKhattV2.otf # Arabic Quran font
|
| 87 |
+
│ └── surah-name-v2.ttf # Surah name ligature font
|
| 88 |
+
│
|
| 89 |
+
├── scripts/
|
| 90 |
+
│ ├── build_phoneme_cache.py # Generate phoneme_cache.pkl from Quran text
|
| 91 |
+
│ ├── build_phoneme_ngram_index.py# Generate phoneme_ngram_index_5.pkl from cache
|
| 92 |
+
│ ├── export_onnx.py # Export models to ONNX format
|
| 93 |
+
│ ├── add_open_tanween.py # Text preprocessing: add open tanween marks
|
| 94 |
+
│ └── fix_stop_sign_spacing.py # Text preprocessing: fix stop sign spacing
|
| 95 |
+
│
|
| 96 |
+
├── tests/
|
| 97 |
+
│ └── test_session_api.py # Integration tests for session API (requires running server)
|
| 98 |
+
│
|
| 99 |
+
├── docs/
|
| 100 |
+
│ ├── api.md # API endpoint documentation (current + planned)
|
| 101 |
+
│ ├── client_api.md # Client-side API docs
|
| 102 |
+
│ └── usage-logging.md # Usage logging schema and design
|
| 103 |
+
│
|
| 104 |
+
└── usage_logs/errors/ # Runtime error JSONL files (fallback when Hub upload fails)
|
| 105 |
+
```
|
| 106 |
+
|
| 107 |
+
## Architecture Principles
|
| 108 |
+
|
| 109 |
+
**`app.py` must stay minimal** (~85 lines). It only bootstraps: path setup, Cython build, `build_interface()`, and model preloading. All logic lives in `src/`.
|
| 110 |
+
|
| 111 |
+
**All constants go in `config.py`.** Model paths, thresholds, window sizes, edit costs, UI settings, presets, slider ranges, debug flags — everything configurable lives here. Never hardcode magic numbers in module code.
|
| 112 |
+
|
| 113 |
+
## DP Alignment Algorithm
|
| 114 |
+
|
| 115 |
+
The core alignment (`phoneme_matcher.py`) uses **substring Levenshtein DP** with word-boundary constraints to find where ASR phonemes best match within the Quran reference:
|
| 116 |
+
|
| 117 |
+
1. **Windowed search:** A window of `LOOKBACK_WORDS` (15) before and `LOOKAHEAD_WORDS` (10) after the current pointer defines the search region. Pre-flattened phoneme arrays avoid per-segment rebuilds.
|
| 118 |
+
2. **Word-boundary constraints:** DP start positions must align with word boundaries (INF cost elsewhere). Only word-end positions are evaluated as candidates.
|
| 119 |
+
3. **Position prior:** Adds `START_PRIOR_WEIGHT` (0.005) penalty per word away from the expected position, biasing sequential matching.
|
| 120 |
+
4. **Edit costs:** Substitution (1.0), insertion (1.0), deletion (0.8). Custom substitution costs from `phoneme_sub_costs.json` for phonetically similar pairs.
|
| 121 |
+
5. **Scoring:** `normalized_edit_distance + position_prior`. Confidence = `1 - normalized_distance`.
|
| 122 |
+
6. **Cython acceleration:** `_dp_core.pyx` provides 10-20x speedup for the inner loop. Falls back to pure Python if not compiled.
|
| 123 |
+
|
| 124 |
+
### Special Cases
|
| 125 |
+
|
| 126 |
+
- **Basmala/Isti'adha detection** (`special_segments.py`): Before main alignment, checks first segments against hardcoded phoneme sequences using edit distance (threshold 0.35). If a combined Isti'adha+Basmala is detected in one segment, it splits at the midpoint.
|
| 127 |
+
- **Fused Basmala:** After chapter transitions, tries prepending Basmala phonemes to the first verse segment and compares confidence with plain alignment. Picks the better match.
|
| 128 |
+
- **N-gram anchor voting** (`phoneme_anchor.py`): Extracts 5-grams from ASR output, looks up in pre-built index, weights by `1/count` (rarity). Finds best contiguous ayah run, trims edges below 15% of max weight.
|
| 129 |
+
- **Graduated retry on failure** (`alignment_pipeline.py`):
|
| 130 |
+
- Tier 1: Expanded window (60 lookback, 40 lookahead), same threshold
|
| 131 |
+
- Tier 2: Expanded window + relaxed threshold (0.45)
|
| 132 |
+
- **Re-anchoring:** After 2 consecutive failures (`MAX_CONSECUTIVE_FAILURES`), runs n-gram voting on remaining segments to jump to a new position within the surah.
|
| 133 |
+
- **Chapter transitions:** When the pointer exceeds chapter end, detects inter-chapter specials and moves to the next chapter. After Surah 1, triggers global re-anchor.
|
| 134 |
+
|
| 135 |
+
## Animation System
|
| 136 |
+
|
| 137 |
+
Two animation modes, both driven by `requestAnimationFrame` tick loops matching `audio.currentTime` to word/character timestamps:
|
| 138 |
+
|
| 139 |
+
### Per-Segment Animation (`animation-core.js`)
|
| 140 |
+
Each segment card has an "Animate" button. On click: builds word/char element caches from `.word`/`.char` spans, activates lazy audio, starts RAF loop. The tick function uses a **fast path** (check current word → next word, covers ~99% of frames) with full-scan fallback for seeking.
|
| 141 |
+
|
| 142 |
+
### Mega Card Animation (`animate-all.js`)
|
| 143 |
+
"Animate All" builds a **unified text flow** from all segment cards: clones word elements, deduplicates shared positions (overlapping segment boundaries), inserts surah separators with ligature font, handles fused Basmala prefixes. Uses a single `<audio>` element for the full recording. Segment transitions are boundary-driven (when `currentTime >= segEndTime`, advance to next segment's tick loop).
|
| 144 |
+
|
| 145 |
+
### Window Opacity Engine
|
| 146 |
+
Both modes use the same windowing system: configurable prev/after word counts with opacity gradients. Display modes (Reveal, Fade, Spotlight, Isolate, Consume, Custom) are presets that set opacity + window size. Verse-only mode hides all words outside the current verse. Settings persist to `localStorage`.
|
| 147 |
+
|
| 148 |
+
Click-to-seek in mega card: click a word → find its segment from timing, reset highlights, seek unified audio.
|
| 149 |
+
|
| 150 |
+
## Profiling & Performance
|
| 151 |
+
|
| 152 |
+
**Always consider performance when adding features.** The `ProfilingData` dataclass tracks 50+ timing fields across every pipeline stage: resampling, VAD (model load, inference, GPU time), ASR (per-batch timing, padding waste), anchor detection, DP alignment (per-segment min/max/avg), retry counts, result building, and audio encoding.
|
| 153 |
+
|
| 154 |
+
Key optimizations to maintain:
|
| 155 |
+
- **Dynamic batching** (ASR): Groups segments by duration to minimize padding waste (max 15%). Tracks `pad_waste` per batch.
|
| 156 |
+
- **Pre-flattened phoneme arrays** (DP): Chapter references pre-concatenate all word phonemes with offset mapping, avoiding per-segment array construction.
|
| 157 |
+
- **Lazy audio loading** (UI): Audio elements use `data-src` with a play button; `<audio>` controls only activate on click. First 5 segments use `preload="auto"`.
|
| 158 |
+
- **Audio warmup** (JS): `pointerdown` event primes AudioContext + silent WAV before first play.
|
| 159 |
+
- **RAF fast path** (animation): Checks current/next word index before falling back to full scan.
|
| 160 |
+
- **Cython DP core**: 10-20x speedup for the alignment inner loop.
|
| 161 |
+
- **AoT compilation** (ZeroGPU): Compiles VAD model ahead-of-time for persistence across GPU leases.
|
| 162 |
+
|
| 163 |
+
## Audio & Temp Storage
|
| 164 |
+
|
| 165 |
+
Audio files use HF Spaces' `/tmp` directory. `SEGMENT_AUDIO_DIR = /tmp/segments`. Per-segment WAVs are written to a UUID-keyed subdirectory for each run. Full recording WAV is written separately for mega card playback. Gradio's `allowed_paths=["/tmp"]` enables serving these files. Cache cleanup runs every 5 hours (`DELETE_CACHE_FREQUENCY`), deleting files older than 5 hours.
|
| 166 |
+
|
| 167 |
+
Audio preprocessing: resample to 16kHz mono via librosa (`soxr_lq` for speed), normalize int16/int32/float32 → float32, stereo → mono by averaging.
|
| 168 |
+
|
| 169 |
+
## Models
|
| 170 |
|
| 171 |
| Model | ID | Purpose |
|
| 172 |
|-------|----|---------|
|
| 173 |
| VAD | `obadx/recitation-segmenter-v2` | Voice activity detection |
|
| 174 |
| ASR Base | `hetchyy/r15_95m` | Phoneme recognition (95M params) |
|
| 175 |
+
| ASR Large | `hetchyy/r7` | Phoneme recognition (higher accuracy, 3x slower) |
|
| 176 |
| MFA | External Space `hetchyy-quran-phoneme-mfa` | Word-level forced alignment |
|
| 177 |
|
| 178 |
+
## Key Patterns
|
| 179 |
|
| 180 |
+
- **State caching:** Preprocessed audio, raw VAD intervals, and segment boundaries are cached in `gr.State` to allow resegment/retranscribe without re-uploading or re-running VAD.
|
| 181 |
+
- **GPU quota management:** `@gpu_with_fallback` decorator detects ZeroGPU quota exhaustion, parses reset time, falls back to CPU with `gr.Warning()` toast.
|
| 182 |
+
- **Idempotent model movement:** `ensure_models_on_gpu()`/`ensure_models_on_cpu()` check current device before moving.
|
| 183 |
- **Confidence scoring:** Green ≥80%, Yellow 60-79%, Red <60%.
|
| 184 |
+
- **Dual-script Quran text:** QPC Hafs for phoneme computation, DigitalKhatt for display rendering (proper Arabic typography with verse markers as combining marks).
|
| 185 |
+
- **Usage logging:** Alignment runs logged to HF Dataset via ParquetScheduler. Audio embedded as bytes. Error fallback to local JSONL.
|
config.py
CHANGED
|
@@ -23,6 +23,13 @@ AUDIO_PRELOAD_COUNT = 5 # First N segments use preload="auto
|
|
| 23 |
DELETE_CACHE_FREQUENCY = 3600*5 # Gradio cache cleanup interval (seconds)
|
| 24 |
DELETE_CACHE_AGE = 3600*5 # Delete cached files older than this (seconds)
|
| 25 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
# =============================================================================
|
| 27 |
# Model and data paths
|
| 28 |
# =============================================================================
|
|
|
|
| 23 |
DELETE_CACHE_FREQUENCY = 3600*5 # Gradio cache cleanup interval (seconds)
|
| 24 |
DELETE_CACHE_AGE = 3600*5 # Delete cached files older than this (seconds)
|
| 25 |
|
| 26 |
+
# =============================================================================
|
| 27 |
+
# Session API settings
|
| 28 |
+
# =============================================================================
|
| 29 |
+
|
| 30 |
+
SESSION_DIR = Path("/tmp/aligner_sessions") # Per-session cached data (audio, VAD, metadata)
|
| 31 |
+
SESSION_EXPIRY_SECONDS = 3600*5 # 5 hours — matches DELETE_CACHE_AGE
|
| 32 |
+
|
| 33 |
# =============================================================================
|
| 34 |
# Model and data paths
|
| 35 |
# =============================================================================
|
src/api/__init__.py
ADDED
|
File without changes
|
src/api/session_api.py
ADDED
|
@@ -0,0 +1,276 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Session-based API: persistence layer + endpoint wrappers.
|
| 2 |
+
|
| 3 |
+
Sessions store preprocessed audio and VAD data in /tmp so that
|
| 4 |
+
follow-up calls (resegment, retranscribe, realign) skip expensive
|
| 5 |
+
re-uploads and re-inference.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
import hashlib
|
| 9 |
+
import json
|
| 10 |
+
import os
|
| 11 |
+
import re
|
| 12 |
+
import shutil
|
| 13 |
+
import time
|
| 14 |
+
import uuid
|
| 15 |
+
|
| 16 |
+
import numpy as np
|
| 17 |
+
|
| 18 |
+
from config import SESSION_DIR, SESSION_EXPIRY_SECONDS
|
| 19 |
+
|
| 20 |
+
# ---------------------------------------------------------------------------
|
| 21 |
+
# Session manager
|
| 22 |
+
# ---------------------------------------------------------------------------
|
| 23 |
+
|
| 24 |
+
_last_cleanup_time = 0.0
|
| 25 |
+
_CLEANUP_INTERVAL = 1800 # sweep at most every 30 min
|
| 26 |
+
|
| 27 |
+
_VALID_ID = re.compile(r"^[0-9a-f]{32}$")
|
| 28 |
+
|
| 29 |
+
|
| 30 |
+
def _session_dir(audio_id: str):
|
| 31 |
+
return SESSION_DIR / audio_id
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
def _validate_id(audio_id: str) -> bool:
|
| 35 |
+
return isinstance(audio_id, str) and bool(_VALID_ID.match(audio_id))
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
def _is_expired(meta: dict) -> bool:
|
| 39 |
+
return (time.time() - meta.get("created_at", 0)) > SESSION_EXPIRY_SECONDS
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
def _read_metadata(session_path):
|
| 43 |
+
meta_path = session_path / "metadata.json"
|
| 44 |
+
if not meta_path.exists():
|
| 45 |
+
return None
|
| 46 |
+
with open(meta_path) as f:
|
| 47 |
+
return json.load(f)
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
def _write_metadata(session_path, meta: dict):
|
| 51 |
+
"""Atomic write via temp file + os.replace."""
|
| 52 |
+
tmp = session_path / "metadata.tmp"
|
| 53 |
+
with open(tmp, "w") as f:
|
| 54 |
+
json.dump(meta, f)
|
| 55 |
+
os.replace(tmp, session_path / "metadata.json")
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
def _sweep_expired():
|
| 59 |
+
"""Delete expired session directories (runs at most every 30 min)."""
|
| 60 |
+
global _last_cleanup_time
|
| 61 |
+
now = time.time()
|
| 62 |
+
if now - _last_cleanup_time < _CLEANUP_INTERVAL:
|
| 63 |
+
return
|
| 64 |
+
_last_cleanup_time = now
|
| 65 |
+
if not SESSION_DIR.exists():
|
| 66 |
+
return
|
| 67 |
+
for entry in SESSION_DIR.iterdir():
|
| 68 |
+
if not entry.is_dir():
|
| 69 |
+
continue
|
| 70 |
+
meta = _read_metadata(entry)
|
| 71 |
+
if meta is None or _is_expired(meta):
|
| 72 |
+
shutil.rmtree(entry, ignore_errors=True)
|
| 73 |
+
|
| 74 |
+
|
| 75 |
+
def _intervals_hash(intervals) -> str:
|
| 76 |
+
return hashlib.md5(json.dumps(intervals).encode()).hexdigest()
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
def create_session(audio, speech_intervals, is_complete, intervals, model_name):
|
| 80 |
+
"""Persist session data and return audio_id (32-char hex UUID)."""
|
| 81 |
+
_sweep_expired()
|
| 82 |
+
audio_id = uuid.uuid4().hex
|
| 83 |
+
path = _session_dir(audio_id)
|
| 84 |
+
path.mkdir(parents=True, exist_ok=True)
|
| 85 |
+
|
| 86 |
+
np.save(path / "audio.npy", audio)
|
| 87 |
+
np.save(path / "speech_intervals.npy", speech_intervals)
|
| 88 |
+
|
| 89 |
+
meta = {
|
| 90 |
+
"is_complete": bool(is_complete),
|
| 91 |
+
"intervals": intervals,
|
| 92 |
+
"model_name": model_name,
|
| 93 |
+
"intervals_hash": _intervals_hash(intervals),
|
| 94 |
+
"created_at": time.time(),
|
| 95 |
+
}
|
| 96 |
+
_write_metadata(path, meta)
|
| 97 |
+
return audio_id
|
| 98 |
+
|
| 99 |
+
|
| 100 |
+
def load_session(audio_id):
|
| 101 |
+
"""Load session data. Returns dict or None if missing/expired/invalid."""
|
| 102 |
+
if not _validate_id(audio_id):
|
| 103 |
+
return None
|
| 104 |
+
path = _session_dir(audio_id)
|
| 105 |
+
if not path.exists():
|
| 106 |
+
return None
|
| 107 |
+
meta = _read_metadata(path)
|
| 108 |
+
if meta is None or _is_expired(meta):
|
| 109 |
+
shutil.rmtree(path, ignore_errors=True)
|
| 110 |
+
return None
|
| 111 |
+
|
| 112 |
+
audio = np.load(path / "audio.npy")
|
| 113 |
+
speech_intervals = np.load(path / "speech_intervals.npy")
|
| 114 |
+
|
| 115 |
+
return {
|
| 116 |
+
"audio": audio,
|
| 117 |
+
"speech_intervals": speech_intervals,
|
| 118 |
+
"is_complete": meta["is_complete"],
|
| 119 |
+
"intervals": meta["intervals"],
|
| 120 |
+
"model_name": meta["model_name"],
|
| 121 |
+
"intervals_hash": meta.get("intervals_hash", ""),
|
| 122 |
+
"audio_id": audio_id,
|
| 123 |
+
}
|
| 124 |
+
|
| 125 |
+
|
| 126 |
+
def update_session(audio_id, *, intervals=None, model_name=None):
|
| 127 |
+
"""Update mutable session fields (intervals, model_name)."""
|
| 128 |
+
path = _session_dir(audio_id)
|
| 129 |
+
meta = _read_metadata(path)
|
| 130 |
+
if meta is None:
|
| 131 |
+
return
|
| 132 |
+
if intervals is not None:
|
| 133 |
+
meta["intervals"] = intervals
|
| 134 |
+
meta["intervals_hash"] = _intervals_hash(intervals)
|
| 135 |
+
if model_name is not None:
|
| 136 |
+
meta["model_name"] = model_name
|
| 137 |
+
_write_metadata(path, meta)
|
| 138 |
+
|
| 139 |
+
|
| 140 |
+
# ---------------------------------------------------------------------------
|
| 141 |
+
# Response formatting
|
| 142 |
+
# ---------------------------------------------------------------------------
|
| 143 |
+
|
| 144 |
+
_SESSION_ERROR = {"error": "Session not found or expired", "segments": []}
|
| 145 |
+
|
| 146 |
+
|
| 147 |
+
def _format_response(audio_id, json_output):
|
| 148 |
+
"""Convert pipeline json_output to the documented API response schema."""
|
| 149 |
+
segments = []
|
| 150 |
+
for seg in json_output.get("segments", []):
|
| 151 |
+
segments.append({
|
| 152 |
+
"segment": seg["segment"],
|
| 153 |
+
"time_from": seg["time_from"],
|
| 154 |
+
"time_to": seg["time_to"],
|
| 155 |
+
"ref_from": seg["ref_from"],
|
| 156 |
+
"ref_to": seg["ref_to"],
|
| 157 |
+
"matched_text": seg["matched_text"],
|
| 158 |
+
"confidence": seg["confidence"],
|
| 159 |
+
"has_missing_words": seg.get("has_missing_words", False),
|
| 160 |
+
"error": seg["error"],
|
| 161 |
+
})
|
| 162 |
+
return {"audio_id": audio_id, "segments": segments}
|
| 163 |
+
|
| 164 |
+
|
| 165 |
+
# ---------------------------------------------------------------------------
|
| 166 |
+
# Endpoint wrappers
|
| 167 |
+
# ---------------------------------------------------------------------------
|
| 168 |
+
|
| 169 |
+
def process_audio_session(audio_data, min_silence_ms, min_speech_ms, pad_ms,
|
| 170 |
+
model_name="Base", device="GPU"):
|
| 171 |
+
"""Full pipeline: preprocess -> VAD -> ASR -> alignment. Creates session."""
|
| 172 |
+
from src.pipeline import process_audio
|
| 173 |
+
|
| 174 |
+
result = process_audio(
|
| 175 |
+
audio_data, int(min_silence_ms), int(min_speech_ms), int(pad_ms),
|
| 176 |
+
model_name, device,
|
| 177 |
+
)
|
| 178 |
+
# result is a 9-tuple:
|
| 179 |
+
# (html, json_output, speech_intervals, is_complete, audio, sr, intervals, seg_dir, log_row)
|
| 180 |
+
json_output = result[1]
|
| 181 |
+
if json_output is None:
|
| 182 |
+
return {"error": "No speech detected in audio", "segments": []}
|
| 183 |
+
|
| 184 |
+
speech_intervals = result[2]
|
| 185 |
+
is_complete = result[3]
|
| 186 |
+
audio = result[4]
|
| 187 |
+
intervals = result[6]
|
| 188 |
+
|
| 189 |
+
audio_id = create_session(
|
| 190 |
+
audio, speech_intervals, is_complete, intervals, model_name,
|
| 191 |
+
)
|
| 192 |
+
return _format_response(audio_id, json_output)
|
| 193 |
+
|
| 194 |
+
|
| 195 |
+
def resegment_session(audio_id, min_silence_ms, min_speech_ms, pad_ms,
|
| 196 |
+
model_name="Base", device="GPU"):
|
| 197 |
+
"""Re-clean VAD boundaries with new params and re-run ASR + alignment."""
|
| 198 |
+
session = load_session(audio_id)
|
| 199 |
+
if session is None:
|
| 200 |
+
return _SESSION_ERROR
|
| 201 |
+
|
| 202 |
+
from src.pipeline import resegment_audio
|
| 203 |
+
|
| 204 |
+
result = resegment_audio(
|
| 205 |
+
session["speech_intervals"], session["is_complete"],
|
| 206 |
+
session["audio"], 16000,
|
| 207 |
+
int(min_silence_ms), int(min_speech_ms), int(pad_ms),
|
| 208 |
+
model_name, device,
|
| 209 |
+
)
|
| 210 |
+
json_output = result[1]
|
| 211 |
+
if json_output is None:
|
| 212 |
+
return {"audio_id": audio_id, "error": "No segments with these settings", "segments": []}
|
| 213 |
+
|
| 214 |
+
new_intervals = result[6]
|
| 215 |
+
update_session(audio_id, intervals=new_intervals, model_name=model_name)
|
| 216 |
+
return _format_response(audio_id, json_output)
|
| 217 |
+
|
| 218 |
+
|
| 219 |
+
def retranscribe_session(audio_id, model_name="Base", device="GPU"):
|
| 220 |
+
"""Re-run ASR with a different model on current segment boundaries."""
|
| 221 |
+
session = load_session(audio_id)
|
| 222 |
+
if session is None:
|
| 223 |
+
return _SESSION_ERROR
|
| 224 |
+
|
| 225 |
+
# Guard: reject if model and boundaries unchanged
|
| 226 |
+
if (model_name == session["model_name"]
|
| 227 |
+
and _intervals_hash(session["intervals"]) == session["intervals_hash"]):
|
| 228 |
+
return {
|
| 229 |
+
"audio_id": audio_id,
|
| 230 |
+
"error": "Model and boundaries unchanged. Change model_name or call /resegment_session first.",
|
| 231 |
+
"segments": [],
|
| 232 |
+
}
|
| 233 |
+
|
| 234 |
+
from src.pipeline import retranscribe_audio
|
| 235 |
+
|
| 236 |
+
result = retranscribe_audio(
|
| 237 |
+
session["intervals"],
|
| 238 |
+
session["audio"], 16000,
|
| 239 |
+
session["speech_intervals"], session["is_complete"],
|
| 240 |
+
model_name, device,
|
| 241 |
+
)
|
| 242 |
+
json_output = result[1]
|
| 243 |
+
if json_output is None:
|
| 244 |
+
return {"audio_id": audio_id, "error": "Retranscription failed", "segments": []}
|
| 245 |
+
|
| 246 |
+
update_session(audio_id, model_name=model_name)
|
| 247 |
+
return _format_response(audio_id, json_output)
|
| 248 |
+
|
| 249 |
+
|
| 250 |
+
def realign_from_timestamps(audio_id, timestamps, model_name="Base", device="GPU"):
|
| 251 |
+
"""Run ASR + alignment on caller-provided timestamp intervals."""
|
| 252 |
+
session = load_session(audio_id)
|
| 253 |
+
if session is None:
|
| 254 |
+
return _SESSION_ERROR
|
| 255 |
+
|
| 256 |
+
# Parse timestamps: accept list of {"start": f, "end": f} dicts
|
| 257 |
+
if isinstance(timestamps, str):
|
| 258 |
+
timestamps = json.loads(timestamps)
|
| 259 |
+
|
| 260 |
+
intervals = [(ts["start"], ts["end"]) for ts in timestamps]
|
| 261 |
+
|
| 262 |
+
from src.pipeline import realign_audio
|
| 263 |
+
|
| 264 |
+
result = realign_audio(
|
| 265 |
+
intervals,
|
| 266 |
+
session["audio"], 16000,
|
| 267 |
+
session["speech_intervals"], session["is_complete"],
|
| 268 |
+
model_name, device,
|
| 269 |
+
)
|
| 270 |
+
json_output = result[1]
|
| 271 |
+
if json_output is None:
|
| 272 |
+
return {"audio_id": audio_id, "error": "Alignment failed", "segments": []}
|
| 273 |
+
|
| 274 |
+
new_intervals = result[6]
|
| 275 |
+
update_session(audio_id, intervals=new_intervals, model_name=model_name)
|
| 276 |
+
return _format_response(audio_id, json_output)
|
src/pipeline.py
CHANGED
|
@@ -473,6 +473,7 @@ def _run_post_vad_pipeline(
|
|
| 473 |
"ref_to": parse_ref(seg.matched_ref)[1],
|
| 474 |
"matched_text": seg.matched_text or "",
|
| 475 |
"confidence": round(seg.match_score, 3),
|
|
|
|
| 476 |
"potentially_undersegmented": seg.potentially_undersegmented,
|
| 477 |
"error": seg.error
|
| 478 |
}
|
|
@@ -721,6 +722,56 @@ def retranscribe_audio(
|
|
| 721 |
return html, json_output, cached_speech_intervals, cached_is_complete, cached_audio, cached_sample_rate, cached_intervals, seg_dir, log_row
|
| 722 |
|
| 723 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 724 |
def _retranscribe_wrapper(
|
| 725 |
cached_intervals, cached_audio, cached_sample_rate,
|
| 726 |
cached_speech_intervals, cached_is_complete,
|
|
|
|
| 473 |
"ref_to": parse_ref(seg.matched_ref)[1],
|
| 474 |
"matched_text": seg.matched_text or "",
|
| 475 |
"confidence": round(seg.match_score, 3),
|
| 476 |
+
"has_missing_words": seg.has_missing_words,
|
| 477 |
"potentially_undersegmented": seg.potentially_undersegmented,
|
| 478 |
"error": seg.error
|
| 479 |
}
|
|
|
|
| 722 |
return html, json_output, cached_speech_intervals, cached_is_complete, cached_audio, cached_sample_rate, cached_intervals, seg_dir, log_row
|
| 723 |
|
| 724 |
|
| 725 |
+
def realign_audio(
|
| 726 |
+
intervals,
|
| 727 |
+
cached_audio, cached_sample_rate,
|
| 728 |
+
cached_speech_intervals, cached_is_complete,
|
| 729 |
+
model_name="Base", device="GPU",
|
| 730 |
+
cached_log_row=None,
|
| 731 |
+
request: gr.Request = None,
|
| 732 |
+
progress=gr.Progress(),
|
| 733 |
+
):
|
| 734 |
+
"""Run ASR + alignment on caller-provided intervals.
|
| 735 |
+
|
| 736 |
+
Same as retranscribe_audio but uses externally-provided intervals
|
| 737 |
+
instead of cached_intervals, bypassing VAD entirely.
|
| 738 |
+
|
| 739 |
+
Returns:
|
| 740 |
+
(html, json_output, cached_speech_intervals, cached_is_complete,
|
| 741 |
+
cached_audio, cached_sample_rate, intervals, segment_dir, log_row)
|
| 742 |
+
"""
|
| 743 |
+
import time
|
| 744 |
+
|
| 745 |
+
if cached_audio is None:
|
| 746 |
+
return "<div>No cached data.</div>", None, None, None, None, None, None, None, None
|
| 747 |
+
|
| 748 |
+
device = device.lower()
|
| 749 |
+
|
| 750 |
+
from src.core.zero_gpu import reset_quota_flag, force_cpu_mode
|
| 751 |
+
reset_quota_flag()
|
| 752 |
+
if device == "cpu":
|
| 753 |
+
force_cpu_mode()
|
| 754 |
+
|
| 755 |
+
print(f"\n{'='*60}")
|
| 756 |
+
print(f"REALIGNING with {len(intervals)} custom timestamps, model={model_name}")
|
| 757 |
+
print(f"{'='*60}")
|
| 758 |
+
|
| 759 |
+
profiling = ProfilingData()
|
| 760 |
+
pipeline_start = time.time()
|
| 761 |
+
|
| 762 |
+
pct, desc = PROGRESS_RETRANSCRIBE["retranscribe"]
|
| 763 |
+
progress(pct, desc=desc.format(model=model_name))
|
| 764 |
+
|
| 765 |
+
html, json_output, seg_dir, log_row = _run_post_vad_pipeline(
|
| 766 |
+
cached_audio, cached_sample_rate, intervals,
|
| 767 |
+
model_name, device, profiling, pipeline_start, PROGRESS_RETRANSCRIBE,
|
| 768 |
+
progress=progress,
|
| 769 |
+
request=request, log_row=cached_log_row,
|
| 770 |
+
)
|
| 771 |
+
|
| 772 |
+
return html, json_output, cached_speech_intervals, cached_is_complete, cached_audio, cached_sample_rate, intervals, seg_dir, log_row
|
| 773 |
+
|
| 774 |
+
|
| 775 |
def _retranscribe_wrapper(
|
| 776 |
cached_intervals, cached_audio, cached_sample_rate,
|
| 777 |
cached_speech_intervals, cached_is_complete,
|
src/ui/event_wiring.py
CHANGED
|
@@ -3,7 +3,11 @@ import gradio as gr
|
|
| 3 |
|
| 4 |
from src.pipeline import (
|
| 5 |
process_audio, resegment_audio,
|
| 6 |
-
_retranscribe_wrapper,
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
)
|
| 8 |
from src.mfa import compute_mfa_timestamps
|
| 9 |
from src.ui.handlers import (
|
|
@@ -418,11 +422,30 @@ def _wire_settings_restoration(app, c):
|
|
| 418 |
|
| 419 |
|
| 420 |
def _wire_api_endpoint(c):
|
| 421 |
-
"""Hidden API-only
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 422 |
gr.Button(visible=False).click(
|
| 423 |
-
fn=
|
| 424 |
-
inputs=[c.
|
| 425 |
-
|
| 426 |
-
|
| 427 |
-
api_name="process_audio_json"
|
| 428 |
)
|
|
|
|
| 3 |
|
| 4 |
from src.pipeline import (
|
| 5 |
process_audio, resegment_audio,
|
| 6 |
+
_retranscribe_wrapper, save_json_export,
|
| 7 |
+
)
|
| 8 |
+
from src.api.session_api import (
|
| 9 |
+
process_audio_session, resegment_session,
|
| 10 |
+
retranscribe_session, realign_from_timestamps,
|
| 11 |
)
|
| 12 |
from src.mfa import compute_mfa_timestamps
|
| 13 |
from src.ui.handlers import (
|
|
|
|
| 422 |
|
| 423 |
|
| 424 |
def _wire_api_endpoint(c):
|
| 425 |
+
"""Hidden API-only endpoints for session-based programmatic access."""
|
| 426 |
+
gr.Button(visible=False).click(
|
| 427 |
+
fn=process_audio_session,
|
| 428 |
+
inputs=[c.api_audio, c.api_silence, c.api_speech, c.api_pad,
|
| 429 |
+
c.api_model, c.api_device],
|
| 430 |
+
outputs=[c.api_result],
|
| 431 |
+
api_name="process_audio_session",
|
| 432 |
+
)
|
| 433 |
+
gr.Button(visible=False).click(
|
| 434 |
+
fn=resegment_session,
|
| 435 |
+
inputs=[c.api_audio_id, c.api_silence, c.api_speech, c.api_pad,
|
| 436 |
+
c.api_model, c.api_device],
|
| 437 |
+
outputs=[c.api_result],
|
| 438 |
+
api_name="resegment_session",
|
| 439 |
+
)
|
| 440 |
+
gr.Button(visible=False).click(
|
| 441 |
+
fn=retranscribe_session,
|
| 442 |
+
inputs=[c.api_audio_id, c.api_model, c.api_device],
|
| 443 |
+
outputs=[c.api_result],
|
| 444 |
+
api_name="retranscribe_session",
|
| 445 |
+
)
|
| 446 |
gr.Button(visible=False).click(
|
| 447 |
+
fn=realign_from_timestamps,
|
| 448 |
+
inputs=[c.api_audio_id, c.api_timestamps, c.api_model, c.api_device],
|
| 449 |
+
outputs=[c.api_result],
|
| 450 |
+
api_name="realign_from_timestamps",
|
|
|
|
| 451 |
)
|
src/ui/interface.py
CHANGED
|
@@ -67,6 +67,17 @@ def build_interface():
|
|
| 67 |
c.cached_log_row = gr.State(value=None)
|
| 68 |
c.resegment_panel_visible = gr.State(value=False)
|
| 69 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
wire_events(app, c)
|
| 71 |
|
| 72 |
return app
|
|
|
|
| 67 |
c.cached_log_row = gr.State(value=None)
|
| 68 |
c.resegment_panel_visible = gr.State(value=False)
|
| 69 |
|
| 70 |
+
# Session API components (hidden, API-only)
|
| 71 |
+
c.api_audio = gr.Audio(visible=False, type="numpy")
|
| 72 |
+
c.api_audio_id = gr.Textbox(visible=False)
|
| 73 |
+
c.api_silence = gr.Number(visible=False, precision=0)
|
| 74 |
+
c.api_speech = gr.Number(visible=False, precision=0)
|
| 75 |
+
c.api_pad = gr.Number(visible=False, precision=0)
|
| 76 |
+
c.api_model = gr.Textbox(visible=False)
|
| 77 |
+
c.api_device = gr.Textbox(visible=False)
|
| 78 |
+
c.api_timestamps = gr.JSON(visible=False)
|
| 79 |
+
c.api_result = gr.JSON(visible=False)
|
| 80 |
+
|
| 81 |
wire_events(app, c)
|
| 82 |
|
| 83 |
return app
|
tests/test_session_api.py
ADDED
|
@@ -0,0 +1,122 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Integration tests for session-based API endpoints.
|
| 2 |
+
|
| 3 |
+
Requires the app to be running on localhost:7860.
|
| 4 |
+
Start with: python app.py
|
| 5 |
+
|
| 6 |
+
Run with: python -m pytest tests/test_session_api.py -v -s
|
| 7 |
+
"""
|
| 8 |
+
|
| 9 |
+
import pytest
|
| 10 |
+
from gradio_client import Client
|
| 11 |
+
|
| 12 |
+
SERVER_URL = "http://localhost:7860"
|
| 13 |
+
AUDIO_FILE = "data/112.mp3" # Surah Al-Ikhlas (~15s)
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
@pytest.fixture(scope="module")
|
| 17 |
+
def client():
|
| 18 |
+
return Client(SERVER_URL)
|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
@pytest.fixture(scope="module")
|
| 22 |
+
def session(client):
|
| 23 |
+
"""Run process_audio_session once, share audio_id across tests."""
|
| 24 |
+
result = client.predict(
|
| 25 |
+
AUDIO_FILE, 200, 1000, 100, "Base", "CPU",
|
| 26 |
+
api_name="/process_audio_session",
|
| 27 |
+
)
|
| 28 |
+
assert "audio_id" in result, f"Missing audio_id: {result}"
|
| 29 |
+
assert result["audio_id"] is not None
|
| 30 |
+
return result
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
# -- 1. process_audio_session -----------------------------------------------
|
| 34 |
+
|
| 35 |
+
def test_process_audio_session(session):
|
| 36 |
+
assert len(session["segments"]) > 0, "Expected at least one segment"
|
| 37 |
+
seg = session["segments"][0]
|
| 38 |
+
for field in ("segment", "time_from", "time_to", "ref_from", "ref_to",
|
| 39 |
+
"matched_text", "confidence", "has_missing_words", "error"):
|
| 40 |
+
assert field in seg, f"Missing field: {field}"
|
| 41 |
+
assert seg["segment"] == 1
|
| 42 |
+
assert seg["time_from"] >= 0
|
| 43 |
+
assert seg["time_to"] > seg["time_from"]
|
| 44 |
+
assert 0 <= seg["confidence"] <= 1
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
# -- 2. resegment_session ---------------------------------------------------
|
| 48 |
+
|
| 49 |
+
def test_resegment_session(client, session):
|
| 50 |
+
audio_id = session["audio_id"]
|
| 51 |
+
result = client.predict(
|
| 52 |
+
audio_id, 600, 1500, 300, "Base", "CPU",
|
| 53 |
+
api_name="/resegment_session",
|
| 54 |
+
)
|
| 55 |
+
assert result["audio_id"] == audio_id
|
| 56 |
+
assert "segments" in result
|
| 57 |
+
assert len(result["segments"]) > 0
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
# -- 3. retranscribe_session ------------------------------------------------
|
| 61 |
+
|
| 62 |
+
def test_retranscribe_session(client, session):
|
| 63 |
+
audio_id = session["audio_id"]
|
| 64 |
+
result = client.predict(
|
| 65 |
+
audio_id, "Large", "CPU",
|
| 66 |
+
api_name="/retranscribe_session",
|
| 67 |
+
)
|
| 68 |
+
assert result["audio_id"] == audio_id
|
| 69 |
+
assert len(result["segments"]) > 0
|
| 70 |
+
|
| 71 |
+
|
| 72 |
+
# -- 4. retranscribe guard --------------------------------------------------
|
| 73 |
+
|
| 74 |
+
def test_retranscribe_guard(client, session):
|
| 75 |
+
"""Same model + same boundaries should return error."""
|
| 76 |
+
audio_id = session["audio_id"]
|
| 77 |
+
result = client.predict(
|
| 78 |
+
audio_id, "Large", "CPU",
|
| 79 |
+
api_name="/retranscribe_session",
|
| 80 |
+
)
|
| 81 |
+
assert "error" in result
|
| 82 |
+
assert result["segments"] == []
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
# -- 5. realign_from_timestamps ---------------------------------------------
|
| 86 |
+
|
| 87 |
+
def test_realign_from_timestamps(client, session):
|
| 88 |
+
audio_id = session["audio_id"]
|
| 89 |
+
timestamps = [
|
| 90 |
+
{"start": 0.5, "end": 3.0},
|
| 91 |
+
{"start": 3.5, "end": 6.0},
|
| 92 |
+
]
|
| 93 |
+
result = client.predict(
|
| 94 |
+
audio_id, timestamps, "Base", "CPU",
|
| 95 |
+
api_name="/realign_from_timestamps",
|
| 96 |
+
)
|
| 97 |
+
assert result["audio_id"] == audio_id
|
| 98 |
+
assert len(result["segments"]) == 2
|
| 99 |
+
|
| 100 |
+
|
| 101 |
+
# -- 6. invalid audio_id ----------------------------------------------------
|
| 102 |
+
|
| 103 |
+
def test_invalid_audio_id(client):
|
| 104 |
+
result = client.predict(
|
| 105 |
+
"00000000000000000000000000000000", "Base", "CPU",
|
| 106 |
+
api_name="/retranscribe_session",
|
| 107 |
+
)
|
| 108 |
+
assert "error" in result
|
| 109 |
+
assert "not found" in result["error"].lower() or "expired" in result["error"].lower()
|
| 110 |
+
assert result["segments"] == []
|
| 111 |
+
|
| 112 |
+
|
| 113 |
+
# -- 7. resegment after realign (session still valid) -----------------------
|
| 114 |
+
|
| 115 |
+
def test_resegment_after_realign(client, session):
|
| 116 |
+
audio_id = session["audio_id"]
|
| 117 |
+
result = client.predict(
|
| 118 |
+
audio_id, 200, 1000, 100, "Base", "CPU",
|
| 119 |
+
api_name="/resegment_session",
|
| 120 |
+
)
|
| 121 |
+
assert result["audio_id"] == audio_id
|
| 122 |
+
assert len(result["segments"]) > 0
|