Quran-multi-aligner

Running on Zero

App Files Files Community

hetchyy commited on Feb 22

Commit

2ce56b1

1 Parent(s): fb6ec07

Add timestamps API endpoints

Browse files

Files changed (7) hide show

.gitignore +1 -0
docs/client_api.md +139 -0
src/api/session_api.py +168 -0
src/mfa.py +417 -355
src/ui/event_wiring.py +13 -0
src/ui/interface.py +3 -1
tests/test_session_api.py +150 -0

.gitignore CHANGED Viewed

@@ -50,5 +50,6 @@ models/
 captures/
 docs/api.md
 scripts/
 tests/

 captures/
 docs/api.md
+docs/lease_duration_history.md
 scripts/
 tests/

docs/client_api.md CHANGED Viewed

@@ -32,6 +32,15 @@ result = client.predict(
     "Base", "GPU",
     api_name="/realign_from_timestamps"
 )
 ```
 ---
@@ -48,6 +57,7 @@ The first call returns an `audio_id` (32-character hex string). Pass it to subse
 | Raw VAD speech intervals | Disk (pickle) | No |
 | Cleaned segment boundaries | Disk (JSON) | Yes (resegment / realign) |
 | Model name | Disk (JSON) | Yes (retranscribe) |
 If `audio_id` is missing, expired, or invalid:
 ```json
@@ -173,6 +183,9 @@ All errors follow the same shape: `{"error": "...", "segments": []}`. Endpoints
 | Retranscribe with same model | `"Model and boundaries unchanged. Change model_name or call /resegment_session first."` | Yes |
 | Retranscription failed | `"Retranscription failed"` | Yes |
 | Realignment failed | `"Alignment failed"` | Yes |
 ---
@@ -237,3 +250,129 @@ Accepts arbitrary `(start, end)` timestamp pairs and runs ASR + alignment on eac
 **Response:** Same shape as `/process_audio_session`. Session boundaries are replaced with the provided timestamps.
 This endpoint subsumes split, merge, and boundary adjustment — the client computes the desired timestamps locally and sends them in one call.

     "Base", "GPU",
     api_name="/realign_from_timestamps"
 )
+# Compute MFA word timestamps (uses stored session segments)
+mfa = client.predict(audio_id, None, "words", api_name="/mfa_timestamps_session")
+# Compute MFA word + letter timestamps
+mfa = client.predict(audio_id, None, "words+chars", api_name="/mfa_timestamps_session")
+# Direct MFA timestamps (no session needed)
+mfa = client.predict("recitation.mp3", result["segments"], "words", api_name="/mfa_timestamps_direct")
 ```
 ---
 | Raw VAD speech intervals | Disk (pickle) | No |
 | Cleaned segment boundaries | Disk (JSON) | Yes (resegment / realign) |
 | Model name | Disk (JSON) | Yes (retranscribe) |
+| Alignment segments | Disk (JSON) | Yes (any alignment call) |
 If `audio_id` is missing, expired, or invalid:
 ```json
 | Retranscribe with same model | `"Model and boundaries unchanged. Change model_name or call /resegment_session first."` | Yes |
 | Retranscription failed | `"Retranscription failed"` | Yes |
 | Realignment failed | `"Alignment failed"` | Yes |
+| No segments in session (MFA) | `"No segments found in session"` | Yes |
+| MFA alignment failed | `"MFA alignment failed: ..."` | Yes (session) / No (direct) |
+| No segments provided (MFA direct) | `"No segments provided"` | No |
 ---
 **Response:** Same shape as `/process_audio_session`. Session boundaries are replaced with the provided timestamps.
 This endpoint subsumes split, merge, and boundary adjustment — the client computes the desired timestamps locally and sends them in one call.
+---
+### `POST /mfa_timestamps_session`
+Compute word-level (and optionally letter-level) MFA timestamps using session audio. Segments come from the stored session or can be overridden.
+| Parameter | Type | Default | Description |
+|---|---|---|---|
+| `audio_id` | str | required | Session ID from a previous alignment call |
+| `segments` | list? | `null` | Segment list to align. `null` uses stored segments from the session |
+| `granularity` | str | `"words"` | `"words"` (word timestamps only) or `"words+chars"` (word + letter timestamps) |
+**Example — using stored segments:**
+```python
+result = client.predict(
+    "a1b2c3d4e5f67890a1b2c3d4e5f67890",  # audio_id
+    None,                                   # segments (null = use stored)
+    "words",                                # granularity
+    api_name="/mfa_timestamps_session",
+)
+```
+**Example — with segments override (minimal):**
+```python
+result = client.predict(
+    "a1b2c3d4e5f67890a1b2c3d4e5f67890",  # audio_id
+    [                                       # segments override
+        {"time_from": 0.48, "time_to": 2.88, "ref_from": "112:1:1", "ref_to": "112:1:4"},
+        {"time_from": 3.12, "time_to": 5.44, "ref_from": "112:2:1", "ref_to": "112:2:3"},
+    ],
+    "words+chars",                          # granularity
+    api_name="/mfa_timestamps_session",
+)
+```
+**Example — passing alignment results directly:**
+```python
+# Segments from /process_audio_session can be passed as-is
+proc = client.predict("recitation.mp3", 200, 1000, 100, "Base", "CPU", api_name="/process_audio_session")
+mfa = client.predict(proc["audio_id"], proc["segments"], "words+chars", api_name="/mfa_timestamps_session")
+```
+**Example — special segment (Basmala):**
+```python
+# Special segments use empty ref_from/ref_to and carry a special_type field
+{"time_from": 0.0, "time_to": 2.1, "ref_from": "", "ref_to": "", "special_type": "Basmala"}
+```
+**Segment input fields:**
+| Field | Type | Required | Description |
+|---|---|---|---|
+| `time_from` | float | yes | Start time in seconds (used to slice audio) |
+| `time_to` | float | yes | End time in seconds (used to slice audio) |
+| `ref_from` | str | yes | First word as `"surah:ayah:word"`. Empty for special segments |
+| `ref_to` | str | yes | Last word as `"surah:ayah:word"`. Empty for special segments |
+| `segment` | int | no | 1-indexed segment number. Auto-assigned from position if omitted |
+| `confidence` | float | no | Defaults to 1.0. Segments with confidence ≤ 0 are skipped |
+| `special_type` | str | no | Only for special segments (`"Basmala"`, `"Isti'adha"`, etc.) |
+| `matched_text` | str | no | Quran text. Used for fused Basmala/Isti'adha prefix detection |
+> **Tip:** You can pass the `segments` array from any alignment endpoint directly — all extra fields are preserved and echoed back in the response.
+**Response:**
+```json
+{
+  "audio_id": "a1b2c3d4e5f67890a1b2c3d4e5f67890",
+  "segments": [
+    {
+      "segment": 1,
+      "words": [
+        ["112:1:1", 0.0, 0.32],
+        ["112:1:2", 0.32, 0.58],
+        ["112:1:3", 0.58, 1.12],
+        ["112:1:4", 1.12, 1.68]
+      ]
+    }
+  ]
+}
+```
+With `granularity="words+chars"`, each word includes a 4th element — letter timestamps:
+```json
+["112:1:1", 0.0, 0.32, [["ق", 0.0, 0.15], ["ل", 0.15, 0.32]]]
+```
+**Word array:** `[location, start, end]` or `[location, start, end, letters]`
+| Index | Type | Description |
+|---|---|---|
+| 0 | str | Word position as `"surah:ayah:word"` |
+| 1 | float | Start time relative to segment (seconds) |
+| 2 | float | End time relative to segment (seconds) |
+| 3 | list? | Only present when `granularity="words+chars"`. Array of `[char, start, end]` tuples |
+> **Note:** All timestamps are **relative to the segment** (not to the full recording). Add `time_from` to convert to absolute times.
+---
+### `POST /mfa_timestamps_direct`
+Compute MFA timestamps with a provided audio file and segments. No session required — standalone endpoint.
+| Parameter | Type | Default | Description |
+|---|---|---|---|
+| `audio` | file | required | Audio file (any format) |
+| `segments` | list | required | Segment list with `time_from`/`time_to` boundaries |
+| `granularity` | str | `"words"` | `"words"` or `"words+chars"` |
+**Response:** Same shape as `/mfa_timestamps_session` but without `audio_id`.
+**Example (minimal):**
+```python
+result = client.predict(
+    "recitation.mp3",
+    [
+        {"time_from": 0.48, "time_to": 2.88, "ref_from": "112:1:1", "ref_to": "112:1:4"},
+        {"time_from": 3.12, "time_to": 5.44, "ref_from": "112:2:1", "ref_to": "112:2:3"},
+    ],
+    "words+chars",
+    api_name="/mfa_timestamps_direct",
+)
+```
+Segment input format is the same as for `/mfa_timestamps_session` — see [segment input fields](#segment-input-fields) above.

src/api/session_api.py CHANGED Viewed

@@ -149,6 +149,30 @@ def update_session(audio_id, *, intervals=None, model_name=None):
     os.replace(tmp, meta_path)
 # ---------------------------------------------------------------------------
 # Response formatting
 # ---------------------------------------------------------------------------
@@ -174,6 +198,7 @@ def _format_response(audio_id, json_output, warning=None):
         if seg.get("special_type"):
             entry["special_type"] = seg["special_type"]
         segments.append(entry)
     resp = {"audio_id": audio_id, "segments": segments}
     if warning:
         resp["warning"] = warning
@@ -338,3 +363,146 @@ def realign_from_timestamps(audio_id, timestamps, model_name="Base", device="GPU
     new_intervals = result[6]
     update_session(audio_id, intervals=new_intervals, model_name=model_name)
     return _format_response(audio_id, json_output, warning=quota_warning)

     os.replace(tmp, meta_path)
+def _save_segments(audio_id, segments):
+    """Persist alignment segments for later MFA use."""
+    path = _session_dir(audio_id)
+    if not path.exists():
+        return
+    seg_path = path / "segments.json"
+    tmp = path / "segments.tmp"
+    with open(tmp, "w") as f:
+        json.dump(segments, f)
+    os.replace(tmp, seg_path)
+def _load_segments(audio_id):
+    """Load stored segments. Returns list or None."""
+    if not _validate_id(audio_id):
+        return None
+    path = _session_dir(audio_id)
+    seg_path = path / "segments.json"
+    if not seg_path.exists():
+        return None
+    with open(seg_path) as f:
+        return json.load(f)
 # ---------------------------------------------------------------------------
 # Response formatting
 # ---------------------------------------------------------------------------
         if seg.get("special_type"):
             entry["special_type"] = seg["special_type"]
         segments.append(entry)
+    _save_segments(audio_id, segments)
     resp = {"audio_id": audio_id, "segments": segments}
     if warning:
         resp["warning"] = warning
     new_intervals = result[6]
     update_session(audio_id, intervals=new_intervals, model_name=model_name)
     return _format_response(audio_id, json_output, warning=quota_warning)
+# ---------------------------------------------------------------------------
+# MFA timestamp helpers
+# ---------------------------------------------------------------------------
+def _preprocess_api_audio(audio_data):
+    """Convert audio input to 16kHz mono float32 numpy array.
+    Handles file path (str) and Gradio numpy tuple (sample_rate, array).
+    Returns (audio_np, sample_rate).
+    """
+    import librosa
+    from config import RESAMPLE_TYPE
+    if isinstance(audio_data, str):
+        audio, sr = librosa.load(audio_data, sr=16000, mono=True, res_type=RESAMPLE_TYPE)
+        return audio, 16000
+    sample_rate, audio = audio_data
+    if audio.dtype == np.int16:
+        audio = audio.astype(np.float32) / 32768.0
+    elif audio.dtype == np.int32:
+        audio = audio.astype(np.float32) / 2147483648.0
+    if len(audio.shape) > 1:
+        audio = audio.mean(axis=1)
+    if sample_rate != 16000:
+        audio = librosa.resample(audio, orig_sr=sample_rate, target_sr=16000, res_type=RESAMPLE_TYPE)
+        sample_rate = 16000
+    return audio, sample_rate
+def _create_segment_wavs(audio_np, sample_rate, segments):
+    """Slice audio by segment boundaries and write WAV files.
+    Returns the temp directory path containing seg_0.wav, seg_1.wav, etc.
+    """
+    import tempfile
+    import soundfile as sf
+    seg_dir = tempfile.mkdtemp(prefix="mfa_api_")
+    for seg in segments:
+        seg_idx = seg.get("segment", 0) - 1
+        time_from = seg.get("time_from", 0)
+        time_to = seg.get("time_to", 0)
+        start_sample = int(time_from * sample_rate)
+        end_sample = int(time_to * sample_rate)
+        segment_audio = audio_np[start_sample:end_sample]
+        wav_path = os.path.join(seg_dir, f"seg_{seg_idx}.wav")
+        sf.write(wav_path, segment_audio, sample_rate)
+    return seg_dir
+# ---------------------------------------------------------------------------
+# MFA timestamp helpers
+# ---------------------------------------------------------------------------
+def _normalize_segments(segments):
+    """Fill defaults so callers can pass minimal segment dicts (timestamps + refs).
+    Auto-assigns ``segment`` numbers and defaults ``confidence`` to 1.0 so
+    segments are not accidentally skipped by ``_build_mfa_refs``.
+    """
+    normalized = []
+    for i, seg in enumerate(segments):
+        entry = dict(seg)
+        if "segment" not in entry:
+            entry["segment"] = i + 1
+        if "confidence" not in entry:
+            entry["confidence"] = 1.0
+        if "matched_text" not in entry:
+            entry["matched_text"] = ""
+        normalized.append(entry)
+    return normalized
+# ---------------------------------------------------------------------------
+# MFA timestamp endpoints
+# ---------------------------------------------------------------------------
+def mfa_timestamps_session(audio_id, segments_json=None, granularity="words"):
+    """Compute MFA word/letter timestamps using session audio."""
+    session = load_session(audio_id)
+    if session is None:
+        return _SESSION_ERROR
+    # Parse segments: use provided or load stored
+    if isinstance(segments_json, str):
+        segments_json = json.loads(segments_json)
+    if segments_json:
+        segments = _normalize_segments(segments_json)
+    else:
+        segments = _load_segments(audio_id)
+        if not segments:
+            return {"audio_id": audio_id, "error": "No segments found in session", "segments": []}
+    # Create segment WAVs from session audio
+    try:
+        seg_dir = _create_segment_wavs(session["audio"], 16000, segments)
+    except Exception as e:
+        return {"audio_id": audio_id, "error": f"Failed to create segment audio: {e}", "segments": []}
+    from src.mfa import compute_mfa_timestamps_api
+    try:
+        result = compute_mfa_timestamps_api(segments, seg_dir, granularity or "words")
+    except Exception as e:
+        return {"audio_id": audio_id, "error": f"MFA alignment failed: {e}", "segments": []}
+    result["audio_id"] = audio_id
+    return result
+def mfa_timestamps_direct(audio_data, segments_json, granularity="words"):
+    """Compute MFA word/letter timestamps with provided audio and segments."""
+    # Parse segments
+    if isinstance(segments_json, str):
+        segments_json = json.loads(segments_json)
+    if not segments_json:
+        return {"error": "No segments provided", "segments": []}
+    segments = _normalize_segments(segments_json)
+    # Preprocess audio
+    try:
+        audio_np, sr = _preprocess_api_audio(audio_data)
+    except Exception as e:
+        return {"error": f"Failed to preprocess audio: {e}", "segments": []}
+    # Create segment WAVs
+    try:
+        seg_dir = _create_segment_wavs(audio_np, sr, segments)
+    except Exception as e:
+        return {"error": f"Failed to create segment audio: {e}", "segments": []}
+    from src.mfa import compute_mfa_timestamps_api
+    try:
+        result = compute_mfa_timestamps_api(segments, seg_dir, granularity or "words")
+    except Exception as e:
+        return {"error": f"MFA alignment failed: {e}", "segments": []}
+    return result

src/mfa.py CHANGED Viewed

@@ -5,6 +5,10 @@ from config import MFA_SPACE_URL, MFA_TIMEOUT, MFA_PROGRESS_SEGMENT_RATE
 # Lowercase special ref names for case-insensitive matching
 _SPECIAL_REFS = {"basmala", "isti'adha", "isti'adha+basmala"}
 def _mfa_upload_and_submit(refs, audio_paths):
     """Upload audio files and submit alignment batch to the MFA Space.
@@ -95,6 +99,395 @@ def _mfa_wait_result(event_id, headers, base):
     return parsed["results"]
 def _ts_progress_bar_html(total_segments, rate, animated=True):
     """Return HTML for a progress bar showing Segment x/N.
@@ -149,6 +542,10 @@ def _ts_progress_bar_html(total_segments, rate, animated=True):
     </div>'''
 def compute_mfa_timestamps(current_html, json_output, segment_dir, cached_log_row=None):
     """Compute word-level timestamps via MFA forced alignment and inject into HTML.
@@ -169,61 +566,11 @@ def compute_mfa_timestamps(current_html, json_output, segment_dir, cached_log_ro
         yield current_html, gr.update(), gr.update(), gr.update(), gr.update()
         return
-    # Build refs and audio paths from structured JSON output
     segments = json_output.get("segments", []) if json_output else []
     print(f"[MFA_TS] {len(segments)} segments in JSON")
-    refs = []
-    audio_paths = []
-    seg_to_result_idx = {}  # Maps segment index (0-based) → result index
-    _BASMALA_TEXT = "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيم"
-    _ISTIATHA_TEXT = "أَعُوذُ بِٱللَّهِ مِنَ الشَّيْطَانِ الرَّجِيم"
-    _COMBINED_PREFIX = _ISTIATHA_TEXT + " ۝ " + _BASMALA_TEXT
-    for seg in segments:
-        ref_from = seg.get("ref_from", "")
-        ref_to = seg.get("ref_to", "")
-        seg_idx = seg.get("segment", 0) - 1  # 0-indexed
-        confidence = seg.get("confidence", 0)
-        # For special segments (Basmala/Isti'adha), ref_from is empty but
-        # special_type carries the ref name needed for MFA
-        if not ref_from:
-            ref_from = seg.get("special_type", "")
-            ref_to = ref_from  # Special segments use same ref for both
-        if not ref_from or confidence <= 0:
-            continue
-        # Build MFA ref
-        if ref_from == ref_to:
-            mfa_ref = ref_from
-        else:
-            mfa_ref = f"{ref_from}-{ref_to}"
-        # Detect fused special prefix and build compound ref
-        # (skip when the ref itself is already a special like "Basmala")
-        _is_special_ref = ref_from.strip().lower() in _SPECIAL_REFS
-        if not _is_special_ref:
-            matched_text = seg.get("matched_text", "")
-            if matched_text.startswith(_COMBINED_PREFIX):
-                mfa_ref = f"Isti'adha+Basmala+{mfa_ref}"
-            elif matched_text.startswith(_ISTIATHA_TEXT):
-                mfa_ref = f"Isti'adha+{mfa_ref}"
-            elif matched_text.startswith(_BASMALA_TEXT):
-                mfa_ref = f"Basmala+{mfa_ref}"
-        # Check audio file exists
-        audio_path = os.path.join(segment_dir, f"seg_{seg_idx}.wav") if segment_dir else None
-        if not audio_path or not os.path.exists(audio_path):
-            print(f"[MFA_TS] Skipping seg {seg_idx}: audio not found at {audio_path}")
-            continue
-        # Track mapping from segment index to result index
-        seg_to_result_idx[seg_idx] = len(refs)
-        refs.append(mfa_ref)
-        audio_paths.append(audio_path)
-    print(f"[MFA_TS] {len(refs)} refs to align: {refs[:5]}{'...' if len(refs) > 5 else ''}")
     if not refs:
         print("[MFA_TS] Early return: no valid refs/audio pairs")
@@ -282,217 +629,29 @@ def compute_mfa_timestamps(current_html, json_output, segment_dir, cached_log_ro
         )
         raise
-    # Build lookup: "result_idx:location" → (start, end) from all successful results
-    # Using result_idx prefix ensures each segment has its own timestamps even for shared words
-    word_timestamps = {}  # "result_idx:location" → (start, end)
-    letter_timestamps = {}  # "result_idx:location" → list of letter dicts with group_id
-    word_to_all_results = {}  # word_pos → [result_idx, ...] (all occurrences)
-    def _assign_letter_groups(letters, word_location):
-        """Assign group_id to letters sharing identical (start, end) timestamps."""
-        if not letters:
-            return []
-        result = []
-        group_id = 0
-        prev_ts = None
-        for letter in letters:
-            ts = (letter.get("start"), letter.get("end"))
-            if ts != prev_ts:
-                group_id += 1
-                prev_ts = ts
-            result.append({
-                "char": letter.get("char", ""),
-                "start": letter.get("start"),
-                "end": letter.get("end"),
-                "group_id": f"{word_location}:{group_id}",  # Unique across words
-            })
-        return result
-    for result_idx, result in enumerate(results):
-        if result.get("status") != "ok":
-            print(f"[MFA_TS] Segment failed: ref={result.get('ref')} error={result.get('error')}")
-            continue
-        ref = result.get("ref", "")
-        is_special = ref.strip().lower() in _SPECIAL_REFS
-        is_fused = "+" in ref
-        for word in result.get("words", []):
-            loc = word.get("location", "")
-            if loc:
-                if is_special:
-                    base_key = f"{ref}:{loc}"
-                elif is_fused and loc.startswith("0:0:"):
-                    base_key = f"{ref}:{loc}"
-                else:
-                    base_key = loc
-                key = f"{result_idx}:{base_key}"  # Prefix with result index
-                word_timestamps[key] = (word["start"], word["end"])
-                # Extract letter timestamps if available
-                letters = word.get("letters")
-                if letters:
-                    letter_timestamps[key] = _assign_letter_groups(letters, loc)
-                # Track word→result_idx mapping for lookup (regular words only)
-                if not is_special and not (is_fused and loc.startswith("0:0:")):
-                    if loc not in word_to_all_results:
-                        word_to_all_results[loc] = []
-                    word_to_all_results[loc].append(result_idx)
-    print(f"[MFA_TS] {len(word_timestamps)} word timestamps collected, {len(letter_timestamps)} with letter-level data")
-    # Build cross-word overlap groups for simultaneous highlighting
-    def _build_crossword_groups(results_list, letter_ts_dict):
-        """
-        Build mapping of (key, letter_idx) -> cross-word group_id.
-        Only checks word boundaries: last letter(s) of word N vs first letter(s) of word N+1.
-        """
-        crossword_groups = {}  # (key, idx) -> group_id
-        for result_idx, result in enumerate(results_list):
-            if result.get("status") != "ok":
-                continue
-            ref = result.get("ref", "")
-            is_special = ref.strip().lower() in _SPECIAL_REFS
-            is_fused = "+" in ref
-            words = result.get("words", [])
-            # Iterate through consecutive word pairs
-            for word_i in range(len(words) - 1):
-                word_a = words[word_i]
-                word_b = words[word_i + 1]
-                loc_a = word_a.get("location", "")
-                loc_b = word_b.get("location", "")
-                if not loc_a or not loc_b:
-                    continue
-                # Build keys for letter_timestamps lookup
-                def make_key(loc):
-                    if is_special:
-                        base_key = f"{ref}:{loc}"
-                    elif is_fused and loc.startswith("0:0:"):
-                        base_key = f"{ref}:{loc}"
-                    else:
-                        base_key = loc
-                    return f"{result_idx}:{base_key}"
-                key_a = make_key(loc_a)
-                key_b = make_key(loc_b)
-                letters_a = letter_ts_dict.get(key_a, [])
-                letters_b = letter_ts_dict.get(key_b, [])
-                if not letters_a or not letters_b:
-                    continue
-                # Compare last letter(s) of word A with first letter(s) of word B
-                # Check last few letters of A against first few letters of B
-                for idx_a in range(len(letters_a) - 1, max(len(letters_a) - 3, -1), -1):
-                    letter_a = letters_a[idx_a]
-                    if letter_a.get("start") is None or letter_a.get("end") is None:
-                        continue
-                    for idx_b in range(min(3, len(letters_b))):
-                        letter_b = letters_b[idx_b]
-                        if letter_b.get("start") is None or letter_b.get("end") is None:
-                            continue
-                        # Check for exact timestamp match (MFA marks simultaneous letters identically)
-                        if letter_a["start"] == letter_b["start"] and letter_a["end"] == letter_b["end"]:
-                            group_id = f"xword-{result_idx}-{word_i}"
-                            crossword_groups[(key_a, idx_a)] = group_id
-                            crossword_groups[(key_b, idx_b)] = group_id
-        if crossword_groups:
-            print(f"[MFA_TS] Found {len(crossword_groups)} cross-word overlapping letters")
-        return crossword_groups
     crossword_groups = _build_crossword_groups(results, letter_timestamps)
-    # Post-process: extend each word's end to the start of the next word
-    # so words don't disappear between timestamps during animation.
-    import wave
-    for seg in segments:
-        ref_from = seg.get("ref_from", "")
-        ref_to = seg.get("ref_to", "")
-        seg_idx = seg.get("segment", 0) - 1
-        confidence = seg.get("confidence", 0)
-        if not ref_from:
-            ref_from = seg.get("special_type", "")
-            ref_to = ref_from  # Special segments use same ref for both
-        if not ref_from or confidence <= 0:
-            continue
-        # Get result_idx for this segment (may not exist if segment was skipped)
-        result_idx = seg_to_result_idx.get(seg_idx)
-        if result_idx is None:
-            continue
-        # Find the matching MFA result and collect word locations in order
-        ref_key = f"{ref_from}-{ref_to}" if ref_from != ref_to else ref_from
-        is_special = ref_from.strip().lower() in _SPECIAL_REFS
-        # Reconstruct compound ref for fused segments
-        # (skip when the ref itself is already a special like "Basmala")
-        if not is_special:
-            matched_text = seg.get("matched_text", "")
-            if matched_text.startswith(_COMBINED_PREFIX):
-                ref_key = f"Isti'adha+Basmala+{ref_key}"
-            elif matched_text.startswith(_ISTIATHA_TEXT):
-                ref_key = f"Isti'adha+{ref_key}"
-            elif matched_text.startswith(_BASMALA_TEXT):
-                ref_key = f"Basmala+{ref_key}"
-        is_fused = "+" in ref_key
-        seg_word_locs = []
-        for result in results:
-            if result.get("ref") == ref_key and result.get("status") == "ok":
-                for w in result.get("words", []):
-                    loc = w.get("location", "")
-                    if loc:
-                        if is_special:
-                            base_key = f"{ref_key}:{loc}"
-                        elif is_fused and loc.startswith("0:0:"):
-                            base_key = f"{ref_key}:{loc}"
-                        else:
-                            base_key = loc
-                        key = f"{result_idx}:{base_key}"  # Use result_idx prefix
-                        if key in word_timestamps:
-                            seg_word_locs.append(key)
-                break
-        if not seg_word_locs:
-            continue
-        # Extend each word's end to the next word's start
-        for i in range(len(seg_word_locs) - 1):
-            cur_start, cur_end = word_timestamps[seg_word_locs[i]]
-            nxt_start, _ = word_timestamps[seg_word_locs[i + 1]]
-            if nxt_start > cur_end:
-                word_timestamps[seg_word_locs[i]] = (cur_start, nxt_start)
-        # Extend first word back to time 0 so highlight starts immediately
-        first_loc = seg_word_locs[0]
-        first_start, first_end = word_timestamps[first_loc]
-        if first_start > 0:
-            word_timestamps[first_loc] = (0, first_end)
-        # Extend last word to segment audio duration
-        last_loc = seg_word_locs[-1]
-        last_start, last_end = word_timestamps[last_loc]
-        audio_path = os.path.join(segment_dir, f"seg_{seg_idx}.wav") if segment_dir else None
-        if audio_path and os.path.exists(audio_path):
-            with wave.open(audio_path, 'rb') as wf:
-                seg_duration = wf.getnframes() / wf.getframerate()
-            if seg_duration > last_end:
-                word_timestamps[last_loc] = (last_start, seg_duration)
-    print(f"[MFA_TS] Post-processed timestamps: extended word ends to fill gaps")
     # Inject timestamps into word spans, using segment boundaries to determine result_idx
-    # Step 1: Find all segment boundaries (position → seg_idx)
-    seg_boundaries = []  # [(position, seg_idx), ...]
     for m in re.finditer(r'data-segment-idx="(\d+)"', current_html):
         seg_boundaries.append((m.start(), int(m.group(1))))
     seg_boundaries.sort(key=lambda x: x[0])
-    # Build segment offset lookup: seg_idx → time_from (for absolute timestamp conversion)
-    seg_offset_map = {}  # seg_idx (0-based) → time_from
     for seg in segments:
-        idx = seg.get("segment", 0) - 1  # Convert to 0-based
         seg_offset_map[idx] = seg.get("time_from", 0)
-    # Step 2: For each word span, find which segment it belongs to
     def _get_seg_idx_at_pos(pos):
-        """Find the segment index for a position in the HTML."""
         seg_idx = None
         for boundary_pos, idx in seg_boundaries:
             if boundary_pos > pos:
@@ -508,13 +667,10 @@ def compute_mfa_timestamps(current_html, json_output, segment_dir, cached_log_ro
         if not pos_m:
             return orig
         pos = pos_m.group(1)
-        # Find which segment this word belongs to
         seg_idx = _get_seg_idx_at_pos(m.start())
         if seg_idx is None:
             return orig
-        # Get expected result_idx for this segment
         expected_result_idx = seg_to_result_idx.get(seg_idx)
-        # For regular words, use word-based mapping to find correct result_idx
         result_idx = None
         if pos and not pos.startswith("0:0:"):
             candidates = word_to_all_results.get(pos, [])
@@ -529,16 +685,13 @@ def compute_mfa_timestamps(current_html, json_output, segment_dir, cached_log_ro
             result_idx = expected_result_idx
         if result_idx is None:
             return orig
-        # Use result_idx prefix to get segment-specific timestamp
         key = f"{result_idx}:{pos}"
         ts = word_timestamps.get(key)
         if not ts:
             return orig
-        # Convert relative timestamps to absolute by adding segment offset
         seg_offset = seg_offset_map.get(seg_idx, 0)
         abs_start = ts[0] + seg_offset
         abs_end = ts[1] + seg_offset
-        # Include result_idx so char-level injection can find letter timestamps
         return orig[:-1] + f' data-result-idx="{result_idx}" data-start="{abs_start:.4f}" data-end="{abs_end:.4f}">'
     html = re.sub(word_open_re, _inject_word_ts, current_html)
@@ -551,19 +704,16 @@ def compute_mfa_timestamps(current_html, json_output, segment_dir, cached_log_ro
     def _stamp_chars_with_mfa(word_m):
         word_open = word_m.group(1)
-        word_abs_start = float(word_m.group(2))  # data-start (already correctly injected)
         inner = word_m.group(4)
-        # Extract data-pos from word tag
         pos_m = re.search(r'data-pos="([^"]+)"', word_open)
         word_pos = pos_m.group(1) if pos_m else None
-        # Find result_idx from word tag's data-result-idx if available, else use mapping
         result_idx_m = re.search(r'data-result-idx="(\d+)"', word_open)
         if result_idx_m:
             result_idx = int(result_idx_m.group(1))
         else:
-            # Fallback: use word-based mapping to find correct result_idx
             result_idx = None
             if word_pos and not word_pos.startswith("0:0:"):
                 candidates = word_to_all_results.get(word_pos, [])
@@ -571,51 +721,42 @@ def compute_mfa_timestamps(current_html, json_output, segment_dir, cached_log_ro
                     if len(candidates) == 1:
                         result_idx = candidates[0]
                     else:
-                        # Without position info, just take the first candidate
                         result_idx = candidates[0]
         key = f"{result_idx}:{word_pos}" if result_idx is not None and word_pos else None
-        # Look up word's relative start from MFA to calculate offset
         word_ts = word_timestamps.get(key) if key else None
         mfa_letters = letter_timestamps.get(key) if key else None
         if not mfa_letters or not word_ts:
             return word_m.group(0)
-        word_rel_start = word_ts[0]  # Word's relative start from MFA
         char_matches = list(re.finditer(r'<span class="char">([^<]*)</span>', inner))
         if not char_matches:
             return word_m.group(0)
-        # Match MFA letters to HTML chars (no NFC — base-char comparison instead)
         mfa_chars = [l["char"] for l in mfa_letters]
         html_chars = [m.group(1).replace('\u0640', '') for m in char_matches]
-        # Allowed character mappings (MFA char → HTML char)
-        # ى (alef maksura) ↔ ي (ya) are visually similar and interchangeable
         CHAR_EQUIVALENTS = {
-            'ى': 'ي',  # alef maksura → ya
-            'ي': 'ى',  # ya → alef maksura
         }
         def _first_base(s):
-            """First non-combining character after NFD decomposition."""
             for c in unicodedata.normalize("NFD", s):
                 if not unicodedata.category(c).startswith('M'):
                     return c
             return s[0] if s else ''
         def chars_match(mfa_c, html_c, log_substitution=False):
-            """Check if MFA char matches HTML char, including allowed equivalents."""
             if mfa_c == html_c or html_c in mfa_c or mfa_c in html_c:
                 return True
-            # Check allowed equivalents
             if CHAR_EQUIVALENTS.get(mfa_c) == html_c:
                 if log_substitution:
                     print(f"[MFA_TS] Char substitution: MFA '{mfa_c}' → HTML '{html_c}' (key={key})")
                 return True
-            # Base-char comparison (handles decomposed↔precomposed without NFC)
             mb, hb = _first_base(mfa_c), _first_base(html_c)
             if mb and hb and (mb == hb or CHAR_EQUIVALENTS.get(mb) == hb):
                 if log_substitution:
@@ -634,26 +775,19 @@ def compute_mfa_timestamps(current_html, json_output, segment_dir, cached_log_ro
                 mfa_char = mfa_chars[mfa_idx]
                 if chars_match(mfa_char, html_char, log_substitution=True):
                     letter = mfa_letters[mfa_idx]
-                    # Skip letters without valid timestamps
                     if letter["start"] is None or letter["end"] is None:
                         print(f"[MFA_TS] Skipping letter with missing timestamp: char='{letter.get('char')}' key={key} mfa_idx={mfa_idx}")
                         if chars_match(mfa_char, html_char) or len(html_char) >= len(mfa_char):
                             mfa_idx += 1
                         continue
-                    # Convert letter timestamps using word anchor
-                    # word_abs_start is already correct from word-level injection
-                    # letter times are relative to segment, so offset by (letter_start - word_rel_start)
                     abs_start = word_abs_start + (letter["start"] - word_rel_start)
                     abs_end = word_abs_start + (letter["end"] - word_rel_start)
-                    # Determine group_id: prefer cross-word group if exists, else use MFA's
                     crossword_gid = crossword_groups.get((key, mfa_idx), "")
                     final_group_id = crossword_gid or letter.get("group_id", "")
                     char_replacements.append((
                         cm.start(), cm.end(),
                         f'<span class="char" data-start="{abs_start:.4f}" data-end="{abs_end:.4f}" data-group-id="{final_group_id}">{cm.group(1)}</span>'
                     ))
-                    # Lookahead: stamp combining continuations with same MFA timestamp
-                    # (handles precomposed MFA char like ئ split into [يْـ, ٔ] in HTML)
                     mfa_nfd = unicodedata.normalize("NFD", letter["char"])
                     peek = html_idx + 1
                     while peek < len(char_matches):
@@ -671,7 +805,6 @@ def compute_mfa_timestamps(current_html, json_output, segment_dir, cached_log_ro
                     if chars_match(mfa_char, html_char) or len(html_char) >= len(mfa_char):
                         mfa_idx += 1
-        # Apply replacements in reverse order
         stamped_inner = inner
         for start, end, replacement in reversed(char_replacements):
             stamped_inner = stamped_inner[:start] + replacement + stamped_inner[end:]
@@ -703,7 +836,6 @@ def compute_mfa_timestamps(current_html, json_output, segment_dir, cached_log_ro
                         for w in result.get("words", []) if w.get("start") is not None and w.get("end") is not None
                     ],
                 })
-                # Collect char-level timestamps
                 _char_ts_log.append({
                     "ref": result.get("ref", ""),
                     "words": [
@@ -726,81 +858,11 @@ def compute_mfa_timestamps(current_html, json_output, segment_dir, cached_log_ro
         except Exception as e:
             print(f"[USAGE_LOG] Failed to log word timestamps: {e}")
-    # Build enriched JSON with word/letter timestamps (relative to segment)
-    from src.core.quran_index import get_quran_index
-    index = get_quran_index()
-    def _get_word_text(location: str) -> str:
-        """Look up word text from Quran index by location (surah:ayah:word)."""
-        if not location or location.startswith("0:0:"):
-            return ""  # Special segments (Basmala/Isti'adha) use 0:0:N
-        try:
-            parts = location.split(":")
-            if len(parts) >= 3:
-                key = (int(parts[0]), int(parts[1]), int(parts[2]))
-                idx = index.word_lookup.get(key)
-                if idx is not None:
-                    return index.words[idx].display_text
-        except (ValueError, IndexError):
-            pass
-        return ""
-    enriched_segments = []
-    for seg in segments:
-        seg_idx = seg.get("segment", 0) - 1
-        result_idx = seg_to_result_idx.get(seg_idx)
-        segment_data = dict(seg)  # Copy original segment data
-        if result_idx is not None:
-            # For special segments (Basmala/Isti'adha), get words from matched_text
-            _ref = seg.get("ref_from", "") or seg.get("special_type", "")
-            is_special = _ref.lower() in _SPECIAL_REFS
-            special_words = seg.get("matched_text", "").replace(" \u06dd ", " ").split() if is_special else []
-            # Find matching MFA result for this segment
-            for i, result in enumerate(results):
-                if i != result_idx or result.get("status") != "ok":
-                    continue
-                words_with_ts = []
-                for word_idx, word in enumerate(result.get("words", [])):
-                    if word.get("start") is None or word.get("end") is None:
-                        continue
-                    location = word.get("location", "")
-                    # Get word text: from matched_text for special, from index for regular
-                    if is_special or location.startswith("0:0:"):
-                        word_text = special_words[word_idx] if word_idx < len(special_words) else ""
-                    else:
-                        word_text = _get_word_text(location)
-                    word_data = {
-                        "word": word_text,
-                        "location": location,
-                        "start": round(word["start"], 4),  # Relative to segment
-                        "end": round(word["end"], 4),
-                    }
-                    # Add letter timestamps if available
-                    if word.get("letters"):
-                        word_data["letters"] = [
-                            {
-                                "char": lt.get("char", ""),
-                                "start": round(lt["start"], 4),
-                                "end": round(lt["end"], 4),
-                            }
-                            for lt in word.get("letters", [])
-                            if lt.get("start") is not None
-                        ]
-                    words_with_ts.append(word_data)
-                if words_with_ts:
-                    segment_data["words"] = words_with_ts
-                break
-        enriched_segments.append(segment_data)
-    enriched_json = {"segments": enriched_segments}
     # Final yield: updated HTML, hide progress bar, show Animate All, enriched JSON
     animate_all_btn_html = '<button class="animate-all-btn">Animate All</button>'

 # Lowercase special ref names for case-insensitive matching
 _SPECIAL_REFS = {"basmala", "isti'adha", "isti'adha+basmala"}
+_BASMALA_TEXT = "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيم"
+_ISTIATHA_TEXT = "أَعُوذُ بِٱللَّهِ مِنَ الشَّيْطَانِ الرَّجِيم"
+_COMBINED_PREFIX = _ISTIATHA_TEXT + " ۝ " + _BASMALA_TEXT
 def _mfa_upload_and_submit(refs, audio_paths):
     """Upload audio files and submit alignment batch to the MFA Space.
     return parsed["results"]
+# ---------------------------------------------------------------------------
+# Reusable helpers (shared by UI generator and API function)
+# ---------------------------------------------------------------------------
+def _make_ts_key(result_idx, ref, loc):
+    """Build the composite key used in word/letter timestamp dicts."""
+    is_special = ref.strip().lower() in _SPECIAL_REFS
+    is_fused = "+" in ref
+    if is_special:
+        base_key = f"{ref}:{loc}"
+    elif is_fused and loc.startswith("0:0:"):
+        base_key = f"{ref}:{loc}"
+    else:
+        base_key = loc
+    return f"{result_idx}:{base_key}"
+def _build_mfa_ref(seg):
+    """Build the MFA ref string for a single segment. Returns None to skip."""
+    ref_from = seg.get("ref_from", "")
+    ref_to = seg.get("ref_to", "")
+    confidence = seg.get("confidence", 0)
+    if not ref_from:
+        ref_from = seg.get("special_type", "")
+        ref_to = ref_from
+    if not ref_from or confidence <= 0:
+        return None
+    if ref_from == ref_to:
+        mfa_ref = ref_from
+    else:
+        mfa_ref = f"{ref_from}-{ref_to}"
+    _is_special_ref = ref_from.strip().lower() in _SPECIAL_REFS
+    if not _is_special_ref:
+        matched_text = seg.get("matched_text", "")
+        if matched_text.startswith(_COMBINED_PREFIX):
+            mfa_ref = f"Isti'adha+Basmala+{mfa_ref}"
+        elif matched_text.startswith(_ISTIATHA_TEXT):
+            mfa_ref = f"Isti'adha+{mfa_ref}"
+        elif matched_text.startswith(_BASMALA_TEXT):
+            mfa_ref = f"Basmala+{mfa_ref}"
+    return mfa_ref
+def _build_mfa_refs(segments, segment_dir):
+    """Build MFA refs and audio paths from segments.
+    Returns (refs, audio_paths, seg_to_result_idx).
+    """
+    refs = []
+    audio_paths = []
+    seg_to_result_idx = {}
+    for seg in segments:
+        seg_idx = seg.get("segment", 0) - 1
+        mfa_ref = _build_mfa_ref(seg)
+        if mfa_ref is None:
+            continue
+        audio_path = os.path.join(segment_dir, f"seg_{seg_idx}.wav") if segment_dir else None
+        if not audio_path or not os.path.exists(audio_path):
+            print(f"[MFA_TS] Skipping seg {seg_idx}: audio not found at {audio_path}")
+            continue
+        seg_to_result_idx[seg_idx] = len(refs)
+        refs.append(mfa_ref)
+        audio_paths.append(audio_path)
+    print(f"[MFA_TS] {len(refs)} refs to align: {refs[:5]}{'...' if len(refs) > 5 else ''}")
+    return refs, audio_paths, seg_to_result_idx
+def _assign_letter_groups(letters, word_location):
+    """Assign group_id to letters sharing identical (start, end) timestamps."""
+    if not letters:
+        return []
+    result = []
+    group_id = 0
+    prev_ts = None
+    for letter in letters:
+        ts = (letter.get("start"), letter.get("end"))
+        if ts != prev_ts:
+            group_id += 1
+            prev_ts = ts
+        result.append({
+            "char": letter.get("char", ""),
+            "start": letter.get("start"),
+            "end": letter.get("end"),
+            "group_id": f"{word_location}:{group_id}",
+        })
+    return result
+def _build_timestamp_lookups(results):
+    """Build timestamp lookup dicts from MFA results.
+    Returns (word_timestamps, letter_timestamps, word_to_all_results).
+    """
+    word_timestamps = {}
+    letter_timestamps = {}
+    word_to_all_results = {}
+    for result_idx, result in enumerate(results):
+        if result.get("status") != "ok":
+            print(f"[MFA_TS] Segment failed: ref={result.get('ref')} error={result.get('error')}")
+            continue
+        ref = result.get("ref", "")
+        is_special = ref.strip().lower() in _SPECIAL_REFS
+        is_fused = "+" in ref
+        for word in result.get("words", []):
+            loc = word.get("location", "")
+            if loc:
+                key = _make_ts_key(result_idx, ref, loc)
+                word_timestamps[key] = (word["start"], word["end"])
+                letters = word.get("letters")
+                if letters:
+                    letter_timestamps[key] = _assign_letter_groups(letters, loc)
+                if not is_special and not (is_fused and loc.startswith("0:0:")):
+                    if loc not in word_to_all_results:
+                        word_to_all_results[loc] = []
+                    word_to_all_results[loc].append(result_idx)
+    print(f"[MFA_TS] {len(word_timestamps)} word timestamps collected, {len(letter_timestamps)} with letter-level data")
+    return word_timestamps, letter_timestamps, word_to_all_results
+def _build_crossword_groups(results, letter_ts_dict):
+    """Build mapping of (key, letter_idx) -> cross-word group_id.
+    Only checks word boundaries: last letter(s) of word N vs first
+    letter(s) of word N+1.
+    """
+    crossword_groups = {}
+    for result_idx, result in enumerate(results):
+        if result.get("status") != "ok":
+            continue
+        ref = result.get("ref", "")
+        words = result.get("words", [])
+        for word_i in range(len(words) - 1):
+            word_a = words[word_i]
+            word_b = words[word_i + 1]
+            loc_a = word_a.get("location", "")
+            loc_b = word_b.get("location", "")
+            if not loc_a or not loc_b:
+                continue
+            key_a = _make_ts_key(result_idx, ref, loc_a)
+            key_b = _make_ts_key(result_idx, ref, loc_b)
+            letters_a = letter_ts_dict.get(key_a, [])
+            letters_b = letter_ts_dict.get(key_b, [])
+            if not letters_a or not letters_b:
+                continue
+            for idx_a in range(len(letters_a) - 1, max(len(letters_a) - 3, -1), -1):
+                letter_a = letters_a[idx_a]
+                if letter_a.get("start") is None or letter_a.get("end") is None:
+                    continue
+                for idx_b in range(min(3, len(letters_b))):
+                    letter_b = letters_b[idx_b]
+                    if letter_b.get("start") is None or letter_b.get("end") is None:
+                        continue
+                    if letter_a["start"] == letter_b["start"] and letter_a["end"] == letter_b["end"]:
+                        group_id = f"xword-{result_idx}-{word_i}"
+                        crossword_groups[(key_a, idx_a)] = group_id
+                        crossword_groups[(key_b, idx_b)] = group_id
+    if crossword_groups:
+        print(f"[MFA_TS] Found {len(crossword_groups)} cross-word overlapping letters")
+    return crossword_groups
+def _reconstruct_ref_key(seg):
+    """Reconstruct the MFA ref key for a segment (for result matching)."""
+    ref_from = seg.get("ref_from", "")
+    ref_to = seg.get("ref_to", "")
+    if not ref_from:
+        ref_from = seg.get("special_type", "")
+        ref_to = ref_from
+    ref_key = f"{ref_from}-{ref_to}" if ref_from != ref_to else ref_from
+    is_special = ref_from.strip().lower() in _SPECIAL_REFS
+    if not is_special:
+        matched_text = seg.get("matched_text", "")
+        if matched_text.startswith(_COMBINED_PREFIX):
+            ref_key = f"Isti'adha+Basmala+{ref_key}"
+        elif matched_text.startswith(_ISTIATHA_TEXT):
+            ref_key = f"Isti'adha+{ref_key}"
+        elif matched_text.startswith(_BASMALA_TEXT):
+            ref_key = f"Basmala+{ref_key}"
+    return ref_key
+def _extend_word_timestamps(word_timestamps, segments, seg_to_result_idx,
+                             results, segment_dir):
+    """Extend word ends to fill gaps between consecutive words.
+    Mutates word_timestamps in place.
+    """
+    import wave
+    for seg in segments:
+        ref_from = seg.get("ref_from", "")
+        confidence = seg.get("confidence", 0)
+        if not ref_from:
+            ref_from = seg.get("special_type", "")
+        if not ref_from or confidence <= 0:
+            continue
+        seg_idx = seg.get("segment", 0) - 1
+        result_idx = seg_to_result_idx.get(seg_idx)
+        if result_idx is None:
+            continue
+        ref_key = _reconstruct_ref_key(seg)
+        seg_word_locs = []
+        for result in results:
+            if result.get("ref") == ref_key and result.get("status") == "ok":
+                for w in result.get("words", []):
+                    loc = w.get("location", "")
+                    if loc:
+                        key = _make_ts_key(result_idx, ref_key, loc)
+                        if key in word_timestamps:
+                            seg_word_locs.append(key)
+                break
+        if not seg_word_locs:
+            continue
+        # Extend each word's end to the next word's start
+        for i in range(len(seg_word_locs) - 1):
+            cur_start, cur_end = word_timestamps[seg_word_locs[i]]
+            nxt_start, _ = word_timestamps[seg_word_locs[i + 1]]
+            if nxt_start > cur_end:
+                word_timestamps[seg_word_locs[i]] = (cur_start, nxt_start)
+        # Extend first word back to time 0 so highlight starts immediately
+        first_loc = seg_word_locs[0]
+        first_start, first_end = word_timestamps[first_loc]
+        if first_start > 0:
+            word_timestamps[first_loc] = (0, first_end)
+        # Extend last word to segment audio duration
+        last_loc = seg_word_locs[-1]
+        last_start, last_end = word_timestamps[last_loc]
+        audio_path = os.path.join(segment_dir, f"seg_{seg_idx}.wav") if segment_dir else None
+        if audio_path and os.path.exists(audio_path):
+            with wave.open(audio_path, 'rb') as wf:
+                seg_duration = wf.getnframes() / wf.getframerate()
+            if seg_duration > last_end:
+                word_timestamps[last_loc] = (last_start, seg_duration)
+    print(f"[MFA_TS] Post-processed timestamps: extended word ends to fill gaps")
+def _build_enriched_json(segments, results, seg_to_result_idx,
+                          word_timestamps, letter_timestamps, granularity,
+                          *, minimal=False):
+    """Build enriched segments with word (and optionally letter) timestamps.
+    When *minimal* is True (API path), each segment only contains
+    ``segment`` number + ``words`` array.  When False (UI path), all
+    original segment fields are preserved.
+    Returns dict with "segments" key.
+    """
+    from src.core.quran_index import get_quran_index
+    index = get_quran_index()
+    include_letters = (granularity == "words+chars")
+    def _get_word_text(location):
+        if not location or location.startswith("0:0:"):
+            return ""
+        try:
+            parts = location.split(":")
+            if len(parts) >= 3:
+                key = (int(parts[0]), int(parts[1]), int(parts[2]))
+                idx = index.word_lookup.get(key)
+                if idx is not None:
+                    return index.words[idx].display_text
+        except (ValueError, IndexError):
+            pass
+        return ""
+    enriched_segments = []
+    for seg in segments:
+        seg_idx = seg.get("segment", 0) - 1
+        result_idx = seg_to_result_idx.get(seg_idx)
+        if minimal:
+            segment_data = {"segment": seg.get("segment", 0)}
+        else:
+            segment_data = dict(seg)
+        if result_idx is not None:
+            _ref = seg.get("ref_from", "") or seg.get("special_type", "")
+            is_special = _ref.lower() in _SPECIAL_REFS
+            special_words = seg.get("matched_text", "").replace(" \u06dd ", " ").split() if is_special else []
+            for i, result in enumerate(results):
+                if i != result_idx or result.get("status") != "ok":
+                    continue
+                words_with_ts = []
+                for word_idx, word in enumerate(result.get("words", [])):
+                    if word.get("start") is None or word.get("end") is None:
+                        continue
+                    location = word.get("location", "")
+                    if minimal:
+                        # API: compact — [location, start, end] or [location, start, end, letters]
+                        word_entry = [location, round(word["start"], 4), round(word["end"], 4)]
+                        if include_letters and word.get("letters"):
+                            word_entry.append([
+                                [lt.get("char", ""), round(lt["start"], 4), round(lt["end"], 4)]
+                                for lt in word.get("letters", [])
+                                if lt.get("start") is not None
+                            ])
+                        words_with_ts.append(word_entry)
+                    else:
+                        # UI: keyed objects with display text
+                        if is_special or location.startswith("0:0:"):
+                            word_text = special_words[word_idx] if word_idx < len(special_words) else ""
+                        else:
+                            word_text = _get_word_text(location)
+                        word_data = {
+                            "word": word_text,
+                            "location": location,
+                            "start": round(word["start"], 4),
+                            "end": round(word["end"], 4),
+                        }
+                        if include_letters and word.get("letters"):
+                            word_data["letters"] = [
+                                {
+                                    "char": lt.get("char", ""),
+                                    "start": round(lt["start"], 4),
+                                    "end": round(lt["end"], 4),
+                                }
+                                for lt in word.get("letters", [])
+                                if lt.get("start") is not None
+                            ]
+                        words_with_ts.append(word_data)
+                if words_with_ts:
+                    segment_data["words"] = words_with_ts
+                break
+        enriched_segments.append(segment_data)
+    return {"segments": enriched_segments}
+# ---------------------------------------------------------------------------
+# Synchronous API function
+# ---------------------------------------------------------------------------
+def compute_mfa_timestamps_api(segments, segment_dir, granularity="words"):
+    """Run MFA forced alignment and return enriched segments (no UI/HTML).
+    Args:
+        segments: List of segment dicts (same format as alignment response).
+        segment_dir: Path to directory containing per-segment WAV files.
+        granularity: "words" or "words+chars".
+    Returns:
+        Dict with "segments" key containing enriched segment data.
+    """
+    if not granularity or granularity not in ("words", "words+chars"):
+        granularity = "words"
+    refs, audio_paths, seg_to_result_idx = _build_mfa_refs(segments, segment_dir)
+    if not refs:
+        return {"segments": segments}
+    event_id, headers, base = _mfa_upload_and_submit(refs, audio_paths)
+    results = _mfa_wait_result(event_id, headers, base)
+    print(f"[MFA_TS] Got {len(results)} results from MFA API")
+    word_ts, letter_ts, _ = _build_timestamp_lookups(results)
+    _build_crossword_groups(results, letter_ts)
+    _extend_word_timestamps(word_ts, segments, seg_to_result_idx, results, segment_dir)
+    return _build_enriched_json(segments, results, seg_to_result_idx,
+                                word_ts, letter_ts, granularity, minimal=True)
+# ---------------------------------------------------------------------------
+# UI progress bar
+# ---------------------------------------------------------------------------
 def _ts_progress_bar_html(total_segments, rate, animated=True):
     """Return HTML for a progress bar showing Segment x/N.
     </div>'''
+# ---------------------------------------------------------------------------
+# UI generator (Gradio — yields progress, injects HTML timestamps)
+# ---------------------------------------------------------------------------
 def compute_mfa_timestamps(current_html, json_output, segment_dir, cached_log_row=None):
     """Compute word-level timestamps via MFA forced alignment and inject into HTML.
         yield current_html, gr.update(), gr.update(), gr.update(), gr.update()
         return
+    # Build refs and audio paths using shared helper
     segments = json_output.get("segments", []) if json_output else []
     print(f"[MFA_TS] {len(segments)} segments in JSON")
+    refs, audio_paths, seg_to_result_idx = _build_mfa_refs(segments, segment_dir)
     if not refs:
         print("[MFA_TS] Early return: no valid refs/audio pairs")
         )
         raise
+    # Build timestamp lookups using shared helper
+    word_timestamps, letter_timestamps, word_to_all_results = _build_timestamp_lookups(results)
+    # Build cross-word groups using shared helper
     crossword_groups = _build_crossword_groups(results, letter_timestamps)
+    # Extend word timestamps using shared helper
+    _extend_word_timestamps(word_timestamps, segments, seg_to_result_idx, results, segment_dir)
+    # --- HTML injection (UI-only, not shared with API) ---
     # Inject timestamps into word spans, using segment boundaries to determine result_idx
+    seg_boundaries = []
     for m in re.finditer(r'data-segment-idx="(\d+)"', current_html):
         seg_boundaries.append((m.start(), int(m.group(1))))
     seg_boundaries.sort(key=lambda x: x[0])
+    seg_offset_map = {}
     for seg in segments:
+        idx = seg.get("segment", 0) - 1
         seg_offset_map[idx] = seg.get("time_from", 0)
     def _get_seg_idx_at_pos(pos):
         seg_idx = None
         for boundary_pos, idx in seg_boundaries:
             if boundary_pos > pos:
         if not pos_m:
             return orig
         pos = pos_m.group(1)
         seg_idx = _get_seg_idx_at_pos(m.start())
         if seg_idx is None:
             return orig
         expected_result_idx = seg_to_result_idx.get(seg_idx)
         result_idx = None
         if pos and not pos.startswith("0:0:"):
             candidates = word_to_all_results.get(pos, [])
             result_idx = expected_result_idx
         if result_idx is None:
             return orig
         key = f"{result_idx}:{pos}"
         ts = word_timestamps.get(key)
         if not ts:
             return orig
         seg_offset = seg_offset_map.get(seg_idx, 0)
         abs_start = ts[0] + seg_offset
         abs_end = ts[1] + seg_offset
         return orig[:-1] + f' data-result-idx="{result_idx}" data-start="{abs_start:.4f}" data-end="{abs_end:.4f}">'
     html = re.sub(word_open_re, _inject_word_ts, current_html)
     def _stamp_chars_with_mfa(word_m):
         word_open = word_m.group(1)
+        word_abs_start = float(word_m.group(2))
         inner = word_m.group(4)
         pos_m = re.search(r'data-pos="([^"]+)"', word_open)
         word_pos = pos_m.group(1) if pos_m else None
         result_idx_m = re.search(r'data-result-idx="(\d+)"', word_open)
         if result_idx_m:
             result_idx = int(result_idx_m.group(1))
         else:
             result_idx = None
             if word_pos and not word_pos.startswith("0:0:"):
                 candidates = word_to_all_results.get(word_pos, [])
                     if len(candidates) == 1:
                         result_idx = candidates[0]
                     else:
                         result_idx = candidates[0]
         key = f"{result_idx}:{word_pos}" if result_idx is not None and word_pos else None
         word_ts = word_timestamps.get(key) if key else None
         mfa_letters = letter_timestamps.get(key) if key else None
         if not mfa_letters or not word_ts:
             return word_m.group(0)
+        word_rel_start = word_ts[0]
         char_matches = list(re.finditer(r'<span class="char">([^<]*)</span>', inner))
         if not char_matches:
             return word_m.group(0)
         mfa_chars = [l["char"] for l in mfa_letters]
         html_chars = [m.group(1).replace('\u0640', '') for m in char_matches]
         CHAR_EQUIVALENTS = {
+            'ى': 'ي',
+            'ي': 'ى',
         }
         def _first_base(s):
             for c in unicodedata.normalize("NFD", s):
                 if not unicodedata.category(c).startswith('M'):
                     return c
             return s[0] if s else ''
         def chars_match(mfa_c, html_c, log_substitution=False):
             if mfa_c == html_c or html_c in mfa_c or mfa_c in html_c:
                 return True
             if CHAR_EQUIVALENTS.get(mfa_c) == html_c:
                 if log_substitution:
                     print(f"[MFA_TS] Char substitution: MFA '{mfa_c}' → HTML '{html_c}' (key={key})")
                 return True
             mb, hb = _first_base(mfa_c), _first_base(html_c)
             if mb and hb and (mb == hb or CHAR_EQUIVALENTS.get(mb) == hb):
                 if log_substitution:
                 mfa_char = mfa_chars[mfa_idx]
                 if chars_match(mfa_char, html_char, log_substitution=True):
                     letter = mfa_letters[mfa_idx]
                     if letter["start"] is None or letter["end"] is None:
                         print(f"[MFA_TS] Skipping letter with missing timestamp: char='{letter.get('char')}' key={key} mfa_idx={mfa_idx}")
                         if chars_match(mfa_char, html_char) or len(html_char) >= len(mfa_char):
                             mfa_idx += 1
                         continue
                     abs_start = word_abs_start + (letter["start"] - word_rel_start)
                     abs_end = word_abs_start + (letter["end"] - word_rel_start)
                     crossword_gid = crossword_groups.get((key, mfa_idx), "")
                     final_group_id = crossword_gid or letter.get("group_id", "")
                     char_replacements.append((
                         cm.start(), cm.end(),
                         f'<span class="char" data-start="{abs_start:.4f}" data-end="{abs_end:.4f}" data-group-id="{final_group_id}">{cm.group(1)}</span>'
                     ))
                     mfa_nfd = unicodedata.normalize("NFD", letter["char"])
                     peek = html_idx + 1
                     while peek < len(char_matches):
                     if chars_match(mfa_char, html_char) or len(html_char) >= len(mfa_char):
                         mfa_idx += 1
         stamped_inner = inner
         for start, end, replacement in reversed(char_replacements):
             stamped_inner = stamped_inner[:start] + replacement + stamped_inner[end:]
                         for w in result.get("words", []) if w.get("start") is not None and w.get("end") is not None
                     ],
                 })
                 _char_ts_log.append({
                     "ref": result.get("ref", ""),
                     "words": [
         except Exception as e:
             print(f"[USAGE_LOG] Failed to log word timestamps: {e}")
+    # Build enriched JSON using shared helper (UI always includes letters)
+    enriched_json = _build_enriched_json(
+        segments, results, seg_to_result_idx,
+        word_timestamps, letter_timestamps, "words+chars",
+    )
     # Final yield: updated HTML, hide progress bar, show Animate All, enriched JSON
     animate_all_btn_html = '<button class="animate-all-btn">Animate All</button>'

src/ui/event_wiring.py CHANGED Viewed

@@ -9,6 +9,7 @@ from src.pipeline import (
 from src.api.session_api import (
     process_audio_session, resegment_session,
     retranscribe_session, realign_from_timestamps,
 )
 from src.mfa import compute_mfa_timestamps
 from src.ui.handlers import (
@@ -483,3 +484,15 @@ def _wire_api_endpoint(c):
         outputs=[c.api_result],
         api_name="realign_from_timestamps",
     )

 from src.api.session_api import (
     process_audio_session, resegment_session,
     retranscribe_session, realign_from_timestamps,
+    mfa_timestamps_session, mfa_timestamps_direct,
 )
 from src.mfa import compute_mfa_timestamps
 from src.ui.handlers import (
         outputs=[c.api_result],
         api_name="realign_from_timestamps",
     )
+    gr.Button(visible=False).click(
+        fn=mfa_timestamps_session,
+        inputs=[c.api_audio_id, c.api_mfa_segments, c.api_mfa_granularity],
+        outputs=[c.api_result],
+        api_name="mfa_timestamps_session",
+    )
+    gr.Button(visible=False).click(
+        fn=mfa_timestamps_direct,
+        inputs=[c.api_audio, c.api_mfa_segments, c.api_mfa_granularity],
+        outputs=[c.api_result],
+        api_name="mfa_timestamps_direct",
+    )

src/ui/interface.py CHANGED Viewed

@@ -78,6 +78,8 @@ def build_interface():
         c.api_model = gr.Textbox(visible=False)
         c.api_device = gr.Textbox(visible=False)
         c.api_timestamps = gr.JSON(visible=False)
         c.api_result = gr.JSON(visible=False)
         wire_events(app, c)
@@ -110,7 +112,7 @@ def _build_left_column(c):
                     choices=["Base", "Large"],
                     value="Base",
                     label="ASR Model",
-                    info="Large: more robust to noisy/non-studio recitations but much slower (10x bigger)"
                 )
                 c.device_radio = gr.Radio(
                     choices=["GPU", "CPU"],

         c.api_model = gr.Textbox(visible=False)
         c.api_device = gr.Textbox(visible=False)
         c.api_timestamps = gr.JSON(visible=False)
+        c.api_mfa_segments = gr.JSON(visible=False)
+        c.api_mfa_granularity = gr.Textbox(visible=False)
         c.api_result = gr.JSON(visible=False)
         wire_events(app, c)
                     choices=["Base", "Large"],
                     value="Base",
                     label="ASR Model",
+                    info="Large: more robust to noisy/non-studio recitations but slower"
                 )
                 c.device_radio = gr.Radio(
                     choices=["GPU", "CPU"],

tests/test_session_api.py CHANGED Viewed

@@ -263,6 +263,156 @@ class TestWorkflow:
 # 6. Error handling
 # ---------------------------------------------------------------------------
 class TestErrorHandling:
     def test_invalid_audio_id_retranscribe(self, client):
         result = client.predict(

 # 6. Error handling
 # ---------------------------------------------------------------------------
+# ---------------------------------------------------------------------------
+# 7. MFA timestamps — session-based
+# ---------------------------------------------------------------------------
+class TestMfaTimestampsSession:
+    def test_basic_words_only(self, client, session):
+        """Session endpoint with stored segments, words granularity."""
+        result = client.predict(
+            session["audio_id"], None, "words",
+            api_name="/mfa_timestamps_session",
+        )
+        assert result["audio_id"] == session["audio_id"]
+        assert len(result["segments"]) > 0
+        has_words = any("words" in seg for seg in result["segments"])
+        assert has_words, "Expected at least one segment with words"
+        # Words-only: each word is [location, start, end] (3 elements)
+        for seg in result["segments"]:
+            for word in seg.get("words", []):
+                assert len(word) == 3, f"words granularity should give 3-element arrays, got {len(word)}"
+    def test_words_plus_chars(self, client, session):
+        """Session endpoint with words+chars granularity."""
+        result = client.predict(
+            session["audio_id"], None, "words+chars",
+            api_name="/mfa_timestamps_session",
+        )
+        has_letters = any(
+            len(word) == 4
+            for seg in result["segments"]
+            for word in seg.get("words", [])
+        )
+        assert has_letters, "words+chars should include letter arrays (4th element)"
+    def test_with_segments_override(self, client, session):
+        """Session endpoint with explicit segments (override stored)."""
+        segments_override = session["segments"][:2]
+        result = client.predict(
+            session["audio_id"], segments_override, "words",
+            api_name="/mfa_timestamps_session",
+        )
+        assert result["audio_id"] == session["audio_id"]
+        assert len(result["segments"]) == 2
+    def test_word_timestamp_fields(self, client, session):
+        """Verify word arrays have correct structure: [location, start, end, ?letters]."""
+        result = client.predict(
+            session["audio_id"], None, "words+chars",
+            api_name="/mfa_timestamps_session",
+        )
+        for seg in result["segments"]:
+            for word in seg.get("words", []):
+                assert isinstance(word[0], str), "word[0] should be location string"
+                assert isinstance(word[1], (int, float)), "word[1] should be start time"
+                assert isinstance(word[2], (int, float)), "word[2] should be end time"
+                assert word[2] > word[1], "end should be > start"
+                if len(word) == 4:
+                    # Letters: list of [char, start, end]
+                    for letter in word[3]:
+                        assert len(letter) == 3
+                        assert isinstance(letter[0], str)
+    def test_invalid_session(self, client):
+        result = client.predict(
+            FAKE_ID, None, "words",
+            api_name="/mfa_timestamps_session",
+        )
+        assert "error" in result
+        assert result["segments"] == []
+    def test_default_granularity(self, client, session):
+        """Empty granularity should default to words."""
+        result = client.predict(
+            session["audio_id"], None, "",
+            api_name="/mfa_timestamps_session",
+        )
+        assert len(result["segments"]) > 0
+        for seg in result["segments"]:
+            for word in seg.get("words", []):
+                assert len(word) == 3, "default granularity should not include letters"
+# ---------------------------------------------------------------------------
+# 8. MFA timestamps — direct
+# ---------------------------------------------------------------------------
+class TestMfaTimestampsDirect:
+    def test_basic(self, client, session):
+        """Direct endpoint with audio file and segments."""
+        result = client.predict(
+            AUDIO_FILE, session["segments"], "words",
+            api_name="/mfa_timestamps_direct",
+        )
+        assert "segments" in result
+        assert len(result["segments"]) > 0
+        has_words = any("words" in seg for seg in result["segments"])
+        assert has_words
+    def test_words_plus_chars(self, client, session):
+        result = client.predict(
+            AUDIO_FILE, session["segments"], "words+chars",
+            api_name="/mfa_timestamps_direct",
+        )
+        has_letters = any(
+            len(word) == 4
+            for seg in result["segments"]
+            for word in seg.get("words", [])
+        )
+        assert has_letters
+    def test_no_audio_id_in_response(self, client, session):
+        """Direct endpoint should not return audio_id."""
+        result = client.predict(
+            AUDIO_FILE, session["segments"], "words",
+            api_name="/mfa_timestamps_direct",
+        )
+        assert "audio_id" not in result
+    def test_empty_segments_error(self, client):
+        result = client.predict(
+            AUDIO_FILE, [], "words",
+            api_name="/mfa_timestamps_direct",
+        )
+        assert "error" in result
+        assert result["segments"] == []
+# ---------------------------------------------------------------------------
+# 9. Segments stored in session after alignment
+# ---------------------------------------------------------------------------
+class TestSegmentStorage:
+    def test_segments_stored_after_process(self, client):
+        """process_audio_session should store segments for later MFA use."""
+        proc = client.predict(
+            AUDIO_FILE, 200, 1000, 100, "Base", "CPU",
+            api_name="/process_audio_session",
+        )
+        # MFA session endpoint should find stored segments
+        result = client.predict(
+            proc["audio_id"], None, "words",
+            api_name="/mfa_timestamps_session",
+        )
+        assert "error" not in result or result.get("segments")
+        assert result["audio_id"] == proc["audio_id"]
+# ---------------------------------------------------------------------------
+# 10. Error handling
+# ---------------------------------------------------------------------------
 class TestErrorHandling:
     def test_invalid_audio_id_retranscribe(self, client):
         result = client.predict(