Spaces:
Running
on
Zero
Running
on
Zero
| # Usage Logging | |
| ## Part 1 โ Reference: `recitation_app` Logging | |
| Documents the recitation logging system used in `recitation_app` to collect anonymised analysis data on HuggingFace Hub. Included here as a reference for the quran_aligner schema below. | |
| ### Dataset | |
| | Property | Value | | |
| |----------|-------| | |
| | Repo | `hetchyy/recitation-logs` (private) | | |
| | Type | HuggingFace Dataset | | |
| | Format | Parquet files in `data/` | | |
| | Push interval | 1 minute | | |
| Configured in `config.py`: | |
| ```python | |
| USAGE_LOG_DATASET_REPO = "hetchyy/recitation-logs" | |
| USAGE_LOG_PUSH_INTERVAL_MINUTES = 1 | |
| USAGE_LOG_AUDIO = False # toggleable at runtime | |
| ``` | |
| ### Schema | |
| Defined in `utils/usage_logger.py` as `_RECITATION_SCHEMA`: | |
| | Field | HF Type | Description | | |
| |-------|---------|-------------| | |
| | `audio` | `Audio` | Optional FLAC-encoded audio bytes embedded in parquet | | |
| | `timestamp` | `Value(string)` | ISO 8601 datetime of the analysis | | |
| | `user_id` | `Value(string)` | SHA-256 hash (12-char) of username or IP+UA | | |
| | `verse_ref` | `Value(string)` | Quranic reference, e.g. `"1:1"` | | |
| | `canonical_text` | `Value(string)` | Arabic text of the verse | | |
| | `segments` | `Value(string)` | JSON array of segment results (see below) | | |
| | `multi_model` | `Value(bool)` | Whether multiple ASR models were used | | |
| | `settings` | `Value(string)` | JSON dict of Tajweed settings | | |
| | `vad_timestamps` | `Value(string)` | JSON list of VAD segment boundaries | | |
| #### Segment object (inside `segments` JSON) | |
| ```json | |
| { | |
| "segment_ref": "1:1", | |
| "canonical_phonemes": "b i s m i ...", | |
| "detected_phonemes": "b i s m i ..." | |
| } | |
| ``` | |
| #### Settings object (inside `settings` JSON) | |
| ```json | |
| { | |
| "tolerance": 0.15, | |
| "iqlab_sound": "m", | |
| "ghunnah_length": 2, | |
| "jaiz_length": 4, | |
| "wajib_length": 4, | |
| "arid_length": 2, | |
| "leen_length": 2 | |
| } | |
| ``` | |
| ### ParquetScheduler | |
| Custom subclass of `huggingface_hub.CommitScheduler` (`utils/usage_logger.py`). | |
| #### How it works | |
| 1. **Buffer** โ Rows accumulate in an in-memory list via `.append(row)`. Access is protected by a threading lock. | |
| 2. **Flush** โ On each scheduler tick (every `USAGE_LOG_PUSH_INTERVAL_MINUTES`): | |
| - Lock the buffer, swap it out, release the lock. | |
| - For any `audio` field containing a file path, read the file and convert to `{"path": filename, "bytes": binary_data}`. | |
| - Build a PyArrow table from the rows. | |
| - Embed the HF feature schema in parquet metadata: | |
| ```python | |
| table.replace_schema_metadata( | |
| {"huggingface": json.dumps({"info": {"features": schema}})} | |
| ) | |
| ``` | |
| - Write to a temp parquet file, then upload via `api.upload_file()` to `data/{uuid4()}.parquet`. | |
| - Clean up temp audio files. | |
| #### Audio encoding | |
| When `USAGE_LOG_AUDIO` is enabled: | |
| ```python | |
| sf.write(filepath, audio_array, sample_rate, format="FLAC") | |
| row["audio"] = str(filepath) # ParquetScheduler reads and embeds the bytes | |
| ``` | |
| The audio is 16kHz mono, encoded as FLAC, and stored as embedded bytes inside the parquet file. | |
| ### Lazy Initialisation | |
| Schedulers are **not** created at import time. They are initialised on first call to `_ensure_schedulers()` using double-checked locking: | |
| ```python | |
| _recitation_scheduler = None | |
| _schedulers_initialized = False | |
| _init_lock = threading.Lock() | |
| def _ensure_schedulers(): | |
| global _recitation_scheduler, _schedulers_initialized | |
| if _schedulers_initialized: | |
| return | |
| with _init_lock: | |
| if _schedulers_initialized: | |
| return | |
| _schedulers_initialized = True | |
| _recitation_scheduler = ParquetScheduler( | |
| repo_id=USAGE_LOG_DATASET_REPO, | |
| schema=_RECITATION_SCHEMA, | |
| every=USAGE_LOG_PUSH_INTERVAL_MINUTES, | |
| path_in_repo="data", | |
| repo_type="dataset", | |
| private=True, | |
| ) | |
| ``` | |
| This avoids interfering with ZeroGPU, which is sensitive to early network calls. | |
| ### Error Logging | |
| Errors use a separate `CommitScheduler` (not `ParquetScheduler`) that watches a local directory: | |
| - Local path: `/usage_logs/errors/error_log-{uuid4()}.jsonl` | |
| - Remote path: `data/errors/` | |
| - Format: JSONL with fields `timestamp`, `user_id`, `verse_ref`, `error_message` | |
| Errors are appended to the JSONL file under a file lock. The `CommitScheduler` syncs the directory to Hub periodically. | |
| ### User Anonymisation | |
| ```python | |
| def get_user_id(request) -> str: | |
| username = getattr(request, "username", None) | |
| if username: | |
| return hashlib.sha256(username.encode()).hexdigest()[:12] | |
| ip = headers.get("x-forwarded-for", "").split(",")[0].strip() | |
| ua = headers.get("user-agent", "") | |
| return hashlib.sha256(f"{ip}|{ua}".encode()).hexdigest()[:12] | |
| ``` | |
| - Logged-in HF users: hash of username | |
| - Anonymous users: hash of IP + User-Agent | |
| - Always truncated to 12 hex characters | |
| ### Fallback | |
| If the scheduler fails to initialise (no HF token, network issues), rows are written to a local JSONL file at `usage_logs/recitations_fallback.jsonl` (without audio). | |
| ### Integration Point | |
| Logging is called from the audio processing handler (`ui/handlers/audio_processing.py`) after each analysis completes: | |
| ```python | |
| log_analysis( | |
| user_id, ref, text, segments, | |
| multi_model=bool(use_multi), | |
| settings=_settings, | |
| audio=audio_for_log, # tuple of (sample_rate, np.ndarray) or None | |
| vad_timestamps=vad_ts, # list of [start, end] pairs | |
| ) | |
| ``` | |
| Errors are logged separately: | |
| ```python | |
| log_error(user_id, ref, "Audio loading failed") | |
| ``` | |
| ### Dependencies | |
| - `huggingface_hub` โ `CommitScheduler` base class and Hub API | |
| - `pyarrow` โ Parquet table creation and schema metadata | |
| - `soundfile` โ FLAC audio encoding | |
| --- | |
| ## Part 2 โ `quran_aligner` Logging Schema | |
| Schema for logging alignment runs from this project. One row per audio upload. The row is mutated in-place while it sits in the `ParquetScheduler` buffer (before the next push-to-Hub tick). Run-level fields (profiling, reciter stats, quality stats, settings) are **overwritten** to reflect the latest run. Segment results are **appended** so every setting combination is preserved. | |
| ### Run-level fields | |
| #### Identity | |
| | Field | HF Type | Description | | |
| |-------|---------|-------------| | |
| | `audio` | `Audio` | FLAC-encoded audio (16kHz mono) | | |
| | `audio_id` | `Value(string)` | `{sha256(audio_bytes)[:16]}:{timestamp}`, e.g. `a3f7b2c91e04d8f2:20260203T141532` | | |
| | `timestamp` | `Value(string)` | ISO 8601 datetime truncated to seconds, e.g. `2026-02-03T01:50:45` | | |
| | `user_id` | `Value(string)` | SHA-256 hash (12-char) of IP+UA | | |
| The `audio_id` hash prefix enables grouping/deduplication of the same recording across runs; the timestamp suffix makes each run unique. Cost is ~90ms for a 5-minute recording. | |
| #### Input metadata | |
| | Field | HF Type | Description | | |
| |-------|---------|-------------| | |
| | `audio_duration_s` | `Value(float64)` | Total audio duration in seconds | | |
| | `num_segments` | `Value(int32)` | Number of VAD segments | | |
| | `surah` | `Value(int32)` | Detected surah (1-114) | | |
| #### Segmentation settings | |
| | Field | HF Type | Description | | |
| |-------|---------|-------------| | |
| | `min_silence_ms` | `Value(int32)` | Minimum silence duration to split | | |
| | `min_speech_ms` | `Value(int32)` | Minimum speech duration for a valid segment | | |
| | `pad_ms` | `Value(int32)` | Padding around speech segments | | |
| | `asr_model` | `Value(string)` | `"Base"` (`hetchyy/r15_95m`) or `"Large"` (`hetchyy/r7`) | | |
| | `device` | `Value(string)` | `"GPU"` or `"CPU"` | | |
| #### Profiling (seconds) | |
| | Field | HF Type | Description | | |
| |-------|---------|-------------| | |
| | `total_time` | `Value(float64)` | End-to-end pipeline wall time | | |
| | `vad_queue_time` | `Value(float64)` | VAD queue wait time | | |
| | `vad_gpu_time` | `Value(float64)` | VAD actual GPU execution | | |
| | `asr_gpu_time` | `Value(float64)` | ASR actual GPU execution | | |
| | `dp_total_time` | `Value(float64)` | Total DP alignment across all segments | | |
| #### Quality & retry stats | |
| | Field | HF Type | Description | | |
| |-------|---------|-------------| | |
| | `segments_passed` | `Value(int32)` | Segments with confidence > 0 | | |
| | `segments_failed` | `Value(int32)` | Segments with confidence <= 0 | | |
| | `mean_confidence` | `Value(float64)` | Average confidence across all segments | | |
| | `tier1_retries` | `Value(int32)` | Expanded-window retry attempts | | |
| | `tier1_passed` | `Value(int32)` | Successful tier 1 retries | | |
| | `tier2_retries` | `Value(int32)` | Relaxed-threshold retry attempts | | |
| | `tier2_passed` | `Value(int32)` | Successful tier 2 retries | | |
| | `reanchors` | `Value(int32)` | Re-anchor events (after consecutive failures) | | |
| | `special_merges` | `Value(int32)` | Basmala-fused segments | | |
| #### Reciter stats | |
| Computed from matched segments (those with `word_count > 0`). Already calculated in `app.py:877-922` for console output. | |
| | Field | HF Type | Description | | |
| |-------|---------|-------------| | |
| | `words_per_minute` | `Value(float64)` | `total_words / (total_speech_s / 60)` | | |
| | `phonemes_per_second` | `Value(float64)` | `total_phonemes / total_speech_s` | | |
| | `avg_segment_duration` | `Value(float64)` | Mean duration of matched segments | | |
| | `std_segment_duration` | `Value(float64)` | Std dev of matched segment durations | | |
| | `avg_pause_duration` | `Value(float64)` | Mean inter-segment silence gap | | |
| | `std_pause_duration` | `Value(float64)` | Std dev of pause durations | | |
| #### Session flags | |
| | Field | HF Type | Description | | |
| |-------|---------|-------------| | |
| | `resegmented` | `Value(bool)` | User resegmented with different VAD settings | | |
| | `retranscribed` | `Value(bool)` | User retranscribed with a different ASR model | | |
| #### Segments, timestamps & error | |
| | Field | HF Type | Description | | |
| |-------|---------|-------------| | |
| | `segments` | `Value(string)` | JSON array of run objects (see below) โ **appended** on resegment/retranscribe | | |
| | `word_timestamps` | `Value(string)` | JSON array of per-segment MFA word timings (see below), null until computed | | |
| | `error` | `Value(string)` | Top-level error message if the pipeline failed | | |
| ### Segment runs (inside `segments` JSON) | |
| Each run with different settings appends a new run object. The array preserves the full history so every setting combination is available. | |
| ```json | |
| [ | |
| { | |
| "min_silence_ms": 200, | |
| "min_speech_ms": 1000, | |
| "pad_ms": 100, | |
| "asr_model": "Base", | |
| "segments": [ | |
| { | |
| "idx": 1, | |
| "start": 0.512, | |
| "end": 3.841, | |
| "duration": 3.329, | |
| "ref": "2:255:1-2:255:5", | |
| "confidence": 0.87, | |
| "word_count": 5, | |
| "ayah_span": 1, | |
| "phoneme_count": 42, | |
| "undersegmented": false, | |
| "missing_words": false, | |
| "special_type": null, | |
| "error": null | |
| } | |
| ] | |
| }, | |
| { | |
| "min_silence_ms": 600, | |
| "min_speech_ms": 1500, | |
| "pad_ms": 300, | |
| "asr_model": "Base", | |
| "segments": [...] | |
| } | |
| ] | |
| ``` | |
| #### Run object | |
| | Field | Type | Description | | |
| |-------|------|-------------| | |
| | `min_silence_ms` | int | Silence setting used for this run | | |
| | `min_speech_ms` | int | Speech setting used for this run | | |
| | `pad_ms` | int | Pad setting used for this run | | |
| | `asr_model` | string | `"Base"` or `"Large"` | | |
| | `segments` | array | Per-segment objects for this run | | |
| #### Per-segment object | |
| | Field | Type | Description | | |
| |-------|------|-------------| | |
| | `idx` | int | 1-indexed segment number | | |
| | `start` | float | Segment start time in seconds | | |
| | `end` | float | Segment end time in seconds | | |
| | `duration` | float | `end - start` | | |
| | `ref` | string | Matched reference `"S:A:W1-S:A:W2"`, empty if failed | | |
| | `confidence` | float | Alignment confidence [0.0, 1.0] | | |
| | `word_count` | int | Number of words matched | | |
| | `ayah_span` | int | Number of ayahs spanned | | |
| | `phoneme_count` | int | Length of ASR phoneme sequence | | |
| | `undersegmented` | bool | Flagged if word_count >= 20 or ayah_span >= 2 and duration >= 15s | | |
| | `missing_words` | bool | Gaps detected in word alignment | | |
| | `special_type` | string\|null | `"Basmala"`, `"Isti'adha"`, `"Isti'adha+Basmala"`, or null | | |
| | `error` | string\|null | Per-segment error message | | |
| ### Word timestamps (inside `word_timestamps` JSON) | |
| Populated when the user computes MFA timestamps. Array of per-segment word timing arrays: | |
| ```json | |
| [ | |
| { | |
| "segment_idx": 1, | |
| "ref": "2:255:1-2:255:5", | |
| "words": [ | |
| {"word": "ูฑูููููู", "start": 0.512, "end": 0.841}, | |
| {"word": "ููุข", "start": 0.870, "end": 1.023} | |
| ] | |
| } | |
| ] | |
| ``` | |
| ### In-place mutation | |
| The row dict is appended to `ParquetScheduler` on the initial run, and a reference is stored in `gr.State`. Subsequent actions (resegment, retranscribe, compute timestamps) mutate the dict in-place before the next push-to-Hub tick (every 1 minute). | |
| - **Overwritten on each run:** profiling, quality/retry stats, reciter stats, run-level settings (`min_silence_ms`, `asr_model`, etc.), `num_segments`, `surah`. | |
| - **Appended on each run:** `segments` JSON array gains a new run object with its settings and per-segment results. | |
| - **Set once:** `word_timestamps` is populated when the user computes MFA timestamps (null until then). | |
| - **If the push already fired** before a subsequent action, the mutation is a no-op on the already-uploaded row. The new results are lost for that row โ acceptable since the initial run is always captured. | |
| ### Design rationale | |
| - **Settings are denormalised** into each row so config changes can be correlated with quality without joins. | |
| - **Profiling fields are flat columns**, not nested JSON, so they are directly queryable in the HF dataset viewer and pandas. | |
| - **Segments are an array of run objects** โ each run includes its settings alongside the per-segment results, so different setting combinations are preserved even though run-level fields reflect the latest state. | |
| - **`mean_confidence` is pre-computed** at the run level for easy filtering and sorting without parsing the segments array. | |
| - **Audio is always uploaded** as the first column so every run is reproducible and the dataset is playable in the HF viewer. | |
| - **`audio_id`** combines a content hash with a timestamp โ the hash prefix groups re-runs of the same recording, the suffix makes each row unique. | |
| - **All sources are from existing objects** โ `ProfilingData` (segment_processor.py), `SegmentInfo` (segment_processor.py), and `config.py` values. No new computation is required beyond assembling the row. | |