Quran-multi-aligner / docs /usage-logging.md
hetchyy's picture
Initial commit
20e9692
# Usage Logging
## Part 1 โ€” Reference: `recitation_app` Logging
Documents the recitation logging system used in `recitation_app` to collect anonymised analysis data on HuggingFace Hub. Included here as a reference for the quran_aligner schema below.
### Dataset
| Property | Value |
|----------|-------|
| Repo | `hetchyy/recitation-logs` (private) |
| Type | HuggingFace Dataset |
| Format | Parquet files in `data/` |
| Push interval | 1 minute |
Configured in `config.py`:
```python
USAGE_LOG_DATASET_REPO = "hetchyy/recitation-logs"
USAGE_LOG_PUSH_INTERVAL_MINUTES = 1
USAGE_LOG_AUDIO = False # toggleable at runtime
```
### Schema
Defined in `utils/usage_logger.py` as `_RECITATION_SCHEMA`:
| Field | HF Type | Description |
|-------|---------|-------------|
| `audio` | `Audio` | Optional FLAC-encoded audio bytes embedded in parquet |
| `timestamp` | `Value(string)` | ISO 8601 datetime of the analysis |
| `user_id` | `Value(string)` | SHA-256 hash (12-char) of username or IP+UA |
| `verse_ref` | `Value(string)` | Quranic reference, e.g. `"1:1"` |
| `canonical_text` | `Value(string)` | Arabic text of the verse |
| `segments` | `Value(string)` | JSON array of segment results (see below) |
| `multi_model` | `Value(bool)` | Whether multiple ASR models were used |
| `settings` | `Value(string)` | JSON dict of Tajweed settings |
| `vad_timestamps` | `Value(string)` | JSON list of VAD segment boundaries |
#### Segment object (inside `segments` JSON)
```json
{
"segment_ref": "1:1",
"canonical_phonemes": "b i s m i ...",
"detected_phonemes": "b i s m i ..."
}
```
#### Settings object (inside `settings` JSON)
```json
{
"tolerance": 0.15,
"iqlab_sound": "m",
"ghunnah_length": 2,
"jaiz_length": 4,
"wajib_length": 4,
"arid_length": 2,
"leen_length": 2
}
```
### ParquetScheduler
Custom subclass of `huggingface_hub.CommitScheduler` (`utils/usage_logger.py`).
#### How it works
1. **Buffer** โ€” Rows accumulate in an in-memory list via `.append(row)`. Access is protected by a threading lock.
2. **Flush** โ€” On each scheduler tick (every `USAGE_LOG_PUSH_INTERVAL_MINUTES`):
- Lock the buffer, swap it out, release the lock.
- For any `audio` field containing a file path, read the file and convert to `{"path": filename, "bytes": binary_data}`.
- Build a PyArrow table from the rows.
- Embed the HF feature schema in parquet metadata:
```python
table.replace_schema_metadata(
{"huggingface": json.dumps({"info": {"features": schema}})}
)
```
- Write to a temp parquet file, then upload via `api.upload_file()` to `data/{uuid4()}.parquet`.
- Clean up temp audio files.
#### Audio encoding
When `USAGE_LOG_AUDIO` is enabled:
```python
sf.write(filepath, audio_array, sample_rate, format="FLAC")
row["audio"] = str(filepath) # ParquetScheduler reads and embeds the bytes
```
The audio is 16kHz mono, encoded as FLAC, and stored as embedded bytes inside the parquet file.
### Lazy Initialisation
Schedulers are **not** created at import time. They are initialised on first call to `_ensure_schedulers()` using double-checked locking:
```python
_recitation_scheduler = None
_schedulers_initialized = False
_init_lock = threading.Lock()
def _ensure_schedulers():
global _recitation_scheduler, _schedulers_initialized
if _schedulers_initialized:
return
with _init_lock:
if _schedulers_initialized:
return
_schedulers_initialized = True
_recitation_scheduler = ParquetScheduler(
repo_id=USAGE_LOG_DATASET_REPO,
schema=_RECITATION_SCHEMA,
every=USAGE_LOG_PUSH_INTERVAL_MINUTES,
path_in_repo="data",
repo_type="dataset",
private=True,
)
```
This avoids interfering with ZeroGPU, which is sensitive to early network calls.
### Error Logging
Errors use a separate `CommitScheduler` (not `ParquetScheduler`) that watches a local directory:
- Local path: `/usage_logs/errors/error_log-{uuid4()}.jsonl`
- Remote path: `data/errors/`
- Format: JSONL with fields `timestamp`, `user_id`, `verse_ref`, `error_message`
Errors are appended to the JSONL file under a file lock. The `CommitScheduler` syncs the directory to Hub periodically.
### User Anonymisation
```python
def get_user_id(request) -> str:
username = getattr(request, "username", None)
if username:
return hashlib.sha256(username.encode()).hexdigest()[:12]
ip = headers.get("x-forwarded-for", "").split(",")[0].strip()
ua = headers.get("user-agent", "")
return hashlib.sha256(f"{ip}|{ua}".encode()).hexdigest()[:12]
```
- Logged-in HF users: hash of username
- Anonymous users: hash of IP + User-Agent
- Always truncated to 12 hex characters
### Fallback
If the scheduler fails to initialise (no HF token, network issues), rows are written to a local JSONL file at `usage_logs/recitations_fallback.jsonl` (without audio).
### Integration Point
Logging is called from the audio processing handler (`ui/handlers/audio_processing.py`) after each analysis completes:
```python
log_analysis(
user_id, ref, text, segments,
multi_model=bool(use_multi),
settings=_settings,
audio=audio_for_log, # tuple of (sample_rate, np.ndarray) or None
vad_timestamps=vad_ts, # list of [start, end] pairs
)
```
Errors are logged separately:
```python
log_error(user_id, ref, "Audio loading failed")
```
### Dependencies
- `huggingface_hub` โ€” `CommitScheduler` base class and Hub API
- `pyarrow` โ€” Parquet table creation and schema metadata
- `soundfile` โ€” FLAC audio encoding
---
## Part 2 โ€” `quran_aligner` Logging Schema
Schema for logging alignment runs from this project. One row per audio upload. The row is mutated in-place while it sits in the `ParquetScheduler` buffer (before the next push-to-Hub tick). Run-level fields (profiling, reciter stats, quality stats, settings) are **overwritten** to reflect the latest run. Segment results are **appended** so every setting combination is preserved.
### Run-level fields
#### Identity
| Field | HF Type | Description |
|-------|---------|-------------|
| `audio` | `Audio` | FLAC-encoded audio (16kHz mono) |
| `audio_id` | `Value(string)` | `{sha256(audio_bytes)[:16]}:{timestamp}`, e.g. `a3f7b2c91e04d8f2:20260203T141532` |
| `timestamp` | `Value(string)` | ISO 8601 datetime truncated to seconds, e.g. `2026-02-03T01:50:45` |
| `user_id` | `Value(string)` | SHA-256 hash (12-char) of IP+UA |
The `audio_id` hash prefix enables grouping/deduplication of the same recording across runs; the timestamp suffix makes each run unique. Cost is ~90ms for a 5-minute recording.
#### Input metadata
| Field | HF Type | Description |
|-------|---------|-------------|
| `audio_duration_s` | `Value(float64)` | Total audio duration in seconds |
| `num_segments` | `Value(int32)` | Number of VAD segments |
| `surah` | `Value(int32)` | Detected surah (1-114) |
#### Segmentation settings
| Field | HF Type | Description |
|-------|---------|-------------|
| `min_silence_ms` | `Value(int32)` | Minimum silence duration to split |
| `min_speech_ms` | `Value(int32)` | Minimum speech duration for a valid segment |
| `pad_ms` | `Value(int32)` | Padding around speech segments |
| `asr_model` | `Value(string)` | `"Base"` (`hetchyy/r15_95m`) or `"Large"` (`hetchyy/r7`) |
| `device` | `Value(string)` | `"GPU"` or `"CPU"` |
#### Profiling (seconds)
| Field | HF Type | Description |
|-------|---------|-------------|
| `total_time` | `Value(float64)` | End-to-end pipeline wall time |
| `vad_queue_time` | `Value(float64)` | VAD queue wait time |
| `vad_gpu_time` | `Value(float64)` | VAD actual GPU execution |
| `asr_gpu_time` | `Value(float64)` | ASR actual GPU execution |
| `dp_total_time` | `Value(float64)` | Total DP alignment across all segments |
#### Quality & retry stats
| Field | HF Type | Description |
|-------|---------|-------------|
| `segments_passed` | `Value(int32)` | Segments with confidence > 0 |
| `segments_failed` | `Value(int32)` | Segments with confidence <= 0 |
| `mean_confidence` | `Value(float64)` | Average confidence across all segments |
| `tier1_retries` | `Value(int32)` | Expanded-window retry attempts |
| `tier1_passed` | `Value(int32)` | Successful tier 1 retries |
| `tier2_retries` | `Value(int32)` | Relaxed-threshold retry attempts |
| `tier2_passed` | `Value(int32)` | Successful tier 2 retries |
| `reanchors` | `Value(int32)` | Re-anchor events (after consecutive failures) |
| `special_merges` | `Value(int32)` | Basmala-fused segments |
#### Reciter stats
Computed from matched segments (those with `word_count > 0`). Already calculated in `app.py:877-922` for console output.
| Field | HF Type | Description |
|-------|---------|-------------|
| `words_per_minute` | `Value(float64)` | `total_words / (total_speech_s / 60)` |
| `phonemes_per_second` | `Value(float64)` | `total_phonemes / total_speech_s` |
| `avg_segment_duration` | `Value(float64)` | Mean duration of matched segments |
| `std_segment_duration` | `Value(float64)` | Std dev of matched segment durations |
| `avg_pause_duration` | `Value(float64)` | Mean inter-segment silence gap |
| `std_pause_duration` | `Value(float64)` | Std dev of pause durations |
#### Session flags
| Field | HF Type | Description |
|-------|---------|-------------|
| `resegmented` | `Value(bool)` | User resegmented with different VAD settings |
| `retranscribed` | `Value(bool)` | User retranscribed with a different ASR model |
#### Segments, timestamps & error
| Field | HF Type | Description |
|-------|---------|-------------|
| `segments` | `Value(string)` | JSON array of run objects (see below) โ€” **appended** on resegment/retranscribe |
| `word_timestamps` | `Value(string)` | JSON array of per-segment MFA word timings (see below), null until computed |
| `error` | `Value(string)` | Top-level error message if the pipeline failed |
### Segment runs (inside `segments` JSON)
Each run with different settings appends a new run object. The array preserves the full history so every setting combination is available.
```json
[
{
"min_silence_ms": 200,
"min_speech_ms": 1000,
"pad_ms": 100,
"asr_model": "Base",
"segments": [
{
"idx": 1,
"start": 0.512,
"end": 3.841,
"duration": 3.329,
"ref": "2:255:1-2:255:5",
"confidence": 0.87,
"word_count": 5,
"ayah_span": 1,
"phoneme_count": 42,
"undersegmented": false,
"missing_words": false,
"special_type": null,
"error": null
}
]
},
{
"min_silence_ms": 600,
"min_speech_ms": 1500,
"pad_ms": 300,
"asr_model": "Base",
"segments": [...]
}
]
```
#### Run object
| Field | Type | Description |
|-------|------|-------------|
| `min_silence_ms` | int | Silence setting used for this run |
| `min_speech_ms` | int | Speech setting used for this run |
| `pad_ms` | int | Pad setting used for this run |
| `asr_model` | string | `"Base"` or `"Large"` |
| `segments` | array | Per-segment objects for this run |
#### Per-segment object
| Field | Type | Description |
|-------|------|-------------|
| `idx` | int | 1-indexed segment number |
| `start` | float | Segment start time in seconds |
| `end` | float | Segment end time in seconds |
| `duration` | float | `end - start` |
| `ref` | string | Matched reference `"S:A:W1-S:A:W2"`, empty if failed |
| `confidence` | float | Alignment confidence [0.0, 1.0] |
| `word_count` | int | Number of words matched |
| `ayah_span` | int | Number of ayahs spanned |
| `phoneme_count` | int | Length of ASR phoneme sequence |
| `undersegmented` | bool | Flagged if word_count >= 20 or ayah_span >= 2 and duration >= 15s |
| `missing_words` | bool | Gaps detected in word alignment |
| `special_type` | string\|null | `"Basmala"`, `"Isti'adha"`, `"Isti'adha+Basmala"`, or null |
| `error` | string\|null | Per-segment error message |
### Word timestamps (inside `word_timestamps` JSON)
Populated when the user computes MFA timestamps. Array of per-segment word timing arrays:
```json
[
{
"segment_idx": 1,
"ref": "2:255:1-2:255:5",
"words": [
{"word": "ูฑู„ู„ูŽู‘ู‡ู", "start": 0.512, "end": 0.841},
{"word": "ู„ูŽุข", "start": 0.870, "end": 1.023}
]
}
]
```
### In-place mutation
The row dict is appended to `ParquetScheduler` on the initial run, and a reference is stored in `gr.State`. Subsequent actions (resegment, retranscribe, compute timestamps) mutate the dict in-place before the next push-to-Hub tick (every 1 minute).
- **Overwritten on each run:** profiling, quality/retry stats, reciter stats, run-level settings (`min_silence_ms`, `asr_model`, etc.), `num_segments`, `surah`.
- **Appended on each run:** `segments` JSON array gains a new run object with its settings and per-segment results.
- **Set once:** `word_timestamps` is populated when the user computes MFA timestamps (null until then).
- **If the push already fired** before a subsequent action, the mutation is a no-op on the already-uploaded row. The new results are lost for that row โ€” acceptable since the initial run is always captured.
### Design rationale
- **Settings are denormalised** into each row so config changes can be correlated with quality without joins.
- **Profiling fields are flat columns**, not nested JSON, so they are directly queryable in the HF dataset viewer and pandas.
- **Segments are an array of run objects** โ€” each run includes its settings alongside the per-segment results, so different setting combinations are preserved even though run-level fields reflect the latest state.
- **`mean_confidence` is pre-computed** at the run level for easy filtering and sorting without parsing the segments array.
- **Audio is always uploaded** as the first column so every run is reproducible and the dataset is playable in the HF viewer.
- **`audio_id`** combines a content hash with a timestamp โ€” the hash prefix groups re-runs of the same recording, the suffix makes each row unique.
- **All sources are from existing objects** โ€” `ProfilingData` (segment_processor.py), `SegmentInfo` (segment_processor.py), and `config.py` values. No new computation is required beyond assembling the row.