Spaces:

hetchyy
/

Quran-multi-aligner

Running on Zero

File size: 14,319 Bytes

20e9692

# Usage Logging

## Part 1 — Reference: `recitation_app` Logging

Documents the recitation logging system used in `recitation_app` to collect anonymised analysis data on HuggingFace Hub. Included here as a reference for the quran_aligner schema below.

### Dataset

| Property | Value |
|----------|-------|
| Repo | `hetchyy/recitation-logs` (private) |
| Type | HuggingFace Dataset |
| Format | Parquet files in `data/` |
| Push interval | 1 minute |

Configured in `config.py`:

```python
USAGE_LOG_DATASET_REPO = "hetchyy/recitation-logs"
USAGE_LOG_PUSH_INTERVAL_MINUTES = 1
USAGE_LOG_AUDIO = False  # toggleable at runtime
```

### Schema

Defined in `utils/usage_logger.py` as `_RECITATION_SCHEMA`:

| Field | HF Type | Description |
|-------|---------|-------------|
| `audio` | `Audio` | Optional FLAC-encoded audio bytes embedded in parquet |
| `timestamp` | `Value(string)` | ISO 8601 datetime of the analysis |
| `user_id` | `Value(string)` | SHA-256 hash (12-char) of username or IP+UA |
| `verse_ref` | `Value(string)` | Quranic reference, e.g. `"1:1"` |
| `canonical_text` | `Value(string)` | Arabic text of the verse |
| `segments` | `Value(string)` | JSON array of segment results (see below) |
| `multi_model` | `Value(bool)` | Whether multiple ASR models were used |
| `settings` | `Value(string)` | JSON dict of Tajweed settings |
| `vad_timestamps` | `Value(string)` | JSON list of VAD segment boundaries |

#### Segment object (inside `segments` JSON)

```json
{
  "segment_ref": "1:1",
  "canonical_phonemes": "b i s m i ...",
  "detected_phonemes": "b i s m i ..."
}
```

#### Settings object (inside `settings` JSON)

```json
{
  "tolerance": 0.15,
  "iqlab_sound": "m",
  "ghunnah_length": 2,
  "jaiz_length": 4,
  "wajib_length": 4,
  "arid_length": 2,
  "leen_length": 2
}
```

### ParquetScheduler

Custom subclass of `huggingface_hub.CommitScheduler` (`utils/usage_logger.py`).

#### How it works

1. **Buffer** — Rows accumulate in an in-memory list via `.append(row)`. Access is protected by a threading lock.
2. **Flush** — On each scheduler tick (every `USAGE_LOG_PUSH_INTERVAL_MINUTES`):
   - Lock the buffer, swap it out, release the lock.
   - For any `audio` field containing a file path, read the file and convert to `{"path": filename, "bytes": binary_data}`.
   - Build a PyArrow table from the rows.
   - Embed the HF feature schema in parquet metadata:
     ```python
     table.replace_schema_metadata(
         {"huggingface": json.dumps({"info": {"features": schema}})}
     )
     ```
   - Write to a temp parquet file, then upload via `api.upload_file()` to `data/{uuid4()}.parquet`.
   - Clean up temp audio files.

#### Audio encoding

When `USAGE_LOG_AUDIO` is enabled:

```python
sf.write(filepath, audio_array, sample_rate, format="FLAC")
row["audio"] = str(filepath)  # ParquetScheduler reads and embeds the bytes
```

The audio is 16kHz mono, encoded as FLAC, and stored as embedded bytes inside the parquet file.

### Lazy Initialisation

Schedulers are **not** created at import time. They are initialised on first call to `_ensure_schedulers()` using double-checked locking:

```python
_recitation_scheduler = None
_schedulers_initialized = False
_init_lock = threading.Lock()

def _ensure_schedulers():
    global _recitation_scheduler, _schedulers_initialized
    if _schedulers_initialized:
        return
    with _init_lock:
        if _schedulers_initialized:
            return
        _schedulers_initialized = True
        _recitation_scheduler = ParquetScheduler(
            repo_id=USAGE_LOG_DATASET_REPO,
            schema=_RECITATION_SCHEMA,
            every=USAGE_LOG_PUSH_INTERVAL_MINUTES,
            path_in_repo="data",
            repo_type="dataset",
            private=True,
        )
```

This avoids interfering with ZeroGPU, which is sensitive to early network calls.

### Error Logging

Errors use a separate `CommitScheduler` (not `ParquetScheduler`) that watches a local directory:

- Local path: `/usage_logs/errors/error_log-{uuid4()}.jsonl`
- Remote path: `data/errors/`
- Format: JSONL with fields `timestamp`, `user_id`, `verse_ref`, `error_message`

Errors are appended to the JSONL file under a file lock. The `CommitScheduler` syncs the directory to Hub periodically.

### User Anonymisation

```python
def get_user_id(request) -> str:
    username = getattr(request, "username", None)
    if username:
        return hashlib.sha256(username.encode()).hexdigest()[:12]
    ip = headers.get("x-forwarded-for", "").split(",")[0].strip()
    ua = headers.get("user-agent", "")
    return hashlib.sha256(f"{ip}|{ua}".encode()).hexdigest()[:12]
```

- Logged-in HF users: hash of username
- Anonymous users: hash of IP + User-Agent
- Always truncated to 12 hex characters

### Fallback

If the scheduler fails to initialise (no HF token, network issues), rows are written to a local JSONL file at `usage_logs/recitations_fallback.jsonl` (without audio).

### Integration Point

Logging is called from the audio processing handler (`ui/handlers/audio_processing.py`) after each analysis completes:

```python
log_analysis(
    user_id, ref, text, segments,
    multi_model=bool(use_multi),
    settings=_settings,
    audio=audio_for_log,       # tuple of (sample_rate, np.ndarray) or None
    vad_timestamps=vad_ts,     # list of [start, end] pairs
)
```

Errors are logged separately:

```python
log_error(user_id, ref, "Audio loading failed")
```

### Dependencies

- `huggingface_hub` — `CommitScheduler` base class and Hub API
- `pyarrow` — Parquet table creation and schema metadata
- `soundfile` — FLAC audio encoding

---

## Part 2 — `quran_aligner` Logging Schema

Schema for logging alignment runs from this project. One row per audio upload. The row is mutated in-place while it sits in the `ParquetScheduler` buffer (before the next push-to-Hub tick). Run-level fields (profiling, reciter stats, quality stats, settings) are **overwritten** to reflect the latest run. Segment results are **appended** so every setting combination is preserved.

### Run-level fields

#### Identity

| Field | HF Type | Description |
|-------|---------|-------------|
| `audio` | `Audio` | FLAC-encoded audio (16kHz mono) |
| `audio_id` | `Value(string)` | `{sha256(audio_bytes)[:16]}:{timestamp}`, e.g. `a3f7b2c91e04d8f2:20260203T141532` |
| `timestamp` | `Value(string)` | ISO 8601 datetime truncated to seconds, e.g. `2026-02-03T01:50:45` |
| `user_id` | `Value(string)` | SHA-256 hash (12-char) of IP+UA |

The `audio_id` hash prefix enables grouping/deduplication of the same recording across runs; the timestamp suffix makes each run unique. Cost is ~90ms for a 5-minute recording.

#### Input metadata

| Field | HF Type | Description |
|-------|---------|-------------|
| `audio_duration_s` | `Value(float64)` | Total audio duration in seconds |
| `num_segments` | `Value(int32)` | Number of VAD segments |
| `surah` | `Value(int32)` | Detected surah (1-114) |

#### Segmentation settings

| Field | HF Type | Description |
|-------|---------|-------------|
| `min_silence_ms` | `Value(int32)` | Minimum silence duration to split |
| `min_speech_ms` | `Value(int32)` | Minimum speech duration for a valid segment |
| `pad_ms` | `Value(int32)` | Padding around speech segments |
| `asr_model` | `Value(string)` | `"Base"` (`hetchyy/r15_95m`) or `"Large"` (`hetchyy/r7`) |
| `device` | `Value(string)` | `"GPU"` or `"CPU"` |

#### Profiling (seconds)

| Field | HF Type | Description |
|-------|---------|-------------|
| `total_time` | `Value(float64)` | End-to-end pipeline wall time |
| `vad_queue_time` | `Value(float64)` | VAD queue wait time |
| `vad_gpu_time` | `Value(float64)` | VAD actual GPU execution |
| `asr_gpu_time` | `Value(float64)` | ASR actual GPU execution |
| `dp_total_time` | `Value(float64)` | Total DP alignment across all segments |

#### Quality & retry stats

| Field | HF Type | Description |
|-------|---------|-------------|
| `segments_passed` | `Value(int32)` | Segments with confidence > 0 |
| `segments_failed` | `Value(int32)` | Segments with confidence <= 0 |
| `mean_confidence` | `Value(float64)` | Average confidence across all segments |
| `tier1_retries` | `Value(int32)` | Expanded-window retry attempts |
| `tier1_passed` | `Value(int32)` | Successful tier 1 retries |
| `tier2_retries` | `Value(int32)` | Relaxed-threshold retry attempts |
| `tier2_passed` | `Value(int32)` | Successful tier 2 retries |
| `reanchors` | `Value(int32)` | Re-anchor events (after consecutive failures) |
| `special_merges` | `Value(int32)` | Basmala-fused segments |

#### Reciter stats

Computed from matched segments (those with `word_count > 0`). Already calculated in `app.py:877-922` for console output.

| Field | HF Type | Description |
|-------|---------|-------------|
| `words_per_minute` | `Value(float64)` | `total_words / (total_speech_s / 60)` |
| `phonemes_per_second` | `Value(float64)` | `total_phonemes / total_speech_s` |
| `avg_segment_duration` | `Value(float64)` | Mean duration of matched segments |
| `std_segment_duration` | `Value(float64)` | Std dev of matched segment durations |
| `avg_pause_duration` | `Value(float64)` | Mean inter-segment silence gap |
| `std_pause_duration` | `Value(float64)` | Std dev of pause durations |

#### Session flags

| Field | HF Type | Description |
|-------|---------|-------------|
| `resegmented` | `Value(bool)` | User resegmented with different VAD settings |
| `retranscribed` | `Value(bool)` | User retranscribed with a different ASR model |

#### Segments, timestamps & error

| Field | HF Type | Description |
|-------|---------|-------------|
| `segments` | `Value(string)` | JSON array of run objects (see below) — **appended** on resegment/retranscribe |
| `word_timestamps` | `Value(string)` | JSON array of per-segment MFA word timings (see below), null until computed |
| `error` | `Value(string)` | Top-level error message if the pipeline failed |

### Segment runs (inside `segments` JSON)

Each run with different settings appends a new run object. The array preserves the full history so every setting combination is available.

```json
[
  {
    "min_silence_ms": 200,
    "min_speech_ms": 1000,
    "pad_ms": 100,
    "asr_model": "Base",
    "segments": [
      {
        "idx": 1,
        "start": 0.512,
        "end": 3.841,
        "duration": 3.329,
        "ref": "2:255:1-2:255:5",
        "confidence": 0.87,
        "word_count": 5,
        "ayah_span": 1,
        "phoneme_count": 42,
        "undersegmented": false,
        "missing_words": false,
        "special_type": null,
        "error": null
      }
    ]
  },
  {
    "min_silence_ms": 600,
    "min_speech_ms": 1500,
    "pad_ms": 300,
    "asr_model": "Base",
    "segments": [...]
  }
]
```

#### Run object

| Field | Type | Description |
|-------|------|-------------|
| `min_silence_ms` | int | Silence setting used for this run |
| `min_speech_ms` | int | Speech setting used for this run |
| `pad_ms` | int | Pad setting used for this run |
| `asr_model` | string | `"Base"` or `"Large"` |
| `segments` | array | Per-segment objects for this run |

#### Per-segment object

| Field | Type | Description |
|-------|------|-------------|
| `idx` | int | 1-indexed segment number |
| `start` | float | Segment start time in seconds |
| `end` | float | Segment end time in seconds |
| `duration` | float | `end - start` |
| `ref` | string | Matched reference `"S:A:W1-S:A:W2"`, empty if failed |
| `confidence` | float | Alignment confidence [0.0, 1.0] |
| `word_count` | int | Number of words matched |
| `ayah_span` | int | Number of ayahs spanned |
| `phoneme_count` | int | Length of ASR phoneme sequence |
| `undersegmented` | bool | Flagged if word_count >= 20 or ayah_span >= 2 and duration >= 15s |
| `missing_words` | bool | Gaps detected in word alignment |
| `special_type` | string\|null | `"Basmala"`, `"Isti'adha"`, `"Isti'adha+Basmala"`, or null |
| `error` | string\|null | Per-segment error message |

### Word timestamps (inside `word_timestamps` JSON)

Populated when the user computes MFA timestamps. Array of per-segment word timing arrays:

```json
[
  {
    "segment_idx": 1,
    "ref": "2:255:1-2:255:5",
    "words": [
      {"word": "ٱللَّهُ", "start": 0.512, "end": 0.841},
      {"word": "لَآ", "start": 0.870, "end": 1.023}
    ]
  }
]
```

### In-place mutation

The row dict is appended to `ParquetScheduler` on the initial run, and a reference is stored in `gr.State`. Subsequent actions (resegment, retranscribe, compute timestamps) mutate the dict in-place before the next push-to-Hub tick (every 1 minute).

- **Overwritten on each run:** profiling, quality/retry stats, reciter stats, run-level settings (`min_silence_ms`, `asr_model`, etc.), `num_segments`, `surah`.
- **Appended on each run:** `segments` JSON array gains a new run object with its settings and per-segment results.
- **Set once:** `word_timestamps` is populated when the user computes MFA timestamps (null until then).
- **If the push already fired** before a subsequent action, the mutation is a no-op on the already-uploaded row. The new results are lost for that row — acceptable since the initial run is always captured.

### Design rationale

- **Settings are denormalised** into each row so config changes can be correlated with quality without joins.
- **Profiling fields are flat columns**, not nested JSON, so they are directly queryable in the HF dataset viewer and pandas.
- **Segments are an array of run objects** — each run includes its settings alongside the per-segment results, so different setting combinations are preserved even though run-level fields reflect the latest state.
- **`mean_confidence` is pre-computed** at the run level for easy filtering and sorting without parsing the segments array.
- **Audio is always uploaded** as the first column so every run is reproducible and the dataset is playable in the HF viewer.
- **`audio_id`** combines a content hash with a timestamp — the hash prefix groups re-runs of the same recording, the suffix makes each row unique.
- **All sources are from existing objects** — `ProfilingData` (segment_processor.py), `SegmentInfo` (segment_processor.py), and `config.py` values. No new computation is required beyond assembling the row.