Spaces:

hetchyy
/

Quran-multi-aligner

Running on Zero

App Files Files Community

Quran-multi-aligner / docs /usage-logging.md

hetchyy

Initial commit

20e9692 6 days ago

preview code

raw

history blame contribute delete

14.3 kB

	# Usage Logging

	## Part 1 — Reference: `recitation_app` Logging

	Documents the recitation logging system used in `recitation_app` to collect anonymised analysis data on HuggingFace Hub. Included here as a reference for the quran_aligner schema below.

	### Dataset

	\| Property \| Value \|
	\|----------\|-------\|
	\| Repo \| `hetchyy/recitation-logs` (private) \|
	\| Type \| HuggingFace Dataset \|
	\| Format \| Parquet files in `data/` \|
	\| Push interval \| 1 minute \|

	Configured in `config.py`:

	```python
	USAGE_LOG_DATASET_REPO = "hetchyy/recitation-logs"
	USAGE_LOG_PUSH_INTERVAL_MINUTES = 1
	USAGE_LOG_AUDIO = False # toggleable at runtime
	```

	### Schema

	Defined in `utils/usage_logger.py` as `_RECITATION_SCHEMA`:

	\| Field \| HF Type \| Description \|
	\|-------\|---------\|-------------\|
	\| `audio` \| `Audio` \| Optional FLAC-encoded audio bytes embedded in parquet \|
	\| `timestamp` \| `Value(string)` \| ISO 8601 datetime of the analysis \|
	\| `user_id` \| `Value(string)` \| SHA-256 hash (12-char) of username or IP+UA \|
	\| `verse_ref` \| `Value(string)` \| Quranic reference, e.g. `"1:1"` \|
	\| `canonical_text` \| `Value(string)` \| Arabic text of the verse \|
	\| `segments` \| `Value(string)` \| JSON array of segment results (see below) \|
	\| `multi_model` \| `Value(bool)` \| Whether multiple ASR models were used \|
	\| `settings` \| `Value(string)` \| JSON dict of Tajweed settings \|
	\| `vad_timestamps` \| `Value(string)` \| JSON list of VAD segment boundaries \|

	#### Segment object (inside `segments` JSON)

	```json
	{
	"segment_ref": "1:1",
	"canonical_phonemes": "b i s m i ...",
	"detected_phonemes": "b i s m i ..."
	}
	```

	#### Settings object (inside `settings` JSON)

	```json
	{
	"tolerance": 0.15,
	"iqlab_sound": "m",
	"ghunnah_length": 2,
	"jaiz_length": 4,
	"wajib_length": 4,
	"arid_length": 2,
	"leen_length": 2
	}
	```

	### ParquetScheduler

	Custom subclass of `huggingface_hub.CommitScheduler` (`utils/usage_logger.py`).

	#### How it works

	1. Buffer — Rows accumulate in an in-memory list via `.append(row)`. Access is protected by a threading lock.
	2. Flush — On each scheduler tick (every `USAGE_LOG_PUSH_INTERVAL_MINUTES`):
	- Lock the buffer, swap it out, release the lock.
	- For any `audio` field containing a file path, read the file and convert to `{"path": filename, "bytes": binary_data}`.
	- Build a PyArrow table from the rows.
	- Embed the HF feature schema in parquet metadata:
	```python
	table.replace_schema_metadata(
	{"huggingface": json.dumps({"info": {"features": schema}})}
	)
	```
	- Write to a temp parquet file, then upload via `api.upload_file()` to `data/{uuid4()}.parquet`.
	- Clean up temp audio files.

	#### Audio encoding

	When `USAGE_LOG_AUDIO` is enabled:

	```python
	sf.write(filepath, audio_array, sample_rate, format="FLAC")
	row["audio"] = str(filepath) # ParquetScheduler reads and embeds the bytes
	```

	The audio is 16kHz mono, encoded as FLAC, and stored as embedded bytes inside the parquet file.

	### Lazy Initialisation

	Schedulers are not created at import time. They are initialised on first call to `_ensure_schedulers()` using double-checked locking:

	```python
	_recitation_scheduler = None
	_schedulers_initialized = False
	_init_lock = threading.Lock()

	def _ensure_schedulers():
	global _recitation_scheduler, _schedulers_initialized
	if _schedulers_initialized:
	return
	with _init_lock:
	if _schedulers_initialized:
	return
	_schedulers_initialized = True
	_recitation_scheduler = ParquetScheduler(
	repo_id=USAGE_LOG_DATASET_REPO,
	schema=_RECITATION_SCHEMA,
	every=USAGE_LOG_PUSH_INTERVAL_MINUTES,
	path_in_repo="data",
	repo_type="dataset",
	private=True,
	)
	```

	This avoids interfering with ZeroGPU, which is sensitive to early network calls.

	### Error Logging

	Errors use a separate `CommitScheduler` (not `ParquetScheduler`) that watches a local directory:

	- Local path: `/usage_logs/errors/error_log-{uuid4()}.jsonl`
	- Remote path: `data/errors/`
	- Format: JSONL with fields `timestamp`, `user_id`, `verse_ref`, `error_message`

	Errors are appended to the JSONL file under a file lock. The `CommitScheduler` syncs the directory to Hub periodically.

	### User Anonymisation

	```python
	def get_user_id(request) -> str:
	username = getattr(request, "username", None)
	if username:
	return hashlib.sha256(username.encode()).hexdigest()[:12]
	ip = headers.get("x-forwarded-for", "").split(",")[0].strip()
	ua = headers.get("user-agent", "")
	return hashlib.sha256(f"{ip}\|{ua}".encode()).hexdigest()[:12]
	```

	- Logged-in HF users: hash of username
	- Anonymous users: hash of IP + User-Agent
	- Always truncated to 12 hex characters

	### Fallback

	If the scheduler fails to initialise (no HF token, network issues), rows are written to a local JSONL file at `usage_logs/recitations_fallback.jsonl` (without audio).

	### Integration Point

	Logging is called from the audio processing handler (`ui/handlers/audio_processing.py`) after each analysis completes:

	```python
	log_analysis(
	user_id, ref, text, segments,
	multi_model=bool(use_multi),
	settings=_settings,
	audio=audio_for_log, # tuple of (sample_rate, np.ndarray) or None
	vad_timestamps=vad_ts, # list of [start, end] pairs
	)
	```

	Errors are logged separately:

	```python
	log_error(user_id, ref, "Audio loading failed")
	```

	### Dependencies

	- `huggingface_hub` — `CommitScheduler` base class and Hub API
	- `pyarrow` — Parquet table creation and schema metadata
	- `soundfile` — FLAC audio encoding

	---

	## Part 2 — `quran_aligner` Logging Schema

	Schema for logging alignment runs from this project. One row per audio upload. The row is mutated in-place while it sits in the `ParquetScheduler` buffer (before the next push-to-Hub tick). Run-level fields (profiling, reciter stats, quality stats, settings) are overwritten to reflect the latest run. Segment results are appended so every setting combination is preserved.

	### Run-level fields

	#### Identity

	\| Field \| HF Type \| Description \|
	\|-------\|---------\|-------------\|
	\| `audio` \| `Audio` \| FLAC-encoded audio (16kHz mono) \|
	\| `audio_id` \| `Value(string)` \| `{sha256(audio_bytes)[:16]}:{timestamp}`, e.g. `a3f7b2c91e04d8f2:20260203T141532` \|
	\| `timestamp` \| `Value(string)` \| ISO 8601 datetime truncated to seconds, e.g. `2026-02-03T01:50:45` \|
	\| `user_id` \| `Value(string)` \| SHA-256 hash (12-char) of IP+UA \|

	The `audio_id` hash prefix enables grouping/deduplication of the same recording across runs; the timestamp suffix makes each run unique. Cost is ~90ms for a 5-minute recording.

	#### Input metadata

	\| Field \| HF Type \| Description \|
	\|-------\|---------\|-------------\|
	\| `audio_duration_s` \| `Value(float64)` \| Total audio duration in seconds \|
	\| `num_segments` \| `Value(int32)` \| Number of VAD segments \|
	\| `surah` \| `Value(int32)` \| Detected surah (1-114) \|

	#### Segmentation settings

	\| Field \| HF Type \| Description \|
	\|-------\|---------\|-------------\|
	\| `min_silence_ms` \| `Value(int32)` \| Minimum silence duration to split \|
	\| `min_speech_ms` \| `Value(int32)` \| Minimum speech duration for a valid segment \|
	\| `pad_ms` \| `Value(int32)` \| Padding around speech segments \|
	\| `asr_model` \| `Value(string)` \| `"Base"` (`hetchyy/r15_95m`) or `"Large"` (`hetchyy/r7`) \|
	\| `device` \| `Value(string)` \| `"GPU"` or `"CPU"` \|

	#### Profiling (seconds)

	\| Field \| HF Type \| Description \|
	\|-------\|---------\|-------------\|
	\| `total_time` \| `Value(float64)` \| End-to-end pipeline wall time \|
	\| `vad_queue_time` \| `Value(float64)` \| VAD queue wait time \|
	\| `vad_gpu_time` \| `Value(float64)` \| VAD actual GPU execution \|
	\| `asr_gpu_time` \| `Value(float64)` \| ASR actual GPU execution \|
	\| `dp_total_time` \| `Value(float64)` \| Total DP alignment across all segments \|

	#### Quality & retry stats

	\| Field \| HF Type \| Description \|
	\|-------\|---------\|-------------\|
	\| `segments_passed` \| `Value(int32)` \| Segments with confidence > 0 \|
	\| `segments_failed` \| `Value(int32)` \| Segments with confidence <= 0 \|
	\| `mean_confidence` \| `Value(float64)` \| Average confidence across all segments \|
	\| `tier1_retries` \| `Value(int32)` \| Expanded-window retry attempts \|
	\| `tier1_passed` \| `Value(int32)` \| Successful tier 1 retries \|
	\| `tier2_retries` \| `Value(int32)` \| Relaxed-threshold retry attempts \|
	\| `tier2_passed` \| `Value(int32)` \| Successful tier 2 retries \|
	\| `reanchors` \| `Value(int32)` \| Re-anchor events (after consecutive failures) \|
	\| `special_merges` \| `Value(int32)` \| Basmala-fused segments \|

	#### Reciter stats

	Computed from matched segments (those with `word_count > 0`). Already calculated in `app.py:877-922` for console output.

	\| Field \| HF Type \| Description \|
	\|-------\|---------\|-------------\|
	\| `words_per_minute` \| `Value(float64)` \| `total_words / (total_speech_s / 60)` \|
	\| `phonemes_per_second` \| `Value(float64)` \| `total_phonemes / total_speech_s` \|
	\| `avg_segment_duration` \| `Value(float64)` \| Mean duration of matched segments \|
	\| `std_segment_duration` \| `Value(float64)` \| Std dev of matched segment durations \|
	\| `avg_pause_duration` \| `Value(float64)` \| Mean inter-segment silence gap \|
	\| `std_pause_duration` \| `Value(float64)` \| Std dev of pause durations \|

	#### Session flags

	\| Field \| HF Type \| Description \|
	\|-------\|---------\|-------------\|
	\| `resegmented` \| `Value(bool)` \| User resegmented with different VAD settings \|
	\| `retranscribed` \| `Value(bool)` \| User retranscribed with a different ASR model \|

	#### Segments, timestamps & error

	\| Field \| HF Type \| Description \|
	\|-------\|---------\|-------------\|
	\| `segments` \| `Value(string)` \| JSON array of run objects (see below) — appended on resegment/retranscribe \|
	\| `word_timestamps` \| `Value(string)` \| JSON array of per-segment MFA word timings (see below), null until computed \|
	\| `error` \| `Value(string)` \| Top-level error message if the pipeline failed \|

	### Segment runs (inside `segments` JSON)

	Each run with different settings appends a new run object. The array preserves the full history so every setting combination is available.

	```json
	[
	{
	"min_silence_ms": 200,
	"min_speech_ms": 1000,
	"pad_ms": 100,
	"asr_model": "Base",
	"segments": [
	{
	"idx": 1,
	"start": 0.512,
	"end": 3.841,
	"duration": 3.329,
	"ref": "2:255:1-2:255:5",
	"confidence": 0.87,
	"word_count": 5,
	"ayah_span": 1,
	"phoneme_count": 42,
	"undersegmented": false,
	"missing_words": false,
	"special_type": null,
	"error": null
	}
	]
	},
	{
	"min_silence_ms": 600,
	"min_speech_ms": 1500,
	"pad_ms": 300,
	"asr_model": "Base",
	"segments": [...]
	}
	]
	```

	#### Run object

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `min_silence_ms` \| int \| Silence setting used for this run \|
	\| `min_speech_ms` \| int \| Speech setting used for this run \|
	\| `pad_ms` \| int \| Pad setting used for this run \|
	\| `asr_model` \| string \| `"Base"` or `"Large"` \|
	\| `segments` \| array \| Per-segment objects for this run \|

	#### Per-segment object

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `idx` \| int \| 1-indexed segment number \|
	\| `start` \| float \| Segment start time in seconds \|
	\| `end` \| float \| Segment end time in seconds \|
	\| `duration` \| float \| `end - start` \|
	\| `ref` \| string \| Matched reference `"S:A:W1-S:A:W2"`, empty if failed \|
	\| `confidence` \| float \| Alignment confidence [0.0, 1.0] \|
	\| `word_count` \| int \| Number of words matched \|
	\| `ayah_span` \| int \| Number of ayahs spanned \|
	\| `phoneme_count` \| int \| Length of ASR phoneme sequence \|
	\| `undersegmented` \| bool \| Flagged if word_count >= 20 or ayah_span >= 2 and duration >= 15s \|
	\| `missing_words` \| bool \| Gaps detected in word alignment \|
	\| `special_type` \| string\\|null \| `"Basmala"`, `"Isti'adha"`, `"Isti'adha+Basmala"`, or null \|
	\| `error` \| string\\|null \| Per-segment error message \|

	### Word timestamps (inside `word_timestamps` JSON)

	Populated when the user computes MFA timestamps. Array of per-segment word timing arrays:

	```json
	[
	{
	"segment_idx": 1,
	"ref": "2:255:1-2:255:5",
	"words": [
	{"word": "ٱللَّهُ", "start": 0.512, "end": 0.841},
	{"word": "لَآ", "start": 0.870, "end": 1.023}
	]
	}
	]
	```

	### In-place mutation

	The row dict is appended to `ParquetScheduler` on the initial run, and a reference is stored in `gr.State`. Subsequent actions (resegment, retranscribe, compute timestamps) mutate the dict in-place before the next push-to-Hub tick (every 1 minute).

	- Overwritten on each run: profiling, quality/retry stats, reciter stats, run-level settings (`min_silence_ms`, `asr_model`, etc.), `num_segments`, `surah`.
	- Appended on each run: `segments` JSON array gains a new run object with its settings and per-segment results.
	- Set once: `word_timestamps` is populated when the user computes MFA timestamps (null until then).
	- If the push already fired before a subsequent action, the mutation is a no-op on the already-uploaded row. The new results are lost for that row — acceptable since the initial run is always captured.

	### Design rationale

	- Settings are denormalised into each row so config changes can be correlated with quality without joins.
	- Profiling fields are flat columns, not nested JSON, so they are directly queryable in the HF dataset viewer and pandas.
	- Segments are an array of run objects — each run includes its settings alongside the per-segment results, so different setting combinations are preserved even though run-level fields reflect the latest state.
	- `mean_confidence` is pre-computed at the run level for easy filtering and sorting without parsing the segments array.
	- Audio is always uploaded as the first column so every run is reproducible and the dataset is playable in the HF viewer.
	- `audio_id` combines a content hash with a timestamp — the hash prefix groups re-runs of the same recording, the suffix makes each row unique.
	- All sources are from existing objects — `ProfilingData` (segment_processor.py), `SegmentInfo` (segment_processor.py), and `config.py` values. No new computation is required beyond assembling the row.