""" Strip PDF/HTML extraction artefacts from corpus markdown. Three classes of noise are addressed: 1. **Hard rules** — always stripped: - Control characters (form-feeds, NULs, etc. that escape from PDF parsing) - Standalone page numbers (`^\\s*\\d+\\s*$`) - Page header/footer patterns (`X | 37`, `X | P a g e`) - Known UI chrome (Material Design icon labels, TOC nav arrows, video transcript markers) 2. **Auto-detected boilerplate** — stripped with a high-confidence threshold: - Any line of 12–100 chars that appears more than 10 times in a single document - Skips lines that look like markdown headings, metadata key:value lines, or list markers - Catches "Health Information Privacy Code 2020" repeating 27× as a page header 3. **Cosmetic cleanup** — collapse runs of 3+ blank lines to a paragraph break. Returns (cleaned_text, stats_dict) so the caller can log what was removed. Used by all `build_*_compilation.py` scripts via `from clean_artifacts import clean_corpus_artifacts`. """ from __future__ import annotations import re from collections import Counter # ---- Hard-coded chrome patterns ----------------------------------- # Each pattern is matched as a *whole-line* fullmatch (case-insensitive) # against the stripped line content. If any pattern matches, the line is dropped. _CHROME_PATTERNS = [ # Standalone page number r"\d+", # Material Design icon labels (text content of