--- title: NoteGuard β€” NHS De-Identification Gate emoji: πŸ›‘οΈ sdk: docker app_port: 8501 pinned: false --- # πŸ›‘οΈ NoteGuard **Automatic PII sanitisation for NHS clinical notes β€” clean data in, no identifiers out.** NoteGuard discovers, inspects, redacts and de-identifies PII in free-text NHS clinical notes **before** the data is used to train any model. It runs **locally at each institution** ("sanitise at source"), so every Trust cleans its own data inside its own governance boundary before anything is shared or used in collaborative / federated training. > Federated learning lets institutions train without moving data. NoteGuard is the **privacy-preserving > on-ramp** that makes the data safe to train on in the first place β€” the missing layer in front of an > NHS Secure Data Environment / the Federated Data Platform. Encode Vibe Coding Hackathon β€” *FLock Sovereign AI Challenge* (hosted by Encode Hub). Built on **Microsoft Presidio** + **spaCy**, evaluated on [NHSEDataScience/synthetic_clinical_notes](https://huggingface.co/datasets/NHSEDataScience/synthetic_clinical_notes). ## What makes this more than "just Presidio" Presidio is the detection **engine** β€” we don't reinvent it. NoteGuard is the **clinical assurance layer** Presidio leaves to you: 1. **Measured residual leakage.** Because the dataset keeps PII in structured tables, we join them back to each note for free ground truth and report a real **re-identification risk** number β€” not a vibe. 2. **Domain adaptation to messy clinical text.** NHS-aware recognisers: checksum-validated NHS numbers **plus** context-anchored detection for the dataset's 9-digit synthetic numbers Presidio's `UK_NHS` misses, plus GMC/NMC clinician IDs, ODS org codes and record UUIDs. 3. **Patient-consistent de-identification.** Same patient β†’ same surrogate across their whole admission journey. Only date-of-birth is treated as PII (shifted by a consistent per-patient offset); visit / admission dates are clinically useful and left intact. Realistic en_GB fakes (or `[label]` redaction). 4. **Pluggable + degrades gracefully.** One `Detector` interface (Rule / Presidio); the pure-Python rule layer + eval run even if spaCy/Presidio are unavailable. 5. **Governance wrapper.** Per-note audit of what was removed + the dataset-level leakage report, mapped to the NHS **Five Safes**. ## Results β€” residual leakage drops as we layer detection *Known identifiers (joined from the structured tables) still present after sanitisation. Measured on all **1,602 notes** (1,027 known-PII occurrences). Reproduce with `python tests/run_eval.py --compare`.* | Detector | NHS number F1 | PERSON recall | **Residual leakage** | |---|---|---|---| | rules only | 0.98 | 0.00 | **74.8 %** | | **presidio + rules** (shipping) | **0.99** | **0.68** | **8.5 %** | The rulesβ†’engine drop is the headline: it shows, with numbers, exactly what the NER engine buys you. > Precision is reported against *structured* PII only, so it is a conservative lower bound β€” correctly > removing a clinician's name (not in the tables) counts here as a false positive. **Recall and leakage > are the sound, headline metrics.** ## Architecture ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ inside Trust A ─────────────────────┐ raw notes ──► β”‚ fix mojibake ─► detect (Presidio NER + rules) β”‚ ──► de-identified (PHI) β”‚ ─► transform (redact | pseudonymise) β”‚ text + audit log β”‚ patient-consistent + date-shift, vaultβ”‚ (no PHI leaves) β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ same gate runs inside Trust B ──► β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ shared de-identified pool β”‚ ──► federated AI training β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` **Project layout** (Gold-RAP "analysis as a product"): ``` src/ the package data.py load CSVs + ground-truth join (EVAL-ONLY oracle) recognisers.py pure-Python rules: NHS checksum/context, postcode, date, phone, email, GMC/NMC/ODS, UUID detect.py RuleDetector / PresidioDetector behind one Detector interface transform.py redaction | patient-consistent pseudonymisation + DOB date-shift (Faker vault) pipeline.py single-note detect -> sanitise -> audit evaluate.py detection P/R/F1 + residual-leakage metric trust_demo.py two-Trust sanitise-at-source demo tests/ unit tests + run_eval.py (the evaluation CLI) docs/ tool_card.md Β· CHANGELOG.md data/ input CSVs (gitignored) outputs/ generated artifacts: results.json, manifests (gitignored) streamlit_app.py demo UI + Hugging Face Space entry point Dockerfile HF Spaces (Docker) deploy pyproject.toml packaging + lint/test config ``` ## Trust & governance β€” mapped to the NHS Five Safes - **Safe data** β€” PII removed to DAPB1523/ICO standard across patient + staff + org identifiers. - **Safe settings** β€” runs inside the Trust; raw CSVs and the vault are gitignored, never leave. - **Safe outputs** β€” only de-identified text + content-free audit logs; the measured leakage gates them. - **Safe people / projects** β€” the re-identification vault stays Trust-local; pseudonymised data is still personal data under UK GDPR β€” stated honestly, no over-claim. ## Run it ```bash # 1) create AND activate the virtual environment python -m venv .venv .\.venv\Scripts\Activate.ps1 # Windows PowerShell # source .venv/Scripts/activate # ... or Windows Git Bash # source .venv/bin/activate # ... or macOS / Linux # 2) install the package (editable) + the spaCy model pip install -e ".[app,dev]" python -m spacy download en_core_web_lg # or en_core_web_sm for a faster, lighter run # 3) run python tests/run_eval.py --compare --limit 300 # reproduce the table -> outputs/results.json python -m src.trust_demo # two NHS Trusts share only de-identified data -> outputs/ streamlit run streamlit_app.py # demo: Try-it Β· Metrics Β· Governance Β· Two-Trust pytest -q # unit tests ``` The dataset is pulled automatically on first run. To run fully offline, drop the three CSVs in a folder and set `NOTEGUARD_DATA_DIR=/path/to/csvs`. ## Deploy the live demo (Hugging Face Spaces) ```bash pip install -U huggingface_hub # provides the `hf` CLI hf auth login # paste a WRITE token from https://huggingface.co/settings/tokens hf repos create /noteguard --repo-type space --space-sdk docker git remote add space https://huggingface.co/spaces//noteguard git push space HEAD:main # builds the image and serves streamlit_app.py ``` ## Data notes (found by inspecting the data, not assuming) - NHS numbers in this synthetic set are **9 digits** (real ones are 10 + mod-11 check). We catch both: checksum-validated 10-digit anywhere, **and** context-anchored numbers after an "NHS …" label. - Some fields are double-encoded (`·`); `_fix_mojibake` repairs them so they don't pollute ground truth. Built with Claude Code (`CLAUDE.md`, `.claude/`).