| --- |
| title: NoteGuard — NHS De-Identification Gate |
| emoji: 🛡️ |
| sdk: docker |
| app_port: 8501 |
| pinned: false |
| --- |
| |
| # 🛡️ NoteGuard |
|
|
| **Automatic PII sanitisation for NHS clinical notes — clean data in, no identifiers out.** |
|
|
| NoteGuard discovers, inspects, redacts and de-identifies PII in free-text NHS clinical notes **before** |
| the data is used to train any model. It runs **locally at each institution** ("sanitise at source"), so |
| every Trust cleans its own data inside its own governance boundary before anything is shared or used in |
| collaborative / federated training. |
|
|
| > Federated learning lets institutions train without moving data. NoteGuard is the **privacy-preserving |
| > on-ramp** that makes the data safe to train on in the first place — the missing layer in front of an |
| > NHS Secure Data Environment / the Federated Data Platform. |
|
|
| Encode Vibe Coding Hackathon — *FLock Sovereign AI Challenge* (hosted by Encode Hub). Built on **Microsoft Presidio** + **spaCy**, |
| evaluated on [NHSEDataScience/synthetic_clinical_notes](https://huggingface.co/datasets/NHSEDataScience/synthetic_clinical_notes). |
|
|
| ## What makes this more than "just Presidio" |
|
|
| Presidio is the detection **engine** — we don't reinvent it. NoteGuard is the **clinical assurance |
| layer** Presidio leaves to you: |
|
|
| 1. **Measured residual leakage.** Because the dataset keeps PII in structured tables, we join them back |
| to each note for free ground truth and report a real **re-identification risk** number — not a vibe. |
| 2. **Domain adaptation to messy clinical text.** NHS-aware recognisers: checksum-validated NHS numbers |
| **plus** context-anchored detection for the dataset's 9-digit synthetic numbers Presidio's `UK_NHS` |
| misses, plus GMC/NMC clinician IDs, ODS org codes and record UUIDs. |
| 3. **Patient-consistent de-identification.** Same patient → same surrogate across their whole |
| admission journey. Only date-of-birth is treated as PII (shifted by a consistent per-patient |
| offset); visit / admission dates are clinically useful and left intact. Realistic en_GB fakes |
| (or `[label]` redaction). |
| 4. **Pluggable + degrades gracefully.** One `Detector` interface (Rule / Presidio); the pure-Python |
| rule layer + eval run even if spaCy/Presidio are unavailable. |
| 5. **Governance wrapper.** Per-note audit of what was removed + the dataset-level leakage report, |
| mapped to the NHS **Five Safes**. |
| |
| ## Results — residual leakage drops as we layer detection |
| |
| *Known identifiers (joined from the structured tables) still present after sanitisation. Measured on all |
| **1,602 notes** (1,027 known-PII occurrences). Reproduce with `python tests/run_eval.py --compare`.* |
|
|
| | Detector | NHS number F1 | PERSON recall | **Residual leakage** | |
| |---|---|---|---| |
| | rules only | 0.98 | 0.00 | **74.8 %** | |
| | **presidio + rules** (shipping) | **0.99** | **0.68** | **8.5 %** | |
|
|
| The rules→engine drop is the headline: it shows, with numbers, exactly what the NER engine buys you. |
|
|
| > Precision is reported against *structured* PII only, so it is a conservative lower bound — correctly |
| > removing a clinician's name (not in the tables) counts here as a false positive. **Recall and leakage |
| > are the sound, headline metrics.** |
|
|
| ## Architecture |
|
|
| ``` |
| ┌──────────────────── inside Trust A ─────────────────────┐ |
| raw notes ──► │ fix mojibake ─► detect (Presidio NER + rules) │ ──► de-identified |
| (PHI) │ ─► transform (redact | pseudonymise) │ text + audit log |
| │ patient-consistent + date-shift, vault│ (no PHI leaves) |
| └─────────────────────────────────────────────────────────┘ |
| same gate runs inside Trust B ──► ┌────────────────────────────┐ |
| │ shared de-identified pool │ ──► federated AI training |
| └────────────────────────────┘ |
| ``` |
|
|
| **Project layout** (Gold-RAP "analysis as a product"): |
|
|
| ``` |
| src/ the package |
| data.py load CSVs + ground-truth join (EVAL-ONLY oracle) |
| recognisers.py pure-Python rules: NHS checksum/context, postcode, date, phone, email, GMC/NMC/ODS, UUID |
| detect.py RuleDetector / PresidioDetector behind one Detector interface |
| transform.py redaction | patient-consistent pseudonymisation + DOB date-shift (Faker vault) |
| pipeline.py single-note detect -> sanitise -> audit |
| evaluate.py detection P/R/F1 + residual-leakage metric |
| trust_demo.py two-Trust sanitise-at-source demo |
| tests/ unit tests + run_eval.py (the evaluation CLI) |
| docs/ tool_card.md · CHANGELOG.md |
| data/ input CSVs (gitignored) |
| outputs/ generated artifacts: results.json, manifests (gitignored) |
| streamlit_app.py demo UI + Hugging Face Space entry point |
| Dockerfile HF Spaces (Docker) deploy pyproject.toml packaging + lint/test config |
| ``` |
|
|
| ## Trust & governance — mapped to the NHS Five Safes |
| - **Safe data** — PII removed to DAPB1523/ICO standard across patient + staff + org identifiers. |
| - **Safe settings** — runs inside the Trust; raw CSVs and the vault are gitignored, never leave. |
| - **Safe outputs** — only de-identified text + content-free audit logs; the measured leakage gates them. |
| - **Safe people / projects** — the re-identification vault stays Trust-local; pseudonymised data is |
| still personal data under UK GDPR — stated honestly, no over-claim. |
|
|
| ## Run it |
|
|
| ```bash |
| # 1) create AND activate the virtual environment |
| python -m venv .venv |
| .\.venv\Scripts\Activate.ps1 # Windows PowerShell |
| # source .venv/Scripts/activate # ... or Windows Git Bash |
| # source .venv/bin/activate # ... or macOS / Linux |
| |
| # 2) install the package (editable) + the spaCy model |
| pip install -e ".[app,dev]" |
| python -m spacy download en_core_web_lg # or en_core_web_sm for a faster, lighter run |
| |
| # 3) run |
| python tests/run_eval.py --compare --limit 300 # reproduce the table -> outputs/results.json |
| python -m src.trust_demo # two NHS Trusts share only de-identified data -> outputs/ |
| streamlit run streamlit_app.py # demo: Try-it · Metrics · Governance · Two-Trust |
| pytest -q # unit tests |
| ``` |
|
|
| The dataset is pulled automatically on first run. To run fully offline, drop the three CSVs in a |
| folder and set `NOTEGUARD_DATA_DIR=/path/to/csvs`. |
|
|
| ## Deploy the live demo (Hugging Face Spaces) |
|
|
| ```bash |
| pip install -U huggingface_hub # provides the `hf` CLI |
| hf auth login # paste a WRITE token from https://huggingface.co/settings/tokens |
| hf repos create <user>/noteguard --repo-type space --space-sdk docker |
| git remote add space https://huggingface.co/spaces/<user>/noteguard |
| git push space HEAD:main # builds the image and serves streamlit_app.py |
| ``` |
|
|
| ## Data notes (found by inspecting the data, not assuming) |
| - NHS numbers in this synthetic set are **9 digits** (real ones are 10 + mod-11 check). We catch both: |
| checksum-validated 10-digit anywhere, **and** context-anchored numbers after an "NHS …" label. |
| - Some fields are double-encoded (`·`); `_fix_mojibake` repairs them so they don't pollute ground truth. |
|
|
| Built with Claude Code (`CLAUDE.md`, `.claude/`). |
|
|