noteguard / README.md
yumi.h
Add de-identified download, remove FLock.io tech refs, add hackathon footer
6f7e511
|
Raw
History Blame Contribute Delete
7.73 kB
---
title: NoteGuard NHS De-Identification Gate
emoji: 🛡️
sdk: docker
app_port: 8501
pinned: false
---
# 🛡️ NoteGuard
**Automatic PII sanitisation for NHS clinical notes — clean data in, no identifiers out.**
NoteGuard discovers, inspects, redacts and de-identifies PII in free-text NHS clinical notes **before**
the data is used to train any model. It runs **locally at each institution** ("sanitise at source"), so
every Trust cleans its own data inside its own governance boundary before anything is shared or used in
collaborative / federated training.
> Federated learning lets institutions train without moving data. NoteGuard is the **privacy-preserving
> on-ramp** that makes the data safe to train on in the first place — the missing layer in front of an
> NHS Secure Data Environment / the Federated Data Platform.
Encode Vibe Coding Hackathon — *FLock Sovereign AI Challenge* (hosted by Encode Hub). Built on **Microsoft Presidio** + **spaCy**,
evaluated on [NHSEDataScience/synthetic_clinical_notes](https://huggingface.co/datasets/NHSEDataScience/synthetic_clinical_notes).
## What makes this more than "just Presidio"
Presidio is the detection **engine** — we don't reinvent it. NoteGuard is the **clinical assurance
layer** Presidio leaves to you:
1. **Measured residual leakage.** Because the dataset keeps PII in structured tables, we join them back
to each note for free ground truth and report a real **re-identification risk** number — not a vibe.
2. **Domain adaptation to messy clinical text.** NHS-aware recognisers: checksum-validated NHS numbers
**plus** context-anchored detection for the dataset's 9-digit synthetic numbers Presidio's `UK_NHS`
misses, plus GMC/NMC clinician IDs, ODS org codes and record UUIDs.
3. **Patient-consistent de-identification.** Same patient → same surrogate across their whole
admission journey. Only date-of-birth is treated as PII (shifted by a consistent per-patient
offset); visit / admission dates are clinically useful and left intact. Realistic en_GB fakes
(or `[label]` redaction).
4. **Pluggable + degrades gracefully.** One `Detector` interface (Rule / Presidio); the pure-Python
rule layer + eval run even if spaCy/Presidio are unavailable.
5. **Governance wrapper.** Per-note audit of what was removed + the dataset-level leakage report,
mapped to the NHS **Five Safes**.
## Results — residual leakage drops as we layer detection
*Known identifiers (joined from the structured tables) still present after sanitisation. Measured on all
**1,602 notes** (1,027 known-PII occurrences). Reproduce with `python tests/run_eval.py --compare`.*
| Detector | NHS number F1 | PERSON recall | **Residual leakage** |
|---|---|---|---|
| rules only | 0.98 | 0.00 | **74.8 %** |
| **presidio + rules** (shipping) | **0.99** | **0.68** | **8.5 %** |
The rules→engine drop is the headline: it shows, with numbers, exactly what the NER engine buys you.
> Precision is reported against *structured* PII only, so it is a conservative lower bound — correctly
> removing a clinician's name (not in the tables) counts here as a false positive. **Recall and leakage
> are the sound, headline metrics.**
## Architecture
```
┌──────────────────── inside Trust A ─────────────────────┐
raw notes ──► │ fix mojibake ─► detect (Presidio NER + rules) │ ──► de-identified
(PHI) │ ─► transform (redact | pseudonymise) │ text + audit log
│ patient-consistent + date-shift, vault│ (no PHI leaves)
└─────────────────────────────────────────────────────────┘
same gate runs inside Trust B ──► ┌────────────────────────────┐
│ shared de-identified pool │ ──► federated AI training
└────────────────────────────┘
```
**Project layout** (Gold-RAP "analysis as a product"):
```
src/ the package
data.py load CSVs + ground-truth join (EVAL-ONLY oracle)
recognisers.py pure-Python rules: NHS checksum/context, postcode, date, phone, email, GMC/NMC/ODS, UUID
detect.py RuleDetector / PresidioDetector behind one Detector interface
transform.py redaction | patient-consistent pseudonymisation + DOB date-shift (Faker vault)
pipeline.py single-note detect -> sanitise -> audit
evaluate.py detection P/R/F1 + residual-leakage metric
trust_demo.py two-Trust sanitise-at-source demo
tests/ unit tests + run_eval.py (the evaluation CLI)
docs/ tool_card.md · CHANGELOG.md
data/ input CSVs (gitignored)
outputs/ generated artifacts: results.json, manifests (gitignored)
streamlit_app.py demo UI + Hugging Face Space entry point
Dockerfile HF Spaces (Docker) deploy pyproject.toml packaging + lint/test config
```
## Trust & governance — mapped to the NHS Five Safes
- **Safe data** — PII removed to DAPB1523/ICO standard across patient + staff + org identifiers.
- **Safe settings** — runs inside the Trust; raw CSVs and the vault are gitignored, never leave.
- **Safe outputs** — only de-identified text + content-free audit logs; the measured leakage gates them.
- **Safe people / projects** — the re-identification vault stays Trust-local; pseudonymised data is
still personal data under UK GDPR — stated honestly, no over-claim.
## Run it
```bash
# 1) create AND activate the virtual environment
python -m venv .venv
.\.venv\Scripts\Activate.ps1 # Windows PowerShell
# source .venv/Scripts/activate # ... or Windows Git Bash
# source .venv/bin/activate # ... or macOS / Linux
# 2) install the package (editable) + the spaCy model
pip install -e ".[app,dev]"
python -m spacy download en_core_web_lg # or en_core_web_sm for a faster, lighter run
# 3) run
python tests/run_eval.py --compare --limit 300 # reproduce the table -> outputs/results.json
python -m src.trust_demo # two NHS Trusts share only de-identified data -> outputs/
streamlit run streamlit_app.py # demo: Try-it · Metrics · Governance · Two-Trust
pytest -q # unit tests
```
The dataset is pulled automatically on first run. To run fully offline, drop the three CSVs in a
folder and set `NOTEGUARD_DATA_DIR=/path/to/csvs`.
## Deploy the live demo (Hugging Face Spaces)
```bash
pip install -U huggingface_hub # provides the `hf` CLI
hf auth login # paste a WRITE token from https://huggingface.co/settings/tokens
hf repos create <user>/noteguard --repo-type space --space-sdk docker
git remote add space https://huggingface.co/spaces/<user>/noteguard
git push space HEAD:main # builds the image and serves streamlit_app.py
```
## Data notes (found by inspecting the data, not assuming)
- NHS numbers in this synthetic set are **9 digits** (real ones are 10 + mod-11 check). We catch both:
checksum-validated 10-digit anywhere, **and** context-anchored numbers after an "NHS …" label.
- Some fields are double-encoded (`·`); `_fix_mojibake` repairs them so they don't pollute ground truth.
Built with Claude Code (`CLAUDE.md`, `.claude/`).