Spaces:

chaeyoona
/

noteguard

Running

App Files Files Community

noteguard / README.md

yumi.h

Add de-identified download, remove FLock.io tech refs, add hackathon footer

6f7e511 12 days ago

preview code

Raw

History Blame Contribute Delete

7.73 kB

	---
	title: NoteGuard — NHS De-Identification Gate
	emoji: 🛡️
	sdk: docker
	app_port: 8501
	pinned: false
	---

	# 🛡️ NoteGuard

	Automatic PII sanitisation for NHS clinical notes — clean data in, no identifiers out.

	NoteGuard discovers, inspects, redacts and de-identifies PII in free-text NHS clinical notes before
	the data is used to train any model. It runs locally at each institution ("sanitise at source"), so
	every Trust cleans its own data inside its own governance boundary before anything is shared or used in
	collaborative / federated training.

	> Federated learning lets institutions train without moving data. NoteGuard is the **privacy-preserving
	> on-ramp** that makes the data safe to train on in the first place — the missing layer in front of an
	> NHS Secure Data Environment / the Federated Data Platform.

	Encode Vibe Coding Hackathon — FLock Sovereign AI Challenge (hosted by Encode Hub). Built on Microsoft Presidio + spaCy,
	evaluated on [NHSEDataScience/synthetic_clinical_notes](https://huggingface.co/datasets/NHSEDataScience/synthetic_clinical_notes).

	## What makes this more than "just Presidio"

	Presidio is the detection engine — we don't reinvent it. NoteGuard is the **clinical assurance
	layer** Presidio leaves to you:

	1. Measured residual leakage. Because the dataset keeps PII in structured tables, we join them back
	to each note for free ground truth and report a real re-identification risk number — not a vibe.
	2. Domain adaptation to messy clinical text. NHS-aware recognisers: checksum-validated NHS numbers
	plus context-anchored detection for the dataset's 9-digit synthetic numbers Presidio's `UK_NHS`
	misses, plus GMC/NMC clinician IDs, ODS org codes and record UUIDs.
	3. Patient-consistent de-identification. Same patient → same surrogate across their whole
	admission journey. Only date-of-birth is treated as PII (shifted by a consistent per-patient
	offset); visit / admission dates are clinically useful and left intact. Realistic en_GB fakes
	(or `[label]` redaction).
	4. Pluggable + degrades gracefully. One `Detector` interface (Rule / Presidio); the pure-Python
	rule layer + eval run even if spaCy/Presidio are unavailable.
	5. Governance wrapper. Per-note audit of what was removed + the dataset-level leakage report,
	mapped to the NHS Five Safes.

	## Results — residual leakage drops as we layer detection

	*Known identifiers (joined from the structured tables) still present after sanitisation. Measured on all
	1,602 notes (1,027 known-PII occurrences). Reproduce with `python tests/run_eval.py --compare`.*

	\| Detector \| NHS number F1 \| PERSON recall \| Residual leakage \|
	\|---\|---\|---\|---\|
	\| rules only \| 0.98 \| 0.00 \| 74.8 % \|
	\| presidio + rules (shipping) \| 0.99 \| 0.68 \| 8.5 % \|

	The rules→engine drop is the headline: it shows, with numbers, exactly what the NER engine buys you.

	> Precision is reported against structured PII only, so it is a conservative lower bound — correctly
	> removing a clinician's name (not in the tables) counts here as a false positive. **Recall and leakage
	> are the sound, headline metrics.**

	## Architecture

	```
	┌──────────────────── inside Trust A ─────────────────────┐
	raw notes ──► │ fix mojibake ─► detect (Presidio NER + rules) │ ──► de-identified
	(PHI) │ ─► transform (redact \| pseudonymise) │ text + audit log
	│ patient-consistent + date-shift, vault│ (no PHI leaves)
	└─────────────────────────────────────────────────────────┘
	same gate runs inside Trust B ──► ┌────────────────────────────┐
	│ shared de-identified pool │ ──► federated AI training
	└────────────────────────────┘
	```

	Project layout (Gold-RAP "analysis as a product"):

	```
	src/ the package
	data.py load CSVs + ground-truth join (EVAL-ONLY oracle)
	recognisers.py pure-Python rules: NHS checksum/context, postcode, date, phone, email, GMC/NMC/ODS, UUID
	detect.py RuleDetector / PresidioDetector behind one Detector interface
	transform.py redaction \| patient-consistent pseudonymisation + DOB date-shift (Faker vault)
	pipeline.py single-note detect -> sanitise -> audit
	evaluate.py detection P/R/F1 + residual-leakage metric
	trust_demo.py two-Trust sanitise-at-source demo
	tests/ unit tests + run_eval.py (the evaluation CLI)
	docs/ tool_card.md · CHANGELOG.md
	data/ input CSVs (gitignored)
	outputs/ generated artifacts: results.json, manifests (gitignored)
	streamlit_app.py demo UI + Hugging Face Space entry point
	Dockerfile HF Spaces (Docker) deploy pyproject.toml packaging + lint/test config
	```

	## Trust & governance — mapped to the NHS Five Safes
	- Safe data — PII removed to DAPB1523/ICO standard across patient + staff + org identifiers.
	- Safe settings — runs inside the Trust; raw CSVs and the vault are gitignored, never leave.
	- Safe outputs — only de-identified text + content-free audit logs; the measured leakage gates them.
	- Safe people / projects — the re-identification vault stays Trust-local; pseudonymised data is
	still personal data under UK GDPR — stated honestly, no over-claim.

	## Run it

	```bash
	# 1) create AND activate the virtual environment
	python -m venv .venv
	.\.venv\Scripts\Activate.ps1 # Windows PowerShell
	# source .venv/Scripts/activate # ... or Windows Git Bash
	# source .venv/bin/activate # ... or macOS / Linux

	# 2) install the package (editable) + the spaCy model
	pip install -e ".[app,dev]"
	python -m spacy download en_core_web_lg # or en_core_web_sm for a faster, lighter run

	# 3) run
	python tests/run_eval.py --compare --limit 300 # reproduce the table -> outputs/results.json
	python -m src.trust_demo # two NHS Trusts share only de-identified data -> outputs/
	streamlit run streamlit_app.py # demo: Try-it · Metrics · Governance · Two-Trust
	pytest -q # unit tests
	```

	The dataset is pulled automatically on first run. To run fully offline, drop the three CSVs in a
	folder and set `NOTEGUARD_DATA_DIR=/path/to/csvs`.

	## Deploy the live demo (Hugging Face Spaces)

	```bash
	pip install -U huggingface_hub # provides the `hf` CLI
	hf auth login # paste a WRITE token from https://huggingface.co/settings/tokens
	hf repos create <user>/noteguard --repo-type space --space-sdk docker
	git remote add space https://huggingface.co/spaces/<user>/noteguard
	git push space HEAD:main # builds the image and serves streamlit_app.py
	```

	## Data notes (found by inspecting the data, not assuming)
	- NHS numbers in this synthetic set are 9 digits (real ones are 10 + mod-11 check). We catch both:
	checksum-validated 10-digit anywhere, and context-anchored numbers after an "NHS …" label.
	- Some fields are double-encoded (`Â·`); `_fix_mojibake` repairs them so they don't pollute ground truth.

	Built with Claude Code (`CLAUDE.md`, `.claude/`).