noteguard / README.md
yumi.h
Add de-identified download, remove FLock.io tech refs, add hackathon footer
6f7e511
|
Raw
History Blame Contribute Delete
7.73 kB
metadata
title: NoteGuard  NHS De-Identification Gate
emoji: 🛡️
sdk: docker
app_port: 8501
pinned: false

🛡️ NoteGuard

Automatic PII sanitisation for NHS clinical notes — clean data in, no identifiers out.

NoteGuard discovers, inspects, redacts and de-identifies PII in free-text NHS clinical notes before the data is used to train any model. It runs locally at each institution ("sanitise at source"), so every Trust cleans its own data inside its own governance boundary before anything is shared or used in collaborative / federated training.

Federated learning lets institutions train without moving data. NoteGuard is the privacy-preserving on-ramp that makes the data safe to train on in the first place — the missing layer in front of an NHS Secure Data Environment / the Federated Data Platform.

Encode Vibe Coding Hackathon — FLock Sovereign AI Challenge (hosted by Encode Hub). Built on Microsoft Presidio + spaCy, evaluated on NHSEDataScience/synthetic_clinical_notes.

What makes this more than "just Presidio"

Presidio is the detection engine — we don't reinvent it. NoteGuard is the clinical assurance layer Presidio leaves to you:

  1. Measured residual leakage. Because the dataset keeps PII in structured tables, we join them back to each note for free ground truth and report a real re-identification risk number — not a vibe.
  2. Domain adaptation to messy clinical text. NHS-aware recognisers: checksum-validated NHS numbers plus context-anchored detection for the dataset's 9-digit synthetic numbers Presidio's UK_NHS misses, plus GMC/NMC clinician IDs, ODS org codes and record UUIDs.
  3. Patient-consistent de-identification. Same patient → same surrogate across their whole admission journey. Only date-of-birth is treated as PII (shifted by a consistent per-patient offset); visit / admission dates are clinically useful and left intact. Realistic en_GB fakes (or [label] redaction).
  4. Pluggable + degrades gracefully. One Detector interface (Rule / Presidio); the pure-Python rule layer + eval run even if spaCy/Presidio are unavailable.
  5. Governance wrapper. Per-note audit of what was removed + the dataset-level leakage report, mapped to the NHS Five Safes.

Results — residual leakage drops as we layer detection

Known identifiers (joined from the structured tables) still present after sanitisation. Measured on all 1,602 notes (1,027 known-PII occurrences). Reproduce with python tests/run_eval.py --compare.

Detector NHS number F1 PERSON recall Residual leakage
rules only 0.98 0.00 74.8 %
presidio + rules (shipping) 0.99 0.68 8.5 %

The rules→engine drop is the headline: it shows, with numbers, exactly what the NER engine buys you.

Precision is reported against structured PII only, so it is a conservative lower bound — correctly removing a clinician's name (not in the tables) counts here as a false positive. Recall and leakage are the sound, headline metrics.

Architecture

                 ┌──────────────────── inside Trust A ─────────────────────┐
 raw notes ──►   │  fix mojibake ─► detect (Presidio NER + rules)           │ ──► de-identified
 (PHI)           │                  ─► transform (redact | pseudonymise)    │     text + audit log
                 │                     patient-consistent + date-shift, vault│     (no PHI leaves)
                 └─────────────────────────────────────────────────────────┘
        same gate runs inside Trust B ──►  ┌────────────────────────────┐
                                           │  shared de-identified pool  │ ──► federated AI training
                                           └────────────────────────────┘

Project layout (Gold-RAP "analysis as a product"):

src/                 the package
  data.py            load CSVs + ground-truth join (EVAL-ONLY oracle)
  recognisers.py     pure-Python rules: NHS checksum/context, postcode, date, phone, email, GMC/NMC/ODS, UUID
  detect.py          RuleDetector / PresidioDetector behind one Detector interface
  transform.py       redaction | patient-consistent pseudonymisation + DOB date-shift (Faker vault)
  pipeline.py        single-note detect -> sanitise -> audit
  evaluate.py        detection P/R/F1 + residual-leakage metric
  trust_demo.py      two-Trust sanitise-at-source demo
tests/               unit tests + run_eval.py (the evaluation CLI)
docs/                tool_card.md · CHANGELOG.md
data/                input CSVs (gitignored)
outputs/             generated artifacts: results.json, manifests (gitignored)
streamlit_app.py     demo UI + Hugging Face Space entry point
Dockerfile           HF Spaces (Docker) deploy      pyproject.toml   packaging + lint/test config

Trust & governance — mapped to the NHS Five Safes

  • Safe data — PII removed to DAPB1523/ICO standard across patient + staff + org identifiers.
  • Safe settings — runs inside the Trust; raw CSVs and the vault are gitignored, never leave.
  • Safe outputs — only de-identified text + content-free audit logs; the measured leakage gates them.
  • Safe people / projects — the re-identification vault stays Trust-local; pseudonymised data is still personal data under UK GDPR — stated honestly, no over-claim.

Run it

# 1) create AND activate the virtual environment
python -m venv .venv
.\.venv\Scripts\Activate.ps1        # Windows PowerShell
# source .venv/Scripts/activate     #   ... or Windows Git Bash
# source .venv/bin/activate         #   ... or macOS / Linux

# 2) install the package (editable) + the spaCy model
pip install -e ".[app,dev]"
python -m spacy download en_core_web_lg   # or en_core_web_sm for a faster, lighter run

# 3) run
python tests/run_eval.py --compare --limit 300   # reproduce the table -> outputs/results.json
python -m src.trust_demo                          # two NHS Trusts share only de-identified data -> outputs/
streamlit run streamlit_app.py                    # demo: Try-it · Metrics · Governance · Two-Trust
pytest -q                                          # unit tests

The dataset is pulled automatically on first run. To run fully offline, drop the three CSVs in a folder and set NOTEGUARD_DATA_DIR=/path/to/csvs.

Deploy the live demo (Hugging Face Spaces)

pip install -U huggingface_hub      # provides the `hf` CLI
hf auth login                        # paste a WRITE token from https://huggingface.co/settings/tokens
hf repos create <user>/noteguard --repo-type space --space-sdk docker
git remote add space https://huggingface.co/spaces/<user>/noteguard
git push space HEAD:main             # builds the image and serves streamlit_app.py

Data notes (found by inspecting the data, not assuming)

  • NHS numbers in this synthetic set are 9 digits (real ones are 10 + mod-11 check). We catch both: checksum-validated 10-digit anywhere, and context-anchored numbers after an "NHS …" label.
  • Some fields are double-encoded (·); _fix_mojibake repairs them so they don't pollute ground truth.

Built with Claude Code (CLAUDE.md, .claude/).