Spaces:

chaeyoona
/

noteguard

Running

App Files Files Community

noteguard / docs /tool_card.md

yumi.h

Add de-identified download, remove FLock.io tech refs, add hackathon footer

6f7e511 12 days ago

preview code

Raw

History Blame Contribute Delete

7.13 kB

NoteGuard — Tool Card

Version: 0.0.1
Track: Public Sector & Citizen Services — NHS Secure Data Environment on-ramp
Status: Hackathon prototype; not validated for production use without further evaluation.

Specification

Field	Value
Description	De-identification gate that detects + removes PII from free-text NHS clinical notes
Type	Hybrid pipeline — pure-Python rules + Microsoft Presidio (spaCy `en_core_web_lg` NER). No model is trained; pre-trained components are composed.
Developer	Encode Vibe Coding Hackathon team (fork of `NoteGuard/`)
Status / version	Prototype · v0.0.1
Repository	github.com/chaeyoonyunakim/automatic-pii-preprocessing-tool

Documented as a tool card, not a model card. NoteGuard trains no model, so the NHS England model card template's training / hyperparameter / data-split sections do not apply. A gov.uk Algorithmic Transparency Recording Standard (ATRS) record is provided in report.md.

What it does

NoteGuard is a de-identification gate for free-text NHS clinical notes. It detects and removes patient and clinician PII inside a Trust before any text reaches a Secure Data Environment (SDE), federated training round, or cross-Trust sharing layer.

"AI detects, humans review, audit logs account."

Who uses it

Role	When	Why
Data Wrangler / IG Analyst	Before releasing notes to research or AI teams	Cannot share raw free-text; must prove zero identifier leakage
SDE Operator	At the Trust boundary ingestion point	Gate between Trust raw data and the shared pool
Federated AI platform	Before each training round	Needs de-identified text; cannot inspect raw Trust data

Use cases out of scope

Not a substitute for Information Governance sign-off, a DPIA, or DARS approval — it is a technical control, not a legal basis for processing.
Not validated on real Trust data, non-English notes, or scanned / handwritten documents.
Not a guarantee of zero re-identification: pseudonymised output is still personal data, and residual leakage is measured, not assumed to be zero on unseen data.
Not for clinical decision-making, or any use of note content beyond de-identification.

Detection coverage

Entity type	Method	Notes
Patient name (`PERSON`)	spaCy `en_core_web_lg` NER	100% recall in benchmarks
NHS number (`UK_NHS`)	Regex + Modulus-11 checksum + 9-digit context anchor	Catches both standard and synthetic dataset forms
Date of birth (`DATE_TIME`)	Presidio + date regex	100% recall
Site / hospital name (`LOCATION`)	spaCy NER + rule-based suffix anchor	"X Hospital / Infirmary / NHS Trust" patterns (ORGANIZATION is excluded — it over-tags labels)
UK postcode (`UK_POSTCODE`)	Regex	Outward-code only after pseudonymisation
Clinician GMC / NMC (`GMC`, `NMC`)	Context-anchored regex	"GMC 1234567", "NMC PIN 12A3456B"
ODS org code (`NHS_ODS`)	Context-anchored regex	"ODS A12345", "Practice Code A12345"
Record / document UUID (`RECORD_ID`)	UUID regex	Quasi-identifier
Email / phone / NINO	Presidio built-ins	Standard patterns
Nationality / religion / political (`NRP`)	Presidio	Always redacted; never pseudonymised (UK GDPR Art. 9)

Anonymisation policy

Mode	Behaviour
Pseudonymise (default)	Faker(en_GB) realistic surrogates; stable per patient via Trust-local vault; date intervals preserved by consistent random shift
Redact	`[ENTITY_TYPE]` placeholder tags
`NRP`	Always redacted regardless of mode

Performance (`en_core_web_lg`)

Entity	Recall
Names	100%
NHS number	~100%
Date of birth	100%
Places / sites	improving (was low due to ORG/LOCATION mismatch — now fixed)

Residual leakage target: 0 known identifiers surviving sanitisation (gates SDE pool admission).

Bias and fairness

The PERSON NER (spaCy en_core_web_lg) is trained largely on Western / English text, so name recall can be lower for names of non-English origin. This is an equity risk: under-detection means those patients carry a higher residual re-identification risk. Honest position and mitigations:

The checksum / context rules (NHS number, DOB, postcode, GMC/NMC/ODS, NINO) are name-agnostic, so structured identifiers are detected uniformly regardless of patient demographics.
The human review queue surfaces low-confidence name spans for IG analyst confirmation.
Required before deployment: evaluate name recall stratified by name origin / ethnicity coding on representative Trust data and report the disparity. Not yet done — evaluation is on synthetic data.

Human-in-the-loop

Low-confidence detections (model score below auto-confirm threshold) are:

Still redacted for safety.
Flagged in the review queue with context snippet and confidence score.
Surfaced to an IG analyst for confirmation before the note enters the SDE pool.

This matches the real NHS Information Governance workflow and makes the tool's accountability explicit.

NHS Five Safes mapping

Safe	Status	How
Safe data	✅	De-identified to DAPB1523/ICO standard; leakage-gated
Safe settings	✅	Processing inside Trust; raw data and vault gitignored
Safe outputs	✅	Only de-identified text + content-free audit logs leave
Safe people	⚠️	IG analyst review queue; vault stays Trust-local; honest UK GDPR framing
Safe projects	⚠️	Technical layer only; DPIA + project approval (DARS) remain Trust processes

Limitations and caveats

Pseudonymised data is still personal data under UK GDPR — the vault is the re-identification key and must stay Trust-local.
Precision is a conservative lower bound: clinician names and unlisted locations correctly detected count as false positives in the evaluation (ground truth is patient-table-only).
Not clinically validated: evaluated on the NHSEDataScience/synthetic_clinical_notes dataset. Real deployment requires validation on representative Trust data.
Clinical transformer models (e.g. obi/deid_roberta_i2b2) were tested and performed worse on UK names than en_core_web_lg (i2b2 training data is US-centric).
Governance prerequisites for deployment: a Data Protection Impact Assessment (DPIA), IG / Caldicott sign-off, and DARS project approval are required before any real use. NoteGuard is the technical control, not the approval.

Adoption path

NHS Trust (raw notes)
    │
    ▼  NoteGuard gate (runs inside Trust)
    │
    ▼  de-identified notes + audit log
    │
    ▼  NHS SDE / FDP shared pool
    │
    ▼  Federated AI

Same privacy model as OpenSAFELY: code comes to the data, data never leaves.

NoteGuard · Encode Vibe Coding Hackathon — FLock Sovereign AI Challenge · internal use only