noteguard / docs /tool_card.md
yumi.h
Add de-identified download, remove FLock.io tech refs, add hackathon footer
6f7e511
|
Raw
History Blame Contribute Delete
7.13 kB
# NoteGuard — Tool Card
**Version:** 0.0.1
**Track:** Public Sector & Citizen Services — NHS Secure Data Environment on-ramp
**Status:** Hackathon prototype; not validated for production use without further evaluation.
---
## Specification
| Field | Value |
|---|---|
| Description | De-identification gate that detects + removes PII from free-text NHS clinical notes |
| Type | Hybrid pipeline — pure-Python rules + Microsoft Presidio (spaCy `en_core_web_lg` NER). **No model is trained**; pre-trained components are composed. |
| Developer | Encode Vibe Coding Hackathon team (fork of `NoteGuard/`) |
| Status / version | Prototype · v0.0.1 |
| Repository | github.com/chaeyoonyunakim/automatic-pii-preprocessing-tool |
> Documented as a **tool card**, not a model card. NoteGuard trains no model, so the NHS England
> *model card* template's training / hyperparameter / data-split sections do not apply. A gov.uk
> Algorithmic Transparency Recording Standard (ATRS) record is provided in [`report.md`](report.md).
---
## What it does
NoteGuard is a **de-identification gate** for free-text NHS clinical notes. It detects and removes patient and clinician PII *inside* a Trust before any text reaches a Secure Data Environment (SDE), federated training round, or cross-Trust sharing layer.
> "AI detects, humans review, audit logs account."
---
## Who uses it
| Role | When | Why |
|---|---|---|
| Data Wrangler / IG Analyst | Before releasing notes to research or AI teams | Cannot share raw free-text; must prove zero identifier leakage |
| SDE Operator | At the Trust boundary ingestion point | Gate between Trust raw data and the shared pool |
| Federated AI platform | Before each training round | Needs de-identified text; cannot inspect raw Trust data |
---
## Use cases out of scope
- **Not** a substitute for Information Governance sign-off, a DPIA, or DARS approval — it is a
technical control, not a legal basis for processing.
- **Not** validated on real Trust data, non-English notes, or scanned / handwritten documents.
- **Not** a guarantee of zero re-identification: pseudonymised output is still personal data, and
residual leakage is *measured*, not assumed to be zero on unseen data.
- **Not** for clinical decision-making, or any use of note *content* beyond de-identification.
---
## Detection coverage
| Entity type | Method | Notes |
|---|---|---|
| Patient name (`PERSON`) | spaCy `en_core_web_lg` NER | 100% recall in benchmarks |
| NHS number (`UK_NHS`) | Regex + Modulus-11 checksum + 9-digit context anchor | Catches both standard and synthetic dataset forms |
| Date of birth (`DATE_TIME`) | Presidio + date regex | 100% recall |
| Site / hospital name (`LOCATION`) | spaCy NER + rule-based suffix anchor | "X Hospital / Infirmary / NHS Trust" patterns (ORGANIZATION is excluded — it over-tags labels) |
| UK postcode (`UK_POSTCODE`) | Regex | Outward-code only after pseudonymisation |
| Clinician GMC / NMC (`GMC`, `NMC`) | Context-anchored regex | "GMC 1234567", "NMC PIN 12A3456B" |
| ODS org code (`NHS_ODS`) | Context-anchored regex | "ODS A12345", "Practice Code A12345" |
| Record / document UUID (`RECORD_ID`) | UUID regex | Quasi-identifier |
| Email / phone / NINO | Presidio built-ins | Standard patterns |
| Nationality / religion / political (`NRP`) | Presidio | Always redacted; never pseudonymised (UK GDPR Art. 9) |
---
## Anonymisation policy
| Mode | Behaviour |
|---|---|
| **Pseudonymise** (default) | Faker(en_GB) realistic surrogates; stable per patient via Trust-local vault; date intervals preserved by consistent random shift |
| **Redact** | `[ENTITY_TYPE]` placeholder tags |
| `NRP` | Always redacted regardless of mode |
---
## Performance (`en_core_web_lg`)
| Entity | Recall |
|---|---|
| Names | 100% |
| NHS number | ~100% |
| Date of birth | 100% |
| Places / sites | improving (was low due to ORG/LOCATION mismatch — now fixed) |
**Residual leakage target:** 0 known identifiers surviving sanitisation (gates SDE pool admission).
---
## Bias and fairness
The `PERSON` NER (spaCy `en_core_web_lg`) is trained largely on Western / English text, so **name
recall can be lower for names of non-English origin**. This is an equity risk: under-detection means
those patients carry a *higher residual re-identification risk*. Honest position and mitigations:
- The checksum / context rules (NHS number, DOB, postcode, GMC/NMC/ODS, NINO) are **name-agnostic**,
so structured identifiers are detected uniformly regardless of patient demographics.
- The **human review queue** surfaces low-confidence name spans for IG analyst confirmation.
- **Required before deployment:** evaluate name recall *stratified by name origin / ethnicity coding*
on representative Trust data and report the disparity. Not yet done — evaluation is on synthetic data.
---
## Human-in-the-loop
Low-confidence detections (model score below auto-confirm threshold) are:
1. Still **redacted** for safety.
2. Flagged in the **review queue** with context snippet and confidence score.
3. Surfaced to an IG analyst for confirmation before the note enters the SDE pool.
This matches the real NHS Information Governance workflow and makes the tool's accountability explicit.
---
## NHS Five Safes mapping
| Safe | Status | How |
|---|---|---|
| **Safe data** | ✅ | De-identified to DAPB1523/ICO standard; leakage-gated |
| **Safe settings** | ✅ | Processing inside Trust; raw data and vault gitignored |
| **Safe outputs** | ✅ | Only de-identified text + content-free audit logs leave |
| **Safe people** | ⚠️ | IG analyst review queue; vault stays Trust-local; honest UK GDPR framing |
| **Safe projects** | ⚠️ | Technical layer only; DPIA + project approval (DARS) remain Trust processes |
---
## Limitations and caveats
- **Pseudonymised data is still personal data** under UK GDPR — the vault is the re-identification key and must stay Trust-local.
- **Precision is a conservative lower bound**: clinician names and unlisted locations correctly detected count as false positives in the evaluation (ground truth is patient-table-only).
- **Not clinically validated**: evaluated on the `NHSEDataScience/synthetic_clinical_notes` dataset. Real deployment requires validation on representative Trust data.
- **Clinical transformer models** (e.g. `obi/deid_roberta_i2b2`) were tested and performed worse on UK names than `en_core_web_lg` (i2b2 training data is US-centric).
- **Governance prerequisites for deployment:** a Data Protection Impact Assessment (DPIA), IG /
Caldicott sign-off, and DARS project approval are required before any real use. NoteGuard is the
technical control, not the approval.
---
## Adoption path
```
NHS Trust (raw notes)
▼ NoteGuard gate (runs inside Trust)
▼ de-identified notes + audit log
▼ NHS SDE / FDP shared pool
▼ Federated AI
```
Same privacy model as OpenSAFELY: *code comes to the data, data never leaves*.
---
*NoteGuard · Encode Vibe Coding Hackathon — FLock Sovereign AI Challenge · internal use only*