Spaces:

chaeyoona
/

noteguard

Running

App Files Files Community

noteguard / docs /tool_card.md

yumi.h

Add de-identified download, remove FLock.io tech refs, add hackathon footer

6f7e511 12 days ago

preview code

Raw

History Blame Contribute Delete

7.13 kB

	# NoteGuard — Tool Card

	Version: 0.0.1
	Track: Public Sector & Citizen Services — NHS Secure Data Environment on-ramp
	Status: Hackathon prototype; not validated for production use without further evaluation.

	---

	## Specification

	\| Field \| Value \|
	\|---\|---\|
	\| Description \| De-identification gate that detects + removes PII from free-text NHS clinical notes \|
	\| Type \| Hybrid pipeline — pure-Python rules + Microsoft Presidio (spaCy `en_core_web_lg` NER). No model is trained; pre-trained components are composed. \|
	\| Developer \| Encode Vibe Coding Hackathon team (fork of `NoteGuard/`) \|
	\| Status / version \| Prototype · v0.0.1 \|
	\| Repository \| github.com/chaeyoonyunakim/automatic-pii-preprocessing-tool \|

	> Documented as a tool card, not a model card. NoteGuard trains no model, so the NHS England
	> model card template's training / hyperparameter / data-split sections do not apply. A gov.uk
	> Algorithmic Transparency Recording Standard (ATRS) record is provided in [`report.md`](report.md).

	---

	## What it does

	NoteGuard is a de-identification gate for free-text NHS clinical notes. It detects and removes patient and clinician PII inside a Trust before any text reaches a Secure Data Environment (SDE), federated training round, or cross-Trust sharing layer.

	> "AI detects, humans review, audit logs account."

	---

	## Who uses it

	\| Role \| When \| Why \|
	\|---\|---\|---\|
	\| Data Wrangler / IG Analyst \| Before releasing notes to research or AI teams \| Cannot share raw free-text; must prove zero identifier leakage \|
	\| SDE Operator \| At the Trust boundary ingestion point \| Gate between Trust raw data and the shared pool \|
	\| Federated AI platform \| Before each training round \| Needs de-identified text; cannot inspect raw Trust data \|

	---

	## Use cases out of scope

	- Not a substitute for Information Governance sign-off, a DPIA, or DARS approval — it is a
	technical control, not a legal basis for processing.
	- Not validated on real Trust data, non-English notes, or scanned / handwritten documents.
	- Not a guarantee of zero re-identification: pseudonymised output is still personal data, and
	residual leakage is measured, not assumed to be zero on unseen data.
	- Not for clinical decision-making, or any use of note content beyond de-identification.

	---

	## Detection coverage

	\| Entity type \| Method \| Notes \|
	\|---\|---\|---\|
	\| Patient name (`PERSON`) \| spaCy `en_core_web_lg` NER \| 100% recall in benchmarks \|
	\| NHS number (`UK_NHS`) \| Regex + Modulus-11 checksum + 9-digit context anchor \| Catches both standard and synthetic dataset forms \|
	\| Date of birth (`DATE_TIME`) \| Presidio + date regex \| 100% recall \|
	\| Site / hospital name (`LOCATION`) \| spaCy NER + rule-based suffix anchor \| "X Hospital / Infirmary / NHS Trust" patterns (ORGANIZATION is excluded — it over-tags labels) \|
	\| UK postcode (`UK_POSTCODE`) \| Regex \| Outward-code only after pseudonymisation \|
	\| Clinician GMC / NMC (`GMC`, `NMC`) \| Context-anchored regex \| "GMC 1234567", "NMC PIN 12A3456B" \|
	\| ODS org code (`NHS_ODS`) \| Context-anchored regex \| "ODS A12345", "Practice Code A12345" \|
	\| Record / document UUID (`RECORD_ID`) \| UUID regex \| Quasi-identifier \|
	\| Email / phone / NINO \| Presidio built-ins \| Standard patterns \|
	\| Nationality / religion / political (`NRP`) \| Presidio \| Always redacted; never pseudonymised (UK GDPR Art. 9) \|

	---

	## Anonymisation policy

	\| Mode \| Behaviour \|
	\|---\|---\|
	\| Pseudonymise (default) \| Faker(en_GB) realistic surrogates; stable per patient via Trust-local vault; date intervals preserved by consistent random shift \|
	\| Redact \| `[ENTITY_TYPE]` placeholder tags \|
	\| `NRP` \| Always redacted regardless of mode \|

	---

	## Performance (`en_core_web_lg`)

	\| Entity \| Recall \|
	\|---\|---\|
	\| Names \| 100% \|
	\| NHS number \| ~100% \|
	\| Date of birth \| 100% \|
	\| Places / sites \| improving (was low due to ORG/LOCATION mismatch — now fixed) \|

	Residual leakage target: 0 known identifiers surviving sanitisation (gates SDE pool admission).

	---

	## Bias and fairness

	The `PERSON` NER (spaCy `en_core_web_lg`) is trained largely on Western / English text, so **name
	recall can be lower for names of non-English origin**. This is an equity risk: under-detection means
	those patients carry a higher residual re-identification risk. Honest position and mitigations:

	- The checksum / context rules (NHS number, DOB, postcode, GMC/NMC/ODS, NINO) are name-agnostic,
	so structured identifiers are detected uniformly regardless of patient demographics.
	- The human review queue surfaces low-confidence name spans for IG analyst confirmation.
	- Required before deployment: evaluate name recall stratified by name origin / ethnicity coding
	on representative Trust data and report the disparity. Not yet done — evaluation is on synthetic data.

	---

	## Human-in-the-loop

	Low-confidence detections (model score below auto-confirm threshold) are:
	1. Still redacted for safety.
	2. Flagged in the review queue with context snippet and confidence score.
	3. Surfaced to an IG analyst for confirmation before the note enters the SDE pool.

	This matches the real NHS Information Governance workflow and makes the tool's accountability explicit.

	---

	## NHS Five Safes mapping

	\| Safe \| Status \| How \|
	\|---\|---\|---\|
	\| Safe data \| ✅ \| De-identified to DAPB1523/ICO standard; leakage-gated \|
	\| Safe settings \| ✅ \| Processing inside Trust; raw data and vault gitignored \|
	\| Safe outputs \| ✅ \| Only de-identified text + content-free audit logs leave \|
	\| Safe people \| ⚠️ \| IG analyst review queue; vault stays Trust-local; honest UK GDPR framing \|
	\| Safe projects \| ⚠️ \| Technical layer only; DPIA + project approval (DARS) remain Trust processes \|

	---

	## Limitations and caveats

	- Pseudonymised data is still personal data under UK GDPR — the vault is the re-identification key and must stay Trust-local.
	- Precision is a conservative lower bound: clinician names and unlisted locations correctly detected count as false positives in the evaluation (ground truth is patient-table-only).
	- Not clinically validated: evaluated on the `NHSEDataScience/synthetic_clinical_notes` dataset. Real deployment requires validation on representative Trust data.
	- Clinical transformer models (e.g. `obi/deid_roberta_i2b2`) were tested and performed worse on UK names than `en_core_web_lg` (i2b2 training data is US-centric).
	- Governance prerequisites for deployment: a Data Protection Impact Assessment (DPIA), IG /
	Caldicott sign-off, and DARS project approval are required before any real use. NoteGuard is the
	technical control, not the approval.

	---

	## Adoption path

	```
	NHS Trust (raw notes)
	│
	▼ NoteGuard gate (runs inside Trust)
	│
	▼ de-identified notes + audit log
	│
	▼ NHS SDE / FDP shared pool
	│
	▼ Federated AI
	```

	Same privacy model as OpenSAFELY: code comes to the data, data never leaves.

	---

	NoteGuard · Encode Vibe Coding Hackathon — FLock Sovereign AI Challenge · internal use only

	# NoteGuard — Tool Card

	Version: 0.0.1
	Track: Public Sector & Citizen Services — NHS Secure Data Environment on-ramp
	Status: Hackathon prototype; not validated for production use without further evaluation.

	---

	## Specification

	\| Field \| Value \|
	\|---\|---\|
	\| Description \| De-identification gate that detects + removes PII from free-text NHS clinical notes \|
	\| Type \| Hybrid pipeline — pure-Python rules + Microsoft Presidio (spaCy `en_core_web_lg` NER). No model is trained; pre-trained components are composed. \|
	\| Developer \| Encode Vibe Coding Hackathon team (fork of `NoteGuard/`) \|
	\| Status / version \| Prototype · v0.0.1 \|
	\| Repository \| github.com/chaeyoonyunakim/automatic-pii-preprocessing-tool \|

	> Documented as a tool card, not a model card. NoteGuard trains no model, so the NHS England
	> model card template's training / hyperparameter / data-split sections do not apply. A gov.uk
	> Algorithmic Transparency Recording Standard (ATRS) record is provided in [`report.md`](report.md).

	---

	## What it does

	NoteGuard is a de-identification gate for free-text NHS clinical notes. It detects and removes patient and clinician PII inside a Trust before any text reaches a Secure Data Environment (SDE), federated training round, or cross-Trust sharing layer.

	> "AI detects, humans review, audit logs account."

	---

	## Who uses it

	\| Role \| When \| Why \|
	\|---\|---\|---\|
	\| Data Wrangler / IG Analyst \| Before releasing notes to research or AI teams \| Cannot share raw free-text; must prove zero identifier leakage \|
	\| SDE Operator \| At the Trust boundary ingestion point \| Gate between Trust raw data and the shared pool \|
	\| Federated AI platform \| Before each training round \| Needs de-identified text; cannot inspect raw Trust data \|

	---

	## Use cases out of scope

	- Not a substitute for Information Governance sign-off, a DPIA, or DARS approval — it is a
	technical control, not a legal basis for processing.
	- Not validated on real Trust data, non-English notes, or scanned / handwritten documents.
	- Not a guarantee of zero re-identification: pseudonymised output is still personal data, and
	residual leakage is measured, not assumed to be zero on unseen data.
	- Not for clinical decision-making, or any use of note content beyond de-identification.

	---

	## Detection coverage

	\| Entity type \| Method \| Notes \|
	\|---\|---\|---\|
	\| Patient name (`PERSON`) \| spaCy `en_core_web_lg` NER \| 100% recall in benchmarks \|
	\| NHS number (`UK_NHS`) \| Regex + Modulus-11 checksum + 9-digit context anchor \| Catches both standard and synthetic dataset forms \|
	\| Date of birth (`DATE_TIME`) \| Presidio + date regex \| 100% recall \|
	\| Site / hospital name (`LOCATION`) \| spaCy NER + rule-based suffix anchor \| "X Hospital / Infirmary / NHS Trust" patterns (ORGANIZATION is excluded — it over-tags labels) \|
	\| UK postcode (`UK_POSTCODE`) \| Regex \| Outward-code only after pseudonymisation \|
	\| Clinician GMC / NMC (`GMC`, `NMC`) \| Context-anchored regex \| "GMC 1234567", "NMC PIN 12A3456B" \|
	\| ODS org code (`NHS_ODS`) \| Context-anchored regex \| "ODS A12345", "Practice Code A12345" \|
	\| Record / document UUID (`RECORD_ID`) \| UUID regex \| Quasi-identifier \|
	\| Email / phone / NINO \| Presidio built-ins \| Standard patterns \|
	\| Nationality / religion / political (`NRP`) \| Presidio \| Always redacted; never pseudonymised (UK GDPR Art. 9) \|

	---

	## Anonymisation policy

	\| Mode \| Behaviour \|
	\|---\|---\|
	\| Pseudonymise (default) \| Faker(en_GB) realistic surrogates; stable per patient via Trust-local vault; date intervals preserved by consistent random shift \|
	\| Redact \| `[ENTITY_TYPE]` placeholder tags \|
	\| `NRP` \| Always redacted regardless of mode \|

	---

	## Performance (`en_core_web_lg`)

	\| Entity \| Recall \|
	\|---\|---\|
	\| Names \| 100% \|
	\| NHS number \| ~100% \|
	\| Date of birth \| 100% \|
	\| Places / sites \| improving (was low due to ORG/LOCATION mismatch — now fixed) \|

	Residual leakage target: 0 known identifiers surviving sanitisation (gates SDE pool admission).

	---

	## Bias and fairness

	The `PERSON` NER (spaCy `en_core_web_lg`) is trained largely on Western / English text, so **name
	recall can be lower for names of non-English origin**. This is an equity risk: under-detection means
	those patients carry a higher residual re-identification risk. Honest position and mitigations:

	- The checksum / context rules (NHS number, DOB, postcode, GMC/NMC/ODS, NINO) are name-agnostic,
	so structured identifiers are detected uniformly regardless of patient demographics.
	- The human review queue surfaces low-confidence name spans for IG analyst confirmation.
	- Required before deployment: evaluate name recall stratified by name origin / ethnicity coding
	on representative Trust data and report the disparity. Not yet done — evaluation is on synthetic data.

	---

	## Human-in-the-loop

	Low-confidence detections (model score below auto-confirm threshold) are:
	1. Still redacted for safety.
	2. Flagged in the review queue with context snippet and confidence score.
	3. Surfaced to an IG analyst for confirmation before the note enters the SDE pool.

	This matches the real NHS Information Governance workflow and makes the tool's accountability explicit.

	---

	## NHS Five Safes mapping

	\| Safe \| Status \| How \|
	\|---\|---\|---\|
	\| Safe data \| ✅ \| De-identified to DAPB1523/ICO standard; leakage-gated \|
	\| Safe settings \| ✅ \| Processing inside Trust; raw data and vault gitignored \|
	\| Safe outputs \| ✅ \| Only de-identified text + content-free audit logs leave \|
	\| Safe people \| ⚠️ \| IG analyst review queue; vault stays Trust-local; honest UK GDPR framing \|
	\| Safe projects \| ⚠️ \| Technical layer only; DPIA + project approval (DARS) remain Trust processes \|

	---

	## Limitations and caveats

	- Pseudonymised data is still personal data under UK GDPR — the vault is the re-identification key and must stay Trust-local.
	- Precision is a conservative lower bound: clinician names and unlisted locations correctly detected count as false positives in the evaluation (ground truth is patient-table-only).
	- Not clinically validated: evaluated on the `NHSEDataScience/synthetic_clinical_notes` dataset. Real deployment requires validation on representative Trust data.
	- Clinical transformer models (e.g. `obi/deid_roberta_i2b2`) were tested and performed worse on UK names than `en_core_web_lg` (i2b2 training data is US-centric).
	- Governance prerequisites for deployment: a Data Protection Impact Assessment (DPIA), IG /
	Caldicott sign-off, and DARS project approval are required before any real use. NoteGuard is the
	technical control, not the approval.

	---

	## Adoption path

	```
	NHS Trust (raw notes)
	│
	▼ NoteGuard gate (runs inside Trust)
	│
	▼ de-identified notes + audit log
	│
	▼ NHS SDE / FDP shared pool
	│
	▼ Federated AI
	```

	Same privacy model as OpenSAFELY: code comes to the data, data never leaves.

	---

	NoteGuard · Encode Vibe Coding Hackathon — FLock Sovereign AI Challenge · internal use only