Spaces:

chaeyoona
/

noteguard

Running

App Files Files Community

noteguard / docs /report.md

yumi.h

Add de-identified download, remove FLock.io tech refs, add hackathon footer

6f7e511 9 days ago

preview code

Raw

History Blame Contribute Delete

8.36 kB

	# Algorithmic Transparency Record — NoteGuard

	> Illustrative record following the UK government [Algorithmic Transparency Recording Standard
	> (ATRS)](https://www.gov.uk/government/collections/algorithmic-transparency-recording-standard-hub),
	> modelled on the [NHS.UK Reviews Automoderation Tool record](https://www.gov.uk/algorithmic-transparency-records/nhs-england-nhs-dot-uk-reviews-automoderation-tool).
	> NoteGuard is a hackathon prototype evaluated on synthetic data — this is not an official
	> published ATRS entry, but is structured so it could become one.

	## Tier 1 — Summary

	- Name: NoteGuard — NHS clinical-note de-identification gate
	- Description: Detects and removes patient/clinician PII from free-text NHS clinical notes
	inside a Trust, so only de-identified text leaves for a Secure Data Environment (SDE) or
	federated AI. Combines pure-Python rules with Microsoft Presidio (spaCy NER). No model is trained.
	- Website / repository: https://github.com/chaeyoonyunakim/automatic-pii-preprocessing-tool
	- Contact: via GitHub issues on the repository (maintainer @chaeyoonyunakim)

	---

	## Tier 2

	### 1. Owner and responsibility
	- 1.1 Organisation: Encode Vibe Coding Hackathon team — FLock Sovereign AI Challenge (fork of `NoteGuard/`).
	- 1.2 Team: Project contributors (see repository history / `docs/CHANGELOG.md`).
	- 1.3 Senior responsible owner: None — prototype, not in service. An SRO would be required before deployment.
	- 1.4 External supplier involvement: No commercial supplier. Built on open-source components
	(Microsoft Presidio, spaCy `en_core_web_lg`, Faker).

	### 2. Description and rationale
	- 2.1 Detailed description: A note is cleaned of mojibake, then scanned by a `RuleDetector`
	(checksum/context rules) unioned with a `PresidioDetector` (spaCy NER for `PERSON`/`LOCATION` +
	built-ins). Detected spans are removed by a transform — redaction (`[type]` tags) or
	pseudonymisation (realistic, patient-consistent Faker surrogates; valid fake NHS numbers;
	consistent date-of-birth shift). A content-free audit log records what was removed.
	- 2.2 Scope: Free-text English NHS clinical notes. Evaluated on the `NHSEDataScience/synthetic_clinical_notes`
	dataset only. Not evaluated on real Trust data, other languages, or scanned documents.
	- 2.3 Benefit: Enables cross-Trust / federated AI without sharing raw PHI ("sanitise at source"),
	with a measured residual-leakage rate rather than an unverified assurance.
	- 2.4 Previous process: Manual redaction by an analyst, or — more commonly — free-text notes simply
	not shared because the re-identification risk could not be quantified.
	- 2.5 Alternatives considered: Manual redaction (does not scale, inconsistent); Presidio alone
	(misses the dataset's 9-digit NHS numbers and UK staff/org identifiers); a clinical transformer
	(`obi/deid_roberta_i2b2`, tested — worse on UK names, US-trained). Rejected in favour of the
	rules + Presidio hybrid.

	### 3. Decision-making process
	- 3.1 Process integration: Sits at the Trust egress boundary. It supports an IG decision and
	automatically removes high-confidence PII; low-confidence spans are still removed but flagged.
	- 3.2 Information provided to reviewers: entity type, confidence score, surrounding context snippet,
	and per-note audit counts (never raw values in the shareable log).
	- 3.3 Frequency and scale: Prototype, batch-oriented. Benchmarked on 1,602 notes / 1,027 known-PII
	occurrences.
	- 3.4 Human decisions and review: An IG analyst reviews the review queue (spans scored between
	the review and auto-confirm thresholds) and makes the final call before a note enters the SDE pool.
	- 3.5 Required training: Reviewers need training on the tool's limitations (esp. name-recall bias),
	the residual-leakage metric, and the escalation route.
	- 3.6 Appeals / redress: Not a citizen-facing decision system, so no external appeal. Internally,
	any missed identifier found downstream is corrected and fed back into the recogniser rules/tests.

	### 4. Tool specification
	- 4.1.1 System architecture: Python package (`src/`) run inside the Trust; `RuleDetector` +
	`PresidioDetector` behind one `Detector` interface; Streamlit demo UI; deployable as a Docker
	Hugging Face Space. Raw notes and the re-identification vault never leave the Trust.
	- 4.1.2 Phase: Prototype (hackathon) — not deployed.
	- 4.1.3 Maintenance: CI (`ruff` + `pytest`) on every change; residual leakage acts as a regression
	gate; recognisers re-evaluated when the data or rules change.
	- 4.1.4 Components: (a) pure-Python rule recognisers; (b) Presidio analyzer with spaCy
	`en_core_web_lg` + custom UK recognisers; (c) Faker pseudonymisation vault.

	4.2 Component specifications

	\| Component \| Task \| Method \| Measured (synthetic, 1,602 notes) \|
	\|---\|---\|---\|---\|
	\| Rule recognisers \| NHS number, postcode, date, phone, email, GMC/NMC/ODS, NINO, vehicle, UUID \| regex + Modulus-11 checksum + context anchors (name-agnostic) \| NHS number F1 ≈ 0.99 \|
	\| Presidio NER \| `PERSON`, `LOCATION` \| spaCy `en_core_web_lg`, score-thresholded, unioned with rules \| PERSON recall ≈ 0.68 \|
	\| Transform \| redact / pseudonymise \| per-entity policy; Faker(en_GB) vault; per-patient DOB shift \| — \|
	\| End-to-end \| residual leakage after sanitisation \| known-PII oracle from structured tables \| rules 74.8% → presidio+rules 8.5% \|

	Precision is a conservative lower bound (correctly removing PII absent from the tables counts as a
	false positive). Recall and leakage are the sound headline metrics.

	4.3 Data specification
	- 4.3.1 Source: `NHSEDataScience/synthetic_clinical_notes` (Hugging Face).
	- 4.3.2 Modality: Text (3 linked CSVs: patients, admissions, notes).
	- 4.3.3 Description: Synthetic clinical notes joined to synthetic patient/admission records on
	`person_id` / `admission_id` — the join provides free ground truth for the leakage metric (EVAL-ONLY).
	- 4.3.4 Quantities: ~70 patients, ~1,602 notes, 1,027 known-PII occurrences.
	- 4.3.5 Sensitive attributes: Synthetic names, NHS numbers, DOBs, sites — treated as if real PHI.
	- 4.3.6 Representativeness: Synthetic; not representative of real Trust notes. Real validation required.
	- 4.3.7 Source URL: https://huggingface.co/datasets/NHSEDataScience/synthetic_clinical_notes
	- 4.3.8 Collection: Generated synthetically by NHS England Data Science; downloaded at runtime.
	- 4.3.9 Cleaning: mojibake repair (`ftfy`/`_fix_mojibake`); Modulus-11 validation; table joins.
	- 4.3.10 Sharing: Only de-identified text + content-free audit logs are shareable. Raw data and the
	vault are gitignored and never committed/shared.
	- 4.3.11 Access/storage: Local to the Trust; `data/` and `outputs/` are gitignored.

	### 5. Risks, mitigations and impact assessments
	- 5.1 Impact assessment: A DPIA is required before any real deployment and has not been done
	(prototype on synthetic data). IG / Caldicott sign-off and DARS approval also required.
	- 5.2 Risks and mitigations:

	\| Risk \| Impact \| Mitigation \|
	\|---\|---\|---\|
	\| False negative (missed PII) \| Re-identification of a patient \| Name-agnostic checksum/context rules; human review queue; leakage measured and gated; recall stratification recommended \|
	\| Name-recall bias (non-English names) \| Unequal re-identification risk across demographics \| Structured-identifier rules are demographic-agnostic; human review; stratified recall evaluation required before deployment \|
	\| Over-redaction (false positive) \| Loss of clinical utility \| Pseudonymise mode preserves structure & timelines; precision reported as a conservative bound \|
	\| Vault compromise \| Re-identification via the linkage key \| Vault stays Trust-local, gitignored; treated as the re-identification key \|
	\| Pseudonymised ≠ anonymised (UK GDPR) \| Mistaken belief data is non-personal \| Stated honestly throughout; DPIA + IG sign-off required \|
	\| Pretrained-component provenance \| No control over Presidio/spaCy training data \| Composed with auditable rules + human review; alternatives documented \|

	---

	NoteGuard · Encode Vibe Coding Hackathon — FLock Sovereign AI Challenge · prototype · v0.0.1

	# Algorithmic Transparency Record — NoteGuard

	> Illustrative record following the UK government [Algorithmic Transparency Recording Standard
	> (ATRS)](https://www.gov.uk/government/collections/algorithmic-transparency-recording-standard-hub),
	> modelled on the [NHS.UK Reviews Automoderation Tool record](https://www.gov.uk/algorithmic-transparency-records/nhs-england-nhs-dot-uk-reviews-automoderation-tool).
	> NoteGuard is a hackathon prototype evaluated on synthetic data — this is not an official
	> published ATRS entry, but is structured so it could become one.

	## Tier 1 — Summary

	- Name: NoteGuard — NHS clinical-note de-identification gate
	- Description: Detects and removes patient/clinician PII from free-text NHS clinical notes
	inside a Trust, so only de-identified text leaves for a Secure Data Environment (SDE) or
	federated AI. Combines pure-Python rules with Microsoft Presidio (spaCy NER). No model is trained.
	- Website / repository: https://github.com/chaeyoonyunakim/automatic-pii-preprocessing-tool
	- Contact: via GitHub issues on the repository (maintainer @chaeyoonyunakim)

	---

	## Tier 2

	### 1. Owner and responsibility
	- 1.1 Organisation: Encode Vibe Coding Hackathon team — FLock Sovereign AI Challenge (fork of `NoteGuard/`).
	- 1.2 Team: Project contributors (see repository history / `docs/CHANGELOG.md`).
	- 1.3 Senior responsible owner: None — prototype, not in service. An SRO would be required before deployment.
	- 1.4 External supplier involvement: No commercial supplier. Built on open-source components
	(Microsoft Presidio, spaCy `en_core_web_lg`, Faker).

	### 2. Description and rationale
	- 2.1 Detailed description: A note is cleaned of mojibake, then scanned by a `RuleDetector`
	(checksum/context rules) unioned with a `PresidioDetector` (spaCy NER for `PERSON`/`LOCATION` +
	built-ins). Detected spans are removed by a transform — redaction (`[type]` tags) or
	pseudonymisation (realistic, patient-consistent Faker surrogates; valid fake NHS numbers;
	consistent date-of-birth shift). A content-free audit log records what was removed.
	- 2.2 Scope: Free-text English NHS clinical notes. Evaluated on the `NHSEDataScience/synthetic_clinical_notes`
	dataset only. Not evaluated on real Trust data, other languages, or scanned documents.
	- 2.3 Benefit: Enables cross-Trust / federated AI without sharing raw PHI ("sanitise at source"),
	with a measured residual-leakage rate rather than an unverified assurance.
	- 2.4 Previous process: Manual redaction by an analyst, or — more commonly — free-text notes simply
	not shared because the re-identification risk could not be quantified.
	- 2.5 Alternatives considered: Manual redaction (does not scale, inconsistent); Presidio alone
	(misses the dataset's 9-digit NHS numbers and UK staff/org identifiers); a clinical transformer
	(`obi/deid_roberta_i2b2`, tested — worse on UK names, US-trained). Rejected in favour of the
	rules + Presidio hybrid.

	### 3. Decision-making process
	- 3.1 Process integration: Sits at the Trust egress boundary. It supports an IG decision and
	automatically removes high-confidence PII; low-confidence spans are still removed but flagged.
	- 3.2 Information provided to reviewers: entity type, confidence score, surrounding context snippet,
	and per-note audit counts (never raw values in the shareable log).
	- 3.3 Frequency and scale: Prototype, batch-oriented. Benchmarked on 1,602 notes / 1,027 known-PII
	occurrences.
	- 3.4 Human decisions and review: An IG analyst reviews the review queue (spans scored between
	the review and auto-confirm thresholds) and makes the final call before a note enters the SDE pool.
	- 3.5 Required training: Reviewers need training on the tool's limitations (esp. name-recall bias),
	the residual-leakage metric, and the escalation route.
	- 3.6 Appeals / redress: Not a citizen-facing decision system, so no external appeal. Internally,
	any missed identifier found downstream is corrected and fed back into the recogniser rules/tests.

	### 4. Tool specification
	- 4.1.1 System architecture: Python package (`src/`) run inside the Trust; `RuleDetector` +
	`PresidioDetector` behind one `Detector` interface; Streamlit demo UI; deployable as a Docker
	Hugging Face Space. Raw notes and the re-identification vault never leave the Trust.
	- 4.1.2 Phase: Prototype (hackathon) — not deployed.
	- 4.1.3 Maintenance: CI (`ruff` + `pytest`) on every change; residual leakage acts as a regression
	gate; recognisers re-evaluated when the data or rules change.
	- 4.1.4 Components: (a) pure-Python rule recognisers; (b) Presidio analyzer with spaCy
	`en_core_web_lg` + custom UK recognisers; (c) Faker pseudonymisation vault.

	4.2 Component specifications

	\| Component \| Task \| Method \| Measured (synthetic, 1,602 notes) \|
	\|---\|---\|---\|---\|
	\| Rule recognisers \| NHS number, postcode, date, phone, email, GMC/NMC/ODS, NINO, vehicle, UUID \| regex + Modulus-11 checksum + context anchors (name-agnostic) \| NHS number F1 ≈ 0.99 \|
	\| Presidio NER \| `PERSON`, `LOCATION` \| spaCy `en_core_web_lg`, score-thresholded, unioned with rules \| PERSON recall ≈ 0.68 \|
	\| Transform \| redact / pseudonymise \| per-entity policy; Faker(en_GB) vault; per-patient DOB shift \| — \|
	\| End-to-end \| residual leakage after sanitisation \| known-PII oracle from structured tables \| rules 74.8% → presidio+rules 8.5% \|

	Precision is a conservative lower bound (correctly removing PII absent from the tables counts as a
	false positive). Recall and leakage are the sound headline metrics.

	4.3 Data specification
	- 4.3.1 Source: `NHSEDataScience/synthetic_clinical_notes` (Hugging Face).
	- 4.3.2 Modality: Text (3 linked CSVs: patients, admissions, notes).
	- 4.3.3 Description: Synthetic clinical notes joined to synthetic patient/admission records on
	`person_id` / `admission_id` — the join provides free ground truth for the leakage metric (EVAL-ONLY).
	- 4.3.4 Quantities: ~70 patients, ~1,602 notes, 1,027 known-PII occurrences.
	- 4.3.5 Sensitive attributes: Synthetic names, NHS numbers, DOBs, sites — treated as if real PHI.
	- 4.3.6 Representativeness: Synthetic; not representative of real Trust notes. Real validation required.
	- 4.3.7 Source URL: https://huggingface.co/datasets/NHSEDataScience/synthetic_clinical_notes
	- 4.3.8 Collection: Generated synthetically by NHS England Data Science; downloaded at runtime.
	- 4.3.9 Cleaning: mojibake repair (`ftfy`/`_fix_mojibake`); Modulus-11 validation; table joins.
	- 4.3.10 Sharing: Only de-identified text + content-free audit logs are shareable. Raw data and the
	vault are gitignored and never committed/shared.
	- 4.3.11 Access/storage: Local to the Trust; `data/` and `outputs/` are gitignored.

	### 5. Risks, mitigations and impact assessments
	- 5.1 Impact assessment: A DPIA is required before any real deployment and has not been done
	(prototype on synthetic data). IG / Caldicott sign-off and DARS approval also required.
	- 5.2 Risks and mitigations:

	\| Risk \| Impact \| Mitigation \|
	\|---\|---\|---\|
	\| False negative (missed PII) \| Re-identification of a patient \| Name-agnostic checksum/context rules; human review queue; leakage measured and gated; recall stratification recommended \|
	\| Name-recall bias (non-English names) \| Unequal re-identification risk across demographics \| Structured-identifier rules are demographic-agnostic; human review; stratified recall evaluation required before deployment \|
	\| Over-redaction (false positive) \| Loss of clinical utility \| Pseudonymise mode preserves structure & timelines; precision reported as a conservative bound \|
	\| Vault compromise \| Re-identification via the linkage key \| Vault stays Trust-local, gitignored; treated as the re-identification key \|
	\| Pseudonymised ≠ anonymised (UK GDPR) \| Mistaken belief data is non-personal \| Stated honestly throughout; DPIA + IG sign-off required \|
	\| Pretrained-component provenance \| No control over Presidio/spaCy training data \| Composed with auditable rules + human review; alternatives documented \|

	---

	NoteGuard · Encode Vibe Coding Hackathon — FLock Sovereign AI Challenge · prototype · v0.0.1