Fix pseudonym mis-mapping from spaCy ORGANIZATION over-tagging
Browse filesReported: "NHS" and "DOB 02/03/1981" were tagged ORGANIZATION and pseudonymised
as ORGANIZATION_NNN, and "Manchester Royal Infirmary"/"Ward 9" mapping was garbled.
Root cause: spaCy lg over-tags labels/abbreviations as ORG, which (a) false-flags
"NHS"/"GMC", (b) swallows the rule DATE inside "DOB 02/03/1981", and (c) produces
a span partially overlapping the _SITE_RE LOCATION, corrupting the output on
replacement.
- detect.py: drop ORGANIZATION from PresidioDetector.KEEP.
- detect.py: _merge now returns DISJOINT spans and, on overlap, prefers precise
rule entities (date/NHS/GMC/…) over broad NER spans — no more text corruption.
- recognizers.py: _SITE_RE now also matches "… Trust" so NHS site names that ORG
used to catch are still caught as LOCATION.
- CLAUDE.md: document the reversal + overlap-safe merge.
Result: "Pt [PERSON], NHS no [UK_NHS], DOB [DATE_TIME], lives [UK_POSTCODE].
[LOCATION] Ward 9. Reviewed by [PERSON], GMC [GMC]." Tests pass (23).
- CLAUDE.md +6 -3
- noteguard/detect.py +22 -8
- noteguard/recognizers.py +1 -1
|
@@ -38,12 +38,15 @@ python -m pytest tests/ -v
|
|
| 38 |
## Decisions locked in (version 1 branch)
|
| 39 |
- **Default model: `en_core_web_lg`** — 100% name recall vs 91% for sm; clinical transformer
|
| 40 |
(`obi/deid_roberta_i2b2`) was tested and performed worse on UK names (US i2b2 training data).
|
| 41 |
-
- **ORGANIZATION
|
| 42 |
-
|
|
|
|
|
|
|
|
|
|
| 43 |
- **Human-in-the-loop review queue** — spans with score in `[review_threshold, score_threshold)`
|
| 44 |
are redacted but flagged `needs_review=True` for IG analyst review before SDE pool admission.
|
| 45 |
- **Places recall** — low recall (0–0.7) was mostly generic "ward"/"bay" in GT (now filtered by
|
| 46 |
-
`_GENERIC`)
|
| 47 |
|
| 48 |
## Gotchas
|
| 49 |
- Note text has mojibake (`·`) — `_fix_mojibake` runs before detection.
|
|
|
|
| 38 |
## Decisions locked in (version 1 branch)
|
| 39 |
- **Default model: `en_core_web_lg`** — 100% name recall vs 91% for sm; clinical transformer
|
| 40 |
(`obi/deid_roberta_i2b2`) was tested and performed worse on UK names (US i2b2 training data).
|
| 41 |
+
- **ORGANIZATION excluded from PresidioDetector.KEEP** — spaCy lg over-tags labels/abbreviations
|
| 42 |
+
("NHS", "DOB …", "GMC") as ORG, causing false positives and swallowing precise rule spans. NHS
|
| 43 |
+
site names are caught by the `_SITE_RE` LOCATION rule (incl. "… Trust") instead.
|
| 44 |
+
- **`_merge` is overlap-safe + priority-ranked** — output spans are disjoint (no transform
|
| 45 |
+
corruption); on overlap, precise rule entities (date/NHS/GMC/…) beat broad NER spans.
|
| 46 |
- **Human-in-the-loop review queue** — spans with score in `[review_threshold, score_threshold)`
|
| 47 |
are redacted but flagged `needs_review=True` for IG analyst review before SDE pool admission.
|
| 48 |
- **Places recall** — low recall (0–0.7) was mostly generic "ward"/"bay" in GT (now filtered by
|
| 49 |
+
`_GENERIC`); NHS site names are caught by the `_SITE_RE` LOCATION rule in recognizers.
|
| 50 |
|
| 51 |
## Gotchas
|
| 52 |
- Note text has mojibake (`·`) — `_fix_mojibake` runs before detection.
|
|
@@ -35,12 +35,13 @@ class PresidioDetector:
|
|
| 35 |
|
| 36 |
name = "presidio+rules"
|
| 37 |
|
| 38 |
-
# Presidio entity types we keep. ORGANIZATION is
|
| 39 |
-
# (
|
| 40 |
-
#
|
|
|
|
| 41 |
KEEP = {
|
| 42 |
"PERSON", "DATE_TIME", "EMAIL_ADDRESS", "PHONE_NUMBER",
|
| 43 |
-
"LOCATION", "
|
| 44 |
"UK_VEHICLE_REGISTRATION", "IP_ADDRESS", "URL",
|
| 45 |
}
|
| 46 |
|
|
@@ -114,14 +115,27 @@ class PresidioDetector:
|
|
| 114 |
return _merge(spans)
|
| 115 |
|
| 116 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 117 |
def _merge(spans: list[Span]) -> list[Span]:
|
| 118 |
-
"""
|
| 119 |
-
|
|
|
|
|
|
|
|
|
|
| 120 |
kept: list[Span] = []
|
| 121 |
-
for s in spans:
|
| 122 |
-
if any(
|
| 123 |
continue
|
| 124 |
kept.append(s)
|
|
|
|
| 125 |
return kept
|
| 126 |
|
| 127 |
|
|
|
|
| 35 |
|
| 36 |
name = "presidio+rules"
|
| 37 |
|
| 38 |
+
# Presidio entity types we keep. ORGANIZATION is deliberately EXCLUDED: spaCy lg
|
| 39 |
+
# over-tags abbreviations/labels ("NHS", "DOB …", "GMC") as ORG, which both creates
|
| 40 |
+
# false positives and swallows precise rule spans (dates, NHS numbers). NHS site
|
| 41 |
+
# names are caught instead by the _SITE_RE LOCATION rule (incl. "… Trust").
|
| 42 |
KEEP = {
|
| 43 |
"PERSON", "DATE_TIME", "EMAIL_ADDRESS", "PHONE_NUMBER",
|
| 44 |
+
"LOCATION", "UK_NHS", "UK_NINO", "UK_PASSPORT",
|
| 45 |
"UK_VEHICLE_REGISTRATION", "IP_ADDRESS", "URL",
|
| 46 |
}
|
| 47 |
|
|
|
|
| 115 |
return _merge(spans)
|
| 116 |
|
| 117 |
|
| 118 |
+
# Precise pattern/checksum entities win over broad NER spans (PERSON/LOCATION) when
|
| 119 |
+
# they overlap — e.g. a rule DATE inside a spurious NER span should survive as the date.
|
| 120 |
+
_PRECISE = {
|
| 121 |
+
"UK_NHS", "DATE_TIME", "EMAIL_ADDRESS", "PHONE_NUMBER", "UK_POSTCODE",
|
| 122 |
+
"UK_NINO", "UK_VEHICLE_REGISTRATION", "UK_PASSPORT", "GMC", "NMC",
|
| 123 |
+
"NHS_ODS", "RECORD_ID",
|
| 124 |
+
}
|
| 125 |
+
|
| 126 |
+
|
| 127 |
def _merge(spans: list[Span]) -> list[Span]:
|
| 128 |
+
"""Return disjoint spans. On overlap, prefer precise rule entities, then the longer,
|
| 129 |
+
higher-scoring span. Disjoint output guarantees the transform can't corrupt text."""
|
| 130 |
+
def rank(s: Span):
|
| 131 |
+
return (1 if s.entity_type in _PRECISE else 0, s.end - s.start, s.score)
|
| 132 |
+
|
| 133 |
kept: list[Span] = []
|
| 134 |
+
for s in sorted(spans, key=rank, reverse=True):
|
| 135 |
+
if any(s.start < k.end and k.start < s.end for k in kept): # overlaps a kept span
|
| 136 |
continue
|
| 137 |
kept.append(s)
|
| 138 |
+
kept.sort(key=lambda s: s.start)
|
| 139 |
return kept
|
| 140 |
|
| 141 |
|
|
@@ -89,7 +89,7 @@ _VEHICLE_RE = re.compile(r"\b[A-Z]{2}\d{2}\s?[A-Z]{3}\b")
|
|
| 89 |
# Context-anchored to title-case to avoid flagging generic lowercase mentions.
|
| 90 |
_SITE_RE = re.compile(
|
| 91 |
r"\b(?:[A-Z][A-Za-z']+\s+){1,4}"
|
| 92 |
-
r"(?:Hospital|Infirmary|
|
| 93 |
)
|
| 94 |
|
| 95 |
# (regex, entity_type, capture_group): group 0 = whole match, 1 = inner capture
|
|
|
|
| 89 |
# Context-anchored to title-case to avoid flagging generic lowercase mentions.
|
| 90 |
_SITE_RE = re.compile(
|
| 91 |
r"\b(?:[A-Z][A-Za-z']+\s+){1,4}"
|
| 92 |
+
r"(?:Hospital|Infirmary|Trust|Medical\s+Centre|Health\s+Centre|Clinic|Surgery)\b"
|
| 93 |
)
|
| 94 |
|
| 95 |
# (regex, entity_type, capture_group): group 0 = whole match, 1 = inner capture
|