yumi.h commited on
Commit
2da24e4
·
1 Parent(s): 4d404c0

Fix pseudonym mis-mapping from spaCy ORGANIZATION over-tagging

Browse files

Reported: "NHS" and "DOB 02/03/1981" were tagged ORGANIZATION and pseudonymised
as ORGANIZATION_NNN, and "Manchester Royal Infirmary"/"Ward 9" mapping was garbled.

Root cause: spaCy lg over-tags labels/abbreviations as ORG, which (a) false-flags
"NHS"/"GMC", (b) swallows the rule DATE inside "DOB 02/03/1981", and (c) produces
a span partially overlapping the _SITE_RE LOCATION, corrupting the output on
replacement.

- detect.py: drop ORGANIZATION from PresidioDetector.KEEP.
- detect.py: _merge now returns DISJOINT spans and, on overlap, prefers precise
rule entities (date/NHS/GMC/…) over broad NER spans — no more text corruption.
- recognizers.py: _SITE_RE now also matches "… Trust" so NHS site names that ORG
used to catch are still caught as LOCATION.
- CLAUDE.md: document the reversal + overlap-safe merge.

Result: "Pt [PERSON], NHS no [UK_NHS], DOB [DATE_TIME], lives [UK_POSTCODE].
[LOCATION] Ward 9. Reviewed by [PERSON], GMC [GMC]." Tests pass (23).

Files changed (3) hide show
  1. CLAUDE.md +6 -3
  2. noteguard/detect.py +22 -8
  3. noteguard/recognizers.py +1 -1
CLAUDE.md CHANGED
@@ -38,12 +38,15 @@ python -m pytest tests/ -v
38
  ## Decisions locked in (version 1 branch)
39
  - **Default model: `en_core_web_lg`** — 100% name recall vs 91% for sm; clinical transformer
40
  (`obi/deid_roberta_i2b2`) was tested and performed worse on UK names (US i2b2 training data).
41
- - **ORGANIZATION added to PresidioDetector.KEEP** — hospital names are often tagged as ORG;
42
- excluding them was the root cause of low places recall.
 
 
 
43
  - **Human-in-the-loop review queue** — spans with score in `[review_threshold, score_threshold)`
44
  are redacted but flagged `needs_review=True` for IG analyst review before SDE pool admission.
45
  - **Places recall** — low recall (0–0.7) was mostly generic "ward"/"bay" in GT (now filtered by
46
- `_GENERIC`) and ORG vs LOCATION mismatch (now fixed via KEEP + `_SITE_RE` in recognizers).
47
 
48
  ## Gotchas
49
  - Note text has mojibake (`·`) — `_fix_mojibake` runs before detection.
 
38
  ## Decisions locked in (version 1 branch)
39
  - **Default model: `en_core_web_lg`** — 100% name recall vs 91% for sm; clinical transformer
40
  (`obi/deid_roberta_i2b2`) was tested and performed worse on UK names (US i2b2 training data).
41
+ - **ORGANIZATION excluded from PresidioDetector.KEEP** — spaCy lg over-tags labels/abbreviations
42
+ ("NHS", "DOB …", "GMC") as ORG, causing false positives and swallowing precise rule spans. NHS
43
+ site names are caught by the `_SITE_RE` LOCATION rule (incl. "… Trust") instead.
44
+ - **`_merge` is overlap-safe + priority-ranked** — output spans are disjoint (no transform
45
+ corruption); on overlap, precise rule entities (date/NHS/GMC/…) beat broad NER spans.
46
  - **Human-in-the-loop review queue** — spans with score in `[review_threshold, score_threshold)`
47
  are redacted but flagged `needs_review=True` for IG analyst review before SDE pool admission.
48
  - **Places recall** — low recall (0–0.7) was mostly generic "ward"/"bay" in GT (now filtered by
49
+ `_GENERIC`); NHS site names are caught by the `_SITE_RE` LOCATION rule in recognizers.
50
 
51
  ## Gotchas
52
  - Note text has mojibake (`·`) — `_fix_mojibake` runs before detection.
noteguard/detect.py CHANGED
@@ -35,12 +35,13 @@ class PresidioDetector:
35
 
36
  name = "presidio+rules"
37
 
38
- # Presidio entity types we keep. ORGANIZATION is included because NHS site names
39
- # (e.g. "Manchester Royal Infirmary") are often tagged as ORG rather than LOCATION;
40
- # excluding them was the main cause of low places recall.
 
41
  KEEP = {
42
  "PERSON", "DATE_TIME", "EMAIL_ADDRESS", "PHONE_NUMBER",
43
- "LOCATION", "ORGANIZATION", "UK_NHS", "UK_NINO", "UK_PASSPORT",
44
  "UK_VEHICLE_REGISTRATION", "IP_ADDRESS", "URL",
45
  }
46
 
@@ -114,14 +115,27 @@ class PresidioDetector:
114
  return _merge(spans)
115
 
116
 
 
 
 
 
 
 
 
 
 
117
  def _merge(spans: list[Span]) -> list[Span]:
118
- """Sort, then drop spans fully contained in a longer span (keep highest score)."""
119
- spans = sorted(spans, key=lambda s: (s.start, -(s.end - s.start), -s.score))
 
 
 
120
  kept: list[Span] = []
121
- for s in spans:
122
- if any(k.start <= s.start and s.end <= k.end for k in kept):
123
  continue
124
  kept.append(s)
 
125
  return kept
126
 
127
 
 
35
 
36
  name = "presidio+rules"
37
 
38
+ # Presidio entity types we keep. ORGANIZATION is deliberately EXCLUDED: spaCy lg
39
+ # over-tags abbreviations/labels ("NHS", "DOB …", "GMC") as ORG, which both creates
40
+ # false positives and swallows precise rule spans (dates, NHS numbers). NHS site
41
+ # names are caught instead by the _SITE_RE LOCATION rule (incl. "… Trust").
42
  KEEP = {
43
  "PERSON", "DATE_TIME", "EMAIL_ADDRESS", "PHONE_NUMBER",
44
+ "LOCATION", "UK_NHS", "UK_NINO", "UK_PASSPORT",
45
  "UK_VEHICLE_REGISTRATION", "IP_ADDRESS", "URL",
46
  }
47
 
 
115
  return _merge(spans)
116
 
117
 
118
+ # Precise pattern/checksum entities win over broad NER spans (PERSON/LOCATION) when
119
+ # they overlap — e.g. a rule DATE inside a spurious NER span should survive as the date.
120
+ _PRECISE = {
121
+ "UK_NHS", "DATE_TIME", "EMAIL_ADDRESS", "PHONE_NUMBER", "UK_POSTCODE",
122
+ "UK_NINO", "UK_VEHICLE_REGISTRATION", "UK_PASSPORT", "GMC", "NMC",
123
+ "NHS_ODS", "RECORD_ID",
124
+ }
125
+
126
+
127
  def _merge(spans: list[Span]) -> list[Span]:
128
+ """Return disjoint spans. On overlap, prefer precise rule entities, then the longer,
129
+ higher-scoring span. Disjoint output guarantees the transform can't corrupt text."""
130
+ def rank(s: Span):
131
+ return (1 if s.entity_type in _PRECISE else 0, s.end - s.start, s.score)
132
+
133
  kept: list[Span] = []
134
+ for s in sorted(spans, key=rank, reverse=True):
135
+ if any(s.start < k.end and k.start < s.end for k in kept): # overlaps a kept span
136
  continue
137
  kept.append(s)
138
+ kept.sort(key=lambda s: s.start)
139
  return kept
140
 
141
 
noteguard/recognizers.py CHANGED
@@ -89,7 +89,7 @@ _VEHICLE_RE = re.compile(r"\b[A-Z]{2}\d{2}\s?[A-Z]{3}\b")
89
  # Context-anchored to title-case to avoid flagging generic lowercase mentions.
90
  _SITE_RE = re.compile(
91
  r"\b(?:[A-Z][A-Za-z']+\s+){1,4}"
92
- r"(?:Hospital|Infirmary|NHS\s+Trust|Medical\s+Centre|Health\s+Centre|Clinic|Surgery)\b"
93
  )
94
 
95
  # (regex, entity_type, capture_group): group 0 = whole match, 1 = inner capture
 
89
  # Context-anchored to title-case to avoid flagging generic lowercase mentions.
90
  _SITE_RE = re.compile(
91
  r"\b(?:[A-Z][A-Za-z']+\s+){1,4}"
92
+ r"(?:Hospital|Infirmary|Trust|Medical\s+Centre|Health\s+Centre|Clinic|Surgery)\b"
93
  )
94
 
95
  # (regex, entity_type, capture_group): group 0 = whole match, 1 = inner capture