Spaces:

chaeyoona
/

noteguard

Running

yumi.h commited on 14 days ago

Commit

2da24e4

1 Parent(s): 4d404c0

Fix pseudonym mis-mapping from spaCy ORGANIZATION over-tagging

Reported: "NHS" and "DOB 02/03/1981" were tagged ORGANIZATION and pseudonymised
as ORGANIZATION_NNN, and "Manchester Royal Infirmary"/"Ward 9" mapping was garbled.

Root cause: spaCy lg over-tags labels/abbreviations as ORG, which (a) false-flags
"NHS"/"GMC", (b) swallows the rule DATE inside "DOB 02/03/1981", and (c) produces
a span partially overlapping the _SITE_RE LOCATION, corrupting the output on
replacement.

- detect.py: drop ORGANIZATION from PresidioDetector.KEEP.
- detect.py: _merge now returns DISJOINT spans and, on overlap, prefers precise
rule entities (date/NHS/GMC/…) over broad NER spans — no more text corruption.
- recognizers.py: _SITE_RE now also matches "… Trust" so NHS site names that ORG
used to catch are still caught as LOCATION.
- CLAUDE.md: document the reversal + overlap-safe merge.

Result: "Pt [PERSON], NHS no [UK_NHS], DOB [DATE_TIME], lives [UK_POSTCODE].
[LOCATION] Ward 9. Reviewed by [PERSON], GMC [GMC]." Tests pass (23).

Files changed (3) hide show

CLAUDE.md +6 -3
noteguard/detect.py +22 -8
noteguard/recognizers.py +1 -1

CLAUDE.md CHANGED Viewed

@@ -38,12 +38,15 @@ python -m pytest tests/ -v
 ## Decisions locked in (version 1 branch)
 - **Default model: `en_core_web_lg`** — 100% name recall vs 91% for sm; clinical transformer
   (`obi/deid_roberta_i2b2`) was tested and performed worse on UK names (US i2b2 training data).
-- **ORGANIZATION added to PresidioDetector.KEEP** — hospital names are often tagged as ORG;
-  excluding them was the root cause of low places recall.
 - **Human-in-the-loop review queue** — spans with score in `[review_threshold, score_threshold)`
   are redacted but flagged `needs_review=True` for IG analyst review before SDE pool admission.
 - **Places recall** — low recall (0–0.7) was mostly generic "ward"/"bay" in GT (now filtered by
-  `_GENERIC`) and ORG vs LOCATION mismatch (now fixed via KEEP + `_SITE_RE` in recognizers).
 ## Gotchas
 - Note text has mojibake (`Â·`) — `_fix_mojibake` runs before detection.

 ## Decisions locked in (version 1 branch)
 - **Default model: `en_core_web_lg`** — 100% name recall vs 91% for sm; clinical transformer
   (`obi/deid_roberta_i2b2`) was tested and performed worse on UK names (US i2b2 training data).
+- **ORGANIZATION excluded from PresidioDetector.KEEP** — spaCy lg over-tags labels/abbreviations
+  ("NHS", "DOB …", "GMC") as ORG, causing false positives and swallowing precise rule spans. NHS
+  site names are caught by the `_SITE_RE` LOCATION rule (incl. "… Trust") instead.
+- **`_merge` is overlap-safe + priority-ranked** — output spans are disjoint (no transform
+  corruption); on overlap, precise rule entities (date/NHS/GMC/…) beat broad NER spans.
 - **Human-in-the-loop review queue** — spans with score in `[review_threshold, score_threshold)`
   are redacted but flagged `needs_review=True` for IG analyst review before SDE pool admission.
 - **Places recall** — low recall (0–0.7) was mostly generic "ward"/"bay" in GT (now filtered by
+  `_GENERIC`); NHS site names are caught by the `_SITE_RE` LOCATION rule in recognizers.
 ## Gotchas
 - Note text has mojibake (`Â·`) — `_fix_mojibake` runs before detection.

noteguard/detect.py CHANGED Viewed

@@ -35,12 +35,13 @@ class PresidioDetector:
     name = "presidio+rules"
-    # Presidio entity types we keep. ORGANIZATION is included because NHS site names
-    # (e.g. "Manchester Royal Infirmary") are often tagged as ORG rather than LOCATION;
-    # excluding them was the main cause of low places recall.
     KEEP = {
         "PERSON", "DATE_TIME", "EMAIL_ADDRESS", "PHONE_NUMBER",
-        "LOCATION", "ORGANIZATION", "UK_NHS", "UK_NINO", "UK_PASSPORT",
         "UK_VEHICLE_REGISTRATION", "IP_ADDRESS", "URL",
     }
@@ -114,14 +115,27 @@ class PresidioDetector:
         return _merge(spans)
 def _merge(spans: list[Span]) -> list[Span]:
-    """Sort, then drop spans fully contained in a longer span (keep highest score)."""
-    spans = sorted(spans, key=lambda s: (s.start, -(s.end - s.start), -s.score))
     kept: list[Span] = []
-    for s in spans:
-        if any(k.start <= s.start and s.end <= k.end for k in kept):
             continue
         kept.append(s)
     return kept

     name = "presidio+rules"
+    # Presidio entity types we keep. ORGANIZATION is deliberately EXCLUDED: spaCy lg
+    # over-tags abbreviations/labels ("NHS", "DOB …", "GMC") as ORG, which both creates
+    # false positives and swallows precise rule spans (dates, NHS numbers). NHS site
+    # names are caught instead by the _SITE_RE LOCATION rule (incl. "… Trust").
     KEEP = {
         "PERSON", "DATE_TIME", "EMAIL_ADDRESS", "PHONE_NUMBER",
+        "LOCATION", "UK_NHS", "UK_NINO", "UK_PASSPORT",
         "UK_VEHICLE_REGISTRATION", "IP_ADDRESS", "URL",
     }
         return _merge(spans)
+# Precise pattern/checksum entities win over broad NER spans (PERSON/LOCATION) when
+# they overlap — e.g. a rule DATE inside a spurious NER span should survive as the date.
+_PRECISE = {
+    "UK_NHS", "DATE_TIME", "EMAIL_ADDRESS", "PHONE_NUMBER", "UK_POSTCODE",
+    "UK_NINO", "UK_VEHICLE_REGISTRATION", "UK_PASSPORT", "GMC", "NMC",
+    "NHS_ODS", "RECORD_ID",
+}
 def _merge(spans: list[Span]) -> list[Span]:
+    """Return disjoint spans. On overlap, prefer precise rule entities, then the longer,
+    higher-scoring span. Disjoint output guarantees the transform can't corrupt text."""
+    def rank(s: Span):
+        return (1 if s.entity_type in _PRECISE else 0, s.end - s.start, s.score)
     kept: list[Span] = []
+    for s in sorted(spans, key=rank, reverse=True):
+        if any(s.start < k.end and k.start < s.end for k in kept):  # overlaps a kept span
             continue
         kept.append(s)
+    kept.sort(key=lambda s: s.start)
     return kept

noteguard/recognizers.py CHANGED Viewed

@@ -89,7 +89,7 @@ _VEHICLE_RE = re.compile(r"\b[A-Z]{2}\d{2}\s?[A-Z]{3}\b")
 # Context-anchored to title-case to avoid flagging generic lowercase mentions.
 _SITE_RE = re.compile(
     r"\b(?:[A-Z][A-Za-z']+\s+){1,4}"
-    r"(?:Hospital|Infirmary|NHS\s+Trust|Medical\s+Centre|Health\s+Centre|Clinic|Surgery)\b"
 )
 # (regex, entity_type, capture_group): group 0 = whole match, 1 = inner capture

 # Context-anchored to title-case to avoid flagging generic lowercase mentions.
 _SITE_RE = re.compile(
     r"\b(?:[A-Z][A-Za-z']+\s+){1,4}"
+    r"(?:Hospital|Infirmary|Trust|Medical\s+Centre|Health\s+Centre|Clinic|Surgery)\b"
 )
 # (regex, entity_type, capture_group): group 0 = whole match, 1 = inner capture