Spaces:

chaeyoona
/

noteguard

Running

yumi.h commited on 15 days ago

Commit

4d404c0

1 Parent(s): 9db610d

Remove the roster / gazetteer feature

Drop the optional Trust-roster (gazetteer) layer entirely:
- detect.py: remove GazetteerDetector and CompositeDetector (now unused), tidy imports
- data.py: remove roster_terms()
- trust_demo.py / run_eval.py / app/streamlit_app.py: drop roster wiring; the
Try-it "Add Trust roster" checkbox and the +roster eval row are gone
- README / CLAUDE / tool_card / run-evaluation skill: remove roster references

Detection is Presidio (en_core_web_lg) + rules only. Tests pass (23), demo runs.

Files changed (9) hide show

.claude/skills/run-evaluation/SKILL.md +2 -4
CLAUDE.md +2 -6
README.md +6 -8
app/streamlit_app.py +4 -7
docs/tool_card.md +1 -2
noteguard/data.py +0 -20
noteguard/detect.py +1 -44
noteguard/trust_demo.py +5 -7
run_eval.py +2 -8

.claude/skills/run-evaluation/SKILL.md CHANGED Viewed

@@ -16,11 +16,9 @@ The eval is the project's pass/fail signal — it proves sanitisation actually r
    - **residual leakage** = known identifiers still present after sanitisation. This is the headline.
 ## How to read it
-- `--compare` prints three rows: **rules** → **presidio+rules** (the shipping detector) →
-  **presidio+rules+roster** (optional gazetteer). The leakage should drop sharply across them.
 - Watch residual leakage as the headline. If it regresses after a change to `noteguard/recognizers.py`,
   `detect.py`, or `transform.py`, fix it before continuing.
-- Keep the roster/gazetteer OUT of the headline claim — it's seeded from known values, so it's an
-  optional recall-lift layer, reported separately.
 Log anything that didn't work in `experiments/FAILED.md`.

    - **residual leakage** = known identifiers still present after sanitisation. This is the headline.
 ## How to read it
+- `--compare` prints two rows: **rules** → **presidio+rules** (the shipping detector). The leakage
+  should drop sharply between them.
 - Watch residual leakage as the headline. If it regresses after a change to `noteguard/recognizers.py`,
   `detect.py`, or `transform.py`, fix it before continuing.
 Log anything that didn't work in `experiments/FAILED.md`.

CLAUDE.md CHANGED Viewed

@@ -9,7 +9,7 @@ data leaves a Trust. Encode Club "Trusted Data & AI Infrastructure" hackathon; f
 python -m venv .venv; .\.venv\Scripts\Activate.ps1
 pip install -r requirements.txt; python -m spacy download en_core_web_lg
-python run_eval.py --compare --limit 300   # VERIFIABLE SIGNAL: rules vs presidio+rules vs +roster -> results.json
 python -m noteguard.trust_demo             # two NHS Trusts share only de-identified data -> data/out/
 streamlit run app/streamlit_app.py         # demo (Try-it / Metrics / Governance / Two-Trust)
 python -m pytest tests/ -v
@@ -19,7 +19,7 @@ python -m pytest tests/ -v
 ## Architecture
 - `noteguard/` — `data` (load + ground-truth join, EVAL-ONLY oracle) · `recognizers` (pure-Python
-  rules) · `detect` (Rule / Presidio / Gazetteer / Composite, graceful fallback) · `transform`
   (redact | patient-consistent pseudonymise + date-shift, Faker) · `evaluate` (P/R/F1 + residual
   leakage) · `pipeline` · `trust_demo`.
 - `run_eval.py` CLI · `app/streamlit_app.py` demo · `tests/` mirror `noteguard/`.
@@ -33,15 +33,11 @@ python -m pytest tests/ -v
   into prompts; point at file paths.
 - The note→patient join (`noteguard/data.py` ground truth) is the EVAL-ONLY oracle. It must NEVER feed
   detection/transform — that is data leakage and invalidates the metric.
-- The roster/gazetteer is seeded from known values, so keep it OUT of the headline metric — report it
-  only as an optional recall-lift layer.
 - Never silently fall back to an older/cached dataset — fail loudly.
 ## Decisions locked in (version 1 branch)
 - **Default model: `en_core_web_lg`** — 100% name recall vs 91% for sm; clinical transformer
   (`obi/deid_roberta_i2b2`) was tested and performed worse on UK names (US i2b2 training data).
-- **Roster OFF by default** — `--roster` flag available to show the recall lift separately;
-  not the headline metric because the gazetteer is seeded from the same known values.
 - **ORGANIZATION added to PresidioDetector.KEEP** — hospital names are often tagged as ORG;
   excluding them was the root cause of low places recall.
 - **Human-in-the-loop review queue** — spans with score in `[review_threshold, score_threshold)`

 python -m venv .venv; .\.venv\Scripts\Activate.ps1
 pip install -r requirements.txt; python -m spacy download en_core_web_lg
+python run_eval.py --compare --limit 300   # VERIFIABLE SIGNAL: rules vs presidio+rules -> results.json
 python -m noteguard.trust_demo             # two NHS Trusts share only de-identified data -> data/out/
 streamlit run app/streamlit_app.py         # demo (Try-it / Metrics / Governance / Two-Trust)
 python -m pytest tests/ -v
 ## Architecture
 - `noteguard/` — `data` (load + ground-truth join, EVAL-ONLY oracle) · `recognizers` (pure-Python
+  rules) · `detect` (Rule / Presidio, graceful fallback) · `transform`
   (redact | patient-consistent pseudonymise + date-shift, Faker) · `evaluate` (P/R/F1 + residual
   leakage) · `pipeline` · `trust_demo`.
 - `run_eval.py` CLI · `app/streamlit_app.py` demo · `tests/` mirror `noteguard/`.
   into prompts; point at file paths.
 - The note→patient join (`noteguard/data.py` ground truth) is the EVAL-ONLY oracle. It must NEVER feed
   detection/transform — that is data leakage and invalidates the metric.
 - Never silently fall back to an older/cached dataset — fail loudly.
 ## Decisions locked in (version 1 branch)
 - **Default model: `en_core_web_lg`** — 100% name recall vs 91% for sm; clinical transformer
   (`obi/deid_roberta_i2b2`) was tested and performed worse on UK names (US i2b2 training data).
 - **ORGANIZATION added to PresidioDetector.KEEP** — hospital names are often tagged as ORG;
   excluding them was the root cause of low places recall.
 - **Human-in-the-loop review queue** — spans with score in `[review_threshold, score_threshold)`

README.md CHANGED Viewed

@@ -27,8 +27,8 @@ layer** Presidio leaves to you:
 3. **Patient-consistent, longitudinal de-identification.** Same patient → same surrogate across their
    whole admission journey, with each patient's dates shifted by one consistent offset so clinical
    intervals survive — *useful* data, not just safe data. Realistic en_GB fakes (or `[TYPE]` redaction).
-4. **Pluggable + degrades gracefully.** One `Detector` interface (Rule / Presidio / Gazetteer /
-   Composite); the pure-Python rule layer + eval run even if spaCy/Presidio are unavailable.
 5. **Governance wrapper.** Per-note audit of what was removed + the dataset-level leakage report,
    mapped to the NHS **Five Safes**.
@@ -41,20 +41,18 @@ layer** Presidio leaves to you:
 |---|---|---|---|
 | rules only | 0.98 | 0.00 | **74.8 %** |
 | **presidio + rules** (shipping) | **0.99** | **0.68** | **8.5 %** |
-| presidio + rules + Trust roster | 0.99 | 0.73 | **0.10 %** (1 / 1027) |
-The rules→engine→roster drop is the headline: it shows, with numbers, exactly what each layer buys you.
 > Precision is reported against *structured* PII only, so it is a conservative lower bound — correctly
 > removing a clinician's name (not in the tables) counts here as a false positive. **Recall and leakage
-> are the sound, headline metrics.** The roster/gazetteer is seeded from a Trust's known patient list,
-> so it's reported as an optional recall-lift layer, kept out of the headline to avoid circularity.
 ## Architecture
 ```
                  ┌──────────────────── inside Trust A ─────────────────────┐
- raw notes ──►   │  fix mojibake ─► detect (Presidio NER + rules +roster)   │ ──► de-identified
  (PHI)           │                  ─► transform (redact | pseudonymise)    │     text + audit log
                  │                     patient-consistent + date-shift, vault│     (no PHI leaves)
                  └─────────────────────────────────────────────────────────┘
@@ -65,7 +63,7 @@ The rules→engine→roster drop is the headline: it shows, with numbers, exactl
 `noteguard/` — `data` (load + ground-truth join, **eval-only oracle**) · `recognizers` (pure-Python
 rules: NHS checksum/context, postcode, date, phone, email, GMC/NMC/ODS, UUID) · `detect`
-(`RuleDetector` / `PresidioDetector` / `GazetteerDetector` / `CompositeDetector`) · `transform`
 (redaction | patient-consistent pseudonymisation + date-shift, Faker vault) · `evaluate` (P/R/F1 +
 residual leakage) · `pipeline` · `trust_demo`.

 3. **Patient-consistent, longitudinal de-identification.** Same patient → same surrogate across their
    whole admission journey, with each patient's dates shifted by one consistent offset so clinical
    intervals survive — *useful* data, not just safe data. Realistic en_GB fakes (or `[TYPE]` redaction).
+4. **Pluggable + degrades gracefully.** One `Detector` interface (Rule / Presidio); the pure-Python
+   rule layer + eval run even if spaCy/Presidio are unavailable.
 5. **Governance wrapper.** Per-note audit of what was removed + the dataset-level leakage report,
    mapped to the NHS **Five Safes**.
 |---|---|---|---|
 | rules only | 0.98 | 0.00 | **74.8 %** |
 | **presidio + rules** (shipping) | **0.99** | **0.68** | **8.5 %** |
+The rules→engine drop is the headline: it shows, with numbers, exactly what the NER engine buys you.
 > Precision is reported against *structured* PII only, so it is a conservative lower bound — correctly
 > removing a clinician's name (not in the tables) counts here as a false positive. **Recall and leakage
+> are the sound, headline metrics.**
 ## Architecture
 ```
                  ┌──────────────────── inside Trust A ─────────────────────┐
+ raw notes ──►   │  fix mojibake ─► detect (Presidio NER + rules)           │ ──► de-identified
  (PHI)           │                  ─► transform (redact | pseudonymise)    │     text + audit log
                  │                     patient-consistent + date-shift, vault│     (no PHI leaves)
                  └─────────────────────────────────────────────────────────┘
 `noteguard/` — `data` (load + ground-truth join, **eval-only oracle**) · `recognizers` (pure-Python
 rules: NHS checksum/context, postcode, date, phone, email, GMC/NMC/ODS, UUID) · `detect`
+(`RuleDetector` / `PresidioDetector`) · `transform`
 (redaction | patient-consistent pseudonymisation + date-shift, Faker vault) · `evaluate` (P/R/F1 +
 residual leakage) · `pipeline` · `trust_demo`.

app/streamlit_app.py CHANGED Viewed

@@ -18,8 +18,8 @@ import streamlit as st
 REPO = Path(__file__).resolve().parent.parent
 sys.path.insert(0, str(REPO))
-from noteguard.data import load_notes, roster_terms  # noqa: E402
-from noteguard.detect import CompositeDetector, GazetteerDetector, build_detector  # noqa: E402
 from noteguard.evaluate import evaluate  # noqa: E402
 from noteguard.pipeline import Pipeline  # noqa: E402
 from noteguard.transform import PSEUDONYM, REDACTION, PseudonymVault  # noqa: E402
@@ -89,7 +89,6 @@ st.caption(
 )
 detector, NOTES = load_engine()
-ROSTER = roster_terms(NOTES) if NOTES else []
 tab_try, tab_metrics, tab_gov, tab_trust = st.tabs(
     ["🔎 Try it", "📊 Metrics & leakage", "🏛️ Governance (Five Safes)", "🤝 Two-Trust sharing"]
@@ -106,7 +105,6 @@ with tab_try:
         method = st.radio("Transform", [PSEUDONYM, REDACTION],
                           format_func=lambda m: "Pseudonymise (realistic, patient-consistent)"
                           if m == PSEUDONYM else "Redact ([TYPE] tags)")
-        use_roster = st.checkbox("Add Trust roster (gazetteer) — catches names NER misses", value=False)
         source = st.radio("Input", ["Sample note", "Paste your own"])
     with c1:
         if source == "Sample note" and NOTES:
@@ -121,8 +119,7 @@ with tab_try:
             person_id = "demo"
     if text.strip():
-        det = CompositeDetector(detector, GazetteerDetector(ROSTER)) if (use_roster and ROSTER) else detector
-        result = Pipeline(det, PseudonymVault()).sanitise(text, method, person_id)
         st.markdown("##### 1) Detected PII")
         scroll_box(highlight(text, result.spans))
@@ -198,7 +195,7 @@ with tab_metrics:
             hide_index=True, use_container_width=True,
         )
         st.caption(
-            f"Detector: `{name}` · model: `en_core_web_lg` · roster: OFF (honest generalisation). "
             "Precision is a conservative lower bound — clinician names and unlisted locations "
             "detected correctly are counted as false positives."
         )

 REPO = Path(__file__).resolve().parent.parent
 sys.path.insert(0, str(REPO))
+from noteguard.data import load_notes  # noqa: E402
+from noteguard.detect import build_detector  # noqa: E402
 from noteguard.evaluate import evaluate  # noqa: E402
 from noteguard.pipeline import Pipeline  # noqa: E402
 from noteguard.transform import PSEUDONYM, REDACTION, PseudonymVault  # noqa: E402
 )
 detector, NOTES = load_engine()
 tab_try, tab_metrics, tab_gov, tab_trust = st.tabs(
     ["🔎 Try it", "📊 Metrics & leakage", "🏛️ Governance (Five Safes)", "🤝 Two-Trust sharing"]
         method = st.radio("Transform", [PSEUDONYM, REDACTION],
                           format_func=lambda m: "Pseudonymise (realistic, patient-consistent)"
                           if m == PSEUDONYM else "Redact ([TYPE] tags)")
         source = st.radio("Input", ["Sample note", "Paste your own"])
     with c1:
         if source == "Sample note" and NOTES:
             person_id = "demo"
     if text.strip():
+        result = Pipeline(detector, PseudonymVault()).sanitise(text, method, person_id)
         st.markdown("##### 1) Detected PII")
         scroll_box(highlight(text, result.spans))
             hide_index=True, use_container_width=True,
         )
         st.caption(
+            f"Detector: `{name}` · model: `en_core_web_lg` (honest generalisation). "
             "Precision is a conservative lower bound — clinician names and unlisted locations "
             "detected correctly are counted as false positives."
         )

docs/tool_card.md CHANGED Viewed

@@ -51,7 +51,7 @@ NoteGuard is a **de-identification gate** for free-text NHS clinical notes. It d
 ---
-## Performance (honest baseline — roster OFF, `en_core_web_lg`)
 | Entity | Recall |
 |---|---|
@@ -93,7 +93,6 @@ This matches the real NHS Information Governance workflow and makes the tool's a
 - **Precision is a conservative lower bound**: clinician names and unlisted locations correctly detected count as false positives in the evaluation (ground truth is patient-table-only).
 - **Not clinically validated**: evaluated on the `NHSEDataScience/synthetic_clinical_notes` dataset. Real deployment requires validation on representative Trust data.
 - **Clinical transformer models** (e.g. `obi/deid_roberta_i2b2`) were tested and performed worse on UK names than `en_core_web_lg` (i2b2 training data is US-centric). See `experiments/FAILED.md`.
-- **Roster / gazetteer** gives a recall lift but is seeded from known patient values — kept out of the headline metric to avoid circularity. Available as `--roster` option.
 ---

 ---
+## Performance (`en_core_web_lg`)
 | Entity | Recall |
 |---|---|
 - **Precision is a conservative lower bound**: clinician names and unlisted locations correctly detected count as false positives in the evaluation (ground truth is patient-table-only).
 - **Not clinically validated**: evaluated on the `NHSEDataScience/synthetic_clinical_notes` dataset. Real deployment requires validation on representative Trust data.
 - **Clinical transformer models** (e.g. `obi/deid_roberta_i2b2`) were tested and performed worse on UK names than `en_core_web_lg` (i2b2 training data is US-centric). See `experiments/FAILED.md`.
 ---

noteguard/data.py CHANGED Viewed

@@ -210,26 +210,6 @@ def load_notes(limit: int | None = None, local_dir: str | None = None) -> list[N
     return records
-def roster_terms(records: list[NoteRecord]) -> list[tuple[str, str]]:
-    """Build (term, entity_type) pairs for a GazetteerDetector from notes' ground truth.
-    This is the patient/site roster a Trust legitimately holds. Used as an OPTIONAL
-    recall-lift layer (and by the two-Trust demo) — kept out of the headline eval to
-    avoid circularity, since the gazetteer is seeded from the same known values.
-    """
-    terms: dict[str, str] = {}
-    for rec in records:
-        for gt in rec.ground_truth:
-            if gt.entity_type not in ("PERSON", "LOCATION"):
-                continue
-            terms.setdefault(gt.text, gt.entity_type)
-            if gt.entity_type == "PERSON":
-                for tok in gt.text.replace(",", " ").split():
-                    if len(tok) >= 3:
-                        terms.setdefault(tok, "PERSON")
-    return list(terms.items())
 if __name__ == "__main__":
     recs = load_notes(limit=5)
     for rec in recs:

     return records
 if __name__ == "__main__":
     recs = load_notes(limit=5)
     for rec in recs:

noteguard/detect.py CHANGED Viewed

@@ -8,8 +8,7 @@ are unavailable.
 """
 from __future__ import annotations
-import re
-from typing import Iterable, Protocol
 from .recognizers import Span, find_rule_spans
@@ -115,48 +114,6 @@ class PresidioDetector:
         return _merge(spans)
-class GazetteerDetector:
-    """Match a known list of names/sites (the roster a trust actually holds).
-    Catches identifiers the NER model misses (rare names, typo'd surnames) using
-    whole-word, case-insensitive matching. Used as an optional layer to show the
-    recall lift — not part of the headline eval, to avoid circularity.
-    """
-    name = "gazetteer"
-    def __init__(self, terms: Iterable[tuple[str, str]], min_len: int = 3):
-        self._patterns: list[tuple[re.Pattern, str]] = []
-        seen: set[str] = set()
-        for term, etype in terms:
-            term = (term or "").strip()
-            if len(term) < min_len or term.lower() in seen:
-                continue
-            seen.add(term.lower())
-            self._patterns.append(
-                (re.compile(rf"\b{re.escape(term)}\b", re.IGNORECASE), etype)
-            )
-    def detect(self, text: str) -> list[Span]:
-        spans: list[Span] = []
-        for pat, etype in self._patterns:
-            for m in pat.finditer(text):
-                spans.append(Span(m.start(), m.end(), etype, m.group(), 0.9))
-        return spans
-class CompositeDetector:
-    def __init__(self, *detectors: Detector):
-        self.detectors = detectors
-        self.name = "+".join(getattr(d, "name", "?") for d in detectors)
-    def detect(self, text: str) -> list[Span]:
-        spans: list[Span] = []
-        for d in self.detectors:
-            spans += d.detect(text)
-        return _merge(spans)
 def _merge(spans: list[Span]) -> list[Span]:
     """Sort, then drop spans fully contained in a longer span (keep highest score)."""
     spans = sorted(spans, key=lambda s: (s.start, -(s.end - s.start), -s.score))

 """
 from __future__ import annotations
+from typing import Protocol
 from .recognizers import Span, find_rule_spans
         return _merge(spans)
 def _merge(spans: list[Span]) -> list[Span]:
     """Sort, then drop spans fully contained in a longer span (keep highest score)."""
     spans = sorted(spans, key=lambda s: (s.start, -(s.end - s.start), -s.score))

noteguard/trust_demo.py CHANGED Viewed

@@ -1,7 +1,7 @@
 """Simulate two NHS Trusts collaborating without sharing sensitive data.
-Each Trust holds its own patients, its own roster (gazetteer), and its own
-re-identification vault. It sanitises its notes LOCALLY and contributes only the
 de-identified text + a content-free audit manifest to a shared pool. Raw notes and
 vaults never leave the Trust. This is the sanitise-at-source gate that sits in
 front of a federated SDE / FLock.io training round.
@@ -16,8 +16,8 @@ import sys
 from collections import Counter
 from pathlib import Path
-from .data import NoteRecord, load_notes, roster_terms
-from .detect import CompositeDetector, GazetteerDetector, build_detector
 from .evaluate import ground_truth_spans, value_variants, _find_all
 from .pipeline import Pipeline
 from .transform import PSEUDONYM, PseudonymVault
@@ -43,9 +43,7 @@ def _residual_leaks(rec: NoteRecord, sanitised: str) -> tuple[int, int]:
 def _run_trust(trust_id: int, records: list[NoteRecord], method: str, base_detector) -> dict:
     """Sanitise one Trust's notes locally; return a shareable manifest + de-identified records."""
-    # Trust-local roster gazetteer layered on the shared detection engine.
-    detector = CompositeDetector(base_detector, GazetteerDetector(roster_terms(records)))
-    pipeline = Pipeline(detector=detector, vault=PseudonymVault())  # vault stays local
     entity_counts: Counter = Counter()
     deidentified: list[dict] = []

 """Simulate two NHS Trusts collaborating without sharing sensitive data.
+Each Trust holds its own patients and its own re-identification vault. It
+sanitises its notes LOCALLY and contributes only the
 de-identified text + a content-free audit manifest to a shared pool. Raw notes and
 vaults never leave the Trust. This is the sanitise-at-source gate that sits in
 front of a federated SDE / FLock.io training round.
 from collections import Counter
 from pathlib import Path
+from .data import NoteRecord, load_notes
+from .detect import build_detector
 from .evaluate import ground_truth_spans, value_variants, _find_all
 from .pipeline import Pipeline
 from .transform import PSEUDONYM, PseudonymVault
 def _run_trust(trust_id: int, records: list[NoteRecord], method: str, base_detector) -> dict:
     """Sanitise one Trust's notes locally; return a shareable manifest + de-identified records."""
+    pipeline = Pipeline(detector=base_detector, vault=PseudonymVault())  # vault stays local
     entity_counts: Counter = Counter()
     deidentified: list[dict] = []

run_eval.py CHANGED Viewed

@@ -11,8 +11,8 @@ from __future__ import annotations
 import argparse
 import json
-from noteguard.data import load_notes, roster_terms
-from noteguard.detect import CompositeDetector, GazetteerDetector, RuleDetector, build_detector
 from noteguard.evaluate import EvalResult, evaluate
 from noteguard.transform import REDACTION
@@ -55,12 +55,6 @@ def main() -> None:
         presidio = build_detector(True)
         runs["presidio+rules"] = evaluate(records, presidio, args.method)
         _print_summary(runs["presidio+rules"])
-        # Optional recall-lift layer: the Trust roster as a gazetteer. Reported
-        # separately and NOT the headline, because it's seeded from known values.
-        print("\n=== presidio+rules+roster (optional gazetteer layer) ===")
-        roster_det = CompositeDetector(presidio, GazetteerDetector(roster_terms(records)))
-        runs["presidio+rules+roster"] = evaluate(records, roster_det, args.method)
-        _print_summary(runs["presidio+rules+roster"])
     else:
         det = RuleDetector() if args.no_presidio else build_detector(True)
         res = evaluate(records, det, args.method)

 import argparse
 import json
+from noteguard.data import load_notes
+from noteguard.detect import RuleDetector, build_detector
 from noteguard.evaluate import EvalResult, evaluate
 from noteguard.transform import REDACTION
         presidio = build_detector(True)
         runs["presidio+rules"] = evaluate(records, presidio, args.method)
         _print_summary(runs["presidio+rules"])
     else:
         det = RuleDetector() if args.no_presidio else build_detector(True)
         res = evaluate(records, det, args.method)