yumi.h commited on
Commit ·
4d404c0
1
Parent(s): 9db610d
Remove the roster / gazetteer feature
Browse filesDrop the optional Trust-roster (gazetteer) layer entirely:
- detect.py: remove GazetteerDetector and CompositeDetector (now unused), tidy imports
- data.py: remove roster_terms()
- trust_demo.py / run_eval.py / app/streamlit_app.py: drop roster wiring; the
Try-it "Add Trust roster" checkbox and the +roster eval row are gone
- README / CLAUDE / tool_card / run-evaluation skill: remove roster references
Detection is Presidio (en_core_web_lg) + rules only. Tests pass (23), demo runs.
- .claude/skills/run-evaluation/SKILL.md +2 -4
- CLAUDE.md +2 -6
- README.md +6 -8
- app/streamlit_app.py +4 -7
- docs/tool_card.md +1 -2
- noteguard/data.py +0 -20
- noteguard/detect.py +1 -44
- noteguard/trust_demo.py +5 -7
- run_eval.py +2 -8
.claude/skills/run-evaluation/SKILL.md
CHANGED
|
@@ -16,11 +16,9 @@ The eval is the project's pass/fail signal — it proves sanitisation actually r
|
|
| 16 |
- **residual leakage** = known identifiers still present after sanitisation. This is the headline.
|
| 17 |
|
| 18 |
## How to read it
|
| 19 |
-
- `--compare` prints
|
| 20 |
-
|
| 21 |
- Watch residual leakage as the headline. If it regresses after a change to `noteguard/recognizers.py`,
|
| 22 |
`detect.py`, or `transform.py`, fix it before continuing.
|
| 23 |
-
- Keep the roster/gazetteer OUT of the headline claim — it's seeded from known values, so it's an
|
| 24 |
-
optional recall-lift layer, reported separately.
|
| 25 |
|
| 26 |
Log anything that didn't work in `experiments/FAILED.md`.
|
|
|
|
| 16 |
- **residual leakage** = known identifiers still present after sanitisation. This is the headline.
|
| 17 |
|
| 18 |
## How to read it
|
| 19 |
+
- `--compare` prints two rows: **rules** → **presidio+rules** (the shipping detector). The leakage
|
| 20 |
+
should drop sharply between them.
|
| 21 |
- Watch residual leakage as the headline. If it regresses after a change to `noteguard/recognizers.py`,
|
| 22 |
`detect.py`, or `transform.py`, fix it before continuing.
|
|
|
|
|
|
|
| 23 |
|
| 24 |
Log anything that didn't work in `experiments/FAILED.md`.
|
CLAUDE.md
CHANGED
|
@@ -9,7 +9,7 @@ data leaves a Trust. Encode Club "Trusted Data & AI Infrastructure" hackathon; f
|
|
| 9 |
python -m venv .venv; .\.venv\Scripts\Activate.ps1
|
| 10 |
pip install -r requirements.txt; python -m spacy download en_core_web_lg
|
| 11 |
|
| 12 |
-
python run_eval.py --compare --limit 300 # VERIFIABLE SIGNAL: rules vs presidio+rules
|
| 13 |
python -m noteguard.trust_demo # two NHS Trusts share only de-identified data -> data/out/
|
| 14 |
streamlit run app/streamlit_app.py # demo (Try-it / Metrics / Governance / Two-Trust)
|
| 15 |
python -m pytest tests/ -v
|
|
@@ -19,7 +19,7 @@ python -m pytest tests/ -v
|
|
| 19 |
|
| 20 |
## Architecture
|
| 21 |
- `noteguard/` — `data` (load + ground-truth join, EVAL-ONLY oracle) · `recognizers` (pure-Python
|
| 22 |
-
rules) · `detect` (Rule / Presidio
|
| 23 |
(redact | patient-consistent pseudonymise + date-shift, Faker) · `evaluate` (P/R/F1 + residual
|
| 24 |
leakage) · `pipeline` · `trust_demo`.
|
| 25 |
- `run_eval.py` CLI · `app/streamlit_app.py` demo · `tests/` mirror `noteguard/`.
|
|
@@ -33,15 +33,11 @@ python -m pytest tests/ -v
|
|
| 33 |
into prompts; point at file paths.
|
| 34 |
- The note→patient join (`noteguard/data.py` ground truth) is the EVAL-ONLY oracle. It must NEVER feed
|
| 35 |
detection/transform — that is data leakage and invalidates the metric.
|
| 36 |
-
- The roster/gazetteer is seeded from known values, so keep it OUT of the headline metric — report it
|
| 37 |
-
only as an optional recall-lift layer.
|
| 38 |
- Never silently fall back to an older/cached dataset — fail loudly.
|
| 39 |
|
| 40 |
## Decisions locked in (version 1 branch)
|
| 41 |
- **Default model: `en_core_web_lg`** — 100% name recall vs 91% for sm; clinical transformer
|
| 42 |
(`obi/deid_roberta_i2b2`) was tested and performed worse on UK names (US i2b2 training data).
|
| 43 |
-
- **Roster OFF by default** — `--roster` flag available to show the recall lift separately;
|
| 44 |
-
not the headline metric because the gazetteer is seeded from the same known values.
|
| 45 |
- **ORGANIZATION added to PresidioDetector.KEEP** — hospital names are often tagged as ORG;
|
| 46 |
excluding them was the root cause of low places recall.
|
| 47 |
- **Human-in-the-loop review queue** — spans with score in `[review_threshold, score_threshold)`
|
|
|
|
| 9 |
python -m venv .venv; .\.venv\Scripts\Activate.ps1
|
| 10 |
pip install -r requirements.txt; python -m spacy download en_core_web_lg
|
| 11 |
|
| 12 |
+
python run_eval.py --compare --limit 300 # VERIFIABLE SIGNAL: rules vs presidio+rules -> results.json
|
| 13 |
python -m noteguard.trust_demo # two NHS Trusts share only de-identified data -> data/out/
|
| 14 |
streamlit run app/streamlit_app.py # demo (Try-it / Metrics / Governance / Two-Trust)
|
| 15 |
python -m pytest tests/ -v
|
|
|
|
| 19 |
|
| 20 |
## Architecture
|
| 21 |
- `noteguard/` — `data` (load + ground-truth join, EVAL-ONLY oracle) · `recognizers` (pure-Python
|
| 22 |
+
rules) · `detect` (Rule / Presidio, graceful fallback) · `transform`
|
| 23 |
(redact | patient-consistent pseudonymise + date-shift, Faker) · `evaluate` (P/R/F1 + residual
|
| 24 |
leakage) · `pipeline` · `trust_demo`.
|
| 25 |
- `run_eval.py` CLI · `app/streamlit_app.py` demo · `tests/` mirror `noteguard/`.
|
|
|
|
| 33 |
into prompts; point at file paths.
|
| 34 |
- The note→patient join (`noteguard/data.py` ground truth) is the EVAL-ONLY oracle. It must NEVER feed
|
| 35 |
detection/transform — that is data leakage and invalidates the metric.
|
|
|
|
|
|
|
| 36 |
- Never silently fall back to an older/cached dataset — fail loudly.
|
| 37 |
|
| 38 |
## Decisions locked in (version 1 branch)
|
| 39 |
- **Default model: `en_core_web_lg`** — 100% name recall vs 91% for sm; clinical transformer
|
| 40 |
(`obi/deid_roberta_i2b2`) was tested and performed worse on UK names (US i2b2 training data).
|
|
|
|
|
|
|
| 41 |
- **ORGANIZATION added to PresidioDetector.KEEP** — hospital names are often tagged as ORG;
|
| 42 |
excluding them was the root cause of low places recall.
|
| 43 |
- **Human-in-the-loop review queue** — spans with score in `[review_threshold, score_threshold)`
|
README.md
CHANGED
|
@@ -27,8 +27,8 @@ layer** Presidio leaves to you:
|
|
| 27 |
3. **Patient-consistent, longitudinal de-identification.** Same patient → same surrogate across their
|
| 28 |
whole admission journey, with each patient's dates shifted by one consistent offset so clinical
|
| 29 |
intervals survive — *useful* data, not just safe data. Realistic en_GB fakes (or `[TYPE]` redaction).
|
| 30 |
-
4. **Pluggable + degrades gracefully.** One `Detector` interface (Rule / Presidio
|
| 31 |
-
|
| 32 |
5. **Governance wrapper.** Per-note audit of what was removed + the dataset-level leakage report,
|
| 33 |
mapped to the NHS **Five Safes**.
|
| 34 |
|
|
@@ -41,20 +41,18 @@ layer** Presidio leaves to you:
|
|
| 41 |
|---|---|---|---|
|
| 42 |
| rules only | 0.98 | 0.00 | **74.8 %** |
|
| 43 |
| **presidio + rules** (shipping) | **0.99** | **0.68** | **8.5 %** |
|
| 44 |
-
| presidio + rules + Trust roster | 0.99 | 0.73 | **0.10 %** (1 / 1027) |
|
| 45 |
|
| 46 |
-
The rules→engine
|
| 47 |
|
| 48 |
> Precision is reported against *structured* PII only, so it is a conservative lower bound — correctly
|
| 49 |
> removing a clinician's name (not in the tables) counts here as a false positive. **Recall and leakage
|
| 50 |
-
> are the sound, headline metrics.**
|
| 51 |
-
> so it's reported as an optional recall-lift layer, kept out of the headline to avoid circularity.
|
| 52 |
|
| 53 |
## Architecture
|
| 54 |
|
| 55 |
```
|
| 56 |
┌──────────────────── inside Trust A ─────────────────────┐
|
| 57 |
-
raw notes ──► │ fix mojibake ─► detect (Presidio NER + rules
|
| 58 |
(PHI) │ ─► transform (redact | pseudonymise) │ text + audit log
|
| 59 |
│ patient-consistent + date-shift, vault│ (no PHI leaves)
|
| 60 |
└─────────────────────────────────────────────────────────┘
|
|
@@ -65,7 +63,7 @@ The rules→engine→roster drop is the headline: it shows, with numbers, exactl
|
|
| 65 |
|
| 66 |
`noteguard/` — `data` (load + ground-truth join, **eval-only oracle**) · `recognizers` (pure-Python
|
| 67 |
rules: NHS checksum/context, postcode, date, phone, email, GMC/NMC/ODS, UUID) · `detect`
|
| 68 |
-
(`RuleDetector` / `PresidioDetector`
|
| 69 |
(redaction | patient-consistent pseudonymisation + date-shift, Faker vault) · `evaluate` (P/R/F1 +
|
| 70 |
residual leakage) · `pipeline` · `trust_demo`.
|
| 71 |
|
|
|
|
| 27 |
3. **Patient-consistent, longitudinal de-identification.** Same patient → same surrogate across their
|
| 28 |
whole admission journey, with each patient's dates shifted by one consistent offset so clinical
|
| 29 |
intervals survive — *useful* data, not just safe data. Realistic en_GB fakes (or `[TYPE]` redaction).
|
| 30 |
+
4. **Pluggable + degrades gracefully.** One `Detector` interface (Rule / Presidio); the pure-Python
|
| 31 |
+
rule layer + eval run even if spaCy/Presidio are unavailable.
|
| 32 |
5. **Governance wrapper.** Per-note audit of what was removed + the dataset-level leakage report,
|
| 33 |
mapped to the NHS **Five Safes**.
|
| 34 |
|
|
|
|
| 41 |
|---|---|---|---|
|
| 42 |
| rules only | 0.98 | 0.00 | **74.8 %** |
|
| 43 |
| **presidio + rules** (shipping) | **0.99** | **0.68** | **8.5 %** |
|
|
|
|
| 44 |
|
| 45 |
+
The rules→engine drop is the headline: it shows, with numbers, exactly what the NER engine buys you.
|
| 46 |
|
| 47 |
> Precision is reported against *structured* PII only, so it is a conservative lower bound — correctly
|
| 48 |
> removing a clinician's name (not in the tables) counts here as a false positive. **Recall and leakage
|
| 49 |
+
> are the sound, headline metrics.**
|
|
|
|
| 50 |
|
| 51 |
## Architecture
|
| 52 |
|
| 53 |
```
|
| 54 |
┌──────────────────── inside Trust A ─────────────────────┐
|
| 55 |
+
raw notes ──► │ fix mojibake ─► detect (Presidio NER + rules) │ ──► de-identified
|
| 56 |
(PHI) │ ─► transform (redact | pseudonymise) │ text + audit log
|
| 57 |
│ patient-consistent + date-shift, vault│ (no PHI leaves)
|
| 58 |
└─────────────────────────────────────────────────────────┘
|
|
|
|
| 63 |
|
| 64 |
`noteguard/` — `data` (load + ground-truth join, **eval-only oracle**) · `recognizers` (pure-Python
|
| 65 |
rules: NHS checksum/context, postcode, date, phone, email, GMC/NMC/ODS, UUID) · `detect`
|
| 66 |
+
(`RuleDetector` / `PresidioDetector`) · `transform`
|
| 67 |
(redaction | patient-consistent pseudonymisation + date-shift, Faker vault) · `evaluate` (P/R/F1 +
|
| 68 |
residual leakage) · `pipeline` · `trust_demo`.
|
| 69 |
|
app/streamlit_app.py
CHANGED
|
@@ -18,8 +18,8 @@ import streamlit as st
|
|
| 18 |
REPO = Path(__file__).resolve().parent.parent
|
| 19 |
sys.path.insert(0, str(REPO))
|
| 20 |
|
| 21 |
-
from noteguard.data import load_notes
|
| 22 |
-
from noteguard.detect import
|
| 23 |
from noteguard.evaluate import evaluate # noqa: E402
|
| 24 |
from noteguard.pipeline import Pipeline # noqa: E402
|
| 25 |
from noteguard.transform import PSEUDONYM, REDACTION, PseudonymVault # noqa: E402
|
|
@@ -89,7 +89,6 @@ st.caption(
|
|
| 89 |
)
|
| 90 |
|
| 91 |
detector, NOTES = load_engine()
|
| 92 |
-
ROSTER = roster_terms(NOTES) if NOTES else []
|
| 93 |
|
| 94 |
tab_try, tab_metrics, tab_gov, tab_trust = st.tabs(
|
| 95 |
["🔎 Try it", "📊 Metrics & leakage", "🏛️ Governance (Five Safes)", "🤝 Two-Trust sharing"]
|
|
@@ -106,7 +105,6 @@ with tab_try:
|
|
| 106 |
method = st.radio("Transform", [PSEUDONYM, REDACTION],
|
| 107 |
format_func=lambda m: "Pseudonymise (realistic, patient-consistent)"
|
| 108 |
if m == PSEUDONYM else "Redact ([TYPE] tags)")
|
| 109 |
-
use_roster = st.checkbox("Add Trust roster (gazetteer) — catches names NER misses", value=False)
|
| 110 |
source = st.radio("Input", ["Sample note", "Paste your own"])
|
| 111 |
with c1:
|
| 112 |
if source == "Sample note" and NOTES:
|
|
@@ -121,8 +119,7 @@ with tab_try:
|
|
| 121 |
person_id = "demo"
|
| 122 |
|
| 123 |
if text.strip():
|
| 124 |
-
|
| 125 |
-
result = Pipeline(det, PseudonymVault()).sanitise(text, method, person_id)
|
| 126 |
|
| 127 |
st.markdown("##### 1) Detected PII")
|
| 128 |
scroll_box(highlight(text, result.spans))
|
|
@@ -198,7 +195,7 @@ with tab_metrics:
|
|
| 198 |
hide_index=True, use_container_width=True,
|
| 199 |
)
|
| 200 |
st.caption(
|
| 201 |
-
f"Detector: `{name}` · model: `en_core_web_lg`
|
| 202 |
"Precision is a conservative lower bound — clinician names and unlisted locations "
|
| 203 |
"detected correctly are counted as false positives."
|
| 204 |
)
|
|
|
|
| 18 |
REPO = Path(__file__).resolve().parent.parent
|
| 19 |
sys.path.insert(0, str(REPO))
|
| 20 |
|
| 21 |
+
from noteguard.data import load_notes # noqa: E402
|
| 22 |
+
from noteguard.detect import build_detector # noqa: E402
|
| 23 |
from noteguard.evaluate import evaluate # noqa: E402
|
| 24 |
from noteguard.pipeline import Pipeline # noqa: E402
|
| 25 |
from noteguard.transform import PSEUDONYM, REDACTION, PseudonymVault # noqa: E402
|
|
|
|
| 89 |
)
|
| 90 |
|
| 91 |
detector, NOTES = load_engine()
|
|
|
|
| 92 |
|
| 93 |
tab_try, tab_metrics, tab_gov, tab_trust = st.tabs(
|
| 94 |
["🔎 Try it", "📊 Metrics & leakage", "🏛️ Governance (Five Safes)", "🤝 Two-Trust sharing"]
|
|
|
|
| 105 |
method = st.radio("Transform", [PSEUDONYM, REDACTION],
|
| 106 |
format_func=lambda m: "Pseudonymise (realistic, patient-consistent)"
|
| 107 |
if m == PSEUDONYM else "Redact ([TYPE] tags)")
|
|
|
|
| 108 |
source = st.radio("Input", ["Sample note", "Paste your own"])
|
| 109 |
with c1:
|
| 110 |
if source == "Sample note" and NOTES:
|
|
|
|
| 119 |
person_id = "demo"
|
| 120 |
|
| 121 |
if text.strip():
|
| 122 |
+
result = Pipeline(detector, PseudonymVault()).sanitise(text, method, person_id)
|
|
|
|
| 123 |
|
| 124 |
st.markdown("##### 1) Detected PII")
|
| 125 |
scroll_box(highlight(text, result.spans))
|
|
|
|
| 195 |
hide_index=True, use_container_width=True,
|
| 196 |
)
|
| 197 |
st.caption(
|
| 198 |
+
f"Detector: `{name}` · model: `en_core_web_lg` (honest generalisation). "
|
| 199 |
"Precision is a conservative lower bound — clinician names and unlisted locations "
|
| 200 |
"detected correctly are counted as false positives."
|
| 201 |
)
|
docs/tool_card.md
CHANGED
|
@@ -51,7 +51,7 @@ NoteGuard is a **de-identification gate** for free-text NHS clinical notes. It d
|
|
| 51 |
|
| 52 |
---
|
| 53 |
|
| 54 |
-
## Performance (
|
| 55 |
|
| 56 |
| Entity | Recall |
|
| 57 |
|---|---|
|
|
@@ -93,7 +93,6 @@ This matches the real NHS Information Governance workflow and makes the tool's a
|
|
| 93 |
- **Precision is a conservative lower bound**: clinician names and unlisted locations correctly detected count as false positives in the evaluation (ground truth is patient-table-only).
|
| 94 |
- **Not clinically validated**: evaluated on the `NHSEDataScience/synthetic_clinical_notes` dataset. Real deployment requires validation on representative Trust data.
|
| 95 |
- **Clinical transformer models** (e.g. `obi/deid_roberta_i2b2`) were tested and performed worse on UK names than `en_core_web_lg` (i2b2 training data is US-centric). See `experiments/FAILED.md`.
|
| 96 |
-
- **Roster / gazetteer** gives a recall lift but is seeded from known patient values — kept out of the headline metric to avoid circularity. Available as `--roster` option.
|
| 97 |
|
| 98 |
---
|
| 99 |
|
|
|
|
| 51 |
|
| 52 |
---
|
| 53 |
|
| 54 |
+
## Performance (`en_core_web_lg`)
|
| 55 |
|
| 56 |
| Entity | Recall |
|
| 57 |
|---|---|
|
|
|
|
| 93 |
- **Precision is a conservative lower bound**: clinician names and unlisted locations correctly detected count as false positives in the evaluation (ground truth is patient-table-only).
|
| 94 |
- **Not clinically validated**: evaluated on the `NHSEDataScience/synthetic_clinical_notes` dataset. Real deployment requires validation on representative Trust data.
|
| 95 |
- **Clinical transformer models** (e.g. `obi/deid_roberta_i2b2`) were tested and performed worse on UK names than `en_core_web_lg` (i2b2 training data is US-centric). See `experiments/FAILED.md`.
|
|
|
|
| 96 |
|
| 97 |
---
|
| 98 |
|
noteguard/data.py
CHANGED
|
@@ -210,26 +210,6 @@ def load_notes(limit: int | None = None, local_dir: str | None = None) -> list[N
|
|
| 210 |
return records
|
| 211 |
|
| 212 |
|
| 213 |
-
def roster_terms(records: list[NoteRecord]) -> list[tuple[str, str]]:
|
| 214 |
-
"""Build (term, entity_type) pairs for a GazetteerDetector from notes' ground truth.
|
| 215 |
-
|
| 216 |
-
This is the patient/site roster a Trust legitimately holds. Used as an OPTIONAL
|
| 217 |
-
recall-lift layer (and by the two-Trust demo) — kept out of the headline eval to
|
| 218 |
-
avoid circularity, since the gazetteer is seeded from the same known values.
|
| 219 |
-
"""
|
| 220 |
-
terms: dict[str, str] = {}
|
| 221 |
-
for rec in records:
|
| 222 |
-
for gt in rec.ground_truth:
|
| 223 |
-
if gt.entity_type not in ("PERSON", "LOCATION"):
|
| 224 |
-
continue
|
| 225 |
-
terms.setdefault(gt.text, gt.entity_type)
|
| 226 |
-
if gt.entity_type == "PERSON":
|
| 227 |
-
for tok in gt.text.replace(",", " ").split():
|
| 228 |
-
if len(tok) >= 3:
|
| 229 |
-
terms.setdefault(tok, "PERSON")
|
| 230 |
-
return list(terms.items())
|
| 231 |
-
|
| 232 |
-
|
| 233 |
if __name__ == "__main__":
|
| 234 |
recs = load_notes(limit=5)
|
| 235 |
for rec in recs:
|
|
|
|
| 210 |
return records
|
| 211 |
|
| 212 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 213 |
if __name__ == "__main__":
|
| 214 |
recs = load_notes(limit=5)
|
| 215 |
for rec in recs:
|
noteguard/detect.py
CHANGED
|
@@ -8,8 +8,7 @@ are unavailable.
|
|
| 8 |
"""
|
| 9 |
from __future__ import annotations
|
| 10 |
|
| 11 |
-
import
|
| 12 |
-
from typing import Iterable, Protocol
|
| 13 |
|
| 14 |
from .recognizers import Span, find_rule_spans
|
| 15 |
|
|
@@ -115,48 +114,6 @@ class PresidioDetector:
|
|
| 115 |
return _merge(spans)
|
| 116 |
|
| 117 |
|
| 118 |
-
class GazetteerDetector:
|
| 119 |
-
"""Match a known list of names/sites (the roster a trust actually holds).
|
| 120 |
-
|
| 121 |
-
Catches identifiers the NER model misses (rare names, typo'd surnames) using
|
| 122 |
-
whole-word, case-insensitive matching. Used as an optional layer to show the
|
| 123 |
-
recall lift — not part of the headline eval, to avoid circularity.
|
| 124 |
-
"""
|
| 125 |
-
|
| 126 |
-
name = "gazetteer"
|
| 127 |
-
|
| 128 |
-
def __init__(self, terms: Iterable[tuple[str, str]], min_len: int = 3):
|
| 129 |
-
self._patterns: list[tuple[re.Pattern, str]] = []
|
| 130 |
-
seen: set[str] = set()
|
| 131 |
-
for term, etype in terms:
|
| 132 |
-
term = (term or "").strip()
|
| 133 |
-
if len(term) < min_len or term.lower() in seen:
|
| 134 |
-
continue
|
| 135 |
-
seen.add(term.lower())
|
| 136 |
-
self._patterns.append(
|
| 137 |
-
(re.compile(rf"\b{re.escape(term)}\b", re.IGNORECASE), etype)
|
| 138 |
-
)
|
| 139 |
-
|
| 140 |
-
def detect(self, text: str) -> list[Span]:
|
| 141 |
-
spans: list[Span] = []
|
| 142 |
-
for pat, etype in self._patterns:
|
| 143 |
-
for m in pat.finditer(text):
|
| 144 |
-
spans.append(Span(m.start(), m.end(), etype, m.group(), 0.9))
|
| 145 |
-
return spans
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
class CompositeDetector:
|
| 149 |
-
def __init__(self, *detectors: Detector):
|
| 150 |
-
self.detectors = detectors
|
| 151 |
-
self.name = "+".join(getattr(d, "name", "?") for d in detectors)
|
| 152 |
-
|
| 153 |
-
def detect(self, text: str) -> list[Span]:
|
| 154 |
-
spans: list[Span] = []
|
| 155 |
-
for d in self.detectors:
|
| 156 |
-
spans += d.detect(text)
|
| 157 |
-
return _merge(spans)
|
| 158 |
-
|
| 159 |
-
|
| 160 |
def _merge(spans: list[Span]) -> list[Span]:
|
| 161 |
"""Sort, then drop spans fully contained in a longer span (keep highest score)."""
|
| 162 |
spans = sorted(spans, key=lambda s: (s.start, -(s.end - s.start), -s.score))
|
|
|
|
| 8 |
"""
|
| 9 |
from __future__ import annotations
|
| 10 |
|
| 11 |
+
from typing import Protocol
|
|
|
|
| 12 |
|
| 13 |
from .recognizers import Span, find_rule_spans
|
| 14 |
|
|
|
|
| 114 |
return _merge(spans)
|
| 115 |
|
| 116 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 117 |
def _merge(spans: list[Span]) -> list[Span]:
|
| 118 |
"""Sort, then drop spans fully contained in a longer span (keep highest score)."""
|
| 119 |
spans = sorted(spans, key=lambda s: (s.start, -(s.end - s.start), -s.score))
|
noteguard/trust_demo.py
CHANGED
|
@@ -1,7 +1,7 @@
|
|
| 1 |
"""Simulate two NHS Trusts collaborating without sharing sensitive data.
|
| 2 |
|
| 3 |
-
Each Trust holds its own patients
|
| 4 |
-
|
| 5 |
de-identified text + a content-free audit manifest to a shared pool. Raw notes and
|
| 6 |
vaults never leave the Trust. This is the sanitise-at-source gate that sits in
|
| 7 |
front of a federated SDE / FLock.io training round.
|
|
@@ -16,8 +16,8 @@ import sys
|
|
| 16 |
from collections import Counter
|
| 17 |
from pathlib import Path
|
| 18 |
|
| 19 |
-
from .data import NoteRecord, load_notes
|
| 20 |
-
from .detect import
|
| 21 |
from .evaluate import ground_truth_spans, value_variants, _find_all
|
| 22 |
from .pipeline import Pipeline
|
| 23 |
from .transform import PSEUDONYM, PseudonymVault
|
|
@@ -43,9 +43,7 @@ def _residual_leaks(rec: NoteRecord, sanitised: str) -> tuple[int, int]:
|
|
| 43 |
|
| 44 |
def _run_trust(trust_id: int, records: list[NoteRecord], method: str, base_detector) -> dict:
|
| 45 |
"""Sanitise one Trust's notes locally; return a shareable manifest + de-identified records."""
|
| 46 |
-
|
| 47 |
-
detector = CompositeDetector(base_detector, GazetteerDetector(roster_terms(records)))
|
| 48 |
-
pipeline = Pipeline(detector=detector, vault=PseudonymVault()) # vault stays local
|
| 49 |
|
| 50 |
entity_counts: Counter = Counter()
|
| 51 |
deidentified: list[dict] = []
|
|
|
|
| 1 |
"""Simulate two NHS Trusts collaborating without sharing sensitive data.
|
| 2 |
|
| 3 |
+
Each Trust holds its own patients and its own re-identification vault. It
|
| 4 |
+
sanitises its notes LOCALLY and contributes only the
|
| 5 |
de-identified text + a content-free audit manifest to a shared pool. Raw notes and
|
| 6 |
vaults never leave the Trust. This is the sanitise-at-source gate that sits in
|
| 7 |
front of a federated SDE / FLock.io training round.
|
|
|
|
| 16 |
from collections import Counter
|
| 17 |
from pathlib import Path
|
| 18 |
|
| 19 |
+
from .data import NoteRecord, load_notes
|
| 20 |
+
from .detect import build_detector
|
| 21 |
from .evaluate import ground_truth_spans, value_variants, _find_all
|
| 22 |
from .pipeline import Pipeline
|
| 23 |
from .transform import PSEUDONYM, PseudonymVault
|
|
|
|
| 43 |
|
| 44 |
def _run_trust(trust_id: int, records: list[NoteRecord], method: str, base_detector) -> dict:
|
| 45 |
"""Sanitise one Trust's notes locally; return a shareable manifest + de-identified records."""
|
| 46 |
+
pipeline = Pipeline(detector=base_detector, vault=PseudonymVault()) # vault stays local
|
|
|
|
|
|
|
| 47 |
|
| 48 |
entity_counts: Counter = Counter()
|
| 49 |
deidentified: list[dict] = []
|
run_eval.py
CHANGED
|
@@ -11,8 +11,8 @@ from __future__ import annotations
|
|
| 11 |
import argparse
|
| 12 |
import json
|
| 13 |
|
| 14 |
-
from noteguard.data import load_notes
|
| 15 |
-
from noteguard.detect import
|
| 16 |
from noteguard.evaluate import EvalResult, evaluate
|
| 17 |
from noteguard.transform import REDACTION
|
| 18 |
|
|
@@ -55,12 +55,6 @@ def main() -> None:
|
|
| 55 |
presidio = build_detector(True)
|
| 56 |
runs["presidio+rules"] = evaluate(records, presidio, args.method)
|
| 57 |
_print_summary(runs["presidio+rules"])
|
| 58 |
-
# Optional recall-lift layer: the Trust roster as a gazetteer. Reported
|
| 59 |
-
# separately and NOT the headline, because it's seeded from known values.
|
| 60 |
-
print("\n=== presidio+rules+roster (optional gazetteer layer) ===")
|
| 61 |
-
roster_det = CompositeDetector(presidio, GazetteerDetector(roster_terms(records)))
|
| 62 |
-
runs["presidio+rules+roster"] = evaluate(records, roster_det, args.method)
|
| 63 |
-
_print_summary(runs["presidio+rules+roster"])
|
| 64 |
else:
|
| 65 |
det = RuleDetector() if args.no_presidio else build_detector(True)
|
| 66 |
res = evaluate(records, det, args.method)
|
|
|
|
| 11 |
import argparse
|
| 12 |
import json
|
| 13 |
|
| 14 |
+
from noteguard.data import load_notes
|
| 15 |
+
from noteguard.detect import RuleDetector, build_detector
|
| 16 |
from noteguard.evaluate import EvalResult, evaluate
|
| 17 |
from noteguard.transform import REDACTION
|
| 18 |
|
|
|
|
| 55 |
presidio = build_detector(True)
|
| 56 |
runs["presidio+rules"] = evaluate(records, presidio, args.method)
|
| 57 |
_print_summary(runs["presidio+rules"])
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
else:
|
| 59 |
det = RuleDetector() if args.no_presidio else build_detector(True)
|
| 60 |
res = evaluate(records, det, args.method)
|