PHI Leak Checker (SAFE / REVIEW / UNSAFE) — Synthetic
This model predicts whether a given text still contains Protected Health Information (PHI) after redaction (or whether it likely requires human review). It is designed as a second safety gate in privacy-first pipelines such as:
- Clinical note de-identification workflows
- Zero-trust logging guardrails (pre-log or post-redaction validation)
- Data preparation for analytics/LLM usage (PHI risk screening)
Labels
The model outputs one of three labels:
- SAFE: no PHI detected (or properly redacted)
- REVIEW: ambiguous / potentially sensitive patterns remain (recommend human review or stricter redaction)
- UNSAFE: PHI likely present (block/quarantine)
Tip: Treat REVIEW as “allow with flag” and UNSAFE as “quarantine/drop” in production guardrails.
How it works
This is a sequence classification model (Transformers) trained on synthetic clinical-note-like and log-like text examples.
Training data is generated by:
- Creating synthetic notes/logs with inserted PHI-like fields (name, phone, email, address, IDs, dates, etc.)
- Producing three variants per source:
- fully redacted → SAFE
- partially redacted / ambiguous → REVIEW
- unredacted → UNSAFE
This produces strong supervision without using real patient data.
Intended Use
✅ Appropriate uses
- PHI leakage screening for synthetic / internal test data
- Guardrails in logging pipelines (detect if PHI slipped through)
- Research prototypes and privacy tooling demos
❌ Not intended for
- Medical diagnosis or treatment advice
- Sole control for HIPAA/GDPR compliance decisions
- High-stakes decisions without additional safeguards
Limitations
- Trained on synthetic text: real-world hospital data may contain formats, abbreviations, and edge cases not represented here.
- May produce false positives on ID-like numbers or addresses.
- Does not guarantee removal of all PHI; it should be part of a layered approach.
Recommended: Combine with a span-based PHI detector + deterministic redaction + this leak-checker as a second gate.
Usage
Transformers pipeline
from transformers import pipeline
clf = pipeline("text-classification", model="bharathja/phi-leak-checker-deberta-v3")
clf("Patient John Smith MRN 001-23-4567 visited on 12/19/2025.")
- Downloads last month
- 706