PHI Leak Checker (SAFE / REVIEW / UNSAFE) — Synthetic

This model predicts whether a given text still contains Protected Health Information (PHI) after redaction (or whether it likely requires human review). It is designed as a second safety gate in privacy-first pipelines such as:

Clinical note de-identification workflows
Zero-trust logging guardrails (pre-log or post-redaction validation)
Data preparation for analytics/LLM usage (PHI risk screening)

Labels

The model outputs one of three labels:

SAFE: no PHI detected (or properly redacted)
REVIEW: ambiguous / potentially sensitive patterns remain (recommend human review or stricter redaction)
UNSAFE: PHI likely present (block/quarantine)

Tip: Treat REVIEW as “allow with flag” and UNSAFE as “quarantine/drop” in production guardrails.

How it works

This is a sequence classification model (Transformers) trained on synthetic clinical-note-like and log-like text examples.

Training data is generated by:

Creating synthetic notes/logs with inserted PHI-like fields (name, phone, email, address, IDs, dates, etc.)
Producing three variants per source:
- fully redacted → SAFE
- partially redacted / ambiguous → REVIEW
- unredacted → UNSAFE

This produces strong supervision without using real patient data.

Intended Use

✅ Appropriate uses

PHI leakage screening for synthetic / internal test data
Guardrails in logging pipelines (detect if PHI slipped through)
Research prototypes and privacy tooling demos

❌ Not intended for

Medical diagnosis or treatment advice
Sole control for HIPAA/GDPR compliance decisions
High-stakes decisions without additional safeguards

Limitations

Trained on synthetic text: real-world hospital data may contain formats, abbreviations, and edge cases not represented here.
May produce false positives on ID-like numbers or addresses.
Does not guarantee removal of all PHI; it should be part of a layered approach.

Recommended: Combine with a span-based PHI detector + deterministic redaction + this leak-checker as a second gate.

Usage

Transformers pipeline

from transformers import pipeline

clf = pipeline("text-classification", model="bharathja/phi-leak-checker-deberta-v3")
clf("Patient John Smith MRN 001-23-4567 visited on 12/19/2025.")

Downloads last month: 706

Safetensors

Model size

82.1M params

Tensor type

F32