PHI Leak Checker (SAFE / REVIEW / UNSAFE) — Synthetic

This model predicts whether a given text still contains Protected Health Information (PHI) after redaction (or whether it likely requires human review). It is designed as a second safety gate in privacy-first pipelines such as:

  • Clinical note de-identification workflows
  • Zero-trust logging guardrails (pre-log or post-redaction validation)
  • Data preparation for analytics/LLM usage (PHI risk screening)

Labels

The model outputs one of three labels:

  • SAFE: no PHI detected (or properly redacted)
  • REVIEW: ambiguous / potentially sensitive patterns remain (recommend human review or stricter redaction)
  • UNSAFE: PHI likely present (block/quarantine)

Tip: Treat REVIEW as “allow with flag” and UNSAFE as “quarantine/drop” in production guardrails.


How it works

This is a sequence classification model (Transformers) trained on synthetic clinical-note-like and log-like text examples.

Training data is generated by:

  1. Creating synthetic notes/logs with inserted PHI-like fields (name, phone, email, address, IDs, dates, etc.)
  2. Producing three variants per source:
    • fully redacted → SAFE
    • partially redacted / ambiguous → REVIEW
    • unredacted → UNSAFE

This produces strong supervision without using real patient data.


Intended Use

Appropriate uses

  • PHI leakage screening for synthetic / internal test data
  • Guardrails in logging pipelines (detect if PHI slipped through)
  • Research prototypes and privacy tooling demos

Not intended for

  • Medical diagnosis or treatment advice
  • Sole control for HIPAA/GDPR compliance decisions
  • High-stakes decisions without additional safeguards

Limitations

  • Trained on synthetic text: real-world hospital data may contain formats, abbreviations, and edge cases not represented here.
  • May produce false positives on ID-like numbers or addresses.
  • Does not guarantee removal of all PHI; it should be part of a layered approach.

Recommended: Combine with a span-based PHI detector + deterministic redaction + this leak-checker as a second gate.


Usage

Transformers pipeline

from transformers import pipeline

clf = pipeline("text-classification", model="bharathja/phi-leak-checker-deberta-v3")
clf("Patient John Smith MRN 001-23-4567 visited on 12/19/2025.")
Downloads last month
706
Safetensors
Model size
82.1M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support