HIPAA PHI Detector (DistilBERT)

A fine-tuned DistilBERT model for detecting Protected Health Information (PHI) in text, covering all 18 HIPAA Safe Harbor categories.

Model Details

  • Architecture: DistilBERT (66M params) with token classification head
  • Training: Fine-tuned on 5,000+ synthetic HIPAA examples
  • Labels: 37 BIO labels (18 entity types x 2 + O)
  • Framework: PyTorch / HuggingFace Transformers

Supported Entity Types

Label HIPAA Category
NAME Names
LOCATION Geographic subdivisions
DATE Dates
PHONE Phone numbers
FAX Fax numbers
EMAIL Email addresses
SSN Social Security numbers
MRN Medical record numbers
HEALTH_PLAN Health plan beneficiary numbers
ACCOUNT Account numbers
LICENSE Certificate/license numbers
VEHICLE Vehicle identifiers
DEVICE Device identifiers
URL Web URLs
IP IP addresses
BIOMETRIC Biometric identifiers
PHOTO Photographic images
OTHER Any other unique identifying number

Usage

from transformers import pipeline

pipe = pipeline("token-classification", model="mkocher/hipaa-phi-detector", aggregation_strategy="simple")
results = pipe("Patient John Smith, SSN 123-45-6789")

Or with the aare-core package:

from aare import HIPAAGuardrail

guardrail = HIPAAGuardrail()
result = guardrail.check("Patient John Smith, SSN 123-45-6789")
if result.blocked:
    print(f"PHI detected: {result.violations}")

License

Apache 2.0

Downloads last month
35
Safetensors
Model size
66.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support