|
|
--- |
|
|
language: en |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- token-classification |
|
|
- ner |
|
|
- hipaa |
|
|
- phi |
|
|
- healthcare |
|
|
- privacy |
|
|
- distilbert |
|
|
datasets: |
|
|
- custom |
|
|
pipeline_tag: token-classification |
|
|
--- |
|
|
|
|
|
# HIPAA PHI Detector (DistilBERT) |
|
|
|
|
|
A fine-tuned DistilBERT model for detecting Protected Health Information (PHI) in text, covering all 18 HIPAA Safe Harbor categories. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Architecture**: DistilBERT (66M params) with token classification head |
|
|
- **Training**: Fine-tuned on 5,000+ synthetic HIPAA examples |
|
|
- **Labels**: 37 BIO labels (18 entity types x 2 + O) |
|
|
- **Framework**: PyTorch / HuggingFace Transformers |
|
|
|
|
|
## Supported Entity Types |
|
|
|
|
|
| Label | HIPAA Category | |
|
|
|-------|---------------| |
|
|
| NAME | Names | |
|
|
| LOCATION | Geographic subdivisions | |
|
|
| DATE | Dates | |
|
|
| PHONE | Phone numbers | |
|
|
| FAX | Fax numbers | |
|
|
| EMAIL | Email addresses | |
|
|
| SSN | Social Security numbers | |
|
|
| MRN | Medical record numbers | |
|
|
| HEALTH_PLAN | Health plan beneficiary numbers | |
|
|
| ACCOUNT | Account numbers | |
|
|
| LICENSE | Certificate/license numbers | |
|
|
| VEHICLE | Vehicle identifiers | |
|
|
| DEVICE | Device identifiers | |
|
|
| URL | Web URLs | |
|
|
| IP | IP addresses | |
|
|
| BIOMETRIC | Biometric identifiers | |
|
|
| PHOTO | Photographic images | |
|
|
| OTHER | Any other unique identifying number | |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
pipe = pipeline("token-classification", model="mkocher/hipaa-phi-detector", aggregation_strategy="simple") |
|
|
results = pipe("Patient John Smith, SSN 123-45-6789") |
|
|
``` |
|
|
|
|
|
Or with the `aare-core` package: |
|
|
|
|
|
```python |
|
|
from aare import HIPAAGuardrail |
|
|
|
|
|
guardrail = HIPAAGuardrail() |
|
|
result = guardrail.check("Patient John Smith, SSN 123-45-6789") |
|
|
if result.blocked: |
|
|
print(f"PHI detected: {result.violations}") |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |
|
|
|