HIPAA-BERT-v0.1 / README.md
abishek-kronosx's picture
Update README.md
12ca848 verified
---
license: mit
language:
- en
tags:
- bert
- pii-detection
- phi-detection
- hipaa
- healthcare
- nlp
- text-classification
- sequence-classification
- lora
- peft
datasets:
- custom
base_model: bert-base-uncased
pipeline_tag: text-classification
library_name: transformers
---
# HIPAA-BERT: PII/PHI Column Name Classifier
A fine-tuned BERT model for classifying database column names as **PII** (Personally Identifiable Information), **PHI** (Protected Health Information), or **Other (O)**.
## Model Details
| Property | Value |
|----------|-------|
| **Developer** | KronosX AI Labs |
| **Model Type** | BERT + LoRA (text classification) |
| **Base Model** | `bert-base-uncased` |
| **Language** | English |
| **Fine-tuning Method** | LoRA (Low-Rank Adaptation) |
| **Task** | Sequence Classification (3 classes) |
## Labels
| Label | Description | Examples |
|-------|-------------|----------|
| `O` | Other/Safe columns | `id`, `created_at`, `status` |
| PII | Personally Identifiable Info | `email`, `phone_number`, `address` |
| PHI | Protected Health Info (HIPAA) | `diagnosis_code`, `patient_name`, `ssn` |
## Training Details
### Hyperparameters
| Parameter | Value |
|-----------|-------|
| Learning Rate | 1e-3 |
| Batch Size | 64 |
| Epochs | 10 |
| Weight Decay | 0.01 |
| Max Sequence Length | 64 |
| LoRA Rank (r) | 16 |
| LoRA Alpha | 32 |
| LoRA Dropout | 0.1 |
| Target Modules | query, value |
### Training Data
Custom HIPAA-compliant dataset with ~50000+ labeled column names from healthcare databases.
### Hardware
- GPU: NVIDIA GPU (Kaggle)
- Mixed Precision: FP16 enabled
## Performance Metrics
| Metric | Score |
|--------|-------|
| Accuracy | ~95%+ |
| F1 (weighted) | ~94%+ |
| Precision | ~93%+ |
| Recall | ~94%+ |
## Usage
### Installation
pip install transformers torch
### Quick Start
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model
model_name = "KronosXAI/HIPAA-BERT-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Classify column names
columns = ["patient_name", "diagnosis_code", "created_at", "email", "status"]
for col in columns:
inputs = tokenizer(col, return_tensors="pt", truncation=True, max_length=64)
with torch.no_grad():
outputs = model(**inputs)
prediction = torch.argmax(outputs.logits, dim=-1).item()
label_map = {0: "O", 1: "PII", 2: "PHI"}
print(f"{col}: {label_map[prediction]}")
### Expected Output
patient_name: PHI
diagnosis_code: PHI
created_at: O
email: PII
status: O
## Intended Use
### Primary Use Cases
* Automatic PII/PHI detection in database schemas
* Data privacy compliance audits
* HIPAA compliance automation
* Healthcare data anonymization pipelines
### Out-of-Scope
* This model classifies column names, not the actual data content
* Not suitable for classifying free-text or unstructured data
* Should be used as part of a larger compliance workflow, not as sole arbiter
## Limitations & Bias
* Trained primarily on English column naming conventions
* May not generalize to non-standard or domain-specific naming patterns
* Should be validated with domain experts before production use
## Model Card Authors
Abishek - KronosX AI Labs
## Citation
@misc{hipaa-bert-2024,
author = {KronosX AI Labs},
title = {HIPAA-BERT: PII/PHI Column Name Classifier},
year = {2026},
url = {https://huggingface.co/KronosXAI/HIPAA-BERT-v0.1}
}
## Links
* Organization: KronosX AI Labs