HIPAA-BERT-v0.1 / README.md
abishek-kronosx's picture
Update README.md
12ca848 verified
metadata
license: mit
language:
  - en
tags:
  - bert
  - pii-detection
  - phi-detection
  - hipaa
  - healthcare
  - nlp
  - text-classification
  - sequence-classification
  - lora
  - peft
datasets:
  - custom
base_model: bert-base-uncased
pipeline_tag: text-classification
library_name: transformers

HIPAA-BERT: PII/PHI Column Name Classifier

A fine-tuned BERT model for classifying database column names as PII (Personally Identifiable Information), PHI (Protected Health Information), or Other (O).

Model Details

Property Value
Developer KronosX AI Labs
Model Type BERT + LoRA (text classification)
Base Model bert-base-uncased
Language English
Fine-tuning Method LoRA (Low-Rank Adaptation)
Task Sequence Classification (3 classes)

Labels

Label Description Examples
O Other/Safe columns id, created_at, status
PII Personally Identifiable Info email, phone_number, address
PHI Protected Health Info (HIPAA) diagnosis_code, patient_name, ssn

Training Details

Hyperparameters

Parameter Value
Learning Rate 1e-3
Batch Size 64
Epochs 10
Weight Decay 0.01
Max Sequence Length 64
LoRA Rank (r) 16
LoRA Alpha 32
LoRA Dropout 0.1
Target Modules query, value

Training Data

Custom HIPAA-compliant dataset with ~50000+ labeled column names from healthcare databases.

Hardware

  • GPU: NVIDIA GPU (Kaggle)
  • Mixed Precision: FP16 enabled

Performance Metrics

Metric Score
Accuracy ~95%+
F1 (weighted) ~94%+
Precision ~93%+
Recall ~94%+

Usage

Installation

pip install transformers torch

Quick Start

from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch

Load model

model_name = "KronosXAI/HIPAA-BERT-v0.1" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name)

Classify column names

columns = ["patient_name", "diagnosis_code", "created_at", "email", "status"] for col in columns: inputs = tokenizer(col, return_tensors="pt", truncation=True, max_length=64) with torch.no_grad(): outputs = model(**inputs) prediction = torch.argmax(outputs.logits, dim=-1).item()

label_map = {0: "O", 1: "PII", 2: "PHI"}
print(f"{col}: {label_map[prediction]}")

Expected Output

patient_name: PHI diagnosis_code: PHI created_at: O email: PII status: O

Intended Use

Primary Use Cases

  • Automatic PII/PHI detection in database schemas
  • Data privacy compliance audits
  • HIPAA compliance automation
  • Healthcare data anonymization pipelines

Out-of-Scope

  • This model classifies column names, not the actual data content
  • Not suitable for classifying free-text or unstructured data
  • Should be used as part of a larger compliance workflow, not as sole arbiter

Limitations & Bias

  • Trained primarily on English column naming conventions
  • May not generalize to non-standard or domain-specific naming patterns
  • Should be validated with domain experts before production use

Model Card Authors

Abishek - KronosX AI Labs

Citation

@misc{hipaa-bert-2024, author = {KronosX AI Labs}, title = {HIPAA-BERT: PII/PHI Column Name Classifier}, year = {2026}, url = {https://huggingface.co/KronosXAI/HIPAA-BERT-v0.1} }

Links

  • Organization: KronosX AI Labs