Update README.md

12ca848 verified about 1 month ago

3.55 kB

license: mit
language:
  - en
tags:
  - bert
  - pii-detection
  - phi-detection
  - hipaa
  - healthcare
  - nlp
  - text-classification
  - sequence-classification
  - lora
  - peft
datasets:
  - custom
base_model: bert-base-uncased
pipeline_tag: text-classification
library_name: transformers

HIPAA-BERT: PII/PHI Column Name Classifier

A fine-tuned BERT model for classifying database column names as PII (Personally Identifiable Information), PHI (Protected Health Information), or Other (O).

Model Details

Property	Value
Developer	KronosX AI Labs
Model Type	BERT + LoRA (text classification)
Base Model	`bert-base-uncased`
Language	English
Fine-tuning Method	LoRA (Low-Rank Adaptation)
Task	Sequence Classification (3 classes)

Labels

Label	Description	Examples
`O`	Other/Safe columns	`id`, `created_at`, `status`
PII	Personally Identifiable Info	`email`, `phone_number`, `address`
PHI	Protected Health Info (HIPAA)	`diagnosis_code`, `patient_name`, `ssn`

Training Details

Hyperparameters

Parameter	Value
Learning Rate	1e-3
Batch Size	64
Epochs	10
Weight Decay	0.01
Max Sequence Length	64
LoRA Rank (r)	16
LoRA Alpha	32
LoRA Dropout	0.1
Target Modules	query, value

Training Data

Custom HIPAA-compliant dataset with ~50000+ labeled column names from healthcare databases.

Hardware

GPU: NVIDIA GPU (Kaggle)
Mixed Precision: FP16 enabled

Performance Metrics

Metric	Score
Accuracy	~95%+
F1 (weighted)	~94%+
Precision	~93%+
Recall	~94%+

Usage

Installation

pip install transformers torch

Quick Start

from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch

Load model

model_name = "KronosXAI/HIPAA-BERT-v0.1" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name)

Classify column names

columns = ["patient_name", "diagnosis_code", "created_at", "email", "status"] for col in columns: inputs = tokenizer(col, return_tensors="pt", truncation=True, max_length=64) with torch.no_grad(): outputs = model(**inputs) prediction = torch.argmax(outputs.logits, dim=-1).item()

label_map = {0: "O", 1: "PII", 2: "PHI"}
print(f"{col}: {label_map[prediction]}")

Expected Output

patient_name: PHI diagnosis_code: PHI created_at: O email: PII status: O

Intended Use

Primary Use Cases

Automatic PII/PHI detection in database schemas
Data privacy compliance audits
HIPAA compliance automation
Healthcare data anonymization pipelines

Out-of-Scope

This model classifies column names, not the actual data content
Not suitable for classifying free-text or unstructured data
Should be used as part of a larger compliance workflow, not as sole arbiter

Limitations & Bias

Trained primarily on English column naming conventions
May not generalize to non-standard or domain-specific naming patterns
Should be validated with domain experts before production use

Model Card Authors

Abishek - KronosX AI Labs

Citation

@misc{hipaa-bert-2024, author = {KronosX AI Labs}, title = {HIPAA-BERT: PII/PHI Column Name Classifier}, year = {2026}, url = {https://huggingface.co/KronosXAI/HIPAA-BERT-v0.1} }

KronosXAI
/

HIPAA-BERT-v0.1