README.md · KronosXAI/HIPAA-BERT-v0.1 at main

HIPAA-BERT-v0.1 / README.md

abishek-kronosx

Update README.md

12ca848 verified about 1 month ago

preview code

raw

history blame contribute delete

3.55 kB

	---
	license: mit
	language:
	- en
	tags:
	- bert
	- pii-detection
	- phi-detection
	- hipaa
	- healthcare
	- nlp
	- text-classification
	- sequence-classification
	- lora
	- peft
	datasets:
	- custom
	base_model: bert-base-uncased
	pipeline_tag: text-classification
	library_name: transformers
	---

	# HIPAA-BERT: PII/PHI Column Name Classifier

	A fine-tuned BERT model for classifying database column names as PII (Personally Identifiable Information), PHI (Protected Health Information), or Other (O).

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Developer \| KronosX AI Labs \|
	\| Model Type \| BERT + LoRA (text classification) \|
	\| Base Model \| `bert-base-uncased` \|
	\| Language \| English \|
	\| Fine-tuning Method \| LoRA (Low-Rank Adaptation) \|
	\| Task \| Sequence Classification (3 classes) \|

	## Labels

	\| Label \| Description \| Examples \|
	\|-------\|-------------\|----------\|
	\| `O` \| Other/Safe columns \| `id`, `created_at`, `status` \|
	\| PII \| Personally Identifiable Info \| `email`, `phone_number`, `address` \|
	\| PHI \| Protected Health Info (HIPAA) \| `diagnosis_code`, `patient_name`, `ssn` \|

	## Training Details

	### Hyperparameters

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Learning Rate \| 1e-3 \|
	\| Batch Size \| 64 \|
	\| Epochs \| 10 \|
	\| Weight Decay \| 0.01 \|
	\| Max Sequence Length \| 64 \|
	\| LoRA Rank (r) \| 16 \|
	\| LoRA Alpha \| 32 \|
	\| LoRA Dropout \| 0.1 \|
	\| Target Modules \| query, value \|

	### Training Data
	Custom HIPAA-compliant dataset with ~50000+ labeled column names from healthcare databases.

	### Hardware
	- GPU: NVIDIA GPU (Kaggle)
	- Mixed Precision: FP16 enabled

	## Performance Metrics

	\| Metric \| Score \|
	\|--------\|-------\|
	\| Accuracy \| ~95%+ \|
	\| F1 (weighted) \| ~94%+ \|
	\| Precision \| ~93%+ \|
	\| Recall \| ~94%+ \|

	## Usage

	### Installation
	pip install transformers torch

	### Quick Start

	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	# Load model
	model_name = "KronosXAI/HIPAA-BERT-v0.1"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	# Classify column names
	columns = ["patient_name", "diagnosis_code", "created_at", "email", "status"]
	for col in columns:
	inputs = tokenizer(col, return_tensors="pt", truncation=True, max_length=64)
	with torch.no_grad():
	outputs = model(**inputs)
	prediction = torch.argmax(outputs.logits, dim=-1).item()

	label_map = {0: "O", 1: "PII", 2: "PHI"}
	print(f"{col}: {label_map[prediction]}")

	### Expected Output
	patient_name: PHI
	diagnosis_code: PHI
	created_at: O
	email: PII
	status: O

	## Intended Use

	### Primary Use Cases
	* Automatic PII/PHI detection in database schemas
	* Data privacy compliance audits
	* HIPAA compliance automation
	* Healthcare data anonymization pipelines

	### Out-of-Scope
	* This model classifies column names, not the actual data content
	* Not suitable for classifying free-text or unstructured data
	* Should be used as part of a larger compliance workflow, not as sole arbiter

	## Limitations & Bias
	* Trained primarily on English column naming conventions
	* May not generalize to non-standard or domain-specific naming patterns
	* Should be validated with domain experts before production use

	## Model Card Authors
	Abishek - KronosX AI Labs

	## Citation
	@misc{hipaa-bert-2024,
	author = {KronosX AI Labs},
	title = {HIPAA-BERT: PII/PHI Column Name Classifier},
	year = {2026},
	url = {https://huggingface.co/KronosXAI/HIPAA-BERT-v0.1}
	}

	## Links
	* Organization: KronosX AI Labs