CliniGuard NER -- PHI/PII De-identification by Genzeon Platform

CliniGuard NER is a clinical Named Entity Recognition model developed by Genzeon Platform for automated detection and de-identification of Protected Health Information (PHI) and Personally Identifiable Information (PII) in clinical text. Built on a domain-specialized BERT architecture fine-tuned on healthcare corpora, CliniGuard delivers production-grade entity recognition across 20 PHI categories.

Model Details

Property Value
Developed by Genzeon Platform
Architecture BertForTokenClassification
Parameters ~110M
Tagging scheme BIO (41 labels)
Max sequence length 512 tokens
License Apache-2.0

Intended Use

CliniGuard NER is designed for enterprise healthcare environments where patient data privacy is critical. Primary use cases include:

  • Clinical text de-identification -- removing or masking patient identifiers before sharing medical records for research.
  • PII detection -- flagging sensitive information in healthcare documents, EHRs, and discharge summaries.
  • Regulatory compliance -- supporting HIPAA Safe Harbor de-identification requirements.
  • Healthcare AI pipelines -- preprocessing clinical text for downstream NLP tasks while ensuring patient privacy.

Entity Types

The model recognizes 20 PHI entity types using BIO tagging (41 labels total):

Category Entity Types
Patient identifiers PATIENT_NAME, DATE_OF_BIRTH, AGE, GENDER, SSN, MRN
Contact information PHONE, FAX, EMAIL
Location ADDRESS, CITY, STATE, ZIP, COUNTRY
Organization HOSPITAL
Provider DOCTOR_NAME
Digital identifiers USERNAME, ID_NUMBER, IP_ADDRESS, URL

Performance

Overall Metrics

Metric Precision Recall F1
Micro avg 0.9659 0.9732 0.9695
Macro avg 0.9609 0.9706 0.9656

Per-Entity Metrics

Entity Precision Recall F1 Support
PATIENT_NAME 0.9817 0.9853 0.9835 14335
DATE_OF_BIRTH 0.9798 0.9740 0.9769 9818
AGE 0.9028 0.9854 0.9423 1508
GENDER 0.9596 0.9885 0.9738 1562
SSN 0.9513 0.9935 0.9719 766
MRN 0.9938 0.9923 0.9930 1943
PHONE 0.9730 0.9869 0.9799 2590
FAX 0.9481 0.9454 0.9468 696
EMAIL 0.9965 0.9936 0.9950 4543
ADDRESS 0.9746 0.9844 0.9794 1985
CITY 0.9086 0.8891 0.8988 2047
STATE 0.9103 0.9060 0.9082 2734
ZIP 0.9770 0.9832 0.9801 951
COUNTRY 0.9485 0.9504 0.9495 2056
HOSPITAL 0.9033 0.9345 0.9186 5267
DOCTOR_NAME 0.9865 1.0000 0.9932 802
USERNAME 0.9689 0.9431 0.9559 1917
ID_NUMBER 0.9724 0.9898 0.9811 8555
IP_ADDRESS 0.9892 0.9924 0.9908 926
URL 0.9910 0.9947 0.9928 3001

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_name = "genzeonplatform/cliniguard-ner"

# Option 1: Use the transformers pipeline
nlp = pipeline("token-classification", model=model_name, aggregation_strategy="simple")
text = "Patient John Smith, DOB 03/15/1960, was seen at Springfield General Hospital by Dr. Jane Doe."
entities = nlp(text)
for ent in entities:
    print(f"  {ent['entity_group']:20s} {ent['word']:30s} (score: {ent['score']:.3f})")

# Option 2: Manual inference
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

import torch
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
for token, pred in zip(tokens, predictions[0]):
    label = model.config.id2label[str(pred.item())]
    if label != "O":
        print(f"  {token:20s} -> {label}")

Training Details

  • Developed by: Genzeon Platform
  • Architecture: Domain-specialized BERT fine-tuned on clinical corpora
  • Training data: Genzeon Platform's proprietary clinical NER dataset with diverse healthcare note formats
  • Epochs: 15 (with early stopping, patience=3)
  • Learning rate: 3e-5 (linear schedule with warmup)
  • Batch size: 16 (train) / 32 (eval)
  • Max sequence length: 512 tokens
  • Optimizer: AdamW (weight decay 0.01)

Limitations

  • English only: Currently optimized for English clinical text. Multilingual support is on the Genzeon Platform roadmap.
  • Recommended with human-in-the-loop: For high-stakes de-identification workflows, Genzeon Platform recommends pairing CliniGuard with human review for maximum safety.
  • Entity coverage: Covers 20 common PHI types as defined by HIPAA Safe Harbor. Rare or domain-specific identifiers may require custom fine-tuning -- contact Genzeon Platform for enterprise support.
  • Context window: Limited to 512 tokens per input. Longer documents should be chunked with overlap for best results.

About Genzeon Platform

Genzeon Platform is a healthcare technology company specializing in AI-powered solutions for clinical data management, regulatory compliance, and healthcare interoperability. CliniGuard NER is part of Genzeon Platform's suite of healthcare AI tools designed to accelerate clinical research while safeguarding patient privacy.

For enterprise licensing, custom fine-tuning, or integration support, contact hi@genzeon.one.

Downloads last month
12
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results