Instructions to use genzeonplatform/cliniguard-ner with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use genzeonplatform/cliniguard-ner with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="genzeonplatform/cliniguard-ner")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("genzeonplatform/cliniguard-ner") model = AutoModelForTokenClassification.from_pretrained("genzeonplatform/cliniguard-ner") - Notebooks
- Google Colab
- Kaggle
CliniGuard NER -- PHI/PII De-identification by Genzeon Platform
CliniGuard NER is a clinical Named Entity Recognition model developed by Genzeon Platform for automated detection and de-identification of Protected Health Information (PHI) and Personally Identifiable Information (PII) in clinical text. Built on a domain-specialized BERT architecture fine-tuned on healthcare corpora, CliniGuard delivers production-grade entity recognition across 20 PHI categories.
Model Details
| Property | Value |
|---|---|
| Developed by | Genzeon Platform |
| Architecture | BertForTokenClassification |
| Parameters | ~110M |
| Tagging scheme | BIO (41 labels) |
| Max sequence length | 512 tokens |
| License | Apache-2.0 |
Intended Use
CliniGuard NER is designed for enterprise healthcare environments where patient data privacy is critical. Primary use cases include:
- Clinical text de-identification -- removing or masking patient identifiers before sharing medical records for research.
- PII detection -- flagging sensitive information in healthcare documents, EHRs, and discharge summaries.
- Regulatory compliance -- supporting HIPAA Safe Harbor de-identification requirements.
- Healthcare AI pipelines -- preprocessing clinical text for downstream NLP tasks while ensuring patient privacy.
Entity Types
The model recognizes 20 PHI entity types using BIO tagging (41 labels total):
| Category | Entity Types |
|---|---|
| Patient identifiers | PATIENT_NAME, DATE_OF_BIRTH, AGE, GENDER, SSN, MRN |
| Contact information | PHONE, FAX, EMAIL |
| Location | ADDRESS, CITY, STATE, ZIP, COUNTRY |
| Organization | HOSPITAL |
| Provider | DOCTOR_NAME |
| Digital identifiers | USERNAME, ID_NUMBER, IP_ADDRESS, URL |
Performance
Overall Metrics
| Metric | Precision | Recall | F1 |
|---|---|---|---|
| Micro avg | 0.9659 | 0.9732 | 0.9695 |
| Macro avg | 0.9609 | 0.9706 | 0.9656 |
Per-Entity Metrics
| Entity | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| PATIENT_NAME | 0.9817 | 0.9853 | 0.9835 | 14335 |
| DATE_OF_BIRTH | 0.9798 | 0.9740 | 0.9769 | 9818 |
| AGE | 0.9028 | 0.9854 | 0.9423 | 1508 |
| GENDER | 0.9596 | 0.9885 | 0.9738 | 1562 |
| SSN | 0.9513 | 0.9935 | 0.9719 | 766 |
| MRN | 0.9938 | 0.9923 | 0.9930 | 1943 |
| PHONE | 0.9730 | 0.9869 | 0.9799 | 2590 |
| FAX | 0.9481 | 0.9454 | 0.9468 | 696 |
| 0.9965 | 0.9936 | 0.9950 | 4543 | |
| ADDRESS | 0.9746 | 0.9844 | 0.9794 | 1985 |
| CITY | 0.9086 | 0.8891 | 0.8988 | 2047 |
| STATE | 0.9103 | 0.9060 | 0.9082 | 2734 |
| ZIP | 0.9770 | 0.9832 | 0.9801 | 951 |
| COUNTRY | 0.9485 | 0.9504 | 0.9495 | 2056 |
| HOSPITAL | 0.9033 | 0.9345 | 0.9186 | 5267 |
| DOCTOR_NAME | 0.9865 | 1.0000 | 0.9932 | 802 |
| USERNAME | 0.9689 | 0.9431 | 0.9559 | 1917 |
| ID_NUMBER | 0.9724 | 0.9898 | 0.9811 | 8555 |
| IP_ADDRESS | 0.9892 | 0.9924 | 0.9908 | 926 |
| URL | 0.9910 | 0.9947 | 0.9928 | 3001 |
Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model_name = "genzeonplatform/cliniguard-ner"
# Option 1: Use the transformers pipeline
nlp = pipeline("token-classification", model=model_name, aggregation_strategy="simple")
text = "Patient John Smith, DOB 03/15/1960, was seen at Springfield General Hospital by Dr. Jane Doe."
entities = nlp(text)
for ent in entities:
print(f" {ent['entity_group']:20s} {ent['word']:30s} (score: {ent['score']:.3f})")
# Option 2: Manual inference
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
import torch
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
for token, pred in zip(tokens, predictions[0]):
label = model.config.id2label[str(pred.item())]
if label != "O":
print(f" {token:20s} -> {label}")
Training Details
- Developed by: Genzeon Platform
- Architecture: Domain-specialized BERT fine-tuned on clinical corpora
- Training data: Genzeon Platform's proprietary clinical NER dataset with diverse healthcare note formats
- Epochs: 15 (with early stopping, patience=3)
- Learning rate: 3e-5 (linear schedule with warmup)
- Batch size: 16 (train) / 32 (eval)
- Max sequence length: 512 tokens
- Optimizer: AdamW (weight decay 0.01)
Limitations
- English only: Currently optimized for English clinical text. Multilingual support is on the Genzeon Platform roadmap.
- Recommended with human-in-the-loop: For high-stakes de-identification workflows, Genzeon Platform recommends pairing CliniGuard with human review for maximum safety.
- Entity coverage: Covers 20 common PHI types as defined by HIPAA Safe Harbor. Rare or domain-specific identifiers may require custom fine-tuning -- contact Genzeon Platform for enterprise support.
- Context window: Limited to 512 tokens per input. Longer documents should be chunked with overlap for best results.
About Genzeon Platform
Genzeon Platform is a healthcare technology company specializing in AI-powered solutions for clinical data management, regulatory compliance, and healthcare interoperability. CliniGuard NER is part of Genzeon Platform's suite of healthcare AI tools designed to accelerate clinical research while safeguarding patient privacy.
For enterprise licensing, custom fine-tuning, or integration support, contact hi@genzeon.one.
- Downloads last month
- 12
Evaluation results
- Micro F1self-reported0.970
- Micro Precisionself-reported0.966
- Micro Recallself-reported0.973