phi-span-detector-deberta-v3 / README.md

Janumpally

Update README.md

83dc153 verified 20 days ago

3.63 kB

	---
	language: en
	license: apache-2.0
	tags:
	- token-classification
	- ner
	- privacy
	- healthcare
	- deidentification
	- security
	- compliance
	pipeline_tag: token-classification
	library_name: transformers
	---

	# PHI Span Detector (BIO NER) — Synthetic

	This model detects Protected Health Information (PHI) spans in clinical-note-like text and log-like text using BIO tagging (token classification). It is intended to power deterministic redaction and zero-trust logging guardrails.

	## PHI Types

	The model predicts spans for the following categories:

	- NAME
	- DATE
	- AGE
	- PHONE
	- EMAIL
	- ADDRESS
	- ID (e.g., MRN/account/record IDs)
	- PROVIDER
	- FACILITY
	- LOCATION

	Output is BIO-formatted per token (e.g., `B-NAME`, `I-NAME`, …).

	---

	## How it works

	This is a token-classification model trained on synthetic examples to keep the project openly shareable:

	1. Synthetic clinical notes and log lines are generated using templates.
	2. PHI-like fields are inserted (names, IDs, phone numbers, dates, addresses, etc.).
	3. Gold labels are produced automatically as character spans and converted to BIO token labels.

	This produces clean supervision without using real patient data.

	---

	## Intended Use

	✅ Appropriate uses
	- PHI span detection for research prototypes
	- Pre-log / post-log redaction guardrails
	- De-identification pipelines when paired with deterministic redaction

	❌ Not intended for
	- Medical diagnosis or treatment advice
	- Sole control for compliance (HIPAA/GDPR) decisions
	- High-stakes production usage without additional safeguards and evaluation

	Recommended pipeline: Detect spans → deterministic redaction → secondary leak-check gate.

	---

	## Limitations

	- Trained on synthetic text: real-world clinical documentation can include unseen formats and edge cases.
	- May over-redact (false positives) on numeric identifiers or location-like strings.
	- May miss rare PHI patterns not represented in synthetic templates.

	If using in a real system, evaluate on your organization’s internal test set and consider adding:
	- regex backstops (email/phone/date patterns)
	- human-in-the-loop review for flagged cases
	- a secondary “PHI leak checker” model

	---

	## Usage

	### 1) Transformers token-classification pipeline
	```python
	from transformers import pipeline

	ner = pipeline(
	"token-classification",
	model="bharathja/phi-span-detector-deberta-v3",
	aggregation_strategy="simple"
	)

	text = "Patient John Smith (MRN: 001-23-4567) visited Boston Medical Center on 12/19/2025."
	print(ner(text))
	```
	### 2) Deterministic redaction (recommended)

	Use detected spans to redact with placeholders such as [NAME], [ID], [DATE], etc.
	(See companion project: PHI Guardrails.)

	Output Schema (recommended)
	```python
	A practical production-friendly span format:

	[
	{"start": 8, "end": 18, "label": "NAME", "score": 0.97},
	{"start": 25, "end": 36, "label": "ID", "score": 0.94},
	{"start": 68, "end": 78, "label": "FACILITY", "score": 0.91},
	{"start": 82, "end": 92, "label": "DATE", "score": 0.89}
	]
	```

	### Safety & Privacy

	This model is trained on synthetic data and is published for research and tooling purposes.
	Do not upload real PHI to public endpoints or demos. Use private infrastructure for real deployments.
	```python
	Citation
	@misc{janumpally_phi_span_detector_2025,
	title = {PHI Span Detector (Synthetic)},
	author = {Bharath Kumar Reddy Janumpally},
	year = {2025},
	publisher = {Hugging Face},
	howpublished = {Model on Hugging Face}
	}

	````