bharathjanumpally
/

phi-span-detector-deberta-v3

@@ -2,27 +2,125 @@
 language: en
 license: apache-2.0
 tags:
-- token-classification
-- ner
-- deidentification
-- privacy
-- healthcare
 ---
-# PHI Span Detector (Synthetic)
-This model detects PHI spans (BIO tagging) in clinical-note-like text and log-like text. It is trained on **synthetic** data.
-## Labels
-NAME, DATE, AGE, PHONE, EMAIL, ADDRESS, ID, PROVIDER, FACILITY, LOCATION
-## Intended use
-- Research & prototyping
-- Pre-log redaction guardrails
 ## Limitations
-- Synthetic training data; may miss real-world edge cases.
-- Not a substitute for compliance programs.
-## Inference
-Use `phi_guardrails.PhiGuardrails` (see repo) or standard Transformers token-classification pipeline.

 language: en
 license: apache-2.0
 tags:
+  - token-classification
+  - ner
+  - privacy
+  - healthcare
+  - deidentification
+  - security
+  - compliance
+pipeline_tag: token-classification
+library_name: transformers
 ---
+# PHI Span Detector (BIO NER) — Synthetic
+This model detects **Protected Health Information (PHI)** spans in clinical-note-like text and log-like text using **BIO tagging** (token classification). It is intended to power **deterministic redaction** and **zero-trust logging guardrails**.
+## PHI Types
+The model predicts spans for the following categories:
+- **NAME**
+- **DATE**
+- **AGE**
+- **PHONE**
+- **EMAIL**
+- **ADDRESS**
+- **ID** (e.g., MRN/account/record IDs)
+- **PROVIDER**
+- **FACILITY**
+- **LOCATION**
+Output is BIO-formatted per token (e.g., `B-NAME`, `I-NAME`, …).
+---
+## How it works
+This is a **token-classification** model trained on **synthetic** examples to keep the project openly shareable:
+1. Synthetic clinical notes and log lines are generated using templates.
+2. PHI-like fields are inserted (names, IDs, phone numbers, dates, addresses, etc.).
+3. Gold labels are produced automatically as character spans and converted to BIO token labels.
+This produces clean supervision without using real patient data.
+---
+## Intended Use
+✅ **Appropriate uses**
+- PHI span detection for research prototypes
+- Pre-log / post-log redaction guardrails
+- De-identification pipelines when paired with deterministic redaction
+❌ **Not intended for**
+- Medical diagnosis or treatment advice
+- Sole control for compliance (HIPAA/GDPR) decisions
+- High-stakes production usage without additional safeguards and evaluation
+**Recommended pipeline:** Detect spans → deterministic redaction → secondary leak-check gate.
+---
 ## Limitations
+- Trained on **synthetic** text: real-world clinical documentation can include unseen formats and edge cases.
+- May over-redact (false positives) on numeric identifiers or location-like strings.
+- May miss rare PHI patterns not represented in synthetic templates.
+If using in a real system, evaluate on your organization’s internal test set and consider adding:
+- regex backstops (email/phone/date patterns)
+- human-in-the-loop review for flagged cases
+- a secondary “PHI leak checker” model
+---
+## Usage
+### 1) Transformers token-classification pipeline
+```python
+from transformers import pipeline
+ner = pipeline(
+    "token-classification",
+    model="bharathja/phi-span-detector-deberta-v3",
+    aggregation_strategy="simple"
+)
+text = "Patient John Smith (MRN: 001-23-4567) visited Boston Medical Center on 12/19/2025."
+print(ner(text))
+```
+### 2) Deterministic redaction (recommended)
+Use detected spans to redact with placeholders such as [NAME], [ID], [DATE], etc.
+(See companion project: PHI Guardrails.)
+Output Schema (recommended)
+```python
+A practical production-friendly span format:
+[
+  {"start": 8, "end": 18, "label": "NAME", "score": 0.97},
+  {"start": 25, "end": 36, "label": "ID", "score": 0.94},
+  {"start": 68, "end": 78, "label": "FACILITY", "score": 0.91},
+  {"start": 82, "end": 92, "label": "DATE", "score": 0.89}
+]
+```
+### Safety & Privacy
+This model is trained on synthetic data and is published for research and tooling purposes.
+Do not upload real PHI to public endpoints or demos. Use private infrastructure for real deployments.
+```python
+Citation
+@misc{janumpally_phi_span_detector_2025,
+  title        = {PHI Span Detector (Synthetic)},
+  author       = {Bharath Kumar Reddy Janumpally},
+  year         = {2025},
+  publisher    = {Hugging Face},
+  howpublished = {Model on Hugging Face}
+}
+````