Janumpally commited on
Commit
83dc153
·
verified ·
1 Parent(s): f726e0a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +114 -16
README.md CHANGED
@@ -2,27 +2,125 @@
2
  language: en
3
  license: apache-2.0
4
  tags:
5
- - token-classification
6
- - ner
7
- - deidentification
8
- - privacy
9
- - healthcare
 
 
 
 
10
  ---
11
 
12
- # PHI Span Detector (Synthetic)
13
 
14
- This model detects PHI spans (BIO tagging) in clinical-note-like text and log-like text. It is trained on **synthetic** data.
15
 
16
- ## Labels
17
- NAME, DATE, AGE, PHONE, EMAIL, ADDRESS, ID, PROVIDER, FACILITY, LOCATION
18
 
19
- ## Intended use
20
- - Research & prototyping
21
- - Pre-log redaction guardrails
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
  ## Limitations
24
- - Synthetic training data; may miss real-world edge cases.
25
- - Not a substitute for compliance programs.
26
 
27
- ## Inference
28
- Use `phi_guardrails.PhiGuardrails` (see repo) or standard Transformers token-classification pipeline.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  language: en
3
  license: apache-2.0
4
  tags:
5
+ - token-classification
6
+ - ner
7
+ - privacy
8
+ - healthcare
9
+ - deidentification
10
+ - security
11
+ - compliance
12
+ pipeline_tag: token-classification
13
+ library_name: transformers
14
  ---
15
 
16
+ # PHI Span Detector (BIO NER) — Synthetic
17
 
18
+ This model detects **Protected Health Information (PHI)** spans in clinical-note-like text and log-like text using **BIO tagging** (token classification). It is intended to power **deterministic redaction** and **zero-trust logging guardrails**.
19
 
20
+ ## PHI Types
 
21
 
22
+ The model predicts spans for the following categories:
23
+
24
+ - **NAME**
25
+ - **DATE**
26
+ - **AGE**
27
+ - **PHONE**
28
+ - **EMAIL**
29
+ - **ADDRESS**
30
+ - **ID** (e.g., MRN/account/record IDs)
31
+ - **PROVIDER**
32
+ - **FACILITY**
33
+ - **LOCATION**
34
+
35
+ Output is BIO-formatted per token (e.g., `B-NAME`, `I-NAME`, …).
36
+
37
+ ---
38
+
39
+ ## How it works
40
+
41
+ This is a **token-classification** model trained on **synthetic** examples to keep the project openly shareable:
42
+
43
+ 1. Synthetic clinical notes and log lines are generated using templates.
44
+ 2. PHI-like fields are inserted (names, IDs, phone numbers, dates, addresses, etc.).
45
+ 3. Gold labels are produced automatically as character spans and converted to BIO token labels.
46
+
47
+ This produces clean supervision without using real patient data.
48
+
49
+ ---
50
+
51
+ ## Intended Use
52
+
53
+ ✅ **Appropriate uses**
54
+ - PHI span detection for research prototypes
55
+ - Pre-log / post-log redaction guardrails
56
+ - De-identification pipelines when paired with deterministic redaction
57
+
58
+ ❌ **Not intended for**
59
+ - Medical diagnosis or treatment advice
60
+ - Sole control for compliance (HIPAA/GDPR) decisions
61
+ - High-stakes production usage without additional safeguards and evaluation
62
+
63
+ **Recommended pipeline:** Detect spans → deterministic redaction → secondary leak-check gate.
64
+
65
+ ---
66
 
67
  ## Limitations
 
 
68
 
69
+ - Trained on **synthetic** text: real-world clinical documentation can include unseen formats and edge cases.
70
+ - May over-redact (false positives) on numeric identifiers or location-like strings.
71
+ - May miss rare PHI patterns not represented in synthetic templates.
72
+
73
+ If using in a real system, evaluate on your organization’s internal test set and consider adding:
74
+ - regex backstops (email/phone/date patterns)
75
+ - human-in-the-loop review for flagged cases
76
+ - a secondary “PHI leak checker” model
77
+
78
+ ---
79
+
80
+ ## Usage
81
+
82
+ ### 1) Transformers token-classification pipeline
83
+ ```python
84
+ from transformers import pipeline
85
+
86
+ ner = pipeline(
87
+ "token-classification",
88
+ model="bharathja/phi-span-detector-deberta-v3",
89
+ aggregation_strategy="simple"
90
+ )
91
+
92
+ text = "Patient John Smith (MRN: 001-23-4567) visited Boston Medical Center on 12/19/2025."
93
+ print(ner(text))
94
+ ```
95
+ ### 2) Deterministic redaction (recommended)
96
+
97
+ Use detected spans to redact with placeholders such as [NAME], [ID], [DATE], etc.
98
+ (See companion project: PHI Guardrails.)
99
+
100
+ Output Schema (recommended)
101
+ ```python
102
+ A practical production-friendly span format:
103
+
104
+ [
105
+ {"start": 8, "end": 18, "label": "NAME", "score": 0.97},
106
+ {"start": 25, "end": 36, "label": "ID", "score": 0.94},
107
+ {"start": 68, "end": 78, "label": "FACILITY", "score": 0.91},
108
+ {"start": 82, "end": 92, "label": "DATE", "score": 0.89}
109
+ ]
110
+ ```
111
+
112
+ ### Safety & Privacy
113
+
114
+ This model is trained on synthetic data and is published for research and tooling purposes.
115
+ Do not upload real PHI to public endpoints or demos. Use private infrastructure for real deployments.
116
+ ```python
117
+ Citation
118
+ @misc{janumpally_phi_span_detector_2025,
119
+ title = {PHI Span Detector (Synthetic)},
120
+ author = {Bharath Kumar Reddy Janumpally},
121
+ year = {2025},
122
+ publisher = {Hugging Face},
123
+ howpublished = {Model on Hugging Face}
124
+ }
125
+
126
+ ````