Files changed (1) hide show
  1. README.md +267 -0
README.md ADDED
@@ -0,0 +1,267 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ pipeline_tag: token-classification
6
+ task_categories:
7
+ - token-classification
8
+ tags:
9
+ - medical
10
+ - biomedical
11
+ - ner
12
+ - named-entity-recognition
13
+ - biobert
14
+ - jargon-detection
15
+ datasets:
16
+ - tner/bc5cdr
17
+ base_model: dmis-lab/biobert-v1.1
18
+ metrics:
19
+ - f1
20
+ - precision
21
+ - recall
22
+ model-index:
23
+ - name: BioBERT-BC5CDR-NER
24
+ results:
25
+ - task:
26
+ type: token-classification
27
+ name: Named Entity Recognition
28
+ dataset:
29
+ name: BC5CDR
30
+ type: tner/bc5cdr
31
+ metrics:
32
+ - type: f1
33
+ value: 0.88
34
+ name: F1 Score
35
+ - type: precision
36
+ value: 0.88
37
+ - type: recall
38
+ value: 0.89
39
+ ---
40
+
41
+ # Medical Named Entity Recognition (NER) Model
42
+
43
+ ## Model Description
44
+
45
+ This model is a fine-tuned version of [dmis-lab/biobert-v1.1](https://huggingface.co/dmis-lab/biobert-v1.1) on the BC5CDR dataset for medical named entity recognition.
46
+
47
+ **What it does:** Identifies medical terminology in text, specifically:
48
+ - **Chemical entities**: Drug names, chemical compounds (e.g., aspirin, metformin)
49
+ - **Disease entities**: Medical conditions, diseases (e.g., hypertension, diabetes)
50
+
51
+ **Intended use:** Assist in reading medical literature by highlighting and explaining technical terminology.
52
+
53
+ ## Training Data
54
+
55
+ - **Dataset**: [BC5CDR](https://huggingface.co/datasets/tner/bc5cdr) (BioCreative V Chemical Disease Relation)
56
+ - **Training samples**: 5,228 sentences
57
+ - **Validation samples**: 5,330 sentences
58
+ - **Test samples**: 5,865 sentences
59
+ - **Entity types**: 5 labels (O, B-Chemical, I-Chemical, B-Disease, I-Disease)
60
+
61
+ ## Model Performance
62
+
63
+ Evaluated on BC5CDR test set:
64
+
65
+ | Metric | Score |
66
+ |-----------|-------|
67
+ | F1 Score | 0.918555 |
68
+ | Precision | 0.905610 |
69
+ | Recall | 0.931875 |
70
+
71
+ ## Usage
72
+
73
+ ### Basic Usage
74
+ ```python
75
+ from transformers import pipeline
76
+
77
+ # Load the model
78
+ ner = pipeline(
79
+ "token-classification",
80
+ model="{repo_id}",
81
+ aggregation_strategy="simple"
82
+ )
83
+
84
+ # Analyze medical text
85
+ text = "Patient diagnosed with hypertension and prescribed metformin."
86
+ results = ner(text)
87
+
88
+ # Print results
89
+ for entity in results:
90
+ print(f"{{entity['word']}}: {{entity['entity_group']}} ({{entity['score']:.2f}})")
91
+ ```
92
+
93
+ **Output:**
94
+ ```
95
+ hypertension: Disease (0.99)
96
+ metformin: Chemical (0.99)
97
+ ```
98
+
99
+ ### Advanced Usage
100
+ ```python
101
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
102
+ import torch
103
+
104
+ # Load model and tokenizer
105
+ tokenizer = AutoTokenizer.from_pretrained("{repo_id}")
106
+ model = AutoModelForTokenClassification.from_pretrained("{repo_id}")
107
+
108
+ # Tokenize input
109
+ text = "Patient has diabetes and takes aspirin."
110
+ inputs = tokenizer(text, return_tensors="pt")
111
+
112
+ # Get predictions
113
+ with torch.no_grad():
114
+ outputs = model(**inputs)
115
+ predictions = torch.argmax(outputs.logits, dim=-1)
116
+
117
+ # Decode predictions
118
+ tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
119
+ labels = [model.config.id2label[p.item()] for p in predictions[0]]
120
+
121
+ for token, label in zip(tokens, labels):
122
+ if label != "O":
123
+ print(f"{{token}}: {{label}}")
124
+ ```
125
+
126
+ ## Label Schema
127
+
128
+ The model uses IOB2 tagging scheme:
129
+
130
+ | Label | Description |
131
+ |-------|-------------|
132
+ | `O` | Outside any entity |
133
+ | `B-Chemical` | Beginning of a chemical/drug entity |
134
+ | `I-Chemical` | Inside a chemical/drug entity (continuation) |
135
+ | `B-Disease` | Beginning of a disease entity |
136
+ | `I-Disease` | Inside a disease entity (continuation) |
137
+
138
+ ## Training Details
139
+
140
+ ### Training Hyperparameters
141
+
142
+ - **Base model**: dmis-lab/biobert-v1.1
143
+ - **Training regime**: Fine-tuning
144
+ - **Optimizer**: AdamW
145
+ - **Learning rate**: 5e-5
146
+ - **Batch size**: 16 (per device)
147
+ - **Number of epochs**: 3
148
+ - **Weight decay**: 0.01
149
+ - **Learning rate scheduler**: Linear warmup
150
+ - **Mixed precision**: FP16
151
+
152
+ ### Training Environment
153
+
154
+ - **Framework**: PyTorch with Transformers library
155
+ - **Hardware**: NVIDIA T4 GPU (Google Colab)
156
+ - **Training time**: ~30 minutes
157
+
158
+ ### Data Preprocessing
159
+
160
+ 1. Tokenization using BioBERT WordPiece tokenizer
161
+ 2. Maximum sequence length: 128 tokens
162
+ 3. Label alignment for subword tokens
163
+ 4. Special tokens: [CLS], [SEP]
164
+
165
+ ## Limitations and Bias
166
+
167
+ ### Limitations
168
+
169
+ - **Domain-specific**: Trained on biomedical literature; may not perform well on clinical notes or patient records
170
+ - **Entity types**: Only detects chemicals and diseases; does not identify procedures, anatomical terms, or symptoms
171
+ - **Language**: English only
172
+ - **Abbreviations**: May struggle with uncommon medical abbreviations
173
+ - **Context**: Does not disambiguate terms (e.g., "cold" as temperature vs. illness)
174
+
175
+ ### Potential Biases
176
+
177
+ - Training data (BC5CDR) comes from scientific publications, which may have different terminology than patient-facing materials
178
+ - More chemical entities than disease entities in training data may affect balance
179
+ - Contemporary medical terminology may not be represented if not in training corpus
180
+
181
+ ## Ethical Considerations
182
+
183
+ - **Not for medical diagnosis**: This model is for educational/assistive purposes only
184
+ - **Human oversight required**: Always verify medical information with qualified healthcare professionals
185
+ - **Privacy**: Do not input personally identifiable information (PII) or protected health information (PHI)
186
+
187
+ ## Citation
188
+
189
+ If you use this model, please cite:
190
+ ```bibtex
191
+ @misc{{{repo_id.replace('/', '-')}}},
192
+ author = {{{YOUR_NAME}}},
193
+ title = {{Medical Named Entity Recognition with BioBERT}},
194
+ year = {{2024}},
195
+ publisher = {{HuggingFace}},
196
+ url = {{https://huggingface.co/{repo_id}}}
197
+ }}
198
+ ```
199
+
200
+ Also cite the original BC5CDR dataset:
201
+ ```bibtex
202
+ @article{{wei2016assessing,
203
+ title={{Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task}},
204
+ author={{Wei, Chih-Hsuan and Peng, Yifan and Leaman, Robert and Davis, Allan Peter and Mattingly, Carolyn J and Li, Jiao and Wiegers, Thomas C and Lu, Zhiyong}},
205
+ journal={{Database}},
206
+ volume={{2016}},
207
+ year={{2016}},
208
+ publisher={{Oxford Academic}}
209
+ }}
210
+ ```
211
+
212
+ And the BioBERT model:
213
+ ```bibtex
214
+ @article{{lee2020biobert,
215
+ title={{BioBERT: a pre-trained biomedical language representation model for biomedical text mining}},
216
+ author={{Lee, Jinhyuk and Yoon, Wonjin and Kim, Sungdong and Kim, Donghyeon and Kim, Sunkyu and So, Chan Ho and Kang, Jaewoo}},
217
+ journal={{Bioinformatics}},
218
+ volume={{36}},
219
+ number={{4}},
220
+ pages={{1234--1240}},
221
+ year={{2020}},
222
+ publisher={{Oxford University Press}}
223
+ }}
224
+ ```
225
+
226
+ ## Contact
227
+
228
+ - **Author**: {YOUR_NAME}
229
+ - **Email**: {YOUR_EMAIL}
230
+ - **GitHub**: [Your GitHub Profile](https://github.com/your-username)
231
+ - **Project Repository**: [Link to your project repo]
232
+
233
+ ## Acknowledgments
234
+
235
+ - Base model: [dmis-lab/biobert-v1.1](https://huggingface.co/dmis-lab/biobert-v1.1)
236
+ - Dataset: [BC5CDR](https://biocreative.bioinformatics.udel.edu/)
237
+ - Built with [HuggingFace Transformers](https://huggingface.co/transformers/)
238
+
239
+ ## License
240
+
241
+ This model is released under the MIT License. See [LICENSE](LICENSE) for details.
242
+
243
+ ---
244
+
245
+ *Model card last updated: {__import__('datetime').datetime.now().strftime('%Y-%m-%d')}*
246
+ """
247
+
248
+ # Save to file
249
+ model_path = "./biobert-ner-final"
250
+ readme_path = f"{model_path}/README.md"
251
+
252
+ with open(readme_path, "w", encoding="utf-8") as f:
253
+ f.write(model_card)
254
+
255
+ print("✓ Model card created!")
256
+ print(f"Saved to: {readme_path}")
257
+
258
+ # Upload to HuggingFace
259
+ api = HfApi()
260
+ api.upload_file(
261
+ path_or_fileobj=readme_path,
262
+ path_in_repo="README.md",
263
+ repo_id=repo_id,
264
+ repo_type="model",
265
+ )
266
+
267
+ print(f"✓ Uploaded to: https://huggingface.co/{repo_id}")