edloginovad commited on
Commit
b2c1f9f
·
verified ·
1 Parent(s): 5d33b14

Training in progress, step 6

Browse files
README.md CHANGED
@@ -1,246 +1,70 @@
1
  ---
 
 
 
2
  license: other
3
  base_model: DedalusHealthCare/tinybert-mlm-de
4
- datasets:
5
- - DedalusHealthCare/ner_demo_de
6
- task_categories:
7
- - token-classification
8
- task_ids:
9
- - named-entity-recognition
10
- language:
11
- - de
12
  tags:
13
- - token-classification
14
- - ner
15
- - named-entity-recognition
16
- - de
17
- - disorder_finding
18
- library_name: transformers
19
- pipeline_tag: token-classification
20
  ---
21
 
22
- # TinyBERT for Demo NER (German)
23
-
24
- ## Model Description
25
-
26
- This model is a fine-tuned TinyBERT model for Named Entity Recognition (NER) of DISORDER_FINDING entities in German medical texts.
27
-
28
- It was fine-tuned from the [DedalusHealthCare/tinybert-mlm-de](https://huggingface.co/DedalusHealthCare/tinybert-mlm-de) masked language model using the [DedalusHealthCare/ner_demo_de](https://huggingface.co/datasets/DedalusHealthCare/ner_demo_de) dataset.
29
-
30
- **Base Model**: [DedalusHealthCare/tinybert-mlm-de](https://huggingface.co/DedalusHealthCare/tinybert-mlm-de)
31
-
32
- **Training Dataset**: [DedalusHealthCare/ner_demo_de](https://huggingface.co/datasets/DedalusHealthCare/ner_demo_de)
33
-
34
- **Task**: Token Classification (Named Entity Recognition)
35
-
36
- **Language**: German (de)
37
-
38
- **Entities**: DISORDER_FINDING
39
-
40
- **Model Format**: PYTORCH+ONNX
41
-
42
- **Please use `max` as aggregation strategy in the NER pipeline (see example below)**.
43
-
44
- ## Training Details
45
-
46
- - **Training epochs**: 1
47
- - **Learning rate**: N/A
48
- - **Training batch size**: 32
49
- - **Evaluation batch size**: 32
50
- - **Max sequence length**: 256
51
- - **Warmup steps**: N/A
52
- - **FP16**: False
53
- - **Gradient accumulation steps**: 2
54
- - **Evaluation accumulation steps**: 2
55
- - **Save steps**: 15000
56
- - **Evaluation steps**: 10000
57
- - **Evaluation strategy**: steps
58
- - **Random seed**: 33
59
- - **Label all tokens**: True
60
- - **Balanced training**: False
61
- - **Chunk mode**: sliding_window
62
- - **Stride**: 16
63
- - **Max training samples**: None
64
- - **Max evaluation samples**: 10000
65
- - **Early stopping patience**: 0
66
- - **Early stopping threshold**: 0.0
67
-
68
- ## Use Case Configuration
69
-
70
- - **Use case name**: demo
71
- - **Language**: German (de)
72
- - **Target entities**: DISORDER_FINDING
73
- - **Text processing max length**: N/A
74
- - **Entity labeling scheme**: N/A
75
-
76
- ## Usage
77
-
78
- ### Using Transformers Pipeline
79
-
80
- ```python
81
- from transformers import pipeline
82
-
83
- # Load the model
84
- ner_pipeline = pipeline(
85
- "ner",
86
- model="DedalusHealthCare/tinybert-demo-de",
87
- tokenizer="DedalusHealthCare/tinybert-demo-de",
88
- aggregation_strategy="max"
89
- )
90
-
91
- # Example text
92
- text = "Der Patient hat Diabetes und Bluthochdruck."
93
-
94
- # Get predictions
95
- entities = ner_pipeline(text)
96
- print(entities)
97
- ```
98
-
99
- ### Using AutoModel and AutoTokenizer
100
-
101
- ```python
102
- from transformers import AutoTokenizer, AutoModelForTokenClassification
103
- import torch
104
-
105
- # Load model and tokenizer
106
- model_name = "DedalusHealthCare/tinybert-demo-de"
107
- tokenizer = AutoTokenizer.from_pretrained(model_name)
108
- model = AutoModelForTokenClassification.from_pretrained(model_name)
109
-
110
- # Tokenize text
111
- text = "Der Patient hat Diabetes und Bluthochdruck."
112
- tokens = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
113
-
114
- # Get predictions
115
- with torch.no_grad():
116
- outputs = model(**tokens)
117
- predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
118
-
119
- # Get labels
120
- predicted_token_class_ids = predictions.argmax(-1)
121
- labels = [model.config.id2label[id.item()] for id in predicted_token_class_ids[0]]
122
- ```
123
-
124
- ### Using ONNX Runtime (Optimized Inference)
125
-
126
- ```python
127
- from optimum.onnxruntime import ORTModelForTokenClassification
128
- from transformers import AutoTokenizer, pipeline
129
- import torch
130
-
131
- # Load ONNX model for faster inference
132
- model_name = "DedalusHealthCare/tinybert-demo-de"
133
- onnx_model = ORTModelForTokenClassification.from_pretrained(model_name)
134
- tokenizer = AutoTokenizer.from_pretrained(model_name)
135
-
136
- # Create pipeline with ONNX model (recommended)
137
- ner_pipeline = pipeline(
138
- "ner",
139
- model=onnx_model,
140
- tokenizer=tokenizer,
141
- aggregation_strategy="max"
142
- )
143
-
144
- # Example text
145
- text = "Der Patient hat Diabetes und Bluthochdruck."
146
- entities = ner_pipeline(text)
147
- print(entities)
148
-
149
- # Direct model usage
150
- inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
151
- with torch.no_grad():
152
- outputs = onnx_model(**inputs)
153
- predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
154
-
155
- predicted_token_class_ids = predictions.argmax(-1)
156
- token_labels = [onnx_model.config.id2label[id.item()] for id in predicted_token_class_ids[0]]
157
- ```
158
-
159
- ### Performance Comparison
160
-
161
- - **PyTorch**: Standard format, suitable for training and research
162
- - **ONNX**: Optimized for inference, typically 2-4x faster than PyTorch
163
- - **Recommendation**: Use ONNX for production inference, PyTorch for research
164
-
165
- ## Model Architecture
166
-
167
- This model is based on the TinyBERT architecture with a token classification head for Named Entity Recognition.
168
-
169
- ## Intended Use
170
-
171
- This model is intended for:
172
- - Named Entity Recognition in German medical texts
173
- - Identification of DISORDER_FINDING entities
174
- - Medical text processing and analysis
175
- - Research and development in medical NLP
176
-
177
- ## Limitations
178
-
179
- - Trained specifically for German medical texts
180
- - Performance may vary on texts from different medical domains
181
- - May not generalize well to non-medical texts
182
- - Requires careful evaluation on new datasets
183
-
184
- ## Ethical Considerations
185
-
186
- - This model is trained on medical data and should be used responsibly
187
- - Outputs should be validated by medical professionals
188
- - Patient privacy and data protection regulations must be followed
189
- - The model may have biases present in the training data
190
-
191
-
192
- ## Model Performance
193
 
194
- This model has been evaluated on the **goldset from ner_disorderfinding_de_goldset** using
195
- IO evaluation (sklearn, token level, lenient) with the following results:
196
 
197
- ### Overall Performance
 
 
 
 
 
 
 
 
 
 
198
 
199
- | Metric | Score |
200
- |--------|-------|
201
- | Precision (Macro) | 0.425502 |
202
- | Recall (Macro) | 0.467986 |
203
- | F1-Score (Macro) | 0.436143 |
204
- | Precision (Weighted) | 0.600423 |
205
- | Recall (Weighted) | 0.698688 |
206
- | F1-Score (Weighted) | 0.641115 |
207
 
208
- **Inference Performance**: 8.36 seconds for evaluation dataset
209
 
210
- ### Entity-Level Performance (IO Evaluation)
211
 
212
- | Entity Type | Precision | Recall | F1-Score | Support |
213
- |-------------|-----------|--------|----------|---------|
214
- | DISORDER_FINDING | 0.097155 | 0.034930 | 0.051386 | N/A |
215
 
216
- ### Evaluation Details
217
 
218
- - **Dataset**: goldset from ner_disorderfinding_de_goldset
219
- - **Dataset Source**: goldset
220
- - **Evaluation Date**: 2025-10-08 12:13:12
221
- - **Language**: de
222
- - **Entities**: DISORDER_FINDING
223
 
224
- *This evaluation section is automatically generated and updated.*
225
 
226
- ## Citation
227
 
228
- If you use this model, please cite:
 
 
 
 
 
 
 
 
 
 
229
 
230
- ```bibtex
231
- @model{demo_de_ner_model,
232
- title = {TinyBERT for Demo NER (German)},
233
- author = {DH Healthcare GmbH},
234
- year = {2025},
235
- publisher = {Hugging Face},
236
- url = {https://huggingface.co/DedalusHealthCare/tinybert-demo-de}
237
- }
238
- ```
239
 
240
- ## License
241
 
242
- This model is proprietary and owned by DH Healthcare GmbH. All rights reserved.
243
 
244
- ## Contact
245
 
246
- For questions or support, please contact DH Healthcare GmbH.
 
 
 
 
1
  ---
2
+ library_name: transformers
3
+ language:
4
+ - multilingual
5
  license: other
6
  base_model: DedalusHealthCare/tinybert-mlm-de
 
 
 
 
 
 
 
 
7
  tags:
8
+ - generated_from_trainer
9
+ datasets:
10
+ - ner_demo_de
11
+ model-index:
12
+ - name: tinybert-demo-de
13
+ results: []
 
14
  ---
15
 
16
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
17
+ should probably proofread and complete it, then remove this comment. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
+ # tinybert-demo-de
 
20
 
21
+ This model is a fine-tuned version of [DedalusHealthCare/tinybert-mlm-de](https://huggingface.co/DedalusHealthCare/tinybert-mlm-de) on the ner_demo_de dataset.
22
+ It achieves the following results on the evaluation set:
23
+ - Loss: 0.4069
24
+ - Disorder Finding Precision: 0.25
25
+ - Disorder Finding Recall: 0.1818
26
+ - Disorder Finding F1: 0.2105
27
+ - Disorder Finding Number: 11
28
+ - Overall Precision: 0.25
29
+ - Overall Recall: 0.1818
30
+ - Overall F1: 0.2105
31
+ - Overall Accuracy: 0.9286
32
 
33
+ ## Model description
 
 
 
 
 
 
 
34
 
35
+ More information needed
36
 
37
+ ## Intended uses & limitations
38
 
39
+ More information needed
 
 
40
 
41
+ ## Training and evaluation data
42
 
43
+ More information needed
 
 
 
 
44
 
45
+ ## Training procedure
46
 
47
+ ### Training hyperparameters
48
 
49
+ The following hyperparameters were used during training:
50
+ - learning_rate: 5e-05
51
+ - train_batch_size: 32
52
+ - eval_batch_size: 32
53
+ - seed: 33
54
+ - gradient_accumulation_steps: 2
55
+ - total_train_batch_size: 64
56
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
57
+ - lr_scheduler_type: linear
58
+ - lr_scheduler_warmup_ratio: 0.1
59
+ - num_epochs: 1
60
 
61
+ ### Training results
 
 
 
 
 
 
 
 
62
 
 
63
 
 
64
 
65
+ ### Framework versions
66
 
67
+ - Transformers 4.45.1
68
+ - Pytorch 2.6.0+cu124
69
+ - Datasets 2.16.0
70
+ - Tokenizers 0.20.3
config.json CHANGED
@@ -1,5 +1,5 @@
1
  {
2
- "_name_or_path": "/workspaces/prod/nlp/nlp-tools/data/ner_demo_de/models/tinybert-clinalytix",
3
  "architectures": [
4
  "BertForTokenClassification"
5
  ],
@@ -27,6 +27,7 @@
27
  "pad_token_id": 0,
28
  "position_embedding_type": "absolute",
29
  "pre_trained": "",
 
30
  "training": "",
31
  "transformers_version": "4.45.1",
32
  "type_vocab_size": 2,
 
1
  {
2
+ "_name_or_path": "DedalusHealthCare/tinybert-mlm-de",
3
  "architectures": [
4
  "BertForTokenClassification"
5
  ],
 
27
  "pad_token_id": 0,
28
  "position_embedding_type": "absolute",
29
  "pre_trained": "",
30
+ "torch_dtype": "float32",
31
  "training": "",
32
  "transformers_version": "4.45.1",
33
  "type_vocab_size": 2,
runs/Oct08_14-03-37_ip-172-31-12-22/events.out.tfevents.1759932228.ip-172-31-12-22.98670.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e1576b651b19110a49916b803c87961a384c6d76cfd3ab683a02aa519256a0ba
3
+ size 5889
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:93f3f52af7c94db82db05a4e6476f75484cf18bcb08979ee8e02724dfe60a95d
3
  size 5368
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dd10a2402b4fe87094e78084162836edea483ed5d0b1af655837b18c8310db9a
3
  size 5368