edloginovad commited on
Commit
c8d953c
·
verified ·
1 Parent(s): a4d261c

Model save

Browse files
Files changed (1) hide show
  1. README.md +34 -224
README.md CHANGED
@@ -1,246 +1,56 @@
1
  ---
 
2
  license: other
3
  base_model: DedalusHealthCare/tinybert-mlm-de
4
- datasets:
5
- - DedalusHealthCare/ner_demo_de
6
- task_categories:
7
- - token-classification
8
- task_ids:
9
- - named-entity-recognition
10
- language:
11
- - de
12
  tags:
13
- - token-classification
14
- - ner
15
- - named-entity-recognition
16
- - de
17
- - disorder_finding
18
- library_name: transformers
19
- pipeline_tag: token-classification
20
  ---
21
 
22
- # TinyBERT for Demo NER (German)
23
-
24
- ## Model Description
25
-
26
- This model is a fine-tuned TinyBERT model for Named Entity Recognition (NER) of DISORDER_FINDING entities in German medical texts.
27
-
28
- It was fine-tuned from the [DedalusHealthCare/tinybert-mlm-de](https://huggingface.co/DedalusHealthCare/tinybert-mlm-de) masked language model using the [DedalusHealthCare/ner_demo_de](https://huggingface.co/datasets/DedalusHealthCare/ner_demo_de) dataset.
29
-
30
- **Base Model**: [DedalusHealthCare/tinybert-mlm-de](https://huggingface.co/DedalusHealthCare/tinybert-mlm-de)
31
-
32
- **Training Dataset**: [DedalusHealthCare/ner_demo_de](https://huggingface.co/datasets/DedalusHealthCare/ner_demo_de)
33
-
34
- **Task**: Token Classification (Named Entity Recognition)
35
-
36
- **Language**: German (de)
37
-
38
- **Entities**: DISORDER_FINDING
39
-
40
- **Model Format**: PYTORCH+ONNX
41
-
42
- **Please use `max` as aggregation strategy in the NER pipeline (see example below)**.
43
-
44
- ## Training Details
45
-
46
- - **Training epochs**: 1
47
- - **Learning rate**: N/A
48
- - **Training batch size**: 32
49
- - **Evaluation batch size**: 32
50
- - **Max sequence length**: 256
51
- - **Warmup steps**: N/A
52
- - **FP16**: False
53
- - **Gradient accumulation steps**: 2
54
- - **Evaluation accumulation steps**: 2
55
- - **Save steps**: 15000
56
- - **Evaluation steps**: 10000
57
- - **Evaluation strategy**: steps
58
- - **Random seed**: 33
59
- - **Label all tokens**: True
60
- - **Balanced training**: False
61
- - **Chunk mode**: sliding_window
62
- - **Stride**: 16
63
- - **Max training samples**: None
64
- - **Max evaluation samples**: 10000
65
- - **Early stopping patience**: 0
66
- - **Early stopping threshold**: 0.0
67
-
68
- ## Use Case Configuration
69
-
70
- - **Use case name**: demo
71
- - **Language**: German (de)
72
- - **Target entities**: DISORDER_FINDING
73
- - **Text processing max length**: N/A
74
- - **Entity labeling scheme**: N/A
75
-
76
- ## Usage
77
-
78
- ### Using Transformers Pipeline
79
-
80
- ```python
81
- from transformers import pipeline
82
-
83
- # Load the model
84
- ner_pipeline = pipeline(
85
- "ner",
86
- model="DedalusHealthCare/tinybert-demo-de",
87
- tokenizer="DedalusHealthCare/tinybert-demo-de",
88
- aggregation_strategy="max"
89
- )
90
-
91
- # Example text
92
- text = "Der Patient hat Diabetes und Bluthochdruck."
93
-
94
- # Get predictions
95
- entities = ner_pipeline(text)
96
- print(entities)
97
- ```
98
-
99
- ### Using AutoModel and AutoTokenizer
100
-
101
- ```python
102
- from transformers import AutoTokenizer, AutoModelForTokenClassification
103
- import torch
104
-
105
- # Load model and tokenizer
106
- model_name = "DedalusHealthCare/tinybert-demo-de"
107
- tokenizer = AutoTokenizer.from_pretrained(model_name)
108
- model = AutoModelForTokenClassification.from_pretrained(model_name)
109
-
110
- # Tokenize text
111
- text = "Der Patient hat Diabetes und Bluthochdruck."
112
- tokens = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
113
-
114
- # Get predictions
115
- with torch.no_grad():
116
- outputs = model(**tokens)
117
- predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
118
-
119
- # Get labels
120
- predicted_token_class_ids = predictions.argmax(-1)
121
- labels = [model.config.id2label[id.item()] for id in predicted_token_class_ids[0]]
122
- ```
123
-
124
- ### Using ONNX Runtime (Optimized Inference)
125
-
126
- ```python
127
- from optimum.onnxruntime import ORTModelForTokenClassification
128
- from transformers import AutoTokenizer, pipeline
129
- import torch
130
-
131
- # Load ONNX model for faster inference
132
- model_name = "DedalusHealthCare/tinybert-demo-de"
133
- onnx_model = ORTModelForTokenClassification.from_pretrained(model_name)
134
- tokenizer = AutoTokenizer.from_pretrained(model_name)
135
-
136
- # Create pipeline with ONNX model (recommended)
137
- ner_pipeline = pipeline(
138
- "ner",
139
- model=onnx_model,
140
- tokenizer=tokenizer,
141
- aggregation_strategy="max"
142
- )
143
-
144
- # Example text
145
- text = "Der Patient hat Diabetes und Bluthochdruck."
146
- entities = ner_pipeline(text)
147
- print(entities)
148
-
149
- # Direct model usage
150
- inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
151
- with torch.no_grad():
152
- outputs = onnx_model(**inputs)
153
- predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
154
-
155
- predicted_token_class_ids = predictions.argmax(-1)
156
- token_labels = [onnx_model.config.id2label[id.item()] for id in predicted_token_class_ids[0]]
157
- ```
158
-
159
- ### Performance Comparison
160
-
161
- - **PyTorch**: Standard format, suitable for training and research
162
- - **ONNX**: Optimized for inference, typically 2-4x faster than PyTorch
163
- - **Recommendation**: Use ONNX for production inference, PyTorch for research
164
-
165
- ## Model Architecture
166
-
167
- This model is based on the TinyBERT architecture with a token classification head for Named Entity Recognition.
168
-
169
- ## Intended Use
170
-
171
- This model is intended for:
172
- - Named Entity Recognition in German medical texts
173
- - Identification of DISORDER_FINDING entities
174
- - Medical text processing and analysis
175
- - Research and development in medical NLP
176
-
177
- ## Limitations
178
-
179
- - Trained specifically for German medical texts
180
- - Performance may vary on texts from different medical domains
181
- - May not generalize well to non-medical texts
182
- - Requires careful evaluation on new datasets
183
-
184
- ## Ethical Considerations
185
-
186
- - This model is trained on medical data and should be used responsibly
187
- - Outputs should be validated by medical professionals
188
- - Patient privacy and data protection regulations must be followed
189
- - The model may have biases present in the training data
190
-
191
-
192
- ## Model Performance
193
 
194
- This model has been evaluated on the **goldset from ner_disorderfinding_de_goldset** using
195
- IO evaluation (sklearn, token level, lenient) with the following results:
196
 
197
- ### Overall Performance
198
 
199
- | Metric | Score |
200
- |--------|-------|
201
- | Precision (Macro) | 0.425082 |
202
- | Recall (Macro) | 0.467785 |
203
- | F1-Score (Macro) | 0.435900 |
204
- | Precision (Weighted) | 0.600185 |
205
- | Recall (Weighted) | 0.698514 |
206
- | F1-Score (Weighted) | 0.640943 |
207
 
208
- **Inference Performance**: 5.65 seconds for evaluation dataset
209
 
210
- ### Entity-Level Performance (IO Evaluation)
211
 
212
- | Entity Type | Precision | Recall | F1-Score | Support |
213
- |-------------|-----------|--------|----------|---------|
214
- | DISORDER_FINDING | 0.753771 | 0.900890 | 0.820790 | N/A |
215
 
216
- ### Evaluation Details
217
 
218
- - **Dataset**: goldset from ner_disorderfinding_de_goldset
219
- - **Dataset Source**: goldset
220
- - **Evaluation Date**: 2025-09-25 09:38:17
221
- - **Language**: de
222
- - **Entities**: DISORDER_FINDING
223
 
224
- *This evaluation section is automatically generated and updated.*
225
 
226
- ## Citation
227
 
228
- If you use this model, please cite:
 
 
 
 
 
 
 
 
 
 
229
 
230
- ```bibtex
231
- @model{demo_de_ner_model,
232
- title = {TinyBERT for Demo NER (German)},
233
- author = {DH Healthcare GmbH},
234
- year = {2025},
235
- publisher = {Hugging Face},
236
- url = {https://huggingface.co/DedalusHealthCare/tinybert-demo-de}
237
- }
238
- ```
239
 
240
- ## License
241
 
242
- This model is proprietary and owned by DH Healthcare GmbH. All rights reserved.
243
 
244
- ## Contact
245
 
246
- For questions or support, please contact DH Healthcare GmbH.
 
 
 
 
1
  ---
2
+ library_name: transformers
3
  license: other
4
  base_model: DedalusHealthCare/tinybert-mlm-de
 
 
 
 
 
 
 
 
5
  tags:
6
+ - generated_from_trainer
7
+ model-index:
8
+ - name: tinybert-demo-de
9
+ results: []
 
 
 
10
  ---
11
 
12
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
13
+ should probably proofread and complete it, then remove this comment. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
+ # tinybert-demo-de
 
16
 
17
+ This model is a fine-tuned version of [DedalusHealthCare/tinybert-mlm-de](https://huggingface.co/DedalusHealthCare/tinybert-mlm-de) on the None dataset.
18
 
19
+ ## Model description
 
 
 
 
 
 
 
20
 
21
+ More information needed
22
 
23
+ ## Intended uses & limitations
24
 
25
+ More information needed
 
 
26
 
27
+ ## Training and evaluation data
28
 
29
+ More information needed
 
 
 
 
30
 
31
+ ## Training procedure
32
 
33
+ ### Training hyperparameters
34
 
35
+ The following hyperparameters were used during training:
36
+ - learning_rate: 5e-05
37
+ - train_batch_size: 32
38
+ - eval_batch_size: 32
39
+ - seed: 33
40
+ - gradient_accumulation_steps: 2
41
+ - total_train_batch_size: 64
42
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
43
+ - lr_scheduler_type: linear
44
+ - lr_scheduler_warmup_ratio: 0.1
45
+ - num_epochs: 1
46
 
47
+ ### Training results
 
 
 
 
 
 
 
 
48
 
 
49
 
 
50
 
51
+ ### Framework versions
52
 
53
+ - Transformers 4.45.1
54
+ - Pytorch 2.6.0+cu124
55
+ - Datasets 2.16.0
56
+ - Tokenizers 0.20.3