Training in progress, step 6

Browse files

Files changed (4) hide show

README.md +48 -224
config.json +2 -1
runs/Oct08_14-03-37_ip-172-31-12-22/events.out.tfevents.1759932228.ip-172-31-12-22.98670.0 +3 -0
training_args.bin +1 -1

README.md CHANGED Viewed

@@ -1,246 +1,70 @@
 ---
 license: other
 base_model: DedalusHealthCare/tinybert-mlm-de
-datasets:
-- DedalusHealthCare/ner_demo_de
-task_categories:
-- token-classification
-task_ids:
-- named-entity-recognition
-language:
-- de
 tags:
-- token-classification
-- ner
-- named-entity-recognition
-- de
-- disorder_finding
-library_name: transformers
-pipeline_tag: token-classification
 ---
-# TinyBERT for Demo NER (German)
-## Model Description
-This model is a fine-tuned TinyBERT model for Named Entity Recognition (NER) of DISORDER_FINDING entities in German medical texts.
-It was fine-tuned from the [DedalusHealthCare/tinybert-mlm-de](https://huggingface.co/DedalusHealthCare/tinybert-mlm-de) masked language model using the [DedalusHealthCare/ner_demo_de](https://huggingface.co/datasets/DedalusHealthCare/ner_demo_de) dataset.
-**Base Model**: [DedalusHealthCare/tinybert-mlm-de](https://huggingface.co/DedalusHealthCare/tinybert-mlm-de)
-**Training Dataset**: [DedalusHealthCare/ner_demo_de](https://huggingface.co/datasets/DedalusHealthCare/ner_demo_de)
-**Task**: Token Classification (Named Entity Recognition)
-**Language**: German (de)
-**Entities**: DISORDER_FINDING
-**Model Format**: PYTORCH+ONNX
-**Please use `max` as aggregation strategy in the NER pipeline (see example below)**.
-## Training Details
-- **Training epochs**: 1
-- **Learning rate**: N/A
-- **Training batch size**: 32
-- **Evaluation batch size**: 32
-- **Max sequence length**: 256
-- **Warmup steps**: N/A
-- **FP16**: False
-- **Gradient accumulation steps**: 2
-- **Evaluation accumulation steps**: 2
-- **Save steps**: 15000
-- **Evaluation steps**: 10000
-- **Evaluation strategy**: steps
-- **Random seed**: 33
-- **Label all tokens**: True
-- **Balanced training**: False
-- **Chunk mode**: sliding_window
-- **Stride**: 16
-- **Max training samples**: None
-- **Max evaluation samples**: 10000
-- **Early stopping patience**: 0
-- **Early stopping threshold**: 0.0
-## Use Case Configuration
-- **Use case name**: demo
-- **Language**: German (de)
-- **Target entities**: DISORDER_FINDING
-- **Text processing max length**: N/A
-- **Entity labeling scheme**: N/A
-## Usage
-### Using Transformers Pipeline
-```python
-from transformers import pipeline
-# Load the model
-ner_pipeline = pipeline(
-    "ner",
-    model="DedalusHealthCare/tinybert-demo-de",
-    tokenizer="DedalusHealthCare/tinybert-demo-de",
-    aggregation_strategy="max"
-)
-# Example text
-text = "Der Patient hat Diabetes und Bluthochdruck."
-# Get predictions
-entities = ner_pipeline(text)
-print(entities)
-```
-### Using AutoModel and AutoTokenizer
-```python
-from transformers import AutoTokenizer, AutoModelForTokenClassification
-import torch
-# Load model and tokenizer
-model_name = "DedalusHealthCare/tinybert-demo-de"
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-model = AutoModelForTokenClassification.from_pretrained(model_name)
-# Tokenize text
-text = "Der Patient hat Diabetes und Bluthochdruck."
-tokens = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
-# Get predictions
-with torch.no_grad():
-    outputs = model(**tokens)
-    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
-# Get labels
-predicted_token_class_ids = predictions.argmax(-1)
-labels = [model.config.id2label[id.item()] for id in predicted_token_class_ids[0]]
-```
-### Using ONNX Runtime (Optimized Inference)
-```python
-from optimum.onnxruntime import ORTModelForTokenClassification
-from transformers import AutoTokenizer, pipeline
-import torch
-# Load ONNX model for faster inference
-model_name = "DedalusHealthCare/tinybert-demo-de"
-onnx_model = ORTModelForTokenClassification.from_pretrained(model_name)
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-# Create pipeline with ONNX model (recommended)
-ner_pipeline = pipeline(
-    "ner",
-    model=onnx_model,
-    tokenizer=tokenizer,
-    aggregation_strategy="max"
-)
-# Example text
-text = "Der Patient hat Diabetes und Bluthochdruck."
-entities = ner_pipeline(text)
-print(entities)
-# Direct model usage
-inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
-with torch.no_grad():
-    outputs = onnx_model(**inputs)
-    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
-predicted_token_class_ids = predictions.argmax(-1)
-token_labels = [onnx_model.config.id2label[id.item()] for id in predicted_token_class_ids[0]]
-```
-### Performance Comparison
-- **PyTorch**: Standard format, suitable for training and research
-- **ONNX**: Optimized for inference, typically 2-4x faster than PyTorch
-- **Recommendation**: Use ONNX for production inference, PyTorch for research
-## Model Architecture
-This model is based on the TinyBERT architecture with a token classification head for Named Entity Recognition.
-## Intended Use
-This model is intended for:
-- Named Entity Recognition in German medical texts
-- Identification of DISORDER_FINDING entities
-- Medical text processing and analysis
-- Research and development in medical NLP
-## Limitations
-- Trained specifically for German medical texts
-- Performance may vary on texts from different medical domains
-- May not generalize well to non-medical texts
-- Requires careful evaluation on new datasets
-## Ethical Considerations
-- This model is trained on medical data and should be used responsibly
-- Outputs should be validated by medical professionals
-- Patient privacy and data protection regulations must be followed
-- The model may have biases present in the training data
-## Model Performance
-This model has been evaluated on the **goldset from ner_disorderfinding_de_goldset** using
-IO evaluation (sklearn, token level, lenient) with the following results:
-### Overall Performance
-| Metric | Score |
-|--------|-------|
-| Precision (Macro) | 0.425502 |
-| Recall (Macro) | 0.467986 |
-| F1-Score (Macro) | 0.436143 |
-| Precision (Weighted) | 0.600423 |
-| Recall (Weighted) | 0.698688 |
-| F1-Score (Weighted) | 0.641115 |
-**Inference Performance**: 8.36 seconds for evaluation dataset
-### Entity-Level Performance (IO Evaluation)
-| Entity Type | Precision | Recall | F1-Score | Support |
-|-------------|-----------|--------|----------|---------|
-| DISORDER_FINDING | 0.097155 | 0.034930 | 0.051386 | N/A |
-### Evaluation Details
-- **Dataset**: goldset from ner_disorderfinding_de_goldset
-- **Dataset Source**: goldset
-- **Evaluation Date**: 2025-10-08 12:13:12
-- **Language**: de
-- **Entities**: DISORDER_FINDING
-*This evaluation section is automatically generated and updated.*
-## Citation
-If you use this model, please cite:
-```bibtex
-@model{demo_de_ner_model,
-  title = {TinyBERT for Demo NER (German)},
-  author = {DH Healthcare GmbH},
-  year = {2025},
-  publisher = {Hugging Face},
-  url = {https://huggingface.co/DedalusHealthCare/tinybert-demo-de}
-}
-```
-## License
-This model is proprietary and owned by DH Healthcare GmbH. All rights reserved.
-## Contact
-For questions or support, please contact DH Healthcare GmbH.

 ---
+library_name: transformers
+language:
+- multilingual
 license: other
 base_model: DedalusHealthCare/tinybert-mlm-de
 tags:
+- generated_from_trainer
+datasets:
+- ner_demo_de
+model-index:
+- name: tinybert-demo-de
+  results: []
 ---
+<!-- This model card has been generated automatically according to the information the Trainer had access to. You
+should probably proofread and complete it, then remove this comment. -->
+# tinybert-demo-de
+This model is a fine-tuned version of [DedalusHealthCare/tinybert-mlm-de](https://huggingface.co/DedalusHealthCare/tinybert-mlm-de) on the ner_demo_de dataset.
+It achieves the following results on the evaluation set:
+- Loss: 0.4069
+- Disorder Finding Precision: 0.25
+- Disorder Finding Recall: 0.1818
+- Disorder Finding F1: 0.2105
+- Disorder Finding Number: 11
+- Overall Precision: 0.25
+- Overall Recall: 0.1818
+- Overall F1: 0.2105
+- Overall Accuracy: 0.9286
+## Model description
+More information needed
+## Intended uses & limitations
+More information needed
+## Training and evaluation data
+More information needed
+## Training procedure
+### Training hyperparameters
+The following hyperparameters were used during training:
+- learning_rate: 5e-05
+- train_batch_size: 32
+- eval_batch_size: 32
+- seed: 33
+- gradient_accumulation_steps: 2
+- total_train_batch_size: 64
+- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
+- lr_scheduler_type: linear
+- lr_scheduler_warmup_ratio: 0.1
+- num_epochs: 1
+### Training results
+### Framework versions
+- Transformers 4.45.1
+- Pytorch 2.6.0+cu124
+- Datasets 2.16.0
+- Tokenizers 0.20.3

config.json CHANGED Viewed

@@ -1,5 +1,5 @@
 {
-  "_name_or_path": "/workspaces/prod/nlp/nlp-tools/data/ner_demo_de/models/tinybert-clinalytix",
   "architectures": [
     "BertForTokenClassification"
   ],
@@ -27,6 +27,7 @@
   "pad_token_id": 0,
   "position_embedding_type": "absolute",
   "pre_trained": "",
   "training": "",
   "transformers_version": "4.45.1",
   "type_vocab_size": 2,

 {
+  "_name_or_path": "DedalusHealthCare/tinybert-mlm-de",
   "architectures": [
     "BertForTokenClassification"
   ],
   "pad_token_id": 0,
   "position_embedding_type": "absolute",
   "pre_trained": "",
+  "torch_dtype": "float32",
   "training": "",
   "transformers_version": "4.45.1",
   "type_vocab_size": 2,

runs/Oct08_14-03-37_ip-172-31-12-22/events.out.tfevents.1759932228.ip-172-31-12-22.98670.0 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e1576b651b19110a49916b803c87961a384c6d76cfd3ab683a02aa519256a0ba
+size 5889

training_args.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:93f3f52af7c94db82db05a4e6476f75484cf18bcb08979ee8e02724dfe60a95d
 size 5368

 version https://git-lfs.github.com/spec/v1
+oid sha256:dd10a2402b4fe87094e78084162836edea483ed5d0b1af655837b18c8310db9a
 size 5368