|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- de |
|
|
base_model: |
|
|
- GeistBERT/GeistBERT_base |
|
|
tags: |
|
|
- health |
|
|
- biomedical |
|
|
- medical |
|
|
--- |
|
|
|
|
|
# ChristBERT |
|
|
|
|
|
ChristBERT (Clinical- and Healthcare-Related Issues and Subjects Tuned BERT) is a family of domain-adapted German biomedical RoBERTa models. It was developed to address the lack of high-quality German-language models for clinical and healthcare NLP tasks. |
|
|
|
|
|
## Model Variants |
|
|
|
|
|
- `ChristBERT`: Continued pretraining from GeistBERT on biomedical data. |
|
|
- [`ChristBERT_scratch`](https://huggingface.co/ChristBERT/ChristBERT_scratch_base): Trained from scratch on biomedical data using GeistBERT's vocabulary. |
|
|
- [`ChristBERT_BPE`](https://huggingface.co/ChristBERT/ChristBERT_bpe_base): Trained from scratch on biomedical data using a new vocabulary trained on the biomedical domain (byte-level BPE, 52k tokens). |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
All ChristBERT variants are based on the **RoBERTa base** architecture: |
|
|
- 12 transformer layers |
|
|
- 768 hidden size |
|
|
- 12 attention heads |
|
|
- ~125M parameters |
|
|
- Sequence length: 512 tokens |
|
|
|
|
|
## Pretraining Data |
|
|
|
|
|
ChristBERT was trained on a 13.5 GB biomedical corpus consisting of: |
|
|
- Hpsmedia medical journals |
|
|
- Springer Nature biomedical publications |
|
|
- PubMed Central abstracts and full texts |
|
|
- German medical Wikipedia |
|
|
- German-translated MIMIC-IV notes (using LLaMA 3.1 8B) |
|
|
- Crawled German health web content (filtered via a fine-tuned classifier) |
|
|
|
|
|
See table below: |
|
|
|
|
|
| Dataset | Documents | Words | Size (MB) | |
|
|
|--------------------|-----------|----------------|-----------| |
|
|
| Hpsmedia | 277,357 | 405M | 3,117 | |
|
|
| Springer Nature | 258,000 | 259M | 1,984 | |
|
|
| PubMed Central | 90,273 | 220M | 1,609 | |
|
|
| PhD Theses | 7,486 | 90M | 646 | |
|
|
| Medical Wikipedia | 75,585 | 49M | 362 | |
|
|
| MIMIC-IV Notes | 330,486 | 734M | 5,310 | |
|
|
| Web Crawl | 93,642 | 69M | 512 | |
|
|
| **Total** | 1.1M+ | ~1.8B words | ~13,540 | |
|
|
|
|
|
## Pretraining Setup |
|
|
|
|
|
- Framework: [Fairseq](https://github.com/facebookresearch/fairseq) |
|
|
- Objective: Masked Language Modeling (Whole Word Masking) |
|
|
- Optimizer: AdamW |
|
|
- Learning Rate Schedule: Linear warmup (10k steps) + polynomial decay |
|
|
- Max LR: |
|
|
- 7e-4 (ChristBERT) |
|
|
- 6e-4 (ChristBERTscratch & BPE) |
|
|
- Batch Size: 8,192 tokens |
|
|
- Sequence Length: 512 |
|
|
- Steps: 100,000 |
|
|
- Hardware: 4Γ NVIDIA A100 or 2Γ NVIDIA H100 |
|
|
- Total compute time: ~21.7 GPU days |
|
|
|
|
|
## Tokenizer |
|
|
|
|
|
- Type: **Byte-level BPE** |
|
|
- Vocabulary size: 52,000 |
|
|
- Compatible with RoBERTa/GPT-2 tokenizer conventions |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
- Named Entity Recognition (NER) |
|
|
- Clinical and biomedical text classification |
|
|
- German medical text mining and information retrieval |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
ChristBERT was evaluated on: |
|
|
- 3 medical NER benchmarks |
|
|
- 2 clinical text classification benchmarks |
|
|
|
|
|
Metrics: Micro-averaged F1, precision, and accuracy |
|
|
|
|
|
β
**Outperformed existing German medical and general-purpose LMs on 4 out of 5 tasks** |
|
|
π Demonstrated strong performance especially with continued pretraining on general medical text |
|
|
|
|
|
|
|
|
### Named Entity Recognition |
|
|
| Model | BRONCO150 Prec. | BRONCO150 Rec. | BRONCO150 F1 | CARDIO\:DE Prec. | CARDIO\:DE Rec. | CARDIO\:DE F1 | GGPONC Prec. | GGPONC Rec. | GGPONC F1 | |
|
|
| ------------------- | --------------- | -------------- | ------------ | ---------------- | --------------- | ------------- | ------------ | ----------- | --------- | |
|
|
| ChristBERT | 81.42 | 81.77 | 81.87 | 85.58 | 89.65 | 87.57 | 75.65 | **79.83** | **77.69** | |
|
|
| ChristBERT\_scratch | *81.87* | *82.32* | *82.09* | 88.38 | 89.89 | 89.13 | *76.54* | *77.56* | *77.05* | |
|
|
| ChristBERT\_BPE | **85.71** | **83.78** | **84.74** | *89.50* | **91.31** | **90.40** | **76.59** | 77.42 | 77.00 | |
|
|
| medBERT.de | 78.67 | 79.58 | 79.12 | 87.66 | 90.02 | 88.83 | 73.89 | 75.78 | 74.73 | |
|
|
| BioGottBERT | 76.96 | 78.45 | 77.70 | 88.37 | *90.74* | 89.54 | 75.24 | 75.40 | 75.32 | |
|
|
| GeistBERT | 75.65 | 79.83 | 77.69 | 85.58 | 89.65 | 87.57 | 74.57 | 75.36 | 74.96 | |
|
|
| GeBERTa | 78.67 | 79.58 | 79.12 | **90.51** | 90.23 | *90.37* | 75.96 | 76.93 | 76.45 | |
|
|
|
|
|
|
|
|
### Text Classification |
|
|
| Model | CLEF Prec. | CLEF Rec. | CLEF F1 | JSynCC Prec. | JSynCC Rec. | JSynCC F1 | |
|
|
|--------------------|------------|-----------|----------|---------------|--------------|------------| |
|
|
| ChristBERT | 78.12 | 75.34 | 76.03 | 89.01 | **100** | _94.19_ | |
|
|
| ChristBERT_scratch | **93.68** | 85.17 | _89.22_ | _91.86_ | 97.53 | **94.61** | |
|
|
| ChristBERT_BPE | 88.22 | _88.35_ | 88.28 | 89.53 | 95.06 | 92.22 | |
|
|
| medBERT.de | 89.21 | 87.59 | 88.40 | 91.25 | 90.12 | 90.68 | |
|
|
| BioGottBERT | 88.30 | 87.90 | 88.10 | 88.89 | _98.77_ | 93.57 | |
|
|
| GeistBERT | _90.43_ | 72.92 | 80.74 | **92.59** | 92.59 | 92.59 | |
|
|
| GeBERTa | 88.91 | **89.71** | **89.31**| **92.59** | 92.59 | 92.59 | |
|
|
|
|
|
|
|
|
## How to Use |
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("ChristBERT/ChristBERT_base") |
|
|
model = AutoModel.from_pretrained("ChristBERT/ChristBERT_base") |
|
|
|
|
|
inputs = tokenizer("Der Patient leidet an Diabetes mellitus.", return_tensors="pt") |
|
|
outputs = model(**inputs) |
|
|
``` |
|
|
|
|
|
## Limitations |
|
|
- Focused on the German biomedical domain β may not generalize well to other domains |
|
|
- Trained on publicly available or de-identified data; not suitable for sensitive clinical decisions |
|
|
|
|
|
# Terms of Use |
|
|
By downloading and using any of the ChristBERT models from the Hugging Face Hub, you agree to abide by the following terms and conditions: |
|
|
|
|
|
**Purpose and Scope:** All of the ChristBERT models are intended for research and informational purposes only and must not be used as the sole basis for making medical decisions or diagnosing patients. |
|
|
The models should be used as a supplementary tool alongside professional medical advice and clinical judgment. |
|
|
|
|
|
**Proper Usage:** Users agree to use one of the ChristBERT models in a responsible manner, complying with all applicable laws, regulations, and ethical guidelines. |
|
|
The model must not be used for any unlawful, harmful, or malicious purposes. The model must not be used in clinical decicion making and patient treatment. |
|
|
|
|
|
**Data Privacy and Security:** Users are responsible for ensuring the privacy and security of any sensitive or confidential data processed using one of the ChristBERT models. |
|
|
Personally identifiable information (PII) should be anonymized before being processed by the model, and users must implement appropriate measures to protect data privacy. |
|
|
|
|
|
**Prohibited Activities:** Users are strictly prohibited from attempting to perform adversarial attacks, information retrieval, or any other actions that may compromise |
|
|
the security and integrity of any of the ChristBERT models. Violators may face legal consequences and the retraction of the model's publication. |
|
|
|
|
|
By downloading and using one of the ChristBERT models, you confirm that you have read, understood, and agree to abide by these terms of use. |
|
|
|
|
|
## Legal Disclaimer: |
|
|
By using one of the *ChristBERT* models, you agree not to engage in any attempts to perform adversarial attacks or information retrieval from the model. |
|
|
Such activities are strictly prohibited and constitute a violation of the terms of use. |
|
|
Violators may face legal consequences, and any discovered violations may result in the immediate retraction of the model's publication. |
|
|
By continuing to one of the *ChristBERT* models, you acknowledge and accept the responsibility to adhere to these terms and conditions. |
|
|
|
|
|
## Citation |
|
|
``` |
|
|
@misc{christbert, |
|
|
title = {The Word and the Way: Strategies for Domain-Specific {BERT} Pre-Training in German Medical NLP}, |
|
|
author = {Henry He and Johann Frei and Raphael Scheible-Schmitt}, |
|
|
shorttitle= {The Word and the Way}, |
|
|
year = {2025}, |
|
|
month = sep, |
|
|
publisher = {Research Square}, |
|
|
doi = {10.21203/rs.3.rs-7332811/v1}, |
|
|
url = {https://www.researchsquare.com/article/rs-7332811/v1}, |
|
|
urldate = {2025-09-23}, |
|
|
note = {ISSN: 2693-5015} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
MIT |