ChristBERT_base / README.md

Raphael Scheible-Schmitt

Update README.md

aa3a4d4 verified 3 months ago

8.84 kB

	---
	license: mit
	language:
	- de
	base_model:
	- GeistBERT/GeistBERT_base
	tags:
	- health
	- biomedical
	- medical
	---

	# ChristBERT

	ChristBERT (Clinical- and Healthcare-Related Issues and Subjects Tuned BERT) is a family of domain-adapted German biomedical RoBERTa models. It was developed to address the lack of high-quality German-language models for clinical and healthcare NLP tasks.

	## Model Variants

	- `ChristBERT`: Continued pretraining from GeistBERT on biomedical data.
	- [`ChristBERT_scratch`](https://huggingface.co/ChristBERT/ChristBERT_scratch_base): Trained from scratch on biomedical data using GeistBERT's vocabulary.
	- [`ChristBERT_BPE`](https://huggingface.co/ChristBERT/ChristBERT_bpe_base): Trained from scratch on biomedical data using a new vocabulary trained on the biomedical domain (byte-level BPE, 52k tokens).

	## Model Architecture

	All ChristBERT variants are based on the RoBERTa base architecture:
	- 12 transformer layers
	- 768 hidden size
	- 12 attention heads
	- ~125M parameters
	- Sequence length: 512 tokens

	## Pretraining Data

	ChristBERT was trained on a 13.5 GB biomedical corpus consisting of:
	- Hpsmedia medical journals
	- Springer Nature biomedical publications
	- PubMed Central abstracts and full texts
	- German medical Wikipedia
	- German-translated MIMIC-IV notes (using LLaMA 3.1 8B)
	- Crawled German health web content (filtered via a fine-tuned classifier)

	See table below:

	\| Dataset \| Documents \| Words \| Size (MB) \|
	\|--------------------\|-----------\|----------------\|-----------\|
	\| Hpsmedia \| 277,357 \| 405M \| 3,117 \|
	\| Springer Nature \| 258,000 \| 259M \| 1,984 \|
	\| PubMed Central \| 90,273 \| 220M \| 1,609 \|
	\| PhD Theses \| 7,486 \| 90M \| 646 \|
	\| Medical Wikipedia \| 75,585 \| 49M \| 362 \|
	\| MIMIC-IV Notes \| 330,486 \| 734M \| 5,310 \|
	\| Web Crawl \| 93,642 \| 69M \| 512 \|
	\| Total \| 1.1M+ \| ~1.8B words \| ~13,540 \|

	## Pretraining Setup

	- Framework: [Fairseq](https://github.com/facebookresearch/fairseq)
	- Objective: Masked Language Modeling (Whole Word Masking)
	- Optimizer: AdamW
	- Learning Rate Schedule: Linear warmup (10k steps) + polynomial decay
	- Max LR:
	- 7e-4 (ChristBERT)
	- 6e-4 (ChristBERTscratch & BPE)
	- Batch Size: 8,192 tokens
	- Sequence Length: 512
	- Steps: 100,000
	- Hardware: 4× NVIDIA A100 or 2× NVIDIA H100
	- Total compute time: ~21.7 GPU days

	## Tokenizer

	- Type: Byte-level BPE
	- Vocabulary size: 52,000
	- Compatible with RoBERTa/GPT-2 tokenizer conventions

	## Intended Use

	- Named Entity Recognition (NER)
	- Clinical and biomedical text classification
	- German medical text mining and information retrieval

	## Evaluation

	ChristBERT was evaluated on:
	- 3 medical NER benchmarks
	- 2 clinical text classification benchmarks

	Metrics: Micro-averaged F1, precision, and accuracy

	✅ Outperformed existing German medical and general-purpose LMs on 4 out of 5 tasks
	📈 Demonstrated strong performance especially with continued pretraining on general medical text


	### Named Entity Recognition
	\| Model \| BRONCO150 Prec. \| BRONCO150 Rec. \| BRONCO150 F1 \| CARDIO\:DE Prec. \| CARDIO\:DE Rec. \| CARDIO\:DE F1 \| GGPONC Prec. \| GGPONC Rec. \| GGPONC F1 \|
	\| ------------------- \| --------------- \| -------------- \| ------------ \| ---------------- \| --------------- \| ------------- \| ------------ \| ----------- \| --------- \|
	\| ChristBERT \| 81.42 \| 81.77 \| 81.87 \| 85.58 \| 89.65 \| 87.57 \| 75.65 \| 79.83 \| 77.69 \|
	\| ChristBERT\_scratch \| 81.87 \| 82.32 \| 82.09 \| 88.38 \| 89.89 \| 89.13 \| 76.54 \| 77.56 \| 77.05 \|
	\| ChristBERT\_BPE \| 85.71 \| 83.78 \| 84.74 \| 89.50 \| 91.31 \| 90.40 \| 76.59 \| 77.42 \| 77.00 \|
	\| medBERT.de \| 78.67 \| 79.58 \| 79.12 \| 87.66 \| 90.02 \| 88.83 \| 73.89 \| 75.78 \| 74.73 \|
	\| BioGottBERT \| 76.96 \| 78.45 \| 77.70 \| 88.37 \| 90.74 \| 89.54 \| 75.24 \| 75.40 \| 75.32 \|
	\| GeistBERT \| 75.65 \| 79.83 \| 77.69 \| 85.58 \| 89.65 \| 87.57 \| 74.57 \| 75.36 \| 74.96 \|
	\| GeBERTa \| 78.67 \| 79.58 \| 79.12 \| 90.51 \| 90.23 \| 90.37 \| 75.96 \| 76.93 \| 76.45 \|


	### Text Classification
	\| Model \| CLEF Prec. \| CLEF Rec. \| CLEF F1 \| JSynCC Prec. \| JSynCC Rec. \| JSynCC F1 \|
	\|--------------------\|------------\|-----------\|----------\|---------------\|--------------\|------------\|
	\| ChristBERT \| 78.12 \| 75.34 \| 76.03 \| 89.01 \| 100 \| _94.19_ \|
	\| ChristBERT_scratch \| 93.68 \| 85.17 \| _89.22_ \| _91.86_ \| 97.53 \| 94.61 \|
	\| ChristBERT_BPE \| 88.22 \| _88.35_ \| 88.28 \| 89.53 \| 95.06 \| 92.22 \|
	\| medBERT.de \| 89.21 \| 87.59 \| 88.40 \| 91.25 \| 90.12 \| 90.68 \|
	\| BioGottBERT \| 88.30 \| 87.90 \| 88.10 \| 88.89 \| _98.77_ \| 93.57 \|
	\| GeistBERT \| _90.43_ \| 72.92 \| 80.74 \| 92.59 \| 92.59 \| 92.59 \|
	\| GeBERTa \| 88.91 \| 89.71 \| 89.31\| 92.59 \| 92.59 \| 92.59 \|


	## How to Use
	```python
	from transformers import AutoTokenizer, AutoModel

	tokenizer = AutoTokenizer.from_pretrained("ChristBERT/ChristBERT_base")
	model = AutoModel.from_pretrained("ChristBERT/ChristBERT_base")

	inputs = tokenizer("Der Patient leidet an Diabetes mellitus.", return_tensors="pt")
	outputs = model(**inputs)
	```

	## Limitations
	- Focused on the German biomedical domain — may not generalize well to other domains
	- Trained on publicly available or de-identified data; not suitable for sensitive clinical decisions

	# Terms of Use
	By downloading and using any of the ChristBERT models from the Hugging Face Hub, you agree to abide by the following terms and conditions:

	Purpose and Scope: All of the ChristBERT models are intended for research and informational purposes only and must not be used as the sole basis for making medical decisions or diagnosing patients.
	The models should be used as a supplementary tool alongside professional medical advice and clinical judgment.

	Proper Usage: Users agree to use one of the ChristBERT models in a responsible manner, complying with all applicable laws, regulations, and ethical guidelines.
	The model must not be used for any unlawful, harmful, or malicious purposes. The model must not be used in clinical decicion making and patient treatment.

	Data Privacy and Security: Users are responsible for ensuring the privacy and security of any sensitive or confidential data processed using one of the ChristBERT models.
	Personally identifiable information (PII) should be anonymized before being processed by the model, and users must implement appropriate measures to protect data privacy.

	Prohibited Activities: Users are strictly prohibited from attempting to perform adversarial attacks, information retrieval, or any other actions that may compromise
	the security and integrity of any of the ChristBERT models. Violators may face legal consequences and the retraction of the model's publication.

	By downloading and using one of the ChristBERT models, you confirm that you have read, understood, and agree to abide by these terms of use.

	## Legal Disclaimer:
	By using one of the ChristBERT models, you agree not to engage in any attempts to perform adversarial attacks or information retrieval from the model.
	Such activities are strictly prohibited and constitute a violation of the terms of use.
	Violators may face legal consequences, and any discovered violations may result in the immediate retraction of the model's publication.
	By continuing to one of the ChristBERT models, you acknowledge and accept the responsibility to adhere to these terms and conditions.

	## Citation
	```
	@misc{christbert,
	title = {The Word and the Way: Strategies for Domain-Specific {BERT} Pre-Training in German Medical NLP},
	author = {Henry He and Johann Frei and Raphael Scheible-Schmitt},
	shorttitle= {The Word and the Way},
	year = {2025},
	month = sep,
	publisher = {Research Square},
	doi = {10.21203/rs.3.rs-7332811/v1},
	url = {https://www.researchsquare.com/article/rs-7332811/v1},
	urldate = {2025-09-23},
	note = {ISSN: 2693-5015}
	}
	```

	## License
	MIT