GeistBERT_base / README.md

Raphael Scheible

Update README.md

0a8200b verified 11 months ago

5.85 kB

	---
	license: mit
	language:
	- de
	base_model:
	- TUM/GottBERT_filtered_base_best
	---

	# GeistBERT
	GeistBERT is a German language model trained on a for the most part deduplicated corpus including OSCAR23, OPUS, and MC4. It builds on GottBERT while introducing Whole Word Masking (WWM) to improve contextual language representation. The model achieves state-of-the-art (SOTA) performance on multiple German NLP benchmarks.

	GeistBERT comes in three versions:
	- GeistBERT (Standard, this repo)
	- [GeistBERT-Nyströmformer](https://huggingface.co/GeistBERT/GeistBERT_base_nystromformer) (Efficient self-attention)
	- [GeistBERT-Longformer](https://huggingface.co/GeistBERT/GeistBERT_base_longformer) (Extended context length)

	## Training Data
	GeistBERT was trained on a diverse German corpus combining:
	- OSCAR23, OPUS, and MC4 (for the most part deduplicated)
	- German Wikipedia
	- OpenLegalData
	- Europarl, EUbookshop, ECB, and EuroPat
	- OpenSubtitles and TildeMODEL

	The dataset amounts to approximately 1.3T tokens, shuffled for improved variance.

	## Training Procedure
	### Hardware
	- Training was conducted on multiple GPUs, including NVIDIA RTX 3090 (24GB VRAM).
	- Gradient accumulation was used for Longformer, requiring more VRAM compared to Nyströmformer and RoBERTa, which fit on a single RTX 3090.

	### Hyperparameters
	\| Parameter \| Value \|
	\|--------------------\|------------------------\|
	\| Model Architecture \| RoBERTa (Base) \|
	\| Batch Size \| 8,000 \|
	\| Training Steps \| 100k \|
	\| Weight Initialization \| [GottBERT filtered base](https://huggingface.co/TUM/GottBERT_filtered_base_best) \|
	\| Warmup Iterations \| 10k \|
	\| Peak Learning Rate \| 0.0007 \|
	\| Learning Rate Decay \| Polynomial to zero \|

	## Performance
	GeistBERT achieves SOTA results on multiple tasks:
	- NER: CoNLL 2003, GermEval 2014
	- Text Classification: GermEval 2018 (coarse & fine), 10kGNAD
	- NLI: German subset of XNLI

	Mertics:
	- NER and Text Classification: F1 Score
	- NLI: Accuracy


	Details:
	- bold values indicate the best performing model within one architecure (base, large), <ins>undescored</ins> values the second best.

	\| Model \| Accuracy NLI \| GermEval\_14 F1 \| CoNLL F1 \| Coarse F1 \| Fine F1 \| 10kGNAD F1 \|
	\|-------------------------------------\|--------------\|----------------\|----------\|-----------\|---------\|------------\|
	\| [GeistBERT](https://huggingface.co/GeistBERT/GeistBERT_base) \| 82.67 \| 88.47 \| _86.17_ \| _79.67_ \| 66.42 \| 90.89 \|
	\| [GeistBERT-Nyströmformer](https://huggingface.co/GeistBERT/GeistBERT_base_nystromformer) \| 82.50 \| 88.23 \| 85.76 \| 79.17 \| 78.57 \| 90.33 \|
	\| [GeistBERT-Longformer](https://huggingface.co/GeistBERT/GeistBERT_base_longformer) \| _82.51_ \| _88.45_ \| 86.71 \| 80.56 \| _66.76_ \| 90.32 \|
	\| [GottBERT_base_best](https://huggingface.co/TUM/GottBERT_base_best) \| 80.82 \| 87.55 \| 85.93 \| 78.17 \| 53.30 \| 89.64 \|
	\| [GottBERT_base_last](https://huggingface.co/TUM/GottBERT_base_last) \| 81.04 \| 87.48 \| 85.61 \| 78.18 \| 53.92 \| 90.27 \|
	\| [GottBERT_filtered_base_best](https://huggingface.co/TUM/GottBERT_filtered_base_best) \| 80.56 \| 87.57 \| 86.14 \| 78.65 \| 52.82 \| 89.79 \|
	\| [GottBERT_filtered_base_last](https://huggingface.co/TUM/GottBERT_filtered_base_last) \| 80.74 \| 87.59 \| 85.66 \| 78.08 \| 52.39 \| 89.92 \|
	\| GELECTRA_base \| 81.70 \| 86.91 \| 85.37 \| 77.26 \| 50.07 \| 89.02 \|
	\| GBERT_base \| 80.06 \| 87.24 \| 85.16 \| 77.37 \| 51.51 \| 90.30 \|
	\| dbmdzBERT \| 68.12 \| 86.82 \| 85.15 \| 77.46 \| 52.07 \| _90.34_ \|
	\| GermanBERT \| 78.16 \| 86.53 \| 83.87 \| 74.81 \| 47.78 \| 90.18 \|
	\| XLM-R_base \| 79.76 \| 86.14 \| 84.46 \| 77.13 \| 50.54 \| 89.81 \|
	\| mBERT \| 77.03 \| 86.67 \| 83.18 \| 73.54 \| 48.32 \| 88.90 \|
	\| [GottBERT_large](https://huggingface.co/TUM/GottBERT_large) \| 82.46 \| 88.20 \| _86.78_ \| 79.40 \| 54.61 \| 90.24 \|
	\| [GottBERT_filtered_large_best](https://huggingface.co/TUM/GottBERT_filtered_large_best) \| 83.31 \| 88.13 \| 86.30 \| 79.32 \| 54.70 \| 90.31 \|
	\| [GottBERT_filtered_large_last](https://huggingface.co/TUM/GottBERT_filtered_large_last) \| 82.79 \| _88.27_ \| 86.28 \| 78.96 \| 54.72 \| 90.17 \|
	\| GELECTRA_large \| 86.33 \| _88.72_ \| _86.78_ \| 81.28 \| _56.17_ \| 90.97 \|
	\| GBERT_large \| _84.21_ \| _88.72_ \| 87.19 \| _80.84_ \| 57.37 \| _90.74_ \|
	\| XLM-R_large \| 84.07 \| 88.83 \| 86.54 \| 79.05 \| 55.06 \| 90.17 \|


	## Intended Use
	This model is designed for German NLP tasks, including:
	- Text classification
	- Named Entity Recognition (NER)
	- Machine Translation Pre-training
	- Document Understanding

	## Limitations
	- Trained on unfiltered data, meaning some redundant or lower-quality samples may be present.
	- Longformer requires more VRAM, making it less accessible for smaller GPU setups.
	- While deduplication was applied to specific subcorpora, the full corpus was not manually curated.

	## Fairseq Checkpoints
	Get the fairseq checkpoints [here](https://drive.proton.me/urls/P83GCPNM40#2f0f87XEIrQP).

	---
	license: mit
	language:
	- de
	base_model:
	- TUM/GottBERT_filtered_base_best
	---

	# GeistBERT
	GeistBERT is a German language model trained on a for the most part deduplicated corpus including OSCAR23, OPUS, and MC4. It builds on GottBERT while introducing Whole Word Masking (WWM) to improve contextual language representation. The model achieves state-of-the-art (SOTA) performance on multiple German NLP benchmarks.

	GeistBERT comes in three versions:
	- GeistBERT (Standard, this repo)
	- [GeistBERT-Nyströmformer](https://huggingface.co/GeistBERT/GeistBERT_base_nystromformer) (Efficient self-attention)
	- [GeistBERT-Longformer](https://huggingface.co/GeistBERT/GeistBERT_base_longformer) (Extended context length)

	## Training Data
	GeistBERT was trained on a diverse German corpus combining:
	- OSCAR23, OPUS, and MC4 (for the most part deduplicated)
	- German Wikipedia
	- OpenLegalData
	- Europarl, EUbookshop, ECB, and EuroPat
	- OpenSubtitles and TildeMODEL

	The dataset amounts to approximately 1.3T tokens, shuffled for improved variance.

	## Training Procedure
	### Hardware
	- Training was conducted on multiple GPUs, including NVIDIA RTX 3090 (24GB VRAM).
	- Gradient accumulation was used for Longformer, requiring more VRAM compared to Nyströmformer and RoBERTa, which fit on a single RTX 3090.

	### Hyperparameters
	\| Parameter \| Value \|
	\|--------------------\|------------------------\|
	\| Model Architecture \| RoBERTa (Base) \|
	\| Batch Size \| 8,000 \|
	\| Training Steps \| 100k \|
	\| Weight Initialization \| [GottBERT filtered base](https://huggingface.co/TUM/GottBERT_filtered_base_best) \|
	\| Warmup Iterations \| 10k \|
	\| Peak Learning Rate \| 0.0007 \|
	\| Learning Rate Decay \| Polynomial to zero \|

	## Performance
	GeistBERT achieves SOTA results on multiple tasks:
	- NER: CoNLL 2003, GermEval 2014
	- Text Classification: GermEval 2018 (coarse & fine), 10kGNAD
	- NLI: German subset of XNLI

	Mertics:
	- NER and Text Classification: F1 Score
	- NLI: Accuracy


	Details:
	- bold values indicate the best performing model within one architecure (base, large), <ins>undescored</ins> values the second best.

	\| Model \| Accuracy NLI \| GermEval\_14 F1 \| CoNLL F1 \| Coarse F1 \| Fine F1 \| 10kGNAD F1 \|
	\|-------------------------------------\|--------------\|----------------\|----------\|-----------\|---------\|------------\|
	\| [GeistBERT](https://huggingface.co/GeistBERT/GeistBERT_base) \| 82.67 \| 88.47 \| _86.17_ \| _79.67_ \| 66.42 \| 90.89 \|
	\| [GeistBERT-Nyströmformer](https://huggingface.co/GeistBERT/GeistBERT_base_nystromformer) \| 82.50 \| 88.23 \| 85.76 \| 79.17 \| 78.57 \| 90.33 \|
	\| [GeistBERT-Longformer](https://huggingface.co/GeistBERT/GeistBERT_base_longformer) \| _82.51_ \| _88.45_ \| 86.71 \| 80.56 \| _66.76_ \| 90.32 \|
	\| [GottBERT_base_best](https://huggingface.co/TUM/GottBERT_base_best) \| 80.82 \| 87.55 \| 85.93 \| 78.17 \| 53.30 \| 89.64 \|
	\| [GottBERT_base_last](https://huggingface.co/TUM/GottBERT_base_last) \| 81.04 \| 87.48 \| 85.61 \| 78.18 \| 53.92 \| 90.27 \|
	\| [GottBERT_filtered_base_best](https://huggingface.co/TUM/GottBERT_filtered_base_best) \| 80.56 \| 87.57 \| 86.14 \| 78.65 \| 52.82 \| 89.79 \|
	\| [GottBERT_filtered_base_last](https://huggingface.co/TUM/GottBERT_filtered_base_last) \| 80.74 \| 87.59 \| 85.66 \| 78.08 \| 52.39 \| 89.92 \|
	\| GELECTRA_base \| 81.70 \| 86.91 \| 85.37 \| 77.26 \| 50.07 \| 89.02 \|
	\| GBERT_base \| 80.06 \| 87.24 \| 85.16 \| 77.37 \| 51.51 \| 90.30 \|
	\| dbmdzBERT \| 68.12 \| 86.82 \| 85.15 \| 77.46 \| 52.07 \| _90.34_ \|
	\| GermanBERT \| 78.16 \| 86.53 \| 83.87 \| 74.81 \| 47.78 \| 90.18 \|
	\| XLM-R_base \| 79.76 \| 86.14 \| 84.46 \| 77.13 \| 50.54 \| 89.81 \|
	\| mBERT \| 77.03 \| 86.67 \| 83.18 \| 73.54 \| 48.32 \| 88.90 \|
	\| [GottBERT_large](https://huggingface.co/TUM/GottBERT_large) \| 82.46 \| 88.20 \| _86.78_ \| 79.40 \| 54.61 \| 90.24 \|
	\| [GottBERT_filtered_large_best](https://huggingface.co/TUM/GottBERT_filtered_large_best) \| 83.31 \| 88.13 \| 86.30 \| 79.32 \| 54.70 \| 90.31 \|
	\| [GottBERT_filtered_large_last](https://huggingface.co/TUM/GottBERT_filtered_large_last) \| 82.79 \| _88.27_ \| 86.28 \| 78.96 \| 54.72 \| 90.17 \|
	\| GELECTRA_large \| 86.33 \| _88.72_ \| _86.78_ \| 81.28 \| _56.17_ \| 90.97 \|
	\| GBERT_large \| _84.21_ \| _88.72_ \| 87.19 \| _80.84_ \| 57.37 \| _90.74_ \|
	\| XLM-R_large \| 84.07 \| 88.83 \| 86.54 \| 79.05 \| 55.06 \| 90.17 \|


	## Intended Use
	This model is designed for German NLP tasks, including:
	- Text classification
	- Named Entity Recognition (NER)
	- Machine Translation Pre-training
	- Document Understanding

	## Limitations
	- Trained on unfiltered data, meaning some redundant or lower-quality samples may be present.
	- Longformer requires more VRAM, making it less accessible for smaller GPU setups.
	- While deduplication was applied to specific subcorpora, the full corpus was not manually curated.

	## Fairseq Checkpoints
	Get the fairseq checkpoints [here](https://drive.proton.me/urls/P83GCPNM40#2f0f87XEIrQP).