metadata
license: mit
language:
- de
base_model:
- TUM/GottBERT_filtered_base_best
GeistBERT
GeistBERT is a German language model trained on a for the most part deduplicated corpus including OSCAR23, OPUS, and MC4. It builds on GottBERT while introducing Whole Word Masking (WWM) to improve contextual language representation. The model achieves state-of-the-art (SOTA) performance on multiple German NLP benchmarks.
GeistBERT comes in three versions:
- GeistBERT (Standard, this repo)
- GeistBERT-Nyströmformer (Efficient self-attention)
- GeistBERT-Longformer (Extended context length)
Training Data
GeistBERT was trained on a diverse German corpus combining:
- OSCAR23, OPUS, and MC4 (for the most part deduplicated)
- German Wikipedia
- OpenLegalData
- Europarl, EUbookshop, ECB, and EuroPat
- OpenSubtitles and TildeMODEL
The dataset amounts to approximately 1.3T tokens, shuffled for improved variance.
Training Procedure
Hardware
- Training was conducted on multiple GPUs, including NVIDIA RTX 3090 (24GB VRAM).
- Gradient accumulation was used for Longformer, requiring more VRAM compared to Nyströmformer and RoBERTa, which fit on a single RTX 3090.
Hyperparameters
| Parameter | Value |
|---|---|
| Model Architecture | RoBERTa (Base) |
| Batch Size | 8,000 |
| Training Steps | 100k |
| Weight Initialization | GottBERT filtered base |
| Warmup Iterations | 10k |
| Peak Learning Rate | 0.0007 |
| Learning Rate Decay | Polynomial to zero |
Performance
GeistBERT achieves SOTA results on multiple tasks:
- NER: CoNLL 2003, GermEval 2014
- Text Classification: GermEval 2018 (coarse & fine), 10kGNAD
- NLI: German subset of XNLI
Mertics:
- NER and Text Classification: F1 Score
- NLI: Accuracy
Details:
- bold values indicate the best performing model within one architecure (base, large), undescored values the second best.
| Model | Accuracy NLI | GermEval_14 F1 | CoNLL F1 | Coarse F1 | Fine F1 | 10kGNAD F1 |
|---|---|---|---|---|---|---|
| GeistBERT | 82.67 | 88.47 | 86.17 | 79.67 | 66.42 | 90.89 |
| GeistBERT-Nyströmformer | 82.50 | 88.23 | 85.76 | 79.17 | 78.57 | 90.33 |
| GeistBERT-Longformer | 82.51 | 88.45 | 86.71 | 80.56 | 66.76 | 90.32 |
| GottBERT_base_best | 80.82 | 87.55 | 85.93 | 78.17 | 53.30 | 89.64 |
| GottBERT_base_last | 81.04 | 87.48 | 85.61 | 78.18 | 53.92 | 90.27 |
| GottBERT_filtered_base_best | 80.56 | 87.57 | 86.14 | 78.65 | 52.82 | 89.79 |
| GottBERT_filtered_base_last | 80.74 | 87.59 | 85.66 | 78.08 | 52.39 | 89.92 |
| GELECTRA_base | 81.70 | 86.91 | 85.37 | 77.26 | 50.07 | 89.02 |
| GBERT_base | 80.06 | 87.24 | 85.16 | 77.37 | 51.51 | 90.30 |
| dbmdzBERT | 68.12 | 86.82 | 85.15 | 77.46 | 52.07 | 90.34 |
| GermanBERT | 78.16 | 86.53 | 83.87 | 74.81 | 47.78 | 90.18 |
| XLM-R_base | 79.76 | 86.14 | 84.46 | 77.13 | 50.54 | 89.81 |
| mBERT | 77.03 | 86.67 | 83.18 | 73.54 | 48.32 | 88.90 |
| GottBERT_large | 82.46 | 88.20 | 86.78 | 79.40 | 54.61 | 90.24 |
| GottBERT_filtered_large_best | 83.31 | 88.13 | 86.30 | 79.32 | 54.70 | 90.31 |
| GottBERT_filtered_large_last | 82.79 | 88.27 | 86.28 | 78.96 | 54.72 | 90.17 |
| GELECTRA_large | 86.33 | 88.72 | 86.78 | 81.28 | 56.17 | 90.97 |
| GBERT_large | 84.21 | 88.72 | 87.19 | 80.84 | 57.37 | 90.74 |
| XLM-R_large | 84.07 | 88.83 | 86.54 | 79.05 | 55.06 | 90.17 |
Intended Use
This model is designed for German NLP tasks, including:
- Text classification
- Named Entity Recognition (NER)
- Machine Translation Pre-training
- Document Understanding
Limitations
- Trained on unfiltered data, meaning some redundant or lower-quality samples may be present.
- Longformer requires more VRAM, making it less accessible for smaller GPU setups.
- While deduplication was applied to specific subcorpora, the full corpus was not manually curated.
Fairseq Checkpoints
Get the fairseq checkpoints here.