|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- de |
|
|
base_model: |
|
|
- TUM/GottBERT_filtered_base_best |
|
|
--- |
|
|
|
|
|
# GeistBERT |
|
|
GeistBERT is a **German language model** trained on a **for the most part deduplicated corpus** including **OSCAR23, OPUS, and MC4**. It builds on **GottBERT** while introducing **Whole Word Masking (WWM)** to improve contextual language representation. The model achieves **state-of-the-art (SOTA) performance** on multiple German NLP benchmarks. |
|
|
|
|
|
GeistBERT comes in **three versions**: |
|
|
- GeistBERT (Standard, this repo) |
|
|
- [GeistBERT-Nyströmformer](https://huggingface.co/GeistBERT/GeistBERT_base_nystromformer) (Efficient self-attention) |
|
|
- [GeistBERT-Longformer](https://huggingface.co/GeistBERT/GeistBERT_base_longformer) (Extended context length) |
|
|
|
|
|
## Training Data |
|
|
GeistBERT was trained on a **diverse German corpus** combining: |
|
|
- **OSCAR23, OPUS, and MC4** (for the most part deduplicated) |
|
|
- **German Wikipedia** |
|
|
- **OpenLegalData** |
|
|
- **Europarl, EUbookshop, ECB, and EuroPat** |
|
|
- **OpenSubtitles and TildeMODEL** |
|
|
|
|
|
The dataset amounts to **approximately 1.3T tokens**, shuffled for improved variance. |
|
|
|
|
|
## Training Procedure |
|
|
### Hardware |
|
|
- Training was conducted on **multiple GPUs**, including **NVIDIA RTX 3090 (24GB VRAM)**. |
|
|
- **Gradient accumulation** was used for **Longformer**, requiring **more VRAM** compared to Nyströmformer and RoBERTa, which fit on a single RTX 3090. |
|
|
|
|
|
### Hyperparameters |
|
|
| Parameter | Value | |
|
|
|--------------------|------------------------| |
|
|
| **Model Architecture** | RoBERTa (Base) | |
|
|
| **Batch Size** | 8,000 | |
|
|
| **Training Steps** | 100k | |
|
|
| **Weight Initialization** | [GottBERT filtered base](https://huggingface.co/TUM/GottBERT_filtered_base_best) | |
|
|
| **Warmup Iterations** | 10k | |
|
|
| **Peak Learning Rate** | 0.0007 | |
|
|
| **Learning Rate Decay** | Polynomial to zero | |
|
|
|
|
|
## Performance |
|
|
GeistBERT achieves **SOTA results** on multiple tasks: |
|
|
- **NER**: CoNLL 2003, GermEval 2014 |
|
|
- **Text Classification**: GermEval 2018 (coarse & fine), 10kGNAD |
|
|
- **NLI**: German subset of XNLI |
|
|
|
|
|
Mertics: |
|
|
- **NER and Text Classification**: F1 Score |
|
|
- **NLI**: Accuracy |
|
|
|
|
|
|
|
|
Details: |
|
|
- **bold** values indicate the best performing model within one architecure (base, large), <ins>undescored</ins> values the second best. |
|
|
|
|
|
| Model | Accuracy NLI | GermEval\_14 F1 | CoNLL F1 | Coarse F1 | Fine F1 | 10kGNAD F1 | |
|
|
|-------------------------------------|--------------|----------------|----------|-----------|---------|------------| |
|
|
| [GeistBERT](https://huggingface.co/GeistBERT/GeistBERT_base) | **82.67** | **88.47** | _86.17_ | _79.67_ | 66.42 | **90.89** | |
|
|
| [GeistBERT-Nyströmformer](https://huggingface.co/GeistBERT/GeistBERT_base_nystromformer) | 82.50 | 88.23 | 85.76 | 79.17 | **78.57** | 90.33 | |
|
|
| [GeistBERT-Longformer](https://huggingface.co/GeistBERT/GeistBERT_base_longformer) | _82.51_ | _88.45_ | **86.71** | **80.56** | _66.76_ | 90.32 | |
|
|
| [GottBERT_base_best](https://huggingface.co/TUM/GottBERT_base_best) | 80.82 | 87.55 | 85.93 | 78.17 | 53.30 | 89.64 | |
|
|
| [GottBERT_base_last](https://huggingface.co/TUM/GottBERT_base_last) | 81.04 | 87.48 | 85.61 | 78.18 | 53.92 | 90.27 | |
|
|
| [GottBERT_filtered_base_best](https://huggingface.co/TUM/GottBERT_filtered_base_best) | 80.56 | 87.57 | 86.14 | 78.65 | 52.82 | 89.79 | |
|
|
| [GottBERT_filtered_base_last](https://huggingface.co/TUM/GottBERT_filtered_base_last) | 80.74 | 87.59 | 85.66 | 78.08 | 52.39 | 89.92 | |
|
|
| GELECTRA_base | 81.70 | 86.91 | 85.37 | 77.26 | 50.07 | 89.02 | |
|
|
| GBERT_base | 80.06 | 87.24 | 85.16 | 77.37 | 51.51 | 90.30 | |
|
|
| dbmdzBERT | 68.12 | 86.82 | 85.15 | 77.46 | 52.07 | _90.34_ | |
|
|
| GermanBERT | 78.16 | 86.53 | 83.87 | 74.81 | 47.78 | 90.18 | |
|
|
| XLM-R_base | 79.76 | 86.14 | 84.46 | 77.13 | 50.54 | 89.81 | |
|
|
| mBERT | 77.03 | 86.67 | 83.18 | 73.54 | 48.32 | 88.90 | |
|
|
| [GottBERT_large](https://huggingface.co/TUM/GottBERT_large) | 82.46 | 88.20 | _86.78_ | 79.40 | 54.61 | 90.24 | |
|
|
| [GottBERT_filtered_large_best](https://huggingface.co/TUM/GottBERT_filtered_large_best) | 83.31 | 88.13 | 86.30 | 79.32 | 54.70 | 90.31 | |
|
|
| [GottBERT_filtered_large_last](https://huggingface.co/TUM/GottBERT_filtered_large_last) | 82.79 | _88.27_ | 86.28 | 78.96 | 54.72 | 90.17 | |
|
|
| GELECTRA_large | **86.33** | _88.72_ | _86.78_ | **81.28** | _56.17_ | **90.97** | |
|
|
| GBERT_large | _84.21_ | _88.72_ | **87.19** | _80.84_ | **57.37** | _90.74_ | |
|
|
| XLM-R_large | 84.07 | **88.83** | 86.54 | 79.05 | 55.06 | 90.17 | |
|
|
|
|
|
|
|
|
## Intended Use |
|
|
This model is designed for **German NLP tasks**, including: |
|
|
- **Text classification** |
|
|
- **Named Entity Recognition (NER)** |
|
|
- **Machine Translation Pre-training** |
|
|
- **Document Understanding** |
|
|
|
|
|
## Limitations |
|
|
- Trained on **unfiltered data**, meaning some **redundant or lower-quality samples** may be present. |
|
|
- Longformer **requires more VRAM**, making it less accessible for smaller GPU setups. |
|
|
- While deduplication was applied to **specific subcorpora**, the full corpus **was not manually curated**. |
|
|
|
|
|
## Fairseq Checkpoints |
|
|
Get the fairseq checkpoints [here](https://drive.proton.me/urls/P83GCPNM40#2f0f87XEIrQP). |
|
|
|
|
|
|