File size: 5,854 Bytes
74cc6a7
 
 
 
 
 
3a8c121
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bfd02fd
 
 
 
 
 
 
 
 
3a8c121
 
 
332242c
 
 
 
 
 
 
 
 
 
 
 
 
 
0a8200b
 
 
332242c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3a8c121
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
license: mit
language:
- de
base_model:
- TUM/GottBERT_filtered_base_best
---

# GeistBERT
GeistBERT is a **German language model** trained on a **for the most part deduplicated corpus** including **OSCAR23, OPUS, and MC4**. It builds on **GottBERT** while introducing **Whole Word Masking (WWM)** to improve contextual language representation. The model achieves **state-of-the-art (SOTA) performance** on multiple German NLP benchmarks.

GeistBERT comes in **three versions**:
- GeistBERT (Standard, this repo)
- [GeistBERT-Nyströmformer](https://huggingface.co/GeistBERT/GeistBERT_base_nystromformer) (Efficient self-attention)
- [GeistBERT-Longformer](https://huggingface.co/GeistBERT/GeistBERT_base_longformer) (Extended context length)

## Training Data
GeistBERT was trained on a **diverse German corpus** combining:
- **OSCAR23, OPUS, and MC4** (for the most part deduplicated)
- **German Wikipedia**
- **OpenLegalData**
- **Europarl, EUbookshop, ECB, and EuroPat**
- **OpenSubtitles and TildeMODEL**

The dataset amounts to **approximately 1.3T tokens**, shuffled for improved variance.

## Training Procedure
### Hardware
- Training was conducted on **multiple GPUs**, including **NVIDIA RTX 3090 (24GB VRAM)**.
- **Gradient accumulation** was used for **Longformer**, requiring **more VRAM** compared to Nyströmformer and RoBERTa, which fit on a single RTX 3090.

### Hyperparameters
| Parameter          | Value                  |
|--------------------|------------------------|
| **Model Architecture** | RoBERTa (Base)      |
| **Batch Size**     | 8,000                  |
| **Training Steps** | 100k                   |
| **Weight Initialization** | [GottBERT filtered base](https://huggingface.co/TUM/GottBERT_filtered_base_best) |
| **Warmup Iterations** | 10k                  |
| **Peak Learning Rate** | 0.0007              |
| **Learning Rate Decay** | Polynomial to zero |

## Performance
GeistBERT achieves **SOTA results** on multiple tasks:
- **NER**: CoNLL 2003, GermEval 2014
- **Text Classification**: GermEval 2018 (coarse & fine), 10kGNAD
- **NLI**: German subset of XNLI

Mertics:
- **NER and Text Classification**: F1 Score
- **NLI**: Accuracy


Details:
- **bold** values indicate the best performing model within one architecure (base, large), <ins>undescored</ins> values the second best.

| Model                               | Accuracy NLI | GermEval\_14 F1 | CoNLL F1 | Coarse F1 | Fine F1 | 10kGNAD F1 |
|-------------------------------------|--------------|----------------|----------|-----------|---------|------------|
| [GeistBERT](https://huggingface.co/GeistBERT/GeistBERT_base)                | **82.67**   | **88.47**   | _86.17_  | _79.67_  | 66.42   | **90.89**  |
| [GeistBERT-Nyströmformer](https://huggingface.co/GeistBERT/GeistBERT_base_nystromformer) | 82.50       | 88.23          | 85.76    | 79.17     | **78.57** | 90.33      |
| [GeistBERT-Longformer](https://huggingface.co/GeistBERT/GeistBERT_base_longformer) | _82.51_     | _88.45_        | **86.71** | **80.56** | _66.76_ | 90.32      |
| [GottBERT_base_best](https://huggingface.co/TUM/GottBERT_base_best)                | 80.82       | 87.55          | 85.93  | 78.17     | 53.30   | 89.64      |
| [GottBERT_base_last](https://huggingface.co/TUM/GottBERT_base_last)                | 81.04       | 87.48          | 85.61    | 78.18   | 53.92 | 90.27  |
| [GottBERT_filtered_base_best](https://huggingface.co/TUM/GottBERT_filtered_base_best)         | 80.56       | 87.57 | 86.14 | 78.65 | 52.82   | 89.79      |
| [GottBERT_filtered_base_last](https://huggingface.co/TUM/GottBERT_filtered_base_last)         | 80.74       | 87.59      | 85.66    | 78.08     | 52.39   | 89.92      |
| GELECTRA_base                   | 81.70   | 86.91          | 85.37    | 77.26     | 50.07   | 89.02      |
| GBERT_base                        | 80.06       | 87.24          | 85.16    | 77.37     | 51.51   | 90.30  |
| dbmdzBERT                          | 68.12       | 86.82          | 85.15    | 77.46     | 52.07   | _90.34_  |
| GermanBERT                        | 78.16       | 86.53          | 83.87    | 74.81     | 47.78   | 90.18      |
| XLM-R_base                        | 79.76       | 86.14          | 84.46    | 77.13     | 50.54   | 89.81      |
| mBERT                              | 77.03       | 86.67          | 83.18    | 73.54     | 48.32   | 88.90      |
| [GottBERT_large](https://huggingface.co/TUM/GottBERT_large)                | 82.46       | 88.20          | _86.78_  | 79.40     | 54.61   | 90.24      |
| [GottBERT_filtered_large_best](https://huggingface.co/TUM/GottBERT_filtered_large_best)     | 83.31       | 88.13          | 86.30    | 79.32     | 54.70   | 90.31      |
| [GottBERT_filtered_large_last](https://huggingface.co/TUM/GottBERT_filtered_large_last)     | 82.79       | _88.27_ | 86.28    | 78.96     | 54.72   | 90.17      |
| GELECTRA_large                | **86.33**   | _88.72_ | _86.78_  | **81.28** | _56.17_ | **90.97**  |
| GBERT_large                      | _84.21_     | _88.72_        | **87.19** | _80.84_   | **57.37** | _90.74_   |
| XLM-R_large                      | 84.07       | **88.83**      | 86.54    | 79.05     | 55.06   | 90.17      |


## Intended Use
This model is designed for **German NLP tasks**, including:
- **Text classification**
- **Named Entity Recognition (NER)**
- **Machine Translation Pre-training**
- **Document Understanding**

## Limitations
- Trained on **unfiltered data**, meaning some **redundant or lower-quality samples** may be present.
- Longformer **requires more VRAM**, making it less accessible for smaller GPU setups.
- While deduplication was applied to **specific subcorpora**, the full corpus **was not manually curated**.

## Fairseq Checkpoints
Get the fairseq checkpoints [here](https://drive.proton.me/urls/P83GCPNM40#2f0f87XEIrQP).