GeistBERT_base / README.md

Raphael Scheible

Update README.md

0a8200b verified 11 months ago

5.85 kB

metadata

license: mit
language:
  - de
base_model:
  - TUM/GottBERT_filtered_base_best

GeistBERT

GeistBERT is a German language model trained on a for the most part deduplicated corpus including OSCAR23, OPUS, and MC4. It builds on GottBERT while introducing Whole Word Masking (WWM) to improve contextual language representation. The model achieves state-of-the-art (SOTA) performance on multiple German NLP benchmarks.

GeistBERT comes in three versions:

GeistBERT (Standard, this repo)
GeistBERT-Nyströmformer (Efficient self-attention)
GeistBERT-Longformer (Extended context length)

Training Data

GeistBERT was trained on a diverse German corpus combining:

OSCAR23, OPUS, and MC4 (for the most part deduplicated)
German Wikipedia
OpenLegalData
Europarl, EUbookshop, ECB, and EuroPat
OpenSubtitles and TildeMODEL

The dataset amounts to approximately 1.3T tokens, shuffled for improved variance.

Training Procedure

Hardware

Training was conducted on multiple GPUs, including NVIDIA RTX 3090 (24GB VRAM).
Gradient accumulation was used for Longformer, requiring more VRAM compared to Nyströmformer and RoBERTa, which fit on a single RTX 3090.

Hyperparameters

Parameter	Value
Model Architecture	RoBERTa (Base)
Batch Size	8,000
Training Steps	100k
Weight Initialization	GottBERT filtered base
Warmup Iterations	10k
Peak Learning Rate	0.0007
Learning Rate Decay	Polynomial to zero

Performance

GeistBERT achieves SOTA results on multiple tasks:

NER: CoNLL 2003, GermEval 2014
Text Classification: GermEval 2018 (coarse & fine), 10kGNAD
NLI: German subset of XNLI

Mertics:

NER and Text Classification: F1 Score
NLI: Accuracy

Details:

bold values indicate the best performing model within one architecure (base, large), undescored values the second best.

Model	Accuracy NLI	GermEval_14 F1	CoNLL F1	Coarse F1	Fine F1	10kGNAD F1
GeistBERT	82.67	88.47	86.17	79.67	66.42	90.89
GeistBERT-Nyströmformer	82.50	88.23	85.76	79.17	78.57	90.33
GeistBERT-Longformer	82.51	88.45	86.71	80.56	66.76	90.32
GottBERT_base_best	80.82	87.55	85.93	78.17	53.30	89.64
GottBERT_base_last	81.04	87.48	85.61	78.18	53.92	90.27
GottBERT_filtered_base_best	80.56	87.57	86.14	78.65	52.82	89.79
GottBERT_filtered_base_last	80.74	87.59	85.66	78.08	52.39	89.92
GELECTRA_base	81.70	86.91	85.37	77.26	50.07	89.02
GBERT_base	80.06	87.24	85.16	77.37	51.51	90.30
dbmdzBERT	68.12	86.82	85.15	77.46	52.07	90.34
GermanBERT	78.16	86.53	83.87	74.81	47.78	90.18
XLM-R_base	79.76	86.14	84.46	77.13	50.54	89.81
mBERT	77.03	86.67	83.18	73.54	48.32	88.90
GottBERT_large	82.46	88.20	86.78	79.40	54.61	90.24
GottBERT_filtered_large_best	83.31	88.13	86.30	79.32	54.70	90.31
GottBERT_filtered_large_last	82.79	88.27	86.28	78.96	54.72	90.17
GELECTRA_large	86.33	88.72	86.78	81.28	56.17	90.97
GBERT_large	84.21	88.72	87.19	80.84	57.37	90.74
XLM-R_large	84.07	88.83	86.54	79.05	55.06	90.17

Intended Use

This model is designed for German NLP tasks, including:

Text classification
Named Entity Recognition (NER)
Machine Translation Pre-training
Document Understanding

Limitations

Trained on unfiltered data, meaning some redundant or lower-quality samples may be present.
Longformer requires more VRAM, making it less accessible for smaller GPU setups.
While deduplication was applied to specific subcorpora, the full corpus was not manually curated.

Fairseq Checkpoints

Get the fairseq checkpoints here.