GeistBERT
/

GeistBERT_base

Model card Files Files and versions

Raphael Scheible commited on Feb 14, 2025

Commit

3a8c121

·

verified ·

1 Parent(s): 74cc6a7

Update README.md

Files changed (1) hide show

README.md +55 -1

README.md CHANGED Viewed

@@ -4,4 +4,58 @@ language:
 - de
 base_model:
 - TUM/GottBERT_filtered_base_best
----

 - de
 base_model:
 - TUM/GottBERT_filtered_base_best
+---
+# GeistBERT
+GeistBERT is a **German language model** trained on a **for the most part deduplicated corpus** including **OSCAR23, OPUS, and MC4**. It builds on **GottBERT** while introducing **Whole Word Masking (WWM)** to improve contextual language representation. The model achieves **state-of-the-art (SOTA) performance** on multiple German NLP benchmarks.
+GeistBERT comes in **three versions**:
+- GeistBERT (Standard, this repo)
+- [GeistBERT-Nyströmformer](https://huggingface.co/GeistBERT/GeistBERT_base_nystromformer) (Efficient self-attention)
+- [GeistBERT-Longformer](https://huggingface.co/GeistBERT/GeistBERT_base_longformer) (Extended context length)
+## Training Data
+GeistBERT was trained on a **diverse German corpus** combining:
+- **OSCAR23, OPUS, and MC4** (for the most part deduplicated)
+- **German Wikipedia**
+- **OpenLegalData**
+- **Europarl, EUbookshop, ECB, and EuroPat**
+- **OpenSubtitles and TildeMODEL**
+The dataset amounts to **approximately 1.3T tokens**, shuffled for improved variance.
+## Training Procedure
+### Hardware
+- Training was conducted on **multiple GPUs**, including **NVIDIA RTX 3090 (24GB VRAM)**.
+- **Gradient accumulation** was used for **Longformer**, requiring **more VRAM** compared to Nyströmformer and RoBERTa, which fit on a single RTX 3090.
+### Hyperparameters
+- Training steps: **100k**
+- Learning rate: **2e-4**
+- Warmup steps: **10k**
+- Batch sizes: **48 / 64 (using gradient accumulation for Longformer)**
+- Optimizer: **AdamW**
+- Weight Initialization: **GottBERT**
+## Performance
+GeistBERT achieves **SOTA results** on multiple tasks:
+- **Sentiment classification** (GermEval 2018)
+- **News categorization** (10kGNAD)
+- **Named Entity Recognition (NER)**
+- **Machine Translation Adaptation**
+## Intended Use
+This model is designed for **German NLP tasks**, including:
+- **Text classification**
+- **Named Entity Recognition (NER)**
+- **Machine Translation Pre-training**
+- **Document Understanding**
+## Limitations
+- Trained on **unfiltered data**, meaning some **redundant or lower-quality samples** may be present.
+- Longformer **requires more VRAM**, making it less accessible for smaller GPU setups.
+- While deduplication was applied to **specific subcorpora**, the full corpus **was not manually curated**.
+## Fairseq Checkpoints
+Get the fairseq checkpoints [here](https://drive.proton.me/urls/P83GCPNM40#2f0f87XEIrQP).