Raphael Scheible commited on
Commit
3a8c121
·
verified ·
1 Parent(s): 74cc6a7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +55 -1
README.md CHANGED
@@ -4,4 +4,58 @@ language:
4
  - de
5
  base_model:
6
  - TUM/GottBERT_filtered_base_best
7
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  - de
5
  base_model:
6
  - TUM/GottBERT_filtered_base_best
7
+ ---
8
+
9
+ # GeistBERT
10
+ GeistBERT is a **German language model** trained on a **for the most part deduplicated corpus** including **OSCAR23, OPUS, and MC4**. It builds on **GottBERT** while introducing **Whole Word Masking (WWM)** to improve contextual language representation. The model achieves **state-of-the-art (SOTA) performance** on multiple German NLP benchmarks.
11
+
12
+ GeistBERT comes in **three versions**:
13
+ - GeistBERT (Standard, this repo)
14
+ - [GeistBERT-Nyströmformer](https://huggingface.co/GeistBERT/GeistBERT_base_nystromformer) (Efficient self-attention)
15
+ - [GeistBERT-Longformer](https://huggingface.co/GeistBERT/GeistBERT_base_longformer) (Extended context length)
16
+
17
+ ## Training Data
18
+ GeistBERT was trained on a **diverse German corpus** combining:
19
+ - **OSCAR23, OPUS, and MC4** (for the most part deduplicated)
20
+ - **German Wikipedia**
21
+ - **OpenLegalData**
22
+ - **Europarl, EUbookshop, ECB, and EuroPat**
23
+ - **OpenSubtitles and TildeMODEL**
24
+
25
+ The dataset amounts to **approximately 1.3T tokens**, shuffled for improved variance.
26
+
27
+ ## Training Procedure
28
+ ### Hardware
29
+ - Training was conducted on **multiple GPUs**, including **NVIDIA RTX 3090 (24GB VRAM)**.
30
+ - **Gradient accumulation** was used for **Longformer**, requiring **more VRAM** compared to Nyströmformer and RoBERTa, which fit on a single RTX 3090.
31
+
32
+ ### Hyperparameters
33
+ - Training steps: **100k**
34
+ - Learning rate: **2e-4**
35
+ - Warmup steps: **10k**
36
+ - Batch sizes: **48 / 64 (using gradient accumulation for Longformer)**
37
+ - Optimizer: **AdamW**
38
+ - Weight Initialization: **GottBERT**
39
+
40
+ ## Performance
41
+ GeistBERT achieves **SOTA results** on multiple tasks:
42
+ - **Sentiment classification** (GermEval 2018)
43
+ - **News categorization** (10kGNAD)
44
+ - **Named Entity Recognition (NER)**
45
+ - **Machine Translation Adaptation**
46
+
47
+ ## Intended Use
48
+ This model is designed for **German NLP tasks**, including:
49
+ - **Text classification**
50
+ - **Named Entity Recognition (NER)**
51
+ - **Machine Translation Pre-training**
52
+ - **Document Understanding**
53
+
54
+ ## Limitations
55
+ - Trained on **unfiltered data**, meaning some **redundant or lower-quality samples** may be present.
56
+ - Longformer **requires more VRAM**, making it less accessible for smaller GPU setups.
57
+ - While deduplication was applied to **specific subcorpora**, the full corpus **was not manually curated**.
58
+
59
+ ## Fairseq Checkpoints
60
+ Get the fairseq checkpoints [here](https://drive.proton.me/urls/P83GCPNM40#2f0f87XEIrQP).
61
+