Raphael Scheible-Schmitt commited on
Commit
91fbf3b
·
verified ·
1 Parent(s): eca5322

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -11
README.md CHANGED
@@ -7,12 +7,7 @@ base_model:
7
  ---
8
 
9
  # GeistBERT
10
- GeistBERT is a **German language model** trained on a **for the most part deduplicated corpus** including **OSCAR23, OPUS, and MC4**. It builds on **GottBERT** while introducing **Whole Word Masking (WWM)** to improve contextual language representation. The model achieves **state-of-the-art (SOTA) performance** on multiple German NLP benchmarks.
11
-
12
- GeistBERT comes in **three versions**:
13
- - GeistBERT (Standard, this repo)
14
- - [GeistBERT-Nyströmformer](https://huggingface.co/GeistBERT/GeistBERT_base_nystromformer) (Efficient self-attention)
15
- - [GeistBERT-Longformer](https://huggingface.co/GeistBERT/GeistBERT_base_longformer) (Extended context length)
16
 
17
  ## Training Data
18
  GeistBERT was trained on a **diverse German corpus** combining:
@@ -27,7 +22,6 @@ The dataset amounts to **approximately 1.3T tokens**, shuffled for improved vari
27
  ## Training Procedure
28
  ### Hardware
29
  - Training was conducted on **multiple GPUs**, including **NVIDIA RTX 3090 (24GB VRAM)**.
30
- - **Gradient accumulation** was used for **Longformer**, requiring **more VRAM** compared to Nyströmformer and RoBERTa, which fit on a single RTX 3090.
31
 
32
  ### Hyperparameters
33
  | Parameter | Value |
@@ -56,9 +50,7 @@ Details:
56
 
57
  | Model | Accuracy NLI | GermEval\_14 F1 | CoNLL F1 | Coarse F1 | Fine F1 | 10kGNAD F1 |
58
  |-------------------------------------|--------------|----------------|----------|-----------|---------|------------|
59
- | [GeistBERT](https://huggingface.co/GeistBERT/GeistBERT_base) | **82.67** | **88.47** | _86.17_ | _79.67_ | 66.42 | **90.89** |
60
- | [GeistBERT-Nyströmformer](https://huggingface.co/GeistBERT/GeistBERT_base_nystromformer) | 82.50 | 88.23 | 85.76 | 79.17 | **78.57** | 90.33 |
61
- | [GeistBERT-Longformer](https://huggingface.co/GeistBERT/GeistBERT_base_longformer) | _82.51_ | _88.45_ | **86.71** | **80.56** | _66.76_ | 90.32 |
62
  | [GottBERT_base_best](https://huggingface.co/TUM/GottBERT_base_best) | 80.82 | 87.55 | 85.93 | 78.17 | 53.30 | 89.64 |
63
  | [GottBERT_base_last](https://huggingface.co/TUM/GottBERT_base_last) | 81.04 | 87.48 | 85.61 | 78.18 | 53.92 | 90.27 |
64
  | [GottBERT_filtered_base_best](https://huggingface.co/TUM/GottBERT_filtered_base_best) | 80.56 | 87.57 | 86.14 | 78.65 | 52.82 | 89.79 |
@@ -86,7 +78,6 @@ This model is designed for **German NLP tasks**, including:
86
 
87
  ## Limitations
88
  - Trained on **unfiltered data**, meaning some **redundant or lower-quality samples** may be present.
89
- - Longformer **requires more VRAM**, making it less accessible for smaller GPU setups.
90
  - While deduplication was applied to **specific subcorpora**, the full corpus **was not manually curated**.
91
 
92
  ## Fairseq Checkpoints
 
7
  ---
8
 
9
  # GeistBERT
10
+ GeistBERT is a **German language model** trained on a **for the most part deduplicated corpus** including **OSCAR23, OPUS, and MC4**. It builds on **GottBERT** while introducing **Whole Word Masking (WWM)** to improve contextual language representation. Achieving state-of-the-art among base models, the model also performs competitively with larger ones on several German NLP benchmarks.
 
 
 
 
 
11
 
12
  ## Training Data
13
  GeistBERT was trained on a **diverse German corpus** combining:
 
22
  ## Training Procedure
23
  ### Hardware
24
  - Training was conducted on **multiple GPUs**, including **NVIDIA RTX 3090 (24GB VRAM)**.
 
25
 
26
  ### Hyperparameters
27
  | Parameter | Value |
 
50
 
51
  | Model | Accuracy NLI | GermEval\_14 F1 | CoNLL F1 | Coarse F1 | Fine F1 | 10kGNAD F1 |
52
  |-------------------------------------|--------------|----------------|----------|-----------|---------|------------|
53
+ | [GeistBERT](https://huggingface.co/GeistBERT/GeistBERT_base) | **82.67** | **88.47** | **86.17** | **79.67** | **66.42** | **90.89** |
 
 
54
  | [GottBERT_base_best](https://huggingface.co/TUM/GottBERT_base_best) | 80.82 | 87.55 | 85.93 | 78.17 | 53.30 | 89.64 |
55
  | [GottBERT_base_last](https://huggingface.co/TUM/GottBERT_base_last) | 81.04 | 87.48 | 85.61 | 78.18 | 53.92 | 90.27 |
56
  | [GottBERT_filtered_base_best](https://huggingface.co/TUM/GottBERT_filtered_base_best) | 80.56 | 87.57 | 86.14 | 78.65 | 52.82 | 89.79 |
 
78
 
79
  ## Limitations
80
  - Trained on **unfiltered data**, meaning some **redundant or lower-quality samples** may be present.
 
81
  - While deduplication was applied to **specific subcorpora**, the full corpus **was not manually curated**.
82
 
83
  ## Fairseq Checkpoints