Raphael Scheible-Schmitt
commited on
Update README.md
Browse files
README.md
CHANGED
|
@@ -7,12 +7,7 @@ base_model:
|
|
| 7 |
---
|
| 8 |
|
| 9 |
# GeistBERT
|
| 10 |
-
GeistBERT is a **German language model** trained on a **for the most part deduplicated corpus** including **OSCAR23, OPUS, and MC4**. It builds on **GottBERT** while introducing **Whole Word Masking (WWM)** to improve contextual language representation.
|
| 11 |
-
|
| 12 |
-
GeistBERT comes in **three versions**:
|
| 13 |
-
- GeistBERT (Standard, this repo)
|
| 14 |
-
- [GeistBERT-Nyströmformer](https://huggingface.co/GeistBERT/GeistBERT_base_nystromformer) (Efficient self-attention)
|
| 15 |
-
- [GeistBERT-Longformer](https://huggingface.co/GeistBERT/GeistBERT_base_longformer) (Extended context length)
|
| 16 |
|
| 17 |
## Training Data
|
| 18 |
GeistBERT was trained on a **diverse German corpus** combining:
|
|
@@ -27,7 +22,6 @@ The dataset amounts to **approximately 1.3T tokens**, shuffled for improved vari
|
|
| 27 |
## Training Procedure
|
| 28 |
### Hardware
|
| 29 |
- Training was conducted on **multiple GPUs**, including **NVIDIA RTX 3090 (24GB VRAM)**.
|
| 30 |
-
- **Gradient accumulation** was used for **Longformer**, requiring **more VRAM** compared to Nyströmformer and RoBERTa, which fit on a single RTX 3090.
|
| 31 |
|
| 32 |
### Hyperparameters
|
| 33 |
| Parameter | Value |
|
|
@@ -56,9 +50,7 @@ Details:
|
|
| 56 |
|
| 57 |
| Model | Accuracy NLI | GermEval\_14 F1 | CoNLL F1 | Coarse F1 | Fine F1 | 10kGNAD F1 |
|
| 58 |
|-------------------------------------|--------------|----------------|----------|-----------|---------|------------|
|
| 59 |
-
| [GeistBERT](https://huggingface.co/GeistBERT/GeistBERT_base) | **82.67** | **88.47** |
|
| 60 |
-
| [GeistBERT-Nyströmformer](https://huggingface.co/GeistBERT/GeistBERT_base_nystromformer) | 82.50 | 88.23 | 85.76 | 79.17 | **78.57** | 90.33 |
|
| 61 |
-
| [GeistBERT-Longformer](https://huggingface.co/GeistBERT/GeistBERT_base_longformer) | _82.51_ | _88.45_ | **86.71** | **80.56** | _66.76_ | 90.32 |
|
| 62 |
| [GottBERT_base_best](https://huggingface.co/TUM/GottBERT_base_best) | 80.82 | 87.55 | 85.93 | 78.17 | 53.30 | 89.64 |
|
| 63 |
| [GottBERT_base_last](https://huggingface.co/TUM/GottBERT_base_last) | 81.04 | 87.48 | 85.61 | 78.18 | 53.92 | 90.27 |
|
| 64 |
| [GottBERT_filtered_base_best](https://huggingface.co/TUM/GottBERT_filtered_base_best) | 80.56 | 87.57 | 86.14 | 78.65 | 52.82 | 89.79 |
|
|
@@ -86,7 +78,6 @@ This model is designed for **German NLP tasks**, including:
|
|
| 86 |
|
| 87 |
## Limitations
|
| 88 |
- Trained on **unfiltered data**, meaning some **redundant or lower-quality samples** may be present.
|
| 89 |
-
- Longformer **requires more VRAM**, making it less accessible for smaller GPU setups.
|
| 90 |
- While deduplication was applied to **specific subcorpora**, the full corpus **was not manually curated**.
|
| 91 |
|
| 92 |
## Fairseq Checkpoints
|
|
|
|
| 7 |
---
|
| 8 |
|
| 9 |
# GeistBERT
|
| 10 |
+
GeistBERT is a **German language model** trained on a **for the most part deduplicated corpus** including **OSCAR23, OPUS, and MC4**. It builds on **GottBERT** while introducing **Whole Word Masking (WWM)** to improve contextual language representation. Achieving state-of-the-art among base models, the model also performs competitively with larger ones on several German NLP benchmarks.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
|
| 12 |
## Training Data
|
| 13 |
GeistBERT was trained on a **diverse German corpus** combining:
|
|
|
|
| 22 |
## Training Procedure
|
| 23 |
### Hardware
|
| 24 |
- Training was conducted on **multiple GPUs**, including **NVIDIA RTX 3090 (24GB VRAM)**.
|
|
|
|
| 25 |
|
| 26 |
### Hyperparameters
|
| 27 |
| Parameter | Value |
|
|
|
|
| 50 |
|
| 51 |
| Model | Accuracy NLI | GermEval\_14 F1 | CoNLL F1 | Coarse F1 | Fine F1 | 10kGNAD F1 |
|
| 52 |
|-------------------------------------|--------------|----------------|----------|-----------|---------|------------|
|
| 53 |
+
| [GeistBERT](https://huggingface.co/GeistBERT/GeistBERT_base) | **82.67** | **88.47** | **86.17** | **79.67** | **66.42** | **90.89** |
|
|
|
|
|
|
|
| 54 |
| [GottBERT_base_best](https://huggingface.co/TUM/GottBERT_base_best) | 80.82 | 87.55 | 85.93 | 78.17 | 53.30 | 89.64 |
|
| 55 |
| [GottBERT_base_last](https://huggingface.co/TUM/GottBERT_base_last) | 81.04 | 87.48 | 85.61 | 78.18 | 53.92 | 90.27 |
|
| 56 |
| [GottBERT_filtered_base_best](https://huggingface.co/TUM/GottBERT_filtered_base_best) | 80.56 | 87.57 | 86.14 | 78.65 | 52.82 | 89.79 |
|
|
|
|
| 78 |
|
| 79 |
## Limitations
|
| 80 |
- Trained on **unfiltered data**, meaning some **redundant or lower-quality samples** may be present.
|
|
|
|
| 81 |
- While deduplication was applied to **specific subcorpora**, the full corpus **was not manually curated**.
|
| 82 |
|
| 83 |
## Fairseq Checkpoints
|