GeistBERT
/

GeistBERT_base

Model card Files Files and versions

Raphael Scheible commited on Feb 14, 2025

Commit

bfd02fd

·

verified ·

1 Parent(s): 3a8c121

Update README.md

Files changed (1) hide show

README.md +9 -6

README.md CHANGED Viewed

@@ -30,12 +30,15 @@ The dataset amounts to **approximately 1.3T tokens**, shuffled for improved vari
 - **Gradient accumulation** was used for **Longformer**, requiring **more VRAM** compared to Nyströmformer and RoBERTa, which fit on a single RTX 3090.
 ### Hyperparameters
-- Training steps: **100k**
-- Learning rate: **2e-4**
-- Warmup steps: **10k**
-- Batch sizes: **48 / 64 (using gradient accumulation for Longformer)**
-- Optimizer: **AdamW**
-- Weight Initialization: **GottBERT**
 ## Performance
 GeistBERT achieves **SOTA results** on multiple tasks:

 - **Gradient accumulation** was used for **Longformer**, requiring **more VRAM** compared to Nyströmformer and RoBERTa, which fit on a single RTX 3090.
 ### Hyperparameters
+| Parameter          | Value                  |
+|--------------------|------------------------|
+| **Model Architecture** | RoBERTa (Base)      |
+| **Batch Size**     | 8,000                  |
+| **Training Steps** | 100k                   |
+| **Weight Initialization** | [GottBERT filtered base](https://huggingface.co/TUM/GottBERT_filtered_base_best) |
+| **Warmup Iterations** | 10k                  |
+| **Peak Learning Rate** | 0.0007              |
+| **Learning Rate Decay** | Polynomial to zero |
 ## Performance
 GeistBERT achieves **SOTA results** on multiple tasks: