Update README.md
Browse files
README.md
CHANGED
|
@@ -14,26 +14,29 @@ datasets:
|
|
| 14 |
|
| 15 |
# DE-T5-Base-15k
|
| 16 |
|
| 17 |
-
`GermanT5/t5-efficient-gc4-german-base-nl36` continued for **15 000 steps** on the German scientific
|
| 18 |
|
| 19 |
## Model Details
|
| 20 |
-
- Base: GermanT5 (T5-base
|
| 21 |
-
-
|
| 22 |
-
-
|
| 23 |
-
-
|
| 24 |
|
| 25 |
-
##
|
|
|
|
|
|
|
|
|
|
| 26 |
| Metric | EN | DE |
|
| 27 |
| --- | --- | --- |
|
| 28 |
-
| Overall | 0.2295 | 0.2295 |
|
| 29 |
| Humanities | 0.2421 | 0.2421 |
|
| 30 |
| STEM | 0.2125 | 0.2125 |
|
| 31 |
| Social Sciences | 0.2171 | 0.2171 |
|
| 32 |
| Other | 0.2398 | 0.2398 |
|
| 33 |
|
| 34 |
## Intended Use
|
| 35 |
-
|
| 36 |
|
| 37 |
## Limitations
|
| 38 |
-
-
|
| 39 |
-
- German SentencePiece vocab;
|
|
|
|
| 14 |
|
| 15 |
# DE-T5-Base-15k
|
| 16 |
|
| 17 |
+
`GermanT5/t5-efficient-gc4-german-base-nl36` continued for **15 000 steps** on the German portion of the scientific corpus (same preprocessing as EN). Checkpoint: `cross_lingual_transfer/logs/native_baseline/.../step-step=015000.ckpt`.
|
| 18 |
|
| 19 |
## Model Details
|
| 20 |
+
- Base: GermanT5 (same architecture as T5-base, German tokenizer)
|
| 21 |
+
- Optimizer: Adafactor, lr=1e-3, inverse sqrt schedule, warmup=1.5k, grad clip=1.0
|
| 22 |
+
- Effective batch: 48 (per-GPU 48, no accumulation)
|
| 23 |
+
- Objective: Span corruption (15 % masking, mean span length 3)
|
| 24 |
|
| 25 |
+
## Training Data
|
| 26 |
+
German split of the Unpaywall-derived corpus (continued-pretraining windows of 512 tokens, 50 % overlap).
|
| 27 |
+
|
| 28 |
+
## Evaluation (Global-MMLU, zero-shot)
|
| 29 |
| Metric | EN | DE |
|
| 30 |
| --- | --- | --- |
|
| 31 |
+
| Overall accuracy | 0.2295 | 0.2295 |
|
| 32 |
| Humanities | 0.2421 | 0.2421 |
|
| 33 |
| STEM | 0.2125 | 0.2125 |
|
| 34 |
| Social Sciences | 0.2171 | 0.2171 |
|
| 35 |
| Other | 0.2398 | 0.2398 |
|
| 36 |
|
| 37 |
## Intended Use
|
| 38 |
+
German scientific NLP baseline; compare against WECHSEL-based models or continue fine-tuning on German datasets.
|
| 39 |
|
| 40 |
## Limitations
|
| 41 |
+
- Only 15k steps, so improvements over base GermanT5 are modest.
|
| 42 |
+
- Uses German SentencePiece vocab; incompatible with English tokenizer out of the box.
|