|
|
--- |
|
|
language: |
|
|
- de |
|
|
license: mit |
|
|
library_name: transformers |
|
|
pipeline_tag: text-generation |
|
|
tags: |
|
|
- t5 |
|
|
- german |
|
|
- wechsel |
|
|
- cross-lingual |
|
|
datasets: |
|
|
- unpaywall-scientific |
|
|
--- |
|
|
|
|
|
# DE-T5-Sci-Transfer-Init |
|
|
|
|
|
WECHSEL-initialized checkpoint: English EN-T5-Sci weights + German tokenizer (`GermanT5/t5-efficient-gc4-german-base-nl36`) aligned using WECHSEL (static embeddings + bilingual dictionary). **No additional German training** after transfer. Folder includes `transfer_metadata.pt` with alignment diagnostics. |
|
|
|
|
|
## Model Details |
|
|
- Embedding init: Orthogonal Procrustes map (fastText n-gram embeddings) + temperature-weighted mixtures (k-nearest neighbors) |
|
|
- Special tokens: `<extra_id_0..99>` aligned, sentinel behavior preserved |
|
|
- Tokenizer: GermanT5 SentencePiece (files bundled here) |
|
|
|
|
|
## Evaluation (Global-MMLU, zero-shot) |
|
|
| Metric | EN | DE | |
|
|
| --- | --- | --- | |
|
|
| Overall accuracy | 0.2434 | 0.2463 | |
|
|
| Humanities | 0.2485 | 0.2559 | |
|
|
| STEM | 0.2391 | 0.2445 | |
|
|
| Social Sciences | 0.2317 | 0.2307 | |
|
|
| Other | 0.2517 | 0.2491 | |
|
|
|
|
|
This demonstrates immediate cross-lingual transfer without any German gradient steps. |
|
|
|
|
|
## Intended Use |
|
|
Starting point for German continued pretraining or fine-tuning where English scientific knowledge should be retained but a German tokenizer is required. |
|
|
|
|
|
## Limitations |
|
|
- No German data exposure beyond embedding alignment; you should run additional continued pretraining (see next model) for best performance. |
|
|
- Still limited to 512-token context. |
|
|
|