rausch's picture
Create README.md
79cddb6 verified
metadata
language:
  - de
license: mit
library_name: transformers
pipeline_tag: text-generation
tags:
  - t5
  - german
  - wechsel
  - cross-lingual
datasets:
  - unpaywall-scientific

DE-T5-Sci-Transfer-Init

WECHSEL-initialized checkpoint: English EN-T5-Sci weights + German tokenizer (GermanT5/t5-efficient-gc4-german-base-nl36) aligned using WECHSEL (static embeddings + bilingual dictionary). No additional German training after transfer. Folder includes transfer_metadata.pt with alignment diagnostics.

Model Details

  • Embedding init: Orthogonal Procrustes map (fastText n-gram embeddings) + temperature-weighted mixtures (k-nearest neighbors)
  • Special tokens: <extra_id_0..99> aligned, sentinel behavior preserved
  • Tokenizer: GermanT5 SentencePiece (files bundled here)

Evaluation (Global-MMLU, zero-shot)

Metric EN DE
Overall accuracy 0.2434 0.2463
Humanities 0.2485 0.2559
STEM 0.2391 0.2445
Social Sciences 0.2317 0.2307
Other 0.2517 0.2491

This demonstrates immediate cross-lingual transfer without any German gradient steps.

Intended Use

Starting point for German continued pretraining or fine-tuning where English scientific knowledge should be retained but a German tokenizer is required.

Limitations

  • No German data exposure beyond embedding alignment; you should run additional continued pretraining (see next model) for best performance.
  • Still limited to 512-token context.