de-t5-sci-transfer-init / README.md

rausch

Create README.md

79cddb6 verified 2 months ago

preview code

raw

history blame contribute delete

1.48 kB

metadata

language:
  - de
license: mit
library_name: transformers
pipeline_tag: text-generation
tags:
  - t5
  - german
  - wechsel
  - cross-lingual
datasets:
  - unpaywall-scientific

DE-T5-Sci-Transfer-Init

WECHSEL-initialized checkpoint: English EN-T5-Sci weights + German tokenizer (GermanT5/t5-efficient-gc4-german-base-nl36) aligned using WECHSEL (static embeddings + bilingual dictionary). No additional German training after transfer. Folder includes transfer_metadata.pt with alignment diagnostics.

Model Details

Embedding init: Orthogonal Procrustes map (fastText n-gram embeddings) + temperature-weighted mixtures (k-nearest neighbors)
Special tokens: <extra_id_0..99> aligned, sentinel behavior preserved
Tokenizer: GermanT5 SentencePiece (files bundled here)

Evaluation (Global-MMLU, zero-shot)

Metric	EN	DE
Overall accuracy	0.2434	0.2463
Humanities	0.2485	0.2559
STEM	0.2391	0.2445
Social Sciences	0.2317	0.2307
Other	0.2517	0.2491

This demonstrates immediate cross-lingual transfer without any German gradient steps.

Intended Use

Starting point for German continued pretraining or fine-tuning where English scientific knowledge should be retained but a German tokenizer is required.

Limitations

No German data exposure beyond embedding alignment; you should run additional continued pretraining (see next model) for best performance.
Still limited to 512-token context.