rausch commited on
Commit
79cddb6
·
verified ·
1 Parent(s): 8b5e5f4

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +41 -0
README.md ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - de
4
+ license: mit
5
+ library_name: transformers
6
+ pipeline_tag: text-generation
7
+ tags:
8
+ - t5
9
+ - german
10
+ - wechsel
11
+ - cross-lingual
12
+ datasets:
13
+ - unpaywall-scientific
14
+ ---
15
+
16
+ # DE-T5-Sci-Transfer-Init
17
+
18
+ WECHSEL-initialized checkpoint: English EN-T5-Sci weights + German tokenizer (`GermanT5/t5-efficient-gc4-german-base-nl36`) aligned using WECHSEL (static embeddings + bilingual dictionary). **No additional German training** after transfer. Folder includes `transfer_metadata.pt` with alignment diagnostics.
19
+
20
+ ## Model Details
21
+ - Embedding init: Orthogonal Procrustes map (fastText n-gram embeddings) + temperature-weighted mixtures (k-nearest neighbors)
22
+ - Special tokens: `<extra_id_0..99>` aligned, sentinel behavior preserved
23
+ - Tokenizer: GermanT5 SentencePiece (files bundled here)
24
+
25
+ ## Evaluation (Global-MMLU, zero-shot)
26
+ | Metric | EN | DE |
27
+ | --- | --- | --- |
28
+ | Overall accuracy | 0.2434 | 0.2463 |
29
+ | Humanities | 0.2485 | 0.2559 |
30
+ | STEM | 0.2391 | 0.2445 |
31
+ | Social Sciences | 0.2317 | 0.2307 |
32
+ | Other | 0.2517 | 0.2491 |
33
+
34
+ This demonstrates immediate cross-lingual transfer without any German gradient steps.
35
+
36
+ ## Intended Use
37
+ Starting point for German continued pretraining or fine-tuning where English scientific knowledge should be retained but a German tokenizer is required.
38
+
39
+ ## Limitations
40
+ - No German data exposure beyond embedding alignment; you should run additional continued pretraining (see next model) for best performance.
41
+ - Still limited to 512-token context.