bumblelbee commited on
Commit
ed6a244
·
verified ·
1 Parent(s): a70a571

Upload Script Reproduction checkpoint for NLP4DH 2026

Browse files
Files changed (4) hide show
  1. README.md +70 -0
  2. config.json +37 -0
  3. generation_config.json +11 -0
  4. model.safetensors +3 -0
README.md ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - egy
4
+ - de
5
+ tags:
6
+ - translation
7
+ - ancient-egyptian
8
+ - hieroglyphics
9
+ - contamination-study
10
+ - nlp4dh
11
+ license: mit
12
+ base_model: facebook/m2m100_418M
13
+ pipeline_tag: translation
14
+ ---
15
+
16
+ # Script Reproduction — Hieroglyphic-to-German Translation
17
+
18
+ This model is part of the paper **"Data Contamination in Neural Machine Translation of Ancient Egyptian Hieroglyphics"** (NLP4DH 2026).
19
+
20
+ ## Model Description
21
+
22
+ M2M-100 (418M) retrained using the original train.py script from the hiero-transformer repository with default hyperparameters (epochs=20, batch_size=16, lr=3e-5). This represents the closest replication of the original training procedure.
23
+
24
+ **Task:** Ancient Egyptian hieroglyphics (Gardiner notation) → German translation
25
+
26
+ ## Performance
27
+
28
+ | Subset | BLEU |
29
+ |--------|------|
30
+ | All (n=50) | 42.2 |
31
+ | Contaminated (n=16) | 77.5 |
32
+ | Clean (n=34) | 33.8 |
33
+
34
+ > **Important:** The "All" and "Contaminated" BLEU scores are inflated due to target-side data contamination (32% of test targets appear in training). The **Clean** score represents genuine translation quality on decontaminated samples.
35
+
36
+ ## Usage
37
+
38
+ ```python
39
+ from transformers import M2MForConditionalGeneration, M2MTokenizer
40
+
41
+ model = M2MForConditionalGeneration.from_pretrained("bumblelbee/hiero-m2m100-script-reproduction")
42
+ tokenizer = M2MTokenizer.from_pretrained("bumblelbee/hiero-m2m100-script-reproduction")
43
+
44
+ # Gardiner notation input (hieroglyphic transliteration)
45
+ source = "D36 N35 G17 D21 X1 O34"
46
+
47
+ tokenizer.src_lang = "ea"
48
+ inputs = tokenizer(source, return_tensors="pt")
49
+ generated = model.generate(**inputs, forced_bos_token_id=tokenizer.get_lang_id("de"))
50
+ output = tokenizer.decode(generated[0], skip_special_tokens=True)
51
+ print(output)
52
+ ```
53
+
54
+ ## Training Data
55
+
56
+ Fine-tuned on 18,669 ea→de pairs from the Thesaurus Linguae Aegyptiae (TLA), maintained by the Berlin-Brandenburg Academy of Sciences and Humanities.
57
+
58
+ ## Citation
59
+
60
+ ```bibtex
61
+ @inproceedings{contamination2026nlp4dh,
62
+ title={Data Contamination in Neural Machine Translation of Ancient Egyptian Hieroglyphics},
63
+ booktitle={Proceedings of the Workshop on Natural Language Processing for Digital Humanities (NLP4DH 2026)},
64
+ year={2026}
65
+ }
66
+ ```
67
+
68
+ ## Paper Repository
69
+
70
+ See the full paper, scripts, and results: [GitHub repository](https://github.com/[repository])
config.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "facebook/m2m100_418M",
3
+ "activation_dropout": 0.0,
4
+ "activation_function": "relu",
5
+ "architectures": [
6
+ "M2M100ForConditionalGeneration"
7
+ ],
8
+ "attention_dropout": 0.1,
9
+ "bos_token_id": 0,
10
+ "d_model": 1024,
11
+ "decoder_attention_heads": 16,
12
+ "decoder_ffn_dim": 4096,
13
+ "decoder_layerdrop": 0.05,
14
+ "decoder_layers": 12,
15
+ "decoder_start_token_id": 2,
16
+ "dropout": 0.1,
17
+ "early_stopping": true,
18
+ "encoder_attention_heads": 16,
19
+ "encoder_ffn_dim": 4096,
20
+ "encoder_layerdrop": 0.05,
21
+ "encoder_layers": 12,
22
+ "eos_token_id": 2,
23
+ "gradient_checkpointing": false,
24
+ "init_std": 0.02,
25
+ "is_encoder_decoder": true,
26
+ "max_length": 200,
27
+ "max_position_embeddings": 1024,
28
+ "model_type": "m2m_100",
29
+ "num_beams": 5,
30
+ "num_hidden_layers": 12,
31
+ "pad_token_id": 1,
32
+ "scale_embedding": true,
33
+ "torch_dtype": "float32",
34
+ "transformers_version": "4.44.0",
35
+ "use_cache": true,
36
+ "vocab_size": 128112
37
+ }
generation_config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 0,
4
+ "decoder_start_token_id": 2,
5
+ "early_stopping": true,
6
+ "eos_token_id": 2,
7
+ "max_length": 200,
8
+ "num_beams": 5,
9
+ "pad_token_id": 1,
10
+ "transformers_version": "4.44.0"
11
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6c37c09c606c59e49b8f9aa214cc87d8f2e7ee9ab2c3dde5b0616ff00d74a9e1
3
+ size 1935681888