odegiber commited on
Commit
7ac3dd5
·
verified ·
1 Parent(s): 8ab0ff7

Added README

Browse files
Files changed (1) hide show
  1. README.md +80 -0
README.md ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - eo
4
+ - en
5
+ - es
6
+ - ca
7
+ tags:
8
+ - translation
9
+ - machine-translation
10
+ - marian
11
+ - opus-mt
12
+ - multilingual
13
+ license: cc-by-4.0
14
+ pipeline_tag: translation
15
+ metrics:
16
+ - bleu
17
+ - chrf
18
+ ---
19
+
20
+ # Esperanto -> Catalan, English, Spanish MT Model
21
+
22
+ ## Model description
23
+
24
+ This repository contains a **multilingual MarianMT** model for **Esperanto → (English, Spanish, Catalan)** translation using language tags.
25
+
26
+ ## Usage
27
+
28
+ The model is loaded and used with `transformers` as:
29
+
30
+ ```python
31
+ from transformers import MarianMTModel, MarianTokenizer
32
+ import torch
33
+
34
+ model_name = "models/hf/eo_esenca_shuf"
35
+
36
+ device = "cuda" if torch.cuda.is_available() else "cpu"
37
+ model = MarianMTModel.from_pretrained(model_name).to(device)
38
+ tokenizer = MarianTokenizer.from_pretrained(model_name)
39
+
40
+ source_texts = [
41
+ ">>spa<< Saluton, kiel vi fartas?",
42
+ ">>eng<< Saluton, kiel vi fartas?",
43
+ ">>cat<< Saluton, kiel vi fartas?"
44
+ ]
45
+
46
+ inputs = tokenizer(source_texts, return_tensors="pt", padding=True, truncation=True)
47
+ inputs = {k: v.to(device) for k, v in inputs.items()}
48
+
49
+ translated_ids = model.generate(inputs["input_ids"])
50
+ translated_texts = tokenizer.batch_decode(translated_ids, skip_special_tokens=True)
51
+
52
+ for src, tgt in zip(source_texts, translated_texts):
53
+ print(f"Source: {src} => Translated: {tgt}")
54
+ ````
55
+
56
+ ### Supported target languages (via tags)
57
+
58
+ You control the target language by prefixing the source sentence with one of the following tags:
59
+
60
+ * `>>eng<<` → English
61
+ * `>>spa<<` → Spanish
62
+ * `>>cat<<` → Catalan
63
+
64
+ ## Training data
65
+
66
+ The model was trained using **Tatoeba** parallel data, with **FLORES-200** used as the development set.
67
+
68
+ Training sentence-pair counts:
69
+
70
+ * **ca-eo**: 672,931
71
+ * **es-eo**: 4,677,945
72
+ * **eo-en**: 5,000,000
73
+
74
+ ## Evaluation on FLORES
75
+
76
+ | Language Pair | BLEU | ChrF++ |
77
+ | ------------- | ----: | ----: |
78
+ | epo-spa | 19.98 | 49.11 |
79
+ | epo-cat | 28.35 | 55.42 |
80
+ | epo-eng | 37.47 | 63.09 |