File size: 1,923 Bytes
7ac3dd5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b1d59bd
7ac3dd5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
---
language:
- eo
- en
- es
- ca
tags:
- translation
- machine-translation
- marian
- opus-mt
- multilingual
license: cc-by-4.0
pipeline_tag: translation
metrics:
- bleu
- chrf
---

# Esperanto -> Catalan, English, Spanish MT Model

## Model description

This repository contains a **multilingual MarianMT** model for **Esperanto → (English, Spanish, Catalan)** translation using language tags.

## Usage

The model is loaded and used with `transformers` as:

```python
from transformers import MarianMTModel, MarianTokenizer
import torch

model_name = "Helsinki-NLP/opus-mt-eo-caenes"

device = "cuda" if torch.cuda.is_available() else "cpu"
model = MarianMTModel.from_pretrained(model_name).to(device)
tokenizer = MarianTokenizer.from_pretrained(model_name)

source_texts = [
    ">>spa<< Saluton, kiel vi fartas?",
    ">>eng<< Saluton, kiel vi fartas?",
    ">>cat<< Saluton, kiel vi fartas?"
]

inputs = tokenizer(source_texts, return_tensors="pt", padding=True, truncation=True)
inputs = {k: v.to(device) for k, v in inputs.items()}

translated_ids = model.generate(inputs["input_ids"])
translated_texts = tokenizer.batch_decode(translated_ids, skip_special_tokens=True)

for src, tgt in zip(source_texts, translated_texts):
    print(f"Source: {src} => Translated: {tgt}")
````

### Supported target languages (via tags)

You control the target language by prefixing the source sentence with one of the following tags:

* `>>eng<<` → English
* `>>spa<<` → Spanish
* `>>cat<<` → Catalan

## Training data

The model was trained using **Tatoeba** parallel data, with **FLORES-200** used as the development set.

Training sentence-pair counts:

* **ca-eo**: 672,931
* **es-eo**: 4,677,945
* **eo-en**: 5,000,000

## Evaluation on FLORES

| Language Pair |  BLEU |  ChrF++ |
| ------------- | ----: | ----: |
| epo-spa       | 19.98 | 49.11 |
| epo-cat       | 28.35 | 55.42 |
| epo-eng       | 37.47 | 63.09 |