|
|
--- |
|
|
language: |
|
|
- eo |
|
|
- en |
|
|
- es |
|
|
- ca |
|
|
tags: |
|
|
- translation |
|
|
- machine-translation |
|
|
- marian |
|
|
- opus-mt |
|
|
- multilingual |
|
|
license: cc-by-4.0 |
|
|
pipeline_tag: translation |
|
|
metrics: |
|
|
- bleu |
|
|
- chrf |
|
|
--- |
|
|
|
|
|
# Esperanto -> Catalan, English, Spanish MT Model |
|
|
|
|
|
## Model description |
|
|
|
|
|
This repository contains a **multilingual MarianMT** model for **Esperanto → (English, Spanish, Catalan)** translation using language tags. |
|
|
|
|
|
## Usage |
|
|
|
|
|
The model is loaded and used with `transformers` as: |
|
|
|
|
|
```python |
|
|
from transformers import MarianMTModel, MarianTokenizer |
|
|
import torch |
|
|
|
|
|
model_name = "Helsinki-NLP/opus-mt-eo-caenes" |
|
|
|
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
model = MarianMTModel.from_pretrained(model_name).to(device) |
|
|
tokenizer = MarianTokenizer.from_pretrained(model_name) |
|
|
|
|
|
source_texts = [ |
|
|
">>spa<< Saluton, kiel vi fartas?", |
|
|
">>eng<< Saluton, kiel vi fartas?", |
|
|
">>cat<< Saluton, kiel vi fartas?" |
|
|
] |
|
|
|
|
|
inputs = tokenizer(source_texts, return_tensors="pt", padding=True, truncation=True) |
|
|
inputs = {k: v.to(device) for k, v in inputs.items()} |
|
|
|
|
|
translated_ids = model.generate(inputs["input_ids"]) |
|
|
translated_texts = tokenizer.batch_decode(translated_ids, skip_special_tokens=True) |
|
|
|
|
|
for src, tgt in zip(source_texts, translated_texts): |
|
|
print(f"Source: {src} => Translated: {tgt}") |
|
|
```` |
|
|
|
|
|
### Supported target languages (via tags) |
|
|
|
|
|
You control the target language by prefixing the source sentence with one of the following tags: |
|
|
|
|
|
* `>>eng<<` → English |
|
|
* `>>spa<<` → Spanish |
|
|
* `>>cat<<` → Catalan |
|
|
|
|
|
## Training data |
|
|
|
|
|
The model was trained using **Tatoeba** parallel data, with **FLORES-200** used as the development set. |
|
|
|
|
|
Training sentence-pair counts: |
|
|
|
|
|
* **ca-eo**: 672,931 |
|
|
* **es-eo**: 4,677,945 |
|
|
* **eo-en**: 5,000,000 |
|
|
|
|
|
## Evaluation on FLORES |
|
|
|
|
|
| Language Pair | BLEU | ChrF++ | |
|
|
| ------------- | ----: | ----: | |
|
|
| epo-spa | 19.98 | 49.11 | |
|
|
| epo-cat | 28.35 | 55.42 | |
|
|
| epo-eng | 37.47 | 63.09 | |
|
|
|