language:
- cs
- pl
- sk
- sl
library_name: transformers
license: cc-by-4.0
tags:
- translation
- mt
- marian
- pytorch
- sentence-piece
- one2many
- multilingual
- pivot
- allegro
- laniqo
MultiSlav P4-pol2many
This repository contains the model described in the paper MultiSlav: Multilingual Translation of Slavic Languages with Pivoting and Cross-lingual Data.
Multilingual Polish-to-Many MT Model
P4-pol2many is an Encoder-Decoder vanilla transformer model trained on sentence-level Machine Translation task. Model is supporting translation from Polish to 3 languages: Czech, Slovak, and Slovene. This model is part of the MultiSlav collection. More information will be available soon in our upcoming MultiSlav paper.
Experiments were conducted under research project by Machine Learning Research lab for Allegro.com. Big thanks to laniqo.com for cooperation in the research.
P4-pol2many - Polish-to-Many model translating from Polish to Slavic languages. This model and P4-many2pol combine into P4-pol pivot system translating between 4 Slavic languages. P4-pol translates all supported Slavic languages using Many2One model to Polish bridge sentence and next using this One2Many model from Polish bridge sentence to target Slavic language.
Model description
- Model name: P4-pol2many
- Source Language: Polish
- Target Languages: Czech, Slovak, Slovene
- Model Collection: MultiSlav
- Model type: MarianMTModel Encoder-Decoder
- License: CC BY 4.0 (commercial use allowed)
- Developed by: MLR @ Allegro & Laniqo.com
Supported languages
Using model you must specify target language for translation. To use the model, you must provide the target language for translation. Target language tokens are represented as 3-letter ISO 639-3 language codes embedded in a format >>xxx<<. All accepted directions and their respective tokens are listed below. Each of them was added as a special token to Sentence-Piece tokenizer.
| Target Language | First token |
|---|---|
| Czech | >>ces<< |
| Slovak | >>slk<< |
| Slovene | >>slv<< |
Use case quickstart
Example code-snippet to use model. Due to bug the MarianMTModel must be used explicitly.
from transformers import AutoTokenizer, MarianMTModel
o2m_model_name = "Allegro/P4-pol2many"
o2m_tokenizer = AutoTokenizer.from_pretrained(o2m_model_name)
o2m_model = MarianMTModel.from_pretrained(o2m_model_name)
text = "Allegro to internetowa platforma e-commerce, na której swoje produkty sprzedają średnie i małe firmy, jak również duże marki."
target_languages = ["ces", "slk", "slv"]
batch_to_translate = [
f">>{lang}<<" + " " + text for lang in target_languages
]
translations = o2m_model.generate(**o2m_tokenizer.batch_encode_plus(batch_to_translate, return_tensors="pt"))
bridge_translations = o2m_tokenizer.batch_decode(translations, skip_special_tokens=True, clean_up_tokenization_spaces=True)
for trans in bridge_translations:
print(trans)
Generated Czech output:
Allegro je online platforma pro e-commerce, na které své produkty prodávají střední a malé firmy, stejně jako velké značky.
Generated Slovak output:
Allegro je online platforma elektronického obchodu, na ktorej svoje produkty predávajú stredné a malé podniky, ako aj veľké značky.
Generated Slovene output:
Allegro je spletna platforma za e-poslovanje, kjer svoje izdelke prodajajo srednje velika in mala podjetja ter velike blagovne znamke.
To pivot-translate to other languages via bridge Polish sentence, we need One2Many model. Many2One model requires explicit source language token as well. Example for translating from Czech to Slovak:
from transformers import AutoTokenizer, MarianMTModel
m2o_model_name = "Allegro/P4-many2pol"
o2m_model_name = "Allegro/P4-pol2many"
m2o_tokenizer = AutoTokenizer.from_pretrained(m2o_model_name)
m2o_model = MarianMTModel.from_pretrained(m2o_model_name)
o2m_tokenizer = AutoTokenizer.from_pretrained(o2m_model_name)
o2m_model = MarianMTModel.from_pretrained(o2m_model_name)
text = ">>ces<<" + " " + "Allegro je on-line e-commerce platforma, na které své produkty prodávají střední a malé firmy, stejně jako velké značky."
translation = m2o_model.generate(**m2o_tokenizer.batch_encode_plus([text], return_tensors="pt"))
bridge_translations = m2o_tokenizer.batch_decode(translation, skip_special_tokens=True, clean_up_tokenization_spaces=True)
post_edited_bridge = ">>slk<<" + " " + bridge_translations[0]
translation = o2m_model.generate(**o2m_tokenizer.batch_encode_plus([post_edited_bridge], return_tensors="pt"))
decoded_translations = o2m_tokenizer.batch_decode(translation, skip_special_tokens=True, clean_up_tokenization_spaces=True)
print(decoded_translations[0])
Generated Czech to Slovak pivot translation via Polish:
Allegro je online platforma elektronického obchodu, na ktorej svoje produkty predávajú stredné a malé podniky, ako aj veľké značky.
Training
SentencePiece tokenizer has a vocab size 64k in total (16k per language). Tokenizer was trained on randomly sampled part of the training corpus. During the training we used the MarianNMT framework. Base marian configuration used: transfromer-big. All training parameters are listed in table below.