language:
- cs
- pl
- sk
- sl
- en
library_name: transformers
license: cc-by-4.0
tags:
- translation
- mt
- marian
- pytorch
- sentence-piece
- multilingual
- allegro
- laniqo
pipeline_tag: translation
paper: https://hf.co/papers/2502.14509
MultiSlav BiDi Models
Multilingual BiDi MT Models
BiDi is a collection of Encoder-Decoder vanilla transformer models trained on sentence-level Machine Translation task. Each model is supporting Bi-Directional translation.
BiDi models are part of the MultiSlav collection. More information will be available soon in our upcoming MultiSlav paper. Paper
Experiments were conducted under research project by Machine Learning Research lab for Allegro.com. Big thanks to laniqo.com for cooperation in the research.
Graphic above provides an example of an BiDi model - BiDi-ces-pol to translate from Polish to Czech language. BiDi-ces-pol is a bi-directional model supporting translation both form Czech to Polish and from Polish to Czech directions.
Supported languages
To use a BiDi model, you must provide the target language for translation. Target language tokens are represented as 3-letter ISO 639-3 language codes embedded in a format >>xxx<<. All accepted directions and their respective tokens are listed below. Note that, for each model only two directions are available. Each of them was added as a special token to Sentence-Piece tokenizer.
| Target Language | First token |
|---|---|
| Czech | >>ces<< |
| English | >>eng<< |
| Polish | >>pol<< |
| Slovak | >>slk<< |
| Slovene | >>slv<< |
Bi-Di models available
We provided 10 BiDi models, allowing to translate between 20 languages.
| Bi-Di model | Languages supported | HF repository |
|---|---|---|
| BiDi-ces-eng | Czech ↔ English | allegro/BiDi-ces-eng |
| BiDi-ces-pol | Czech ↔ Polish | allegro/BiDi-ces-pol |
| BiDi-ces-slk | Czech ↔ Slovak | allegro/BiDi-ces-slk |
| BiDi-ces-slv | Czech ↔ Slovene | allegro/BiDi-ces-slv |
| BiDi-eng-pol | English ↔ Polish | allegro/BiDi-eng-pol |
| BiDi-eng-slk | English ↔ Slovak | allegro/BiDi-eng-slk |
| BiDi-eng-slv | English ↔ Slovene | allegro/BiDi-eng-slv |
| BiDi-pol-slk | Polish ↔ Slovak | allegro/BiDi-pol-slk |
| BiDi-pol-slv | Polish ↔ Slovene | allegro/BiDi-pol-slv |
| BiDi-slk-slv | Slovak ↔ Slovene | allegro/BiDi-slk-slv |
Use case quickstart
Example code-snippet to use model. Due to bug the MarianMTModel must be used explicitly.
Remember to adjust source and target languages to your use-case.
from transformers import AutoTokenizer, MarianMTModel
source_lang = "pol"
target_lang = "ces"
first_lang, second_lang = sorted([source_lang, target_lang])
model_name = f"Allegro/BiDi-{first_lang}-{second_lang}"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
text = f">>{target_lang}<<" + " " + "Allegro to internetowa platforma e-commerce, na której swoje produkty sprzedają średnie i małe firmy, jak również duże marki."
batch_to_translate = [text]
translations = model.generate(**tokenizer.batch_encode_plus(batch_to_translate, return_tensors="pt"))
decoded_translation = tokenizer.batch_decode(translations, skip_special_tokens=True, clean_up_tokenization_spaces=True)[0]
print(decoded_translation)
Generated Czech output:
Allegro je online e-commerce platforma, na které své výrobky prodávají střední a malé firmy, stejně jako velké značky.
Training
SentencePiece tokenizer has a vocab size 32k in total (16k per language). Tokenizer was trained on randomly sampled part of the training corpus. During the training we used the MarianNMT framework. Base marian configuration used: transfromer-big. All training parameters are listed in table below.