BiDi-ces-eng / README.md
nielsr's picture
nielsr HF Staff
Add paper link and pipeline tag to model card
6245558 verified
|
raw
history blame
5.3 kB
metadata
language:
  - cs
  - pl
  - sk
  - sl
  - en
library_name: transformers
license: cc-by-4.0
tags:
  - translation
  - mt
  - marian
  - pytorch
  - sentence-piece
  - multilingual
  - allegro
  - laniqo
pipeline_tag: translation
paper: https://hf.co/papers/2502.14509

MultiSlav BiDi Models

MLR @ Allegro.com

Multilingual BiDi MT Models

BiDi is a collection of Encoder-Decoder vanilla transformer models trained on sentence-level Machine Translation task. Each model is supporting Bi-Directional translation.

BiDi models are part of the MultiSlav collection. More information will be available soon in our upcoming MultiSlav paper. Paper

Experiments were conducted under research project by Machine Learning Research lab for Allegro.com. Big thanks to laniqo.com for cooperation in the research.

Graphic above provides an example of an BiDi model - BiDi-ces-pol to translate from Polish to Czech language. BiDi-ces-pol is a bi-directional model supporting translation both form Czech to Polish and from Polish to Czech directions.

Supported languages

To use a BiDi model, you must provide the target language for translation. Target language tokens are represented as 3-letter ISO 639-3 language codes embedded in a format >>xxx<<. All accepted directions and their respective tokens are listed below. Note that, for each model only two directions are available. Each of them was added as a special token to Sentence-Piece tokenizer.

Target Language First token
Czech >>ces<<
English >>eng<<
Polish >>pol<<
Slovak >>slk<<
Slovene >>slv<<

Bi-Di models available

We provided 10 BiDi models, allowing to translate between 20 languages.

Bi-Di model Languages supported HF repository
BiDi-ces-eng Czech ↔ English allegro/BiDi-ces-eng
BiDi-ces-pol Czech ↔ Polish allegro/BiDi-ces-pol
BiDi-ces-slk Czech ↔ Slovak allegro/BiDi-ces-slk
BiDi-ces-slv Czech ↔ Slovene allegro/BiDi-ces-slv
BiDi-eng-pol English ↔ Polish allegro/BiDi-eng-pol
BiDi-eng-slk English ↔ Slovak allegro/BiDi-eng-slk
BiDi-eng-slv English ↔ Slovene allegro/BiDi-eng-slv
BiDi-pol-slk Polish ↔ Slovak allegro/BiDi-pol-slk
BiDi-pol-slv Polish ↔ Slovene allegro/BiDi-pol-slv
BiDi-slk-slv Slovak ↔ Slovene allegro/BiDi-slk-slv

Use case quickstart

Example code-snippet to use model. Due to bug the MarianMTModel must be used explicitly. Remember to adjust source and target languages to your use-case.

from transformers import AutoTokenizer, MarianMTModel

source_lang = "pol"
target_lang = "ces"
first_lang, second_lang = sorted([source_lang, target_lang])
model_name = f"Allegro/BiDi-{first_lang}-{second_lang}"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

text = f">>{target_lang}<<" + " " + "Allegro to internetowa platforma e-commerce, na której swoje produkty sprzedają średnie i małe firmy, jak również duże marki."

batch_to_translate = [text]
translations = model.generate(**tokenizer.batch_encode_plus(batch_to_translate, return_tensors="pt"))
decoded_translation = tokenizer.batch_decode(translations, skip_special_tokens=True, clean_up_tokenization_spaces=True)[0]

print(decoded_translation)

Generated Czech output:

Allegro je online e-commerce platforma, na které své výrobky prodávají střední a malé firmy, stejně jako velké značky.

Training

SentencePiece tokenizer has a vocab size 32k in total (16k per language). Tokenizer was trained on randomly sampled part of the training corpus. During the training we used the MarianNMT framework. Base marian configuration used: transfromer-big. All training parameters are listed in table below.

Training hyperparameters: