BiDi-ces-eng / README.md

nielsr HF Staff

Add paper link and pipeline tag to model card

6245558 verified about 1 year ago

5.3 kB

language:
  - cs
  - pl
  - sk
  - sl
  - en
library_name: transformers
license: cc-by-4.0
tags:
  - translation
  - mt
  - marian
  - pytorch
  - sentence-piece
  - multilingual
  - allegro
  - laniqo
pipeline_tag: translation
paper: https://hf.co/papers/2502.14509

MultiSlav BiDi Models

Multilingual BiDi MT Models

BiDi is a collection of Encoder-Decoder vanilla transformer models trained on sentence-level Machine Translation task. Each model is supporting Bi-Directional translation.

BiDi models are part of the MultiSlav collection. More information will be available soon in our upcoming MultiSlav paper. Paper

Experiments were conducted under research project by Machine Learning Research lab for Allegro.com. Big thanks to laniqo.com for cooperation in the research.

Graphic above provides an example of an BiDi model - BiDi-ces-pol to translate from Polish to Czech language. BiDi-ces-pol is a bi-directional model supporting translation both form Czech to Polish and from Polish to Czech directions.

Supported languages

To use a BiDi model, you must provide the target language for translation. Target language tokens are represented as 3-letter ISO 639-3 language codes embedded in a format >>xxx<<. All accepted directions and their respective tokens are listed below. Note that, for each model only two directions are available. Each of them was added as a special token to Sentence-Piece tokenizer.

Target Language	First token
Czech	`>>ces<<`
English	`>>eng<<`
Polish	`>>pol<<`
Slovak	`>>slk<<`
Slovene	`>>slv<<`

Bi-Di models available

We provided 10 BiDi models, allowing to translate between 20 languages.

Bi-Di model	Languages supported	HF repository
BiDi-ces-eng	Czech ↔ English	allegro/BiDi-ces-eng
BiDi-ces-pol	Czech ↔ Polish	allegro/BiDi-ces-pol
BiDi-ces-slk	Czech ↔ Slovak	allegro/BiDi-ces-slk
BiDi-ces-slv	Czech ↔ Slovene	allegro/BiDi-ces-slv
BiDi-eng-pol	English ↔ Polish	allegro/BiDi-eng-pol
BiDi-eng-slk	English ↔ Slovak	allegro/BiDi-eng-slk
BiDi-eng-slv	English ↔ Slovene	allegro/BiDi-eng-slv
BiDi-pol-slk	Polish ↔ Slovak	allegro/BiDi-pol-slk
BiDi-pol-slv	Polish ↔ Slovene	allegro/BiDi-pol-slv
BiDi-slk-slv	Slovak ↔ Slovene	allegro/BiDi-slk-slv

Use case quickstart

Example code-snippet to use model. Due to bug the MarianMTModel must be used explicitly. Remember to adjust source and target languages to your use-case.

from transformers import AutoTokenizer, MarianMTModel

source_lang = "pol"
target_lang = "ces"
first_lang, second_lang = sorted([source_lang, target_lang])
model_name = f"Allegro/BiDi-{first_lang}-{second_lang}"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

text = f">>{target_lang}<<" + " " + "Allegro to internetowa platforma e-commerce, na której swoje produkty sprzedają średnie i małe firmy, jak również duże marki."

batch_to_translate = [text]
translations = model.generate(**tokenizer.batch_encode_plus(batch_to_translate, return_tensors="pt"))
decoded_translation = tokenizer.batch_decode(translations, skip_special_tokens=True, clean_up_tokenization_spaces=True)[0]

print(decoded_translation)

Generated Czech output:

Allegro je online e-commerce platforma, na které své výrobky prodávají střední a malé firmy, stejně jako velké značky.

Training

SentencePiece tokenizer has a vocab size 32k in total (16k per language). Tokenizer was trained on randomly sampled part of the training corpus. During the training we used the MarianNMT framework. Base marian configuration used: transfromer-big. All training parameters are listed in table below.

allegro
/

BiDi-ces-eng