opus-mt-mul-en-big

Multilingual-to-English neural machine translation model (Transformer Big). Translates from many source languages into English.

Model description

  • Architecture: Marian NMT (Transformer Big)
  • Direction: Many languages (mul) to English (eng)
  • Parameters: ~0.3B
  • Format: Safetensors (F16)
  • Preprocessing: Normalization + SentencePiece (spm32k)
  • License: Apache 2.0

Supported source languages include widely used ones such as Japanese (jpn), Chinese (cmn), French (fra), German (deu), Spanish (spa), Russian (rus), Arabic (ara), and many more (hundreds of language/variant codes in total).

Source and origin

This repository is a mirror/copy of the same model for convenience. Credit and origin:

No weights or training code have been modified; this is a re-hosted copy with this model card added.

Intended use

  • Translating text from a large set of source languages into English
  • Research and product prototyping for multilingual-to-English MT
  • Use with the Hugging Face transformers library (MarianMT)

Limitations

  • Output is English only; quality varies by source language and domain
  • Best used with clear, sentence-level input; very long or noisy text may degrade quality
  • No additional fine-tuning has been applied in this repository

Usage

Install transformers and use the source-language prefix >>{lang}<< as required by Marian mul-en models:

from transformers import MarianMTModel, MarianTokenizer

model_name = "aoiandroid/opus-mt-mul-en-big"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

# Example: Japanese (jpn) to English
src_text = ">>jpn<< ใ“ใ‚Œใฏใƒ†ใ‚นใƒˆใงใ™ใ€‚"
inputs = tokenizer(src_text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# This is a test.

# Example: French (fra) to English
src_text = ">>fra<< Bonjour, comment allez-vous ?"
inputs = tokenizer(src_text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Use the appropriate ISO 639-3 code (e.g. jpn, fra, deu, cmn) in the >>code<< prefix.

Benchmarks

Evaluated on Tatoeba-test-v2023-09-26.multi-eng:

testset BLEU chr-F #sent #words BP
Tatoeba-test-v2023-09-26.multi-eng 39.7 0.60108 10000 76940 1.000

License

Apache 2.0. Same as the original OPUS-MT / vonjack model.

Citation

If you use this model or the OPUS-MT project, consider citing the original work:

Downloads last month
-
Safetensors
Model size
0.3B params
Tensor type
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support