opus-mt-mul-en-big

Multilingual-to-English neural machine translation model (Transformer Big). Translates from many source languages into English.

Model description

Architecture: Marian NMT (Transformer Big)
Direction: Many languages (mul) to English (eng)
Parameters: ~0.3B
Format: Safetensors (F16)
Preprocessing: Normalization + SentencePiece (spm32k)
License: Apache 2.0

Supported source languages include widely used ones such as Japanese (jpn), Chinese (cmn), French (fra), German (deu), Spanish (spa), Russian (rus), Arabic (ara), and many more (hundreds of language/variant codes in total).

Source and origin

This repository is a mirror/copy of the same model for convenience. Credit and origin:

Direct source (Hugging Face): vonjack/opus-mt-mul-en-big
Original project: OPUS-MT by the Helsinki-NLP group (University of Helsinki, Language Technology Research Group)
Training: Models are trained with the OPUS-MT-train pipeline using OPUS and Tatoeba data
Original weights: Built from the Tatoeba-MT release; the underlying artifact is:
- Dataset: opusTCv20230926max50+bt+jhubc
- Model: transformer-big
- Release: opusTCv20230926max50+bt+jhubc_transformer-big_2024-08-17
- Download (CSC): opusTCv20230926max50+bt+jhubc_transformer-big_2024-08-17.zip
- Test set: test translations, eval scores

No weights or training code have been modified; this is a re-hosted copy with this model card added.

Intended use

Translating text from a large set of source languages into English
Research and product prototyping for multilingual-to-English MT
Use with the Hugging Face transformers library (MarianMT)

Limitations

Output is English only; quality varies by source language and domain
Best used with clear, sentence-level input; very long or noisy text may degrade quality
No additional fine-tuning has been applied in this repository

Usage

Install transformers and use the source-language prefix >>{lang}<< as required by Marian mul-en models:

from transformers import MarianMTModel, MarianTokenizer

model_name = "aoiandroid/opus-mt-mul-en-big"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

# Example: Japanese (jpn) to English
src_text = ">>jpn<< これはテストです。"
inputs = tokenizer(src_text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# This is a test.

# Example: French (fra) to English
src_text = ">>fra<< Bonjour, comment allez-vous ?"
inputs = tokenizer(src_text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Use the appropriate ISO 639-3 code (e.g. jpn, fra, deu, cmn) in the >>code<< prefix.

Benchmarks

Evaluated on Tatoeba-test-v2023-09-26.multi-eng:

testset	BLEU	chr-F	#sent	#words	BP
Tatoeba-test-v2023-09-26.multi-eng	39.7	0.60108	10000	76940	1.000

License

Apache 2.0. Same as the original OPUS-MT / vonjack model.

Citation

If you use this model or the OPUS-MT project, consider citing the original work:

OPUS-MT: Helsinki-NLP/Opus-MT
This copy: aoiandroid/opus-mt-mul-en-big (from vonjack/opus-mt-mul-en-big)

Downloads last month: 4

Safetensors

Model size

0.3B params

Tensor type

F16