opus-mt-mul-en-big
Multilingual-to-English neural machine translation model (Transformer Big). Translates from many source languages into English.
Model description
- Architecture: Marian NMT (Transformer Big)
- Direction: Many languages (mul) to English (eng)
- Parameters: ~0.3B
- Format: Safetensors (F16)
- Preprocessing: Normalization + SentencePiece (spm32k)
- License: Apache 2.0
Supported source languages include widely used ones such as Japanese (jpn), Chinese (cmn), French (fra), German (deu), Spanish (spa), Russian (rus), Arabic (ara), and many more (hundreds of language/variant codes in total).
Source and origin
This repository is a mirror/copy of the same model for convenience. Credit and origin:
- Direct source (Hugging Face): vonjack/opus-mt-mul-en-big
- Original project: OPUS-MT by the Helsinki-NLP group (University of Helsinki, Language Technology Research Group)
- Training: Models are trained with the OPUS-MT-train pipeline using OPUS and Tatoeba data
- Original weights: Built from the Tatoeba-MT release; the underlying artifact is:
- Dataset: opusTCv20230926max50+bt+jhubc
- Model: transformer-big
- Release: opusTCv20230926max50+bt+jhubc_transformer-big_2024-08-17
- Download (CSC): opusTCv20230926max50+bt+jhubc_transformer-big_2024-08-17.zip
- Test set: test translations, eval scores
No weights or training code have been modified; this is a re-hosted copy with this model card added.
Intended use
- Translating text from a large set of source languages into English
- Research and product prototyping for multilingual-to-English MT
- Use with the Hugging Face
transformerslibrary (MarianMT)
Limitations
- Output is English only; quality varies by source language and domain
- Best used with clear, sentence-level input; very long or noisy text may degrade quality
- No additional fine-tuning has been applied in this repository
Usage
Install transformers and use the source-language prefix >>{lang}<< as required by Marian mul-en models:
from transformers import MarianMTModel, MarianTokenizer
model_name = "aoiandroid/opus-mt-mul-en-big"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
# Example: Japanese (jpn) to English
src_text = ">>jpn<< ใใใฏใในใใงใใ"
inputs = tokenizer(src_text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# This is a test.
# Example: French (fra) to English
src_text = ">>fra<< Bonjour, comment allez-vous ?"
inputs = tokenizer(src_text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Use the appropriate ISO 639-3 code (e.g. jpn, fra, deu, cmn) in the >>code<< prefix.
Benchmarks
Evaluated on Tatoeba-test-v2023-09-26.multi-eng:
| testset | BLEU | chr-F | #sent | #words | BP |
|---|---|---|---|---|---|
| Tatoeba-test-v2023-09-26.multi-eng | 39.7 | 0.60108 | 10000 | 76940 | 1.000 |
License
Apache 2.0. Same as the original OPUS-MT / vonjack model.
Citation
If you use this model or the OPUS-MT project, consider citing the original work:
- OPUS-MT: Helsinki-NLP/Opus-MT
- This copy: aoiandroid/opus-mt-mul-en-big (from vonjack/opus-mt-mul-en-big)
- Downloads last month
- -