opus-mt-en-de

English to German translation model based on the MarianMT architecture.

Description

This model is a English to German translation model trained on the OPUS corpus. It uses the MarianMT architecture, which is a neural machine translation framework designed for efficient training and inference. The model is based on the Transformer architecture with encoder-decoder structure.

The model was trained on parallel corpora from various sources including Europarl, Common Crawl, and other publicly available datasets. It achieves competitive BLEU scores on standard translation benchmarks.

Intended use

This model is intended for translating text from English to German. Primary use cases include:

Document Translation: Translating documents, articles, and web content from English to German
Communication Assistance: Helping users understand German content or communicate in German
Content Localization: Adapting English content for German-speaking audiences
Research and Development: Using as a baseline for translation research or fine-tuning for specific domains

The model works best with standard written English and produces natural-sounding German output. It handles various text types including news, literature, and technical content.

Limitations

While this model provides good quality translations, users should be aware of the following limitations:

Domain Specificity: The model may not perform optimally on highly specialized domains (medical, legal, technical) without additional fine-tuning
Context Length: The model has a maximum sequence length of 512 tokens, which may limit translation of very long documents
Idiomatic Expressions: Some idioms and cultural references may not translate accurately
Named Entities: Proper nouns and named entities may not always be preserved correctly
Low-Resource Scenarios: Performance may degrade on rare words or uncommon sentence structures
No Real-time Learning: The model does not learn from user interactions or corrections

Users should always review automated translations, especially for critical applications.

How to use

Using Transformers

from transformers import MarianMTModel, MarianTokenizer

# Load model and tokenizer
model_name = "Jiaao/opus-mt-en-de"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

# Translate
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt", padding=True)
translated = model.generate(**inputs)
result = tokenizer.decode(translated[0], skip_special_tokens=True)
print(result)  # "Hallo, wie geht es Ihnen?"

Using Pipeline

from transformers import pipeline

translator = pipeline("translation", model="Jiaao/opus-mt-en-de")
result = translator("Hello, how are you?")
print(result[0]['translation_text'])

Batch Translation

texts = [
    "Good morning!",
    "Thank you for your help.",
    "Where is the nearest station?"
]

inputs = tokenizer(texts, return_tensors="pt", padding=True)
translated = model.generate(**inputs)
results = tokenizer.batch_decode(translated, skip_special_tokens=True)

Training Data

The model was trained on the OPUS corpus, which includes:

Europarl parallel corpus
Common Crawl corpus
OpenSubtitles corpus
Various other publicly available parallel datasets

License

This model is released under the Apache-2.0 license.

Citation

@inproceedings{tiedemann-2012-parallel,
  title = "Parallel Data, Tools and Interfaces in OPUS",
  author = "Tiedemann, J{\"o}rg",
  booktitle = "Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)",
  year = "2012"
}

Downloads last month: 49