| <!--Copyright 2020 The HuggingFace Team. All rights reserved. | |
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
| the License. You may obtain a copy of the License at | |
| http://www.apache.org/licenses/LICENSE-2.0 | |
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
| specific language governing permissions and limitations under the License. | |
| โ ๏ธ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | |
| rendered properly in your Markdown viewer. | |
| --> | |
| # MarianMT[[MarianMT]] | |
| <div class="flex flex-wrap space-x-1"> | |
| <a href="https://huggingface.co/models?filter=marian"> | |
| <img alt="Models" src="https://img.shields.io/badge/All_model_pages-marian-blueviolet"> | |
| </a> | |
| <a href="https://huggingface.co/spaces/docs-demos/opus-mt-zh-en"> | |
| <img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"> | |
| </a> | |
| </div> | |
| ## ๊ฐ์[[Overview]] | |
| BART์ ๋์ผํ ๋ชจ๋ธ์ ์ฌ์ฉํ๋ ๋ฒ์ญ ๋ชจ๋ธ ํ๋ ์์ํฌ์ ๋๋ค. ๋ฒ์ญ ๊ฒฐ๊ณผ๋ ๊ฐ ๋ชจ๋ธ ์นด๋์ ํ ์คํธ ์ธํธ์ ์ ์ฌํ์ง๋ง, ์ ํํ ์ผ์นํ์ง๋ ์์ ์ ์์ต๋๋ค. ์ด ๋ชจ๋ธ์ [sshleifer](https://huggingface.co/sshleifer)๊ฐ ์ ๊ณตํ์ต๋๋ค. | |
| ## ๊ตฌํ ๋ ธํธ[[Implementation Notes]] | |
| - ๊ฐ ๋ชจ๋ธ์ ์ฝ 298 MB๋ฅผ ์ฐจ์งํ๋ฉฐ, 1,000๊ฐ ์ด์์ ๋ชจ๋ธ์ด ์ ๊ณต๋ฉ๋๋ค. | |
| - ์ง์๋๋ ์ธ์ด ์ ๋ชฉ๋ก์ [์ฌ๊ธฐ](https://huggingface.co/Helsinki-NLP)์์ ํ์ธํ ์ ์์ต๋๋ค. | |
| - ๋ชจ๋ธ๋ค์ [Jรถrg Tiedemann](https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann)์ ์ํด [Marian](https://marian-nmt.github.io/) C++ ๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ฅผ ์ด์ฉํ์ฌ ํ์ต๋์์ต๋๋ค. ์ด ๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ ๋น ๋ฅธ ํ์ต๊ณผ ๋ฒ์ญ์ ์ง์ํฉ๋๋ค. | |
| - ๋ชจ๋ ๋ชจ๋ธ์ 6๊ฐ ๋ ์ด์ด๋ก ์ด๋ฃจ์ด์ง Transformer ๊ธฐ๋ฐ์ ์ธ์ฝ๋-๋์ฝ๋ ๊ตฌ์กฐ์ ๋๋ค. ๊ฐ ๋ชจ๋ธ์ ์ฑ๋ฅ์ ๋ชจ๋ธ ์นด๋์ ๊ธฐ์ ๋์ด ์์ต๋๋ค. | |
| - BPE ์ ์ฒ๋ฆฌ๊ฐ ํ์ํ 80๊ฐ์ OPUS ๋ชจ๋ธ์ ์ง์๋์ง ์์ต๋๋ค. | |
| - ๋ชจ๋ธ๋ง ์ฝ๋๋ [`BartForConditionalGeneration`]์ ๊ธฐ๋ฐ์ผ๋ก ํ๋ฉฐ, ์ผ๋ถ ์์ ์ฌํญ์ด ๋ฐ์๋์ด ์์ต๋๋ค: | |
| - ์ ์ (์ฌ์ธ ํจ์ ๊ธฐ๋ฐ) ์์น ์๋ฒ ๋ฉ ์ฌ์ฉ (`MarianConfig.static_position_embeddings=True`) | |
| - ์๋ฒ ๋ฉ ๋ ์ด์ด ์ ๊ทํ ์๋ต (`MarianConfig.normalize_embedding=False`) | |
| - ๋ชจ๋ธ์ ์์ฑ ์ ํ๋ฆฌํฝ์ค๋ก `pad_token_id` (ํด๋น ํ ํฐ ์๋ฒ ๋ฉ ๊ฐ์ 0)๋ฅผ ์ฌ์ฉํ์ฌ ์์ํฉ๋๋ค (Bart๋ | |
| `<s/>`๋ฅผ ์ฌ์ฉ), | |
| - Marian ๋ชจ๋ธ์ PyTorch๋ก ๋๋ ๋ณํํ๋ ์ฝ๋๋ `convert_marian_to_pytorch.py`์์ ์ฐพ์ ์ ์์ต๋๋ค. | |
| ## ๋ชจ๋ธ ์ด๋ฆ ๊ท์น[[Naming]] | |
| - ๋ชจ๋ ๋ชจ๋ธ ์ด๋ฆ์ `Helsinki-NLP/opus-mt-{src}-{tgt}` ํ์์ ๋ฐ๋ฆ ๋๋ค. | |
| - ๋ชจ๋ธ์ ์ธ์ด ์ฝ๋ ํ๊ธฐ๋ ์ผ๊ด๋์ง ์์ต๋๋ค. ๋ ์๋ฆฌ ์ฝ๋๋ ์ผ๋ฐ์ ์ผ๋ก [์ฌ๊ธฐ](https://developers.google.com/admin-sdk/directory/v1/languages)์์ ์ฐพ์ ์ ์์ผ๋ฉฐ, ์ธ ์๋ฆฌ ์ฝ๋๋ "์ธ์ด ์ฝ๋ {code}"๋ก ๊ตฌ๊ธ ๊ฒ์์ ํตํด ์ฐพ์ต๋๋ค. | |
| - `es_AR`๊ณผ ๊ฐ์ ํํ์ ์ฝ๋๋ `code_{region}` ํ์์ ์๋ฏธํฉ๋๋ค. ์ฌ๊ธฐ์์ ์์๋ ์๋ฅดํจํฐ๋์ ์คํ์ธ์ด๋ฅผ ์๋ฏธํฉ๋๋ค. | |
| - ๋ชจ๋ธ ๋ณํ์ ๋ ๋จ๊ณ๋ก ์ด๋ฃจ์ด์ก์ต๋๋ค. ์ฒ์ 1,000๊ฐ ๋ชจ๋ธ์ ISO-639-2 ์ฝ๋๋ฅผ ์ฌ์ฉํ๊ณ , ๋ ๋ฒ์งธ ๊ทธ๋ฃน์ ISO-639-5์ ISO-639-2 ์ฝ๋๋ฅผ ์กฐํฉํ์ฌ ์ธ์ด๋ฅผ ์๋ณํฉ๋๋ค. | |
| ## ์์[[Examples]] | |
| - Marian ๋ชจ๋ธ์ ๋ผ์ด๋ธ๋ฌ๋ฆฌ์ ๋ค๋ฅธ ๋ฒ์ญ ๋ชจ๋ธ๋ค๋ณด๋ค ํฌ๊ธฐ๊ฐ ์์ ํ์ธํ๋ ์คํ๊ณผ ํตํฉ ํ ์คํธ์ ์ ์ฉํฉ๋๋ค. | |
| - [GPU์์ ํ์ธํ๋ํ๊ธฐ](https://github.com/huggingface/transformers/blob/master/examples/legacy/seq2seq/train_distil_marian_enro.sh) | |
| ## ๋ค๊ตญ์ด ๋ชจ๋ธ ์ฌ์ฉ๋ฒ[[Multilingual Models]] | |
| - ๋ชจ๋ ๋ชจ๋ธ ์ด๋ฆ์`Helsinki-NLP/opus-mt-{src}-{tgt}` ํ์์ ๋ฐ๋ฆ ๋๋ค. | |
| - ๋ค์ค ์ธ์ด ์ถ๋ ฅ์ ์ง์ํ๋ ๋ชจ๋ธ์ ๊ฒฝ์ฐ, ์ถ๋ ฅ์ ์ํ๋ ์ธ์ด์ ์ธ์ด ์ฝ๋๋ฅผ `src_text`์ ์์ ๋ถ๋ถ์ ์ถ๊ฐํ์ฌ ์ง์ ํด์ผ ํฉ๋๋ค. | |
| - ๋ชจ๋ธ ์นด๋์์ ์ง์๋๋ ์ธ์ด ์ฝ๋์ ๋ชฉ๋ก์ ํ์ธํ ์ ์์ต๋๋ค! ์๋ฅผ ๋ค์ด [opus-mt-en-roa](https://huggingface.co/Helsinki-NLP/opus-mt-en-roa)์์ ํ์ธํ ์ ์์ต๋๋ค. | |
| - `Helsinki-NLP/opus-mt-roa-en`์ฒ๋ผ ์์ค ์ธก์์๋ง ๋ค๊ตญ์ด๋ฅผ ์ง์ํ๋ ๋ชจ๋ธ์ ๊ฒฝ์ฐ, ๋ณ๋์ ์ธ์ด ์ฝ๋ ์ง์ ์ด ํ์ํ์ง ์์ต๋๋ค. | |
| [Tatoeba-Challenge ๋ฆฌํฌ์งํ ๋ฆฌ](https://github.com/Helsinki-NLP/Tatoeba-Challenge)์ ์๋ก์ด ๋ค๊ตญ์ ๋ชจ๋ธ์ 3์๋ฆฌ ์ธ์ด ์ฝ๋๋ฅผ ์ฌ์ฉํฉ๋๋ค: | |
| ```python | |
| >>> from transformers import MarianMTModel, MarianTokenizer | |
| >>> src_text = [ | |
| ... ">>fra<< this is a sentence in english that we want to translate to french", | |
| ... ">>por<< This should go to portuguese", | |
| ... ">>esp<< And this to Spanish", | |
| ... ] | |
| >>> model_name = "Helsinki-NLP/opus-mt-en-roa" | |
| >>> tokenizer = MarianTokenizer.from_pretrained(model_name) | |
| >>> print(tokenizer.supported_language_codes) | |
| ['>>zlm_Latn<<', '>>mfe<<', '>>hat<<', '>>pap<<', '>>ast<<', '>>cat<<', '>>ind<<', '>>glg<<', '>>wln<<', '>>spa<<', '>>fra<<', '>>ron<<', '>>por<<', '>>ita<<', '>>oci<<', '>>arg<<', '>>min<<'] | |
| >>> model = MarianMTModel.from_pretrained(model_name) | |
| >>> translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True)) | |
| >>> [tokenizer.decode(t, skip_special_tokens=True) for t in translated] | |
| ["c'est une phrase en anglais que nous voulons traduire en franรงais", | |
| 'Isto deve ir para o portuguรชs.', | |
| 'Y esto al espaรฑol'] | |
| ``` | |
| ํ๋ธ์ ์๋ ๋ชจ๋ ์ฌ์ ํ์ต๋ ๋ชจ๋ธ์ ํ์ธํ๋ ์ฝ๋์ ๋๋ค: | |
| ```python | |
| from huggingface_hub import list_models | |
| model_list = list_models() | |
| org = "Helsinki-NLP" | |
| model_ids = [x.id for x in model_list if x.id.startswith(org)] | |
| suffix = [x.split("/")[1] for x in model_ids] | |
| old_style_multi_models = [f"{org}/{s}" for s in suffix if s != s.lower()] | |
| ``` | |
| ## ๊ตฌํ ๋ค๊ตญ์ด ๋ชจ๋ธ[[Old Style Multi-Lingual Models]] | |
| ์ด ๋ชจ๋ธ๋ค์ OPUS-MT-Train ๋ฆฌํฌ์งํ ๋ฆฌ์ ๊ตฌํ ๋ค๊ตญ์ด ๋ชจ๋ธ๋ค์ ๋๋ค. ๊ฐ ์ธ์ด ๊ทธ๋ฃน์ ํฌํจ๋ ์ธ์ด๋ค์ ๋ค์๊ณผ ๊ฐ์ต๋๋ค: | |
| ```python no-style | |
| ['Helsinki-NLP/opus-mt-NORTH_EU-NORTH_EU', | |
| 'Helsinki-NLP/opus-mt-ROMANCE-en', | |
| 'Helsinki-NLP/opus-mt-SCANDINAVIA-SCANDINAVIA', | |
| 'Helsinki-NLP/opus-mt-de-ZH', | |
| 'Helsinki-NLP/opus-mt-en-CELTIC', | |
| 'Helsinki-NLP/opus-mt-en-ROMANCE', | |
| 'Helsinki-NLP/opus-mt-es-NORWAY', | |
| 'Helsinki-NLP/opus-mt-fi-NORWAY', | |
| 'Helsinki-NLP/opus-mt-fi-ZH', | |
| 'Helsinki-NLP/opus-mt-fi_nb_no_nn_ru_sv_en-SAMI', | |
| 'Helsinki-NLP/opus-mt-sv-NORWAY', | |
| 'Helsinki-NLP/opus-mt-sv-ZH'] | |
| GROUP_MEMBERS = { | |
| 'ZH': ['cmn', 'cn', 'yue', 'ze_zh', 'zh_cn', 'zh_CN', 'zh_HK', 'zh_tw', 'zh_TW', 'zh_yue', 'zhs', 'zht', 'zh'], | |
| 'ROMANCE': ['fr', 'fr_BE', 'fr_CA', 'fr_FR', 'wa', 'frp', 'oc', 'ca', 'rm', 'lld', 'fur', 'lij', 'lmo', 'es', 'es_AR', 'es_CL', 'es_CO', 'es_CR', 'es_DO', 'es_EC', 'es_ES', 'es_GT', 'es_HN', 'es_MX', 'es_NI', 'es_PA', 'es_PE', 'es_PR', 'es_SV', 'es_UY', 'es_VE', 'pt', 'pt_br', 'pt_BR', 'pt_PT', 'gl', 'lad', 'an', 'mwl', 'it', 'it_IT', 'co', 'nap', 'scn', 'vec', 'sc', 'ro', 'la'], | |
| 'NORTH_EU': ['de', 'nl', 'fy', 'af', 'da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'], | |
| 'SCANDINAVIA': ['da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'], | |
| 'SAMI': ['se', 'sma', 'smj', 'smn', 'sms'], | |
| 'NORWAY': ['nb_NO', 'nb', 'nn_NO', 'nn', 'nog', 'no_nb', 'no'], | |
| 'CELTIC': ['ga', 'cy', 'br', 'gd', 'kw', 'gv'] | |
| } | |
| ``` | |
| ์์ด๋ฅผ ์ฌ๋ฌ ๋ก๋ง์ค ์ธ์ด๋ก ๋ฒ์ญํ๋ ์์ ์ ๋๋ค. ์ฌ๊ธฐ์๋ ๊ตฌํ 2์๋ฆฌ ์ธ์ด ์ฝ๋๋ฅผ ์ฌ์ฉํฉ๋๋ค: | |
| ```python | |
| >>> from transformers import MarianMTModel, MarianTokenizer | |
| >>> src_text = [ | |
| ... ">>fr<< this is a sentence in english that we want to translate to french", | |
| ... ">>pt<< This should go to portuguese", | |
| ... ">>es<< And this to Spanish", | |
| ... ] | |
| >>> model_name = "Helsinki-NLP/opus-mt-en-ROMANCE" | |
| >>> tokenizer = MarianTokenizer.from_pretrained(model_name) | |
| >>> model = MarianMTModel.from_pretrained(model_name) | |
| >>> translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True)) | |
| >>> tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated] | |
| ["c'est une phrase en anglais que nous voulons traduire en franรงais", | |
| 'Isto deve ir para o portuguรชs.', | |
| 'Y esto al espaรฑol'] | |
| ``` | |
| ## ์๋ฃ[[Resources]] | |
| - [๋ฒ์ญ ์์ ๊ฐ์ด๋](../tasks/translation) | |
| - [์์ฝ ์์ ๊ฐ์ด๋](../tasks/summarization) | |
| - [์ธ์ด ๋ชจ๋ธ๋ง ์์ ๊ฐ์ด๋](../tasks/language_modeling) | |
| ## MarianConfig | |
| [[autodoc]] MarianConfig | |
| ## MarianTokenizer | |
| [[autodoc]] MarianTokenizer | |
| - build_inputs_with_special_tokens | |
| ## MarianModel | |
| [[autodoc]] MarianModel | |
| - forward | |
| ## MarianMTModel | |
| [[autodoc]] MarianMTModel | |
| - forward | |
| ## MarianForCausalLM | |
| [[autodoc]] MarianForCausalLM | |
| - forward | |