DrDavis's picture
Upload folder using huggingface_hub
17c6d62 verified

MarianMT[[MarianMT]]

๊ฐœ์š”[[Overview]]

BART์™€ ๋™์ผํ•œ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฒˆ์—ญ ๋ชจ๋ธ ํ”„๋ ˆ์ž„์›Œํฌ์ž…๋‹ˆ๋‹ค. ๋ฒˆ์—ญ ๊ฒฐ๊ณผ๋Š” ๊ฐ ๋ชจ๋ธ ์นด๋“œ์˜ ํ…Œ์ŠคํŠธ ์„ธํŠธ์™€ ์œ ์‚ฌํ•˜์ง€๋งŒ, ์ •ํ™•ํžˆ ์ผ์น˜ํ•˜์ง€๋Š” ์•Š์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ sshleifer๊ฐ€ ์ œ๊ณตํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ตฌํ˜„ ๋…ธํŠธ[[Implementation Notes]]

  • ๊ฐ ๋ชจ๋ธ์€ ์•ฝ 298 MB๋ฅผ ์ฐจ์ง€ํ•˜๋ฉฐ, 1,000๊ฐœ ์ด์ƒ์˜ ๋ชจ๋ธ์ด ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค.

  • ์ง€์›๋˜๋Š” ์–ธ์–ด ์Œ ๋ชฉ๋ก์€ ์—ฌ๊ธฐ์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ๋ชจ๋ธ๋“ค์€ Jรถrg Tiedemann์— ์˜ํ•ด Marian C++ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” ๋น ๋ฅธ ํ•™์Šต๊ณผ ๋ฒˆ์—ญ์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

  • ๋ชจ๋“  ๋ชจ๋ธ์€ 6๊ฐœ ๋ ˆ์ด์–ด๋กœ ์ด๋ฃจ์–ด์ง„ Transformer ๊ธฐ๋ฐ˜์˜ ์ธ์ฝ”๋”-๋””์ฝ”๋” ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค. ๊ฐ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์€ ๋ชจ๋ธ ์นด๋“œ์— ๊ธฐ์ž…๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

  • BPE ์ „์ฒ˜๋ฆฌ๊ฐ€ ํ•„์š”ํ•œ 80๊ฐœ์˜ OPUS ๋ชจ๋ธ์€ ์ง€์›๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

  • ๋ชจ๋ธ๋ง ์ฝ”๋“œ๋Š” [BartForConditionalGeneration]์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋ฉฐ, ์ผ๋ถ€ ์ˆ˜์ •์‚ฌํ•ญ์ด ๋ฐ˜์˜๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค:

    • ์ •์  (์‚ฌ์ธ ํ•จ์ˆ˜ ๊ธฐ๋ฐ˜) ์œ„์น˜ ์ž„๋ฒ ๋”ฉ ์‚ฌ์šฉ (MarianConfig.static_position_embeddings=True)
    • ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด ์ •๊ทœํ™” ์ƒ๋žต (MarianConfig.normalize_embedding=False)
    • ๋ชจ๋ธ์€ ์ƒ์„ฑ ์‹œ ํ”„๋ฆฌํ”ฝ์Šค๋กœ pad_token_id (ํ•ด๋‹น ํ† ํฐ ์ž„๋ฒ ๋”ฉ ๊ฐ’์€ 0)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค (Bart๋Š” <s/>๋ฅผ ์‚ฌ์šฉ),
  • Marian ๋ชจ๋ธ์„ PyTorch๋กœ ๋Œ€๋Ÿ‰ ๋ณ€ํ™˜ํ•˜๋Š” ์ฝ”๋“œ๋Š” convert_marian_to_pytorch.py์—์„œ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ชจ๋ธ ์ด๋ฆ„ ๊ทœ์น™[[Naming]]

  • ๋ชจ๋“  ๋ชจ๋ธ ์ด๋ฆ„์€ Helsinki-NLP/opus-mt-{src}-{tgt} ํ˜•์‹์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค.
  • ๋ชจ๋ธ์˜ ์–ธ์–ด ์ฝ”๋“œ ํ‘œ๊ธฐ๋Š” ์ผ๊ด€๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋‘ ์ž๋ฆฌ ์ฝ”๋“œ๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ์—ฌ๊ธฐ์—์„œ ์ฐพ์„ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์„ธ ์ž๋ฆฌ ์ฝ”๋“œ๋Š” "์–ธ์–ด ์ฝ”๋“œ {code}"๋กœ ๊ตฌ๊ธ€ ๊ฒ€์ƒ‰์„ ํ†ตํ•ด ์ฐพ์Šต๋‹ˆ๋‹ค.
  • es_AR๊ณผ ๊ฐ™์€ ํ˜•ํƒœ์˜ ์ฝ”๋“œ๋Š” code_{region} ํ˜•์‹์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ์˜ ์˜ˆ์‹œ๋Š” ์•„๋ฅดํ—จํ‹ฐ๋‚˜์˜ ์ŠคํŽ˜์ธ์–ด๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
  • ๋ชจ๋ธ ๋ณ€ํ™˜์€ ๋‘ ๋‹จ๊ณ„๋กœ ์ด๋ฃจ์–ด์กŒ์Šต๋‹ˆ๋‹ค. ์ฒ˜์Œ 1,000๊ฐœ ๋ชจ๋ธ์€ ISO-639-2 ์ฝ”๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , ๋‘ ๋ฒˆ์งธ ๊ทธ๋ฃน์€ ISO-639-5์™€ ISO-639-2 ์ฝ”๋“œ๋ฅผ ์กฐํ•ฉํ•˜์—ฌ ์–ธ์–ด๋ฅผ ์‹๋ณ„ํ•ฉ๋‹ˆ๋‹ค.

์˜ˆ์‹œ[[Examples]]

  • Marian ๋ชจ๋ธ์€ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ ๋‹ค๋ฅธ ๋ฒˆ์—ญ ๋ชจ๋ธ๋“ค๋ณด๋‹ค ํฌ๊ธฐ๊ฐ€ ์ž‘์•„ ํŒŒ์ธํŠœ๋‹ ์‹คํ—˜๊ณผ ํ†ตํ•ฉ ํ…Œ์ŠคํŠธ์— ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • GPU์—์„œ ํŒŒ์ธํŠœ๋‹ํ•˜๊ธฐ

๋‹ค๊ตญ์–ด ๋ชจ๋ธ ์‚ฌ์šฉ๋ฒ•[[Multilingual Models]]

  • ๋ชจ๋“  ๋ชจ๋ธ ์ด๋ฆ„์€Helsinki-NLP/opus-mt-{src}-{tgt} ํ˜•์‹์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค.
  • ๋‹ค์ค‘ ์–ธ์–ด ์ถœ๋ ฅ์„ ์ง€์›ํ•˜๋Š” ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ, ์ถœ๋ ฅ์„ ์›ํ•˜๋Š” ์–ธ์–ด์˜ ์–ธ์–ด ์ฝ”๋“œ๋ฅผ src_text์˜ ์‹œ์ž‘ ๋ถ€๋ถ„์— ์ถ”๊ฐ€ํ•˜์—ฌ ์ง€์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • ๋ชจ๋ธ ์นด๋“œ์—์„œ ์ง€์›๋˜๋Š” ์–ธ์–ด ์ฝ”๋“œ์˜ ๋ชฉ๋ก์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค! ์˜ˆ๋ฅผ ๋“ค์–ด opus-mt-en-roa์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Helsinki-NLP/opus-mt-roa-en์ฒ˜๋Ÿผ ์†Œ์Šค ์ธก์—์„œ๋งŒ ๋‹ค๊ตญ์–ด๋ฅผ ์ง€์›ํ•˜๋Š” ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ, ๋ณ„๋„์˜ ์–ธ์–ด ์ฝ”๋“œ ์ง€์ •์ด ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

Tatoeba-Challenge ๋ฆฌํฌ์ง€ํ† ๋ฆฌ์˜ ์ƒˆ๋กœ์šด ๋‹ค๊ตญ์  ๋ชจ๋ธ์€ 3์ž๋ฆฌ ์–ธ์–ด ์ฝ”๋“œ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

>>> from transformers import MarianMTModel, MarianTokenizer

>>> src_text = [
...     ">>fra<< this is a sentence in english that we want to translate to french",
...     ">>por<< This should go to portuguese",
...     ">>esp<< And this to Spanish",
... ]

>>> model_name = "Helsinki-NLP/opus-mt-en-roa"
>>> tokenizer = MarianTokenizer.from_pretrained(model_name)
>>> print(tokenizer.supported_language_codes)
['>>zlm_Latn<<', '>>mfe<<', '>>hat<<', '>>pap<<', '>>ast<<', '>>cat<<', '>>ind<<', '>>glg<<', '>>wln<<', '>>spa<<', '>>fra<<', '>>ron<<', '>>por<<', '>>ita<<', '>>oci<<', '>>arg<<', '>>min<<']

>>> model = MarianMTModel.from_pretrained(model_name)
>>> translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
>>> [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
["c'est une phrase en anglais que nous voulons traduire en franรงais",
 'Isto deve ir para o portuguรชs.',
 'Y esto al espaรฑol']

ํ—ˆ๋ธŒ์— ์žˆ๋Š” ๋ชจ๋“  ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ์„ ํ™•์ธํ•˜๋Š” ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค:

from huggingface_hub import list_models

model_list = list_models()
org = "Helsinki-NLP"
model_ids = [x.id for x in model_list if x.id.startswith(org)]
suffix = [x.split("/")[1] for x in model_ids]
old_style_multi_models = [f"{org}/{s}" for s in suffix if s != s.lower()]

๊ตฌํ˜• ๋‹ค๊ตญ์–ด ๋ชจ๋ธ[[Old Style Multi-Lingual Models]]

์ด ๋ชจ๋ธ๋“ค์€ OPUS-MT-Train ๋ฆฌํฌ์ง€ํ† ๋ฆฌ์˜ ๊ตฌํ˜• ๋‹ค๊ตญ์–ด ๋ชจ๋ธ๋“ค์ž…๋‹ˆ๋‹ค. ๊ฐ ์–ธ์–ด ๊ทธ๋ฃน์— ํฌํ•จ๋œ ์–ธ์–ด๋“ค์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

['Helsinki-NLP/opus-mt-NORTH_EU-NORTH_EU',
 'Helsinki-NLP/opus-mt-ROMANCE-en',
 'Helsinki-NLP/opus-mt-SCANDINAVIA-SCANDINAVIA',
 'Helsinki-NLP/opus-mt-de-ZH',
 'Helsinki-NLP/opus-mt-en-CELTIC',
 'Helsinki-NLP/opus-mt-en-ROMANCE',
 'Helsinki-NLP/opus-mt-es-NORWAY',
 'Helsinki-NLP/opus-mt-fi-NORWAY',
 'Helsinki-NLP/opus-mt-fi-ZH',
 'Helsinki-NLP/opus-mt-fi_nb_no_nn_ru_sv_en-SAMI',
 'Helsinki-NLP/opus-mt-sv-NORWAY',
 'Helsinki-NLP/opus-mt-sv-ZH']
GROUP_MEMBERS = {
 'ZH': ['cmn', 'cn', 'yue', 'ze_zh', 'zh_cn', 'zh_CN', 'zh_HK', 'zh_tw', 'zh_TW', 'zh_yue', 'zhs', 'zht', 'zh'],
 'ROMANCE': ['fr', 'fr_BE', 'fr_CA', 'fr_FR', 'wa', 'frp', 'oc', 'ca', 'rm', 'lld', 'fur', 'lij', 'lmo', 'es', 'es_AR', 'es_CL', 'es_CO', 'es_CR', 'es_DO', 'es_EC', 'es_ES', 'es_GT', 'es_HN', 'es_MX', 'es_NI', 'es_PA', 'es_PE', 'es_PR', 'es_SV', 'es_UY', 'es_VE', 'pt', 'pt_br', 'pt_BR', 'pt_PT', 'gl', 'lad', 'an', 'mwl', 'it', 'it_IT', 'co', 'nap', 'scn', 'vec', 'sc', 'ro', 'la'],
 'NORTH_EU': ['de', 'nl', 'fy', 'af', 'da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
 'SCANDINAVIA': ['da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
 'SAMI': ['se', 'sma', 'smj', 'smn', 'sms'],
 'NORWAY': ['nb_NO', 'nb', 'nn_NO', 'nn', 'nog', 'no_nb', 'no'],
 'CELTIC': ['ga', 'cy', 'br', 'gd', 'kw', 'gv']
}

์˜์–ด๋ฅผ ์—ฌ๋Ÿฌ ๋กœ๋ง์Šค ์–ธ์–ด๋กœ ๋ฒˆ์—ญํ•˜๋Š” ์˜ˆ์ œ์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” ๊ตฌํ˜• 2์ž๋ฆฌ ์–ธ์–ด ์ฝ”๋“œ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

>>> from transformers import MarianMTModel, MarianTokenizer

>>> src_text = [
...     ">>fr<< this is a sentence in english that we want to translate to french",
...     ">>pt<< This should go to portuguese",
...     ">>es<< And this to Spanish",
... ]

>>> model_name = "Helsinki-NLP/opus-mt-en-ROMANCE"
>>> tokenizer = MarianTokenizer.from_pretrained(model_name)

>>> model = MarianMTModel.from_pretrained(model_name)
>>> translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
>>> tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
["c'est une phrase en anglais que nous voulons traduire en franรงais", 
 'Isto deve ir para o portuguรชs.',
 'Y esto al espaรฑol']

์ž๋ฃŒ[[Resources]]

MarianConfig

[[autodoc]] MarianConfig

MarianTokenizer

[[autodoc]] MarianTokenizer - build_inputs_with_special_tokens

MarianModel

[[autodoc]] MarianModel - forward

MarianMTModel

[[autodoc]] MarianMTModel - forward

MarianForCausalLM

[[autodoc]] MarianForCausalLM - forward

TFMarianModel

[[autodoc]] TFMarianModel - call

TFMarianMTModel

[[autodoc]] TFMarianMTModel - call

FlaxMarianModel

[[autodoc]] FlaxMarianModel - call

FlaxMarianMTModel

[[autodoc]] FlaxMarianMTModel - call