DrDavis's picture
Upload folder using huggingface_hub
17c6d62 verified

BARTpho [[bartpho]]

๊ฐœ์š” [[overview]]

BARTpho ๋ชจ๋ธ์€ Nguyen Luong Tran, Duong Minh Le, Dat Quoc Nguyen์— ์˜ํ•ด BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese์—์„œ ์ œ์•ˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์ด ๋…ผ๋ฌธ์˜ ์ดˆ๋ก์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

์šฐ๋ฆฌ๋Š” BARTpho_word์™€ BARTpho_syllable์˜ ๋‘ ๊ฐ€์ง€ ๋ฒ„์ „์œผ๋กœ BARTpho๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋ฒ ํŠธ๋‚จ์–ด๋ฅผ ์œ„ํ•ด ์‚ฌ์ „ํ›ˆ๋ จ๋œ ์ตœ์ดˆ์˜ ๋Œ€๊ทœ๋ชจ ๋‹จ์ผ ์–ธ์–ด ์‹œํ€€์Šค-ํˆฌ-์‹œํ€€์Šค ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์šฐ๋ฆฌ์˜ BARTpho๋Š” ์‹œํ€€์Šค-ํˆฌ-์‹œํ€€์Šค ๋””๋…ธ์ด์ง• ๋ชจ๋ธ์ธ BART์˜ "large" ์•„ํ‚คํ…์ฒ˜์™€ ์‚ฌ์ „ํ›ˆ๋ จ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์—ฌ, ์ƒ์„ฑํ˜• NLP ์ž‘์—…์— ํŠนํžˆ ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค. ๋ฒ ํŠธ๋‚จ์–ด ํ…์ŠคํŠธ ์š”์•ฝ์˜ ๋‹ค์šด์ŠคํŠธ๋ฆผ ์ž‘์—… ์‹คํ—˜์—์„œ, ์ž๋™ ๋ฐ ์ธ๊ฐ„ ํ‰๊ฐ€ ๋ชจ๋‘์—์„œ BARTpho๊ฐ€ ๊ฐ•๋ ฅํ•œ ๊ธฐ์ค€์ธ mBART๋ฅผ ๋Šฅ๊ฐ€ํ•˜๊ณ  ์ตœ์‹  ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ–ˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐ ๋ฒ ํŠธ๋‚จ์–ด ์ƒ์„ฑํ˜• NLP ์ž‘์—…์˜ ์‘์šฉ์„ ์ด‰์ง„ํ•˜๊ธฐ ์œ„ํ•ด BARTpho๋ฅผ ๊ณต๊ฐœํ•ฉ๋‹ˆ๋‹ค.

์ด ๋ชจ๋ธ์€ dqnguyen์ด ๊ธฐ์—ฌํ–ˆ์Šต๋‹ˆ๋‹ค. ์›๋ณธ ์ฝ”๋“œ๋Š” ์—ฌ๊ธฐ์—์„œ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์‚ฌ์šฉ ์˜ˆ์‹œ [[usage-example]]

>>> import torch
>>> from transformers import AutoModel, AutoTokenizer

>>> bartpho = AutoModel.from_pretrained("vinai/bartpho-syllable")

>>> tokenizer = AutoTokenizer.from_pretrained("vinai/bartpho-syllable")

>>> line = "Chรบng tรดi lร  nhแปฏng nghiรชn cแปฉu viรชn."

>>> input_ids = tokenizer(line, return_tensors="pt")

>>> with torch.no_grad():
...     features = bartpho(**input_ids)  # ์ด์ œ ๋ชจ๋ธ ์ถœ๋ ฅ์€ ํŠœํ”Œ์ž…๋‹ˆ๋‹ค

>>> # With TensorFlow 2.0+:
>>> from transformers import TFAutoModel

>>> bartpho = TFAutoModel.from_pretrained("vinai/bartpho-syllable")
>>> input_ids = tokenizer(line, return_tensors="tf")
>>> features = bartpho(**input_ids)

์‚ฌ์šฉ ํŒ [[usage-tips]]

  • mBART๋ฅผ ๋”ฐ๋ฅด๋ฉฐ, BARTpho๋Š” BART์˜ "large" ์•„ํ‚คํ…์ฒ˜์— ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋”์˜ ์ƒ๋‹จ์— ์ถ”๊ฐ€์ ์ธ ๋ ˆ์ด์–ด ์ •๊ทœํ™” ๋ ˆ์ด์–ด๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ BART ๋ฌธ์„œ์— ์žˆ๋Š” ์‚ฌ์šฉ ์˜ˆ์‹œ๋ฅผ BARTpho์— ๋งž๊ฒŒ ์ ์šฉํ•˜๋ ค๋ฉด BART ์ „์šฉ ํด๋ž˜์Šค๋ฅผ mBART ์ „์šฉ ํด๋ž˜์Šค๋กœ ๋Œ€์ฒดํ•˜์—ฌ ์กฐ์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด:
>>> from transformers import MBartForConditionalGeneration

>>> bartpho = MBartForConditionalGeneration.from_pretrained("vinai/bartpho-syllable")
>>> TXT = "Chรบng tรดi lร  <mask> nghiรชn cแปฉu viรชn."
>>> input_ids = tokenizer([TXT], return_tensors="pt")["input_ids"]
>>> logits = bartpho(input_ids).logits
>>> masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()
>>> probs = logits[0, masked_index].softmax(dim=0)
>>> values, predictions = probs.topk(5)
>>> print(tokenizer.decode(predictions).split())
  • ์ด ๊ตฌํ˜„์€ ํ† ํฐํ™”๋งŒ์„ ์œ„ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค: "monolingual_vocab_file"์€ ๋‹ค๊ตญ์–ด XLM-RoBERTa์—์„œ ์ œ๊ณต๋˜๋Š” ์‚ฌ์ „ํ›ˆ๋ จ๋œ SentencePiece ๋ชจ๋ธ "vocab_file"์—์„œ ์ถ”์ถœ๋œ ๋ฒ ํŠธ๋‚จ์–ด ์ „์šฉ ์œ ํ˜•์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ์–ธ์–ด๋“ค๋„ ์ด ์‚ฌ์ „ํ›ˆ๋ จ๋œ ๋‹ค๊ตญ์–ด SentencePiece ๋ชจ๋ธ "vocab_file"์„ ํ•˜์œ„ ๋‹จ์–ด ๋ถ„ํ• ์— ์‚ฌ์šฉํ•˜๋ฉด, ์ž์‹ ์˜ ์–ธ์–ด ์ „์šฉ "monolingual_vocab_file"๊ณผ ํ•จ๊ป˜ BartphoTokenizer๋ฅผ ์žฌ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

BartphoTokenizer [[bartphotokenizer]]

[[autodoc]] BartphoTokenizer