| <!--Copyright 2021 The HuggingFace Team. All rights reserved. | |
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
| the License. You may obtain a copy of the License at | |
| http://www.apache.org/licenses/LICENSE-2.0 | |
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
| specific language governing permissions and limitations under the License. | |
| โ ๏ธ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | |
| rendered properly in your Markdown viewer. | |
| --> | |
| # BARTpho [[bartpho]] | |
| ## ๊ฐ์ [[overview]] | |
| BARTpho ๋ชจ๋ธ์ Nguyen Luong Tran, Duong Minh Le, Dat Quoc Nguyen์ ์ํด [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://huggingface.co/papers/2109.09701)์์ ์ ์๋์์ต๋๋ค. | |
| ์ด ๋ ผ๋ฌธ์ ์ด๋ก์ ๋ค์๊ณผ ๊ฐ์ต๋๋ค: | |
| *์ฐ๋ฆฌ๋ BARTpho_word์ BARTpho_syllable์ ๋ ๊ฐ์ง ๋ฒ์ ์ผ๋ก BARTpho๋ฅผ ์ ์ํฉ๋๋ค. | |
| ์ด๋ ๋ฒ ํธ๋จ์ด๋ฅผ ์ํด ์ฌ์ ํ๋ จ๋ ์ต์ด์ ๋๊ท๋ชจ ๋จ์ผ ์ธ์ด ์ํ์ค-ํฌ-์ํ์ค ๋ชจ๋ธ์ ๋๋ค. | |
| ์ฐ๋ฆฌ์ BARTpho๋ ์ํ์ค-ํฌ-์ํ์ค ๋๋ ธ์ด์ง ๋ชจ๋ธ์ธ BART์ "large" ์ํคํ ์ฒ์ ์ฌ์ ํ๋ จ ๋ฐฉ์์ ์ฌ์ฉํ์ฌ, ์์ฑํ NLP ์์ ์ ํนํ ์ ํฉํฉ๋๋ค. | |
| ๋ฒ ํธ๋จ์ด ํ ์คํธ ์์ฝ์ ๋ค์ด์คํธ๋ฆผ ์์ ์คํ์์, | |
| ์๋ ๋ฐ ์ธ๊ฐ ํ๊ฐ ๋ชจ๋์์ BARTpho๊ฐ ๊ฐ๋ ฅํ ๊ธฐ์ค์ธ mBART๋ฅผ ๋ฅ๊ฐํ๊ณ ์ต์ ์ฑ๋ฅ์ ๊ฐ์ ํ์์ ๋ณด์ฌ์ค๋๋ค. | |
| ์ฐ๋ฆฌ๋ ํฅํ ์ฐ๊ตฌ ๋ฐ ๋ฒ ํธ๋จ์ด ์์ฑํ NLP ์์ ์ ์์ฉ์ ์ด์งํ๊ธฐ ์ํด BARTpho๋ฅผ ๊ณต๊ฐํฉ๋๋ค.* | |
| ์ด ๋ชจ๋ธ์ [dqnguyen](https://huggingface.co/dqnguyen)์ด ๊ธฐ์ฌํ์ต๋๋ค. ์๋ณธ ์ฝ๋๋ [์ฌ๊ธฐ](https://github.com/VinAIResearch/BARTpho)์์ ์ฐพ์ ์ ์์ต๋๋ค. | |
| ## ์ฌ์ฉ ์์ [[usage-example]] | |
| ```python | |
| >>> import torch | |
| >>> from transformers import AutoModel, AutoTokenizer | |
| >>> bartpho = AutoModel.from_pretrained("vinai/bartpho-syllable") | |
| >>> tokenizer = AutoTokenizer.from_pretrained("vinai/bartpho-syllable") | |
| >>> line = "Chรบng tรดi lร nhแปฏng nghiรชn cแปฉu viรชn." | |
| >>> input_ids = tokenizer(line, return_tensors="pt") | |
| >>> with torch.no_grad(): | |
| ... features = bartpho(**input_ids) # ์ด์ ๋ชจ๋ธ ์ถ๋ ฅ์ ํํ์ ๋๋ค | |
| >>> # With TensorFlow 2.0+: | |
| >>> from transformers import TFAutoModel | |
| >>> bartpho = TFAutoModel.from_pretrained("vinai/bartpho-syllable") | |
| >>> input_ids = tokenizer(line, return_tensors="tf") | |
| >>> features = bartpho(**input_ids) | |
| ``` | |
| ## ์ฌ์ฉ ํ [[usage-tips]] | |
| - mBART๋ฅผ ๋ฐ๋ฅด๋ฉฐ, BARTpho๋ BART์ "large" ์ํคํ ์ฒ์ ์ธ์ฝ๋์ ๋์ฝ๋์ ์๋จ์ ์ถ๊ฐ์ ์ธ ๋ ์ด์ด ์ ๊ทํ ๋ ์ด์ด๋ฅผ ์ฌ์ฉํฉ๋๋ค. | |
| ๋ฐ๋ผ์ [BART ๋ฌธ์](bart)์ ์๋ ์ฌ์ฉ ์์๋ฅผ BARTpho์ ๋ง๊ฒ ์ ์ฉํ๋ ค๋ฉด | |
| BART ์ ์ฉ ํด๋์ค๋ฅผ mBART ์ ์ฉ ํด๋์ค๋ก ๋์ฒดํ์ฌ ์กฐ์ ํด์ผ ํฉ๋๋ค. | |
| ์๋ฅผ ๋ค์ด: | |
| ```python | |
| >>> from transformers import MBartForConditionalGeneration | |
| >>> bartpho = MBartForConditionalGeneration.from_pretrained("vinai/bartpho-syllable") | |
| >>> TXT = "Chรบng tรดi lร <mask> nghiรชn cแปฉu viรชn." | |
| >>> input_ids = tokenizer([TXT], return_tensors="pt")["input_ids"] | |
| >>> logits = bartpho(input_ids).logits | |
| >>> masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item() | |
| >>> probs = logits[0, masked_index].softmax(dim=0) | |
| >>> values, predictions = probs.topk(5) | |
| >>> print(tokenizer.decode(predictions).split()) | |
| ``` | |
| - ์ด ๊ตฌํ์ ํ ํฐํ๋ง์ ์ํ ๊ฒ์ ๋๋ค: "monolingual_vocab_file"์ ๋ค๊ตญ์ด | |
| XLM-RoBERTa์์ ์ ๊ณต๋๋ ์ฌ์ ํ๋ จ๋ SentencePiece ๋ชจ๋ธ | |
| "vocab_file"์์ ์ถ์ถ๋ ๋ฒ ํธ๋จ์ด ์ ์ฉ ์ ํ์ผ๋ก ๊ตฌ์ฑ๋ฉ๋๋ค. | |
| ๋ค๋ฅธ ์ธ์ด๋ค๋ ์ด ์ฌ์ ํ๋ จ๋ ๋ค๊ตญ์ด SentencePiece ๋ชจ๋ธ "vocab_file"์ ํ์ ๋จ์ด ๋ถํ ์ ์ฌ์ฉํ๋ฉด, ์์ ์ ์ธ์ด ์ ์ฉ "monolingual_vocab_file"๊ณผ ํจ๊ป BartphoTokenizer๋ฅผ ์ฌ์ฌ์ฉํ ์ ์์ต๋๋ค. | |
| ## BartphoTokenizer [[bartphotokenizer]] | |
| [[autodoc]] BartphoTokenizer | |