DrDavis's picture
Upload folder using huggingface_hub
17c6d62 verified

BERTweet [[bertweet]]

๊ฐœ์š” [[overview]]

BERTweet ๋ชจ๋ธ์€ Dat Quoc Nguyen, Thanh Vu, Anh Tuan Nguyen์— ์˜ํ•ด BERTweet: A pre-trained language model for English Tweets ์—์„œ ์ œ์•ˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

ํ•ด๋‹น ๋…ผ๋ฌธ์˜ ์ดˆ๋ก :

์˜์–ด ํŠธ์œ—์„ ์œ„ํ•œ ์ตœ์ดˆ์˜ ๊ณต๊ฐœ ๋Œ€๊ทœ๋ชจ ์‚ฌ์ „ ํ•™์Šต๋œ ์–ธ์–ด ๋ชจ๋ธ์ธ BERTweet์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. BERTweet์€ BERT-base(Devlin et al., 2019)์™€ ๋™์ผํ•œ ์•„ํ‚คํ…์ฒ˜๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฉฐ, RoBERTa ์‚ฌ์ „ ํ•™์Šต ์ ˆ์ฐจ(Liu et al., 2019)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, BERTweet์€ ๊ฐ•๋ ฅํ•œ ๊ธฐ์ค€ ๋ชจ๋ธ์ธ RoBERTa-base ๋ฐ XLM-R-base(Conneau et al., 2020)์˜ ์„ฑ๋Šฅ์„ ๋Šฅ๊ฐ€ํ•˜์—ฌ ์„ธ ๊ฐ€์ง€ ํŠธ์œ— NLP ์ž‘์—…(ํ’ˆ์‚ฌ ํƒœ๊น…, ๊ฐœ์ฒด๋ช… ์ธ์‹, ํ…์ŠคํŠธ ๋ถ„๋ฅ˜)์—์„œ ์ด์ „ ์ตœ์‹  ๋ชจ๋ธ๋ณด๋‹ค ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

์ด ๋ชจ๋ธ์€ dqnguyen ๊ป˜์„œ ๊ธฐ์—ฌํ•˜์…จ์Šต๋‹ˆ๋‹ค. ์›๋ณธ ์ฝ”๋“œ๋Š” ์—ฌ๊ธฐ.์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์‚ฌ์šฉ ์˜ˆ์‹œ [[usage-example]]

>>> import torch
>>> from transformers import AutoModel, AutoTokenizer

>>> bertweet = AutoModel.from_pretrained("vinai/bertweet-base")

>>> # ํŠธ๋žœ์Šคํฌ๋จธ ๋ฒ„์ „ 4.x ์ด์ƒ :
>>> tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False)

>>> # ํŠธ๋žœ์Šคํฌ๋จธ ๋ฒ„์ „ 3.x ์ด์ƒ:
>>> # tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")

>>> # ์ž…๋ ฅ๋œ ํŠธ์œ—์€ ์ด๋ฏธ ์ •๊ทœํ™”๋˜์—ˆ์Šต๋‹ˆ๋‹ค!
>>> line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"

>>> input_ids = torch.tensor([tokenizer.encode(line)])

>>> with torch.no_grad():
...     features = bertweet(input_ids)  # Models outputs are now tuples

>>> # With TensorFlow 2.0+:
>>> # from transformers import TFAutoModel
>>> # bertweet = TFAutoModel.from_pretrained("vinai/bertweet-base")

์ด ๊ตฌํ˜„์€ ํ† ํฐํ™” ๋ฐฉ๋ฒ•์„ ์ œ์™ธํ•˜๊ณ ๋Š” BERT์™€ ๋™์ผํ•ฉ๋‹ˆ๋‹ค. API ์ฐธ์กฐ ์ •๋ณด๋Š” BERT ๋ฌธ์„œ ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

Bertweet ํ† ํฐํ™”(BertweetTokenizer) [[transformers.BertweetTokenizer]]

[[autodoc]] BertweetTokenizer