Buckets:
| # BERTweet [[bertweet]] | |
| ## 개요 [[overview]] | |
| BERTweet 모델은 Dat Quoc Nguyen, Thanh Vu, Anh Tuan Nguyen에 의해 [BERTweet: A pre-trained language model for English Tweets](https://www.aclweb.org/anthology/2020.emnlp-demos.2.pdf) 에서 제안되었습니다. | |
| 해당 논문의 초록 : | |
| *영어 트윗을 위한 최초의 공개 대규모 사전 학습된 언어 모델인 BERTweet을 소개합니다. | |
| BERTweet은 BERT-base(Devlin et al., 2019)와 동일한 아키텍처를 가지고 있으며, RoBERTa 사전 학습 절차(Liu et al., 2019)를 사용하여 학습되었습니다. | |
| 실험 결과, BERTweet은 강력한 기준 모델인 RoBERTa-base 및 XLM-R-base(Conneau et al., 2020)의 성능을 능가하여 세 가지 트윗 NLP 작업(품사 태깅, 개체명 인식, 텍스트 분류)에서 이전 최신 모델보다 더 나은 성능을 보여주었습니다.* | |
| 이 모델은 [dqnguyen](https://huggingface.co/dqnguyen) 께서 기여하셨습니다. 원본 코드는 [여기](https://github.com/VinAIResearch/BERTweet).에서 확인할 수 있습니다. | |
| ## 사용 예시 [[usage-example]] | |
| ```python | |
| >>> import torch | |
| >>> from transformers import AutoModel, AutoTokenizer | |
| >>> bertweet = AutoModel.from_pretrained("vinai/bertweet-base") | |
| >>> # 트랜스포머 버전 4.x 이상 : | |
| >>> tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False) | |
| >>> # 트랜스포머 버전 3.x 이상: | |
| >>> # tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base") | |
| >>> # 입력된 트윗은 이미 정규화되었습니다! | |
| >>> line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:" | |
| >>> input_ids = torch.tensor([tokenizer.encode(line)]) | |
| >>> with torch.no_grad(): | |
| ... features = bertweet(input_ids) # Models outputs are now tuples | |
| >>> # With TensorFlow 2.0+: | |
| >>> # from transformers import TFAutoModel | |
| >>> # bertweet = TFAutoModel.from_pretrained("vinai/bertweet-base") | |
| ``` | |
| 이 구현은 토큰화 방법을 제외하고는 BERT와 동일합니다. API 참조 정보는 [BERT 문서](bert) 를 참조하세요. | |
| ## Bertweet 토큰화(BertweetTokenizer) [[transformers.BertweetTokenizer]][[transformers.BertweetTokenizer]] | |
| #### transformers.BertweetTokenizer[[transformers.BertweetTokenizer]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bertweet/tokenization_bertweet.py#L51) | |
| Constructs a BERTweet tokenizer, using Byte-Pair-Encoding. | |
| This tokenizer inherits from [PreTrainedTokenizer](/docs/transformers/main/ko/main_classes/tokenizer#transformers.PythonBackend) which contains most of the main methods. Users should refer to | |
| this superclass for more information regarding those methods. | |
| add_from_filetransformers.BertweetTokenizer.add_from_filehttps://github.com/huggingface/transformers/blob/main/src/transformers/models/bertweet/tokenization_bertweet.py#L332[{"name": "f", "val": ""}] | |
| Loads a pre-existing dictionary from a text file and adds its symbols to this instance. | |
| **Parameters:** | |
| vocab_file (`str`) : Path to the vocabulary file. | |
| merges_file (`str`) : Path to the merges file. | |
| normalization (`bool`, *optional*, defaults to `False`) : Whether or not to apply a normalization preprocess. | |
| bos_token (`str`, *optional*, defaults to `"<s>"`) : The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token. When building a sequence using special tokens, this is not the token that is used for the beginning of sequence. The token used is the `cls_token`. | |
| eos_token (`str`, *optional*, defaults to `"</s>"`) : The end of sequence token. When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the `sep_token`. | |
| sep_token (`str`, *optional*, defaults to `"</s>"`) : The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens. | |
| cls_token (`str`, *optional*, defaults to `"<s>"`) : The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens. | |
| unk_token (`str`, *optional*, defaults to `"<unk>"`) : The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. | |
| pad_token (`str`, *optional*, defaults to `"<pad>"`) : The token used for padding, for example when batching sequences of different lengths. | |
| mask_token (`str`, *optional*, defaults to `"<mask>"`) : The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict. | |
| #### convert_tokens_to_string[[transformers.BertweetTokenizer.convert_tokens_to_string]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bertweet/tokenization_bertweet.py#L291) | |
| Converts a sequence of tokens (string) in a single string. | |
| #### normalizeToken[[transformers.BertweetTokenizer.normalizeToken]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bertweet/tokenization_bertweet.py#L264) | |
| Normalize tokens in a Tweet | |
| #### normalizeTweet[[transformers.BertweetTokenizer.normalizeTweet]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bertweet/tokenization_bertweet.py#L230) | |
| Normalize a raw Tweet | |
| #### save_vocabulary[[transformers.BertweetTokenizer.save_vocabulary]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bertweet/tokenization_bertweet.py#L302) | |
| Save the vocabulary and merges files to a directory. | |
Xet Storage Details
- Size:
- 5.9 kB
- Xet hash:
- 4a91e75910398c430d99d6173b8c335bda2113e49faaea2b2a73f0d3e8aa861b
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.