Buckets:
| # BERTweet | |
| <div style="float: right;"> | |
| <div class="flex flex-wrap space-x-1"> | |
| <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white"> | |
| </div> | |
| ## BERTweet | |
| [BERTweet](https://huggingface.co/papers/2005.10200) shares the same architecture as [BERT-base](./bert), but it's pretrained like [RoBERTa](./roberta) on English Tweets. It performs really well on Tweet-related tasks like part-of-speech tagging, named entity recognition, and text classification. | |
| You can find all the original BERTweet checkpoints under the [VinAI Research](https://huggingface.co/vinai?search_models=BERTweet) organization. | |
| > [!TIP] | |
| > Refer to the [BERT](./bert) docs for more examples of how to apply BERTweet to different language tasks. | |
| The example below demonstrates how to predict the `<mask>` token with [Pipeline](/docs/transformers/pr_33892/en/main_classes/pipelines#transformers.Pipeline), [AutoModel](/docs/transformers/pr_33892/en/model_doc/auto#transformers.AutoModel), and from the command line. | |
| <hfoptions id="usage"> | |
| <hfoption id="Pipeline"> | |
| ```py | |
| import torch | |
| from transformers import pipeline | |
| pipeline = pipeline( | |
| task="fill-mask", | |
| model="vinai/bertweet-base", | |
| dtype=torch.float16, | |
| device=0 | |
| ) | |
| pipeline("Plants create <mask> through a process known as photosynthesis.") | |
| ``` | |
| </hfoption> | |
| <hfoption id="AutoModel"> | |
| ```py | |
| import torch | |
| from transformers import AutoModelForMaskedLM, AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained( | |
| "vinai/bertweet-base", | |
| ) | |
| model = AutoModelForMaskedLM.from_pretrained( | |
| "vinai/bertweet-base", | |
| dtype=torch.float16, | |
| device_map="auto" | |
| ) | |
| inputs = tokenizer("Plants create <mask> through a process known as photosynthesis.", return_tensors="pt").to(model.device) | |
| with torch.no_grad(): | |
| outputs = model(**inputs) | |
| predictions = outputs.logits | |
| masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1] | |
| predicted_token_id = predictions[0, masked_index].argmax(dim=-1) | |
| predicted_token = tokenizer.decode(predicted_token_id) | |
| print(f"The predicted token is: {predicted_token}") | |
| ``` | |
| </hfoption> | |
| <hfoption id="transformers CLI"> | |
| ```bash | |
| echo -e "Plants create <mask> through a process known as photosynthesis." | transformers run --task fill-mask --model vinai/bertweet-base --device 0 | |
| ``` | |
| </hfoption> | |
| </hfoptions> | |
| ## Notes | |
| - Use the [AutoTokenizer](/docs/transformers/pr_33892/en/model_doc/auto#transformers.AutoTokenizer) or [BertweetTokenizer](/docs/transformers/pr_33892/en/model_doc/bertweet#transformers.BertweetTokenizer) because it's preloaded with a custom vocabulary adapted to tweet-specific tokens like hashtags (#), mentions (@), emojis, and common abbreviations. Make sure to also install the [emoji](https://pypi.org/project/emoji/) library. | |
| - Inputs should be padded on the right (`padding="max_length"`) because BERT uses absolute position embeddings. | |
| ## BertweetTokenizer[[transformers.BertweetTokenizer]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class transformers.BertweetTokenizer</name><anchor>transformers.BertweetTokenizer</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/bertweet/tokenization_bertweet.py#L54</source><parameters>[{"name": "vocab_file", "val": ""}, {"name": "merges_file", "val": ""}, {"name": "normalization", "val": " = False"}, {"name": "bos_token", "val": " = '<s>'"}, {"name": "eos_token", "val": " = '</s>'"}, {"name": "sep_token", "val": " = '</s>'"}, {"name": "cls_token", "val": " = '<s>'"}, {"name": "unk_token", "val": " = '<unk>'"}, {"name": "pad_token", "val": " = '<pad>'"}, {"name": "mask_token", "val": " = '<mask>'"}, {"name": "**kwargs", "val": ""}]</parameters><paramsdesc>- **vocab_file** (`str`) -- | |
| Path to the vocabulary file. | |
| - **merges_file** (`str`) -- | |
| Path to the merges file. | |
| - **normalization** (`bool`, *optional*, defaults to `False`) -- | |
| Whether or not to apply a normalization preprocess. | |
| - **bos_token** (`str`, *optional*, defaults to `"<s>"`) -- | |
| The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token. | |
| <Tip> | |
| When building a sequence using special tokens, this is not the token that is used for the beginning of | |
| sequence. The token used is the `cls_token`. | |
| </Tip> | |
| - **eos_token** (`str`, *optional*, defaults to `"</s>"`) -- | |
| The end of sequence token. | |
| <Tip> | |
| When building a sequence using special tokens, this is not the token that is used for the end of sequence. | |
| The token used is the `sep_token`. | |
| </Tip> | |
| - **sep_token** (`str`, *optional*, defaults to `"</s>"`) -- | |
| The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for | |
| sequence classification or for a text and a question for question answering. It is also used as the last | |
| token of a sequence built with special tokens. | |
| - **cls_token** (`str`, *optional*, defaults to `"<s>"`) -- | |
| The classifier token which is used when doing sequence classification (classification of the whole sequence | |
| instead of per-token classification). It is the first token of the sequence when built with special tokens. | |
| - **unk_token** (`str`, *optional*, defaults to `"<unk>"`) -- | |
| The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this | |
| token instead. | |
| - **pad_token** (`str`, *optional*, defaults to `"<pad>"`) -- | |
| The token used for padding, for example when batching sequences of different lengths. | |
| - **mask_token** (`str`, *optional*, defaults to `"<mask>"`) -- | |
| The token used for masking values. This is the token used when training this model with masked language | |
| modeling. This is the token which the model will try to predict.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Constructs a BERTweet tokenizer, using Byte-Pair-Encoding. | |
| This tokenizer inherits from [PreTrainedTokenizer](/docs/transformers/pr_33892/en/main_classes/tokenizer#transformers.PreTrainedTokenizer) which contains most of the main methods. Users should refer to | |
| this superclass for more information regarding those methods. | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>add_from_file</name><anchor>transformers.BertweetTokenizer.add_from_file</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/bertweet/tokenization_bertweet.py#L402</source><parameters>[{"name": "f", "val": ""}]</parameters></docstring> | |
| Loads a pre-existing dictionary from a text file and adds its symbols to this instance. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>build_inputs_with_special_tokens</name><anchor>transformers.BertweetTokenizer.build_inputs_with_special_tokens</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/bertweet/tokenization_bertweet.py#L167</source><parameters>[{"name": "token_ids_0", "val": ": list"}, {"name": "token_ids_1", "val": ": typing.Optional[list[int]] = None"}]</parameters><paramsdesc>- **token_ids_0** (`list[int]`) -- | |
| List of IDs to which the special tokens will be added. | |
| - **token_ids_1** (`list[int]`, *optional*) -- | |
| Optional second list of IDs for sequence pairs.</paramsdesc><paramgroups>0</paramgroups><rettype>`list[int]`</rettype><retdesc>List of [input IDs](../glossary#input-ids) with the appropriate special tokens.</retdesc></docstring> | |
| Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and | |
| adding special tokens. A BERTweet sequence has the following format: | |
| - single sequence: `<s> X </s>` | |
| - pair of sequences: `<s> A </s></s> B </s>` | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>convert_tokens_to_string</name><anchor>transformers.BertweetTokenizer.convert_tokens_to_string</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/bertweet/tokenization_bertweet.py#L368</source><parameters>[{"name": "tokens", "val": ""}]</parameters></docstring> | |
| Converts a sequence of tokens (string) in a single string. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>create_token_type_ids_from_sequences</name><anchor>transformers.BertweetTokenizer.create_token_type_ids_from_sequences</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/bertweet/tokenization_bertweet.py#L221</source><parameters>[{"name": "token_ids_0", "val": ": list"}, {"name": "token_ids_1", "val": ": typing.Optional[list[int]] = None"}]</parameters><paramsdesc>- **token_ids_0** (`list[int]`) -- | |
| List of IDs. | |
| - **token_ids_1** (`list[int]`, *optional*) -- | |
| Optional second list of IDs for sequence pairs.</paramsdesc><paramgroups>0</paramgroups><rettype>`list[int]`</rettype><retdesc>List of zeros.</retdesc></docstring> | |
| Create a mask from the two sequences passed to be used in a sequence-pair classification task. BERTweet does | |
| not make use of token type ids, therefore a list of zeros is returned. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>get_special_tokens_mask</name><anchor>transformers.BertweetTokenizer.get_special_tokens_mask</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/bertweet/tokenization_bertweet.py#L193</source><parameters>[{"name": "token_ids_0", "val": ": list"}, {"name": "token_ids_1", "val": ": typing.Optional[list[int]] = None"}, {"name": "already_has_special_tokens", "val": ": bool = False"}]</parameters><paramsdesc>- **token_ids_0** (`list[int]`) -- | |
| List of IDs. | |
| - **token_ids_1** (`list[int]`, *optional*) -- | |
| Optional second list of IDs for sequence pairs. | |
| - **already_has_special_tokens** (`bool`, *optional*, defaults to `False`) -- | |
| Whether or not the token list is already formatted with special tokens for the model.</paramsdesc><paramgroups>0</paramgroups><rettype>`list[int]`</rettype><retdesc>A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.</retdesc></docstring> | |
| Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding | |
| special tokens using the tokenizer `prepare_for_model` method. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>normalizeToken</name><anchor>transformers.BertweetTokenizer.normalizeToken</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/bertweet/tokenization_bertweet.py#L341</source><parameters>[{"name": "token", "val": ""}]</parameters></docstring> | |
| Normalize tokens in a Tweet | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>normalizeTweet</name><anchor>transformers.BertweetTokenizer.normalizeTweet</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/bertweet/tokenization_bertweet.py#L307</source><parameters>[{"name": "tweet", "val": ""}]</parameters></docstring> | |
| Normalize a raw Tweet | |
| </div></div> | |
| <EditOnGithub source="https://github.com/huggingface/transformers/blob/main/docs/source/en/model_doc/bertweet.md" /> |
Xet Storage Details
- Size:
- 11.6 kB
- Xet hash:
- 28fa1fa6191a20bd7603d96fefe41350ec425d7a4547f8b0cecd8ed36866a0f8
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.