Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / transformers /pr_33892 /en /model_doc /bertweet.md

rtrm

about 2 months ago

preview code

download

raw

11.6 kB

	# BERTweet

	<div style="float: right;">
	<div class="flex flex-wrap space-x-1">
	<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
	</div>

	## BERTweet

	[BERTweet](https://huggingface.co/papers/2005.10200) shares the same architecture as [BERT-base](./bert), but it's pretrained like [RoBERTa](./roberta) on English Tweets. It performs really well on Tweet-related tasks like part-of-speech tagging, named entity recognition, and text classification.

	You can find all the original BERTweet checkpoints under the [VinAI Research](https://huggingface.co/vinai?search_models=BERTweet) organization.

	> [!TIP]
	> Refer to the [BERT](./bert) docs for more examples of how to apply BERTweet to different language tasks.

	The example below demonstrates how to predict the `<mask>` token with [Pipeline](/docs/transformers/pr_33892/en/main_classes/pipelines#transformers.Pipeline), [AutoModel](/docs/transformers/pr_33892/en/model_doc/auto#transformers.AutoModel), and from the command line.

	<hfoptions id="usage">
	<hfoption id="Pipeline">

	```py
	import torch
	from transformers import pipeline

	pipeline = pipeline(
	task="fill-mask",
	model="vinai/bertweet-base",
	dtype=torch.float16,
	device=0
	)
	pipeline("Plants create <mask> through a process known as photosynthesis.")
	```

	</hfoption>
	<hfoption id="AutoModel">

	```py
	import torch
	from transformers import AutoModelForMaskedLM, AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained(
	"vinai/bertweet-base",
	)
	model = AutoModelForMaskedLM.from_pretrained(
	"vinai/bertweet-base",
	dtype=torch.float16,
	device_map="auto"
	)
	inputs = tokenizer("Plants create <mask> through a process known as photosynthesis.", return_tensors="pt").to(model.device)

	with torch.no_grad():
	outputs = model(**inputs)
	predictions = outputs.logits

	masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1]
	predicted_token_id = predictions[0, masked_index].argmax(dim=-1)
	predicted_token = tokenizer.decode(predicted_token_id)

	print(f"The predicted token is: {predicted_token}")
	```

	</hfoption>
	<hfoption id="transformers CLI">

	```bash
	echo -e "Plants create <mask> through a process known as photosynthesis." \| transformers run --task fill-mask --model vinai/bertweet-base --device 0
	```

	</hfoption>
	</hfoptions>

	## Notes

	- Use the [AutoTokenizer](/docs/transformers/pr_33892/en/model_doc/auto#transformers.AutoTokenizer) or [BertweetTokenizer](/docs/transformers/pr_33892/en/model_doc/bertweet#transformers.BertweetTokenizer) because it's preloaded with a custom vocabulary adapted to tweet-specific tokens like hashtags (#), mentions (@), emojis, and common abbreviations. Make sure to also install the [emoji](https://pypi.org/project/emoji/) library.
	- Inputs should be padded on the right (`padding="max_length"`) because BERT uses absolute position embeddings.

	## BertweetTokenizer[[transformers.BertweetTokenizer]]

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>class transformers.BertweetTokenizer</name><anchor>transformers.BertweetTokenizer</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/bertweet/tokenization_bertweet.py#L54</source><parameters>[{"name": "vocab_file", "val": ""}, {"name": "merges_file", "val": ""}, {"name": "normalization", "val": " = False"}, {"name": "bos_token", "val": " = '<s>'"}, {"name": "eos_token", "val": " = '</s>'"}, {"name": "sep_token", "val": " = '</s>'"}, {"name": "cls_token", "val": " = '<s>'"}, {"name": "unk_token", "val": " = '<unk>'"}, {"name": "pad_token", "val": " = '<pad>'"}, {"name": "mask_token", "val": " = '<mask>'"}, {"name": "kwargs", "val": ""}]</parameters><paramsdesc>- vocab_file** (`str`) --
	Path to the vocabulary file.
	- merges_file (`str`) --
	Path to the merges file.
	- normalization (`bool`, optional, defaults to `False`) --
	Whether or not to apply a normalization preprocess.
	- bos_token (`str`, optional, defaults to `"<s>"`) --
	The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.

	<Tip>

	When building a sequence using special tokens, this is not the token that is used for the beginning of
	sequence. The token used is the `cls_token`.

	</Tip>

	- eos_token (`str`, optional, defaults to `"</s>"`) --
	The end of sequence token.

	<Tip>

	When building a sequence using special tokens, this is not the token that is used for the end of sequence.
	The token used is the `sep_token`.

	</Tip>

	- sep_token (`str`, optional, defaults to `"</s>"`) --
	The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
	sequence classification or for a text and a question for question answering. It is also used as the last
	token of a sequence built with special tokens.
	- cls_token (`str`, optional, defaults to `"<s>"`) --
	The classifier token which is used when doing sequence classification (classification of the whole sequence
	instead of per-token classification). It is the first token of the sequence when built with special tokens.
	- unk_token (`str`, optional, defaults to `"<unk>"`) --
	The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
	token instead.
	- pad_token (`str`, optional, defaults to `"<pad>"`) --
	The token used for padding, for example when batching sequences of different lengths.
	- mask_token (`str`, optional, defaults to `"<mask>"`) --
	The token used for masking values. This is the token used when training this model with masked language
	modeling. This is the token which the model will try to predict.</paramsdesc><paramgroups>0</paramgroups></docstring>

	Constructs a BERTweet tokenizer, using Byte-Pair-Encoding.

	This tokenizer inherits from [PreTrainedTokenizer](/docs/transformers/pr_33892/en/main_classes/tokenizer#transformers.PreTrainedTokenizer) which contains most of the main methods. Users should refer to
	this superclass for more information regarding those methods.





	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>add_from_file</name><anchor>transformers.BertweetTokenizer.add_from_file</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/bertweet/tokenization_bertweet.py#L402</source><parameters>[{"name": "f", "val": ""}]</parameters></docstring>

	Loads a pre-existing dictionary from a text file and adds its symbols to this instance.


	</div>
	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>build_inputs_with_special_tokens</name><anchor>transformers.BertweetTokenizer.build_inputs_with_special_tokens</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/bertweet/tokenization_bertweet.py#L167</source><parameters>[{"name": "token_ids_0", "val": ": list"}, {"name": "token_ids_1", "val": ": typing.Optional[list[int]] = None"}]</parameters><paramsdesc>- token_ids_0 (`list[int]`) --
	List of IDs to which the special tokens will be added.
	- token_ids_1 (`list[int]`, optional) --
	Optional second list of IDs for sequence pairs.</paramsdesc><paramgroups>0</paramgroups><rettype>`list[int]`</rettype><retdesc>List of [input IDs](../glossary#input-ids) with the appropriate special tokens.</retdesc></docstring>

	Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
	adding special tokens. A BERTweet sequence has the following format:

	- single sequence: `<s> X </s>`
	- pair of sequences: `<s> A </s></s> B </s>`








	</div>
	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>convert_tokens_to_string</name><anchor>transformers.BertweetTokenizer.convert_tokens_to_string</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/bertweet/tokenization_bertweet.py#L368</source><parameters>[{"name": "tokens", "val": ""}]</parameters></docstring>
	Converts a sequence of tokens (string) in a single string.

	</div>
	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>create_token_type_ids_from_sequences</name><anchor>transformers.BertweetTokenizer.create_token_type_ids_from_sequences</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/bertweet/tokenization_bertweet.py#L221</source><parameters>[{"name": "token_ids_0", "val": ": list"}, {"name": "token_ids_1", "val": ": typing.Optional[list[int]] = None"}]</parameters><paramsdesc>- token_ids_0 (`list[int]`) --
	List of IDs.
	- token_ids_1 (`list[int]`, optional) --
	Optional second list of IDs for sequence pairs.</paramsdesc><paramgroups>0</paramgroups><rettype>`list[int]`</rettype><retdesc>List of zeros.</retdesc></docstring>

	Create a mask from the two sequences passed to be used in a sequence-pair classification task. BERTweet does
	not make use of token type ids, therefore a list of zeros is returned.








	</div>
	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>get_special_tokens_mask</name><anchor>transformers.BertweetTokenizer.get_special_tokens_mask</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/bertweet/tokenization_bertweet.py#L193</source><parameters>[{"name": "token_ids_0", "val": ": list"}, {"name": "token_ids_1", "val": ": typing.Optional[list[int]] = None"}, {"name": "already_has_special_tokens", "val": ": bool = False"}]</parameters><paramsdesc>- token_ids_0 (`list[int]`) --
	List of IDs.
	- token_ids_1 (`list[int]`, optional) --
	Optional second list of IDs for sequence pairs.
	- already_has_special_tokens (`bool`, optional, defaults to `False`) --
	Whether or not the token list is already formatted with special tokens for the model.</paramsdesc><paramgroups>0</paramgroups><rettype>`list[int]`</rettype><retdesc>A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.</retdesc></docstring>

	Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
	special tokens using the tokenizer `prepare_for_model` method.








	</div>
	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>normalizeToken</name><anchor>transformers.BertweetTokenizer.normalizeToken</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/bertweet/tokenization_bertweet.py#L341</source><parameters>[{"name": "token", "val": ""}]</parameters></docstring>

	Normalize tokens in a Tweet


	</div>
	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>normalizeTweet</name><anchor>transformers.BertweetTokenizer.normalizeTweet</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/bertweet/tokenization_bertweet.py#L307</source><parameters>[{"name": "tweet", "val": ""}]</parameters></docstring>

	Normalize a raw Tweet


	</div></div>

	<EditOnGithub source="https://github.com/huggingface/transformers/blob/main/docs/source/en/model_doc/bertweet.md" />

Xet Storage Details

Size:: 11.6 kB
Xet hash:: 28fa1fa6191a20bd7603d96fefe41350ec425d7a4547f8b0cecd8ed36866a0f8

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.