lsmpp's picture
Add files using upload-large-folder tool
4cef5ec verified

BERTweet

PyTorch TensorFlow Flax

BERTweet

BERTweet shares the same architecture as BERT-base, but it’s pretrained like RoBERTa on English Tweets. It performs really well on Tweet-related tasks like part-of-speech tagging, named entity recognition, and text classification.

You can find all the original BERTweet checkpoints under the VinAI Research organization.

Refer to the BERT docs for more examples of how to apply BERTweet to different language tasks.

The example below demonstrates how to predict the <mask> token with [Pipeline], [AutoModel], and from the command line.

import torch
from transformers import pipeline

pipeline = pipeline(
    task="fill-mask",
    model="vinai/bertweet-base",
    torch_dtype=torch.float16,
    device=0
)
pipeline("Plants create <mask> through a process known as photosynthesis.")
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
   "vinai/bertweet-base",
)
model = AutoModelForMaskedLM.from_pretrained(
    "vinai/bertweet-base",
    torch_dtype=torch.float16,
    device_map="auto"
)
inputs = tokenizer("Plants create <mask> through a process known as photosynthesis.", return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model(**inputs)
    predictions = outputs.logits

masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1]
predicted_token_id = predictions[0, masked_index].argmax(dim=-1)
predicted_token = tokenizer.decode(predicted_token_id)

print(f"The predicted token is: {predicted_token}")
echo -e "Plants create <mask> through a process known as photosynthesis." | transformers-cli run --task fill-mask --model vinai/bertweet-base --device 0

Notes

  • Use the [AutoTokenizer] or [BertweetTokenizer] because it’s preloaded with a custom vocabulary adapted to tweet-specific tokens like hashtags (#), mentions (@), emojis, and common abbreviations. Make sure to also install the emoji library.
  • Inputs should be padded on the right (padding="max_length") because BERT uses absolute position embeddings.

BertweetTokenizer

[[autodoc]] BertweetTokenizer