Buckets:

hf-doc-build
/

doc

Files

xet

hf-doc-build/doc / transformers /main /ja /model_doc /bertweet.md

HuggingFaceDocBuilder

about 5 hours ago

preview code

download

raw

6.07 kB

	# BERTweet

	## Overview

	BERTweet モデルは、Dat Quoc Nguyen、Thanh Vu によって [BERTweet: A pre-trained language model for English Tweets](https://www.aclweb.org/anthology/2020.emnlp-demos.2.pdf) で提案されました。アン・トゥアン・グエンさん。

	論文の要約は次のとおりです。

	*私たちは、英語ツイート用に初めて公開された大規模な事前トレーニング済み言語モデルである BERTweet を紹介します。私たちのBERTweetは、
	BERT ベースと同じアーキテクチャ (Devlin et al., 2019) は、RoBERTa 事前トレーニング手順 (Liu et al.) を使用してトレーニングされます。
	al.、2019）。実験では、BERTweet が強力なベースラインである RoBERTa ベースおよび XLM-R ベースを上回るパフォーマンスを示すことが示されています (Conneau et al.,
	2020)、3 つのツイート NLP タスクにおいて、以前の最先端モデルよりも優れたパフォーマンス結果が得られました。
	品詞タグ付け、固有表現認識およびテキスト分類。*

	## Usage example

	```python
	>>> import torch
	>>> from transformers import AutoModel, AutoTokenizer

	>>> bertweet = AutoModel.from_pretrained("vinai/bertweet-base")

	>>> # For transformers v4.x+:
	>>> tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False)

	>>> # For transformers v3.x:
	>>> # tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")

	>>> # INPUT TWEET IS ALREADY NORMALIZED!
	>>> line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"

	>>> input_ids = torch.tensor([tokenizer.encode(line)])

	>>> with torch.no_grad():
	... features = bertweet(input_ids) # Models outputs are now tuples

	>>> # With TensorFlow 2.0+:
	>>> # from transformers import TFAutoModel
	>>> # bertweet = TFAutoModel.from_pretrained("vinai/bertweet-base")
	```

	この実装は、トークン化方法を除いて BERT と同じです。詳細については、[BERT ドキュメント](bert) を参照してください。
	API リファレンス情報。

	このモデルは [dqnguyen](https://huggingface.co/dqnguyen) によって提供されました。元のコードは [ここ](https://github.com/VinAIResearch/BERTweet) にあります。

	## BertweetTokenizer[[transformers.BertweetTokenizer]]

	#### transformers.BertweetTokenizer[[transformers.BertweetTokenizer]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bertweet/tokenization_bertweet.py#L51)

	Constructs a BERTweet tokenizer, using Byte-Pair-Encoding.

	This tokenizer inherits from [PreTrainedTokenizer](/docs/transformers/main/ja/main_classes/tokenizer#transformers.PythonBackend) which contains most of the main methods. Users should refer to
	this superclass for more information regarding those methods.

	add_from_filetransformers.BertweetTokenizer.add_from_filehttps://github.com/huggingface/transformers/blob/main/src/transformers/models/bertweet/tokenization_bertweet.py#L332[{"name": "f", "val": ""}]

	Loads a pre-existing dictionary from a text file and adds its symbols to this instance.

	Parameters:

	vocab_file (`str`) : Path to the vocabulary file.

	merges_file (`str`) : Path to the merges file.

	normalization (`bool`, optional, defaults to `False`) : Whether or not to apply a normalization preprocess.

	bos_token (`str`, optional, defaults to `"<s>"`) : The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token. When building a sequence using special tokens, this is not the token that is used for the beginning of sequence. The token used is the `cls_token`.

	eos_token (`str`, optional, defaults to `"</s>"`) : The end of sequence token. When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the `sep_token`.

	sep_token (`str`, optional, defaults to `"</s>"`) : The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.

	cls_token (`str`, optional, defaults to `"<s>"`) : The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.

	unk_token (`str`, optional, defaults to `"<unk>"`) : The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

	pad_token (`str`, optional, defaults to `"<pad>"`) : The token used for padding, for example when batching sequences of different lengths.

	mask_token (`str`, optional, defaults to `"<mask>"`) : The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.
	#### convert_tokens_to_string[[transformers.BertweetTokenizer.convert_tokens_to_string]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bertweet/tokenization_bertweet.py#L291)

	Converts a sequence of tokens (string) in a single string.
	#### normalizeToken[[transformers.BertweetTokenizer.normalizeToken]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bertweet/tokenization_bertweet.py#L264)

	Normalize tokens in a Tweet
	#### normalizeTweet[[transformers.BertweetTokenizer.normalizeTweet]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bertweet/tokenization_bertweet.py#L230)

	Normalize a raw Tweet
	#### save_vocabulary[[transformers.BertweetTokenizer.save_vocabulary]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bertweet/tokenization_bertweet.py#L302)

	Save the vocabulary and merges files to a directory.

Xet Storage Details

Size:: 6.07 kB
Xet hash:: d4f9ba1264804650b764034b266d15f64fe7ae18aaa84327970a3cd0a1ae4299

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.