init

002bd9b about 1 year ago

8.57 kB

	<!--Copyright 2021 The HuggingFace Team. All rights reserved.

	Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
	an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
	specific language governing permissions and limitations under the License.

	⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
	rendered properly in your Markdown viewer.

	-->

	# ByT5

	## Overview

	ByT5 モデルは、[ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/abs/2105.13626) by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir
	Kale, Adam Roberts, Colin Raffel.

	論文の要約は次のとおりです。

	*最も広く使用されている事前トレーニング済み言語モデルは、単語またはサブワード単位に対応するトークンのシーケンスで動作します。
	テキストをトークンのシーケンスとしてエンコードするには、トークナイザーが必要です。トークナイザーは通常、
	モデル。代わりに生のテキスト (バイトまたは文字) を直接操作するトークンフリーモデルには多くの利点があります。
	すぐに使用できるあらゆる言語のテキストを処理でき、ノイズに対してより堅牢であり、技術的負債を最小限に抑えます。
	複雑でエラーが発生しやすいテキスト前処理パイプラインを削除します。バイトまたは文字列がトークンより長いため
	トークンフリーモデルに関する過去の研究では、シーケンスのコストを償却するように設計された新しいモデルアーキテクチャが導入されることがよくありました。
	生のテキストを直接操作します。この論文では、標準的な Transformer アーキテクチャが次のようなもので使用できることを示します。
	バイトシーケンスを処理するための最小限の変更。パラメータ数の観点からトレードオフを注意深く特徴付けます。
	FLOP のトレーニングと推論速度を調べ、バイトレベルのモデルがトークンレベルと競合できることを示します。
	対応者。また、バイトレベルのモデルはノイズに対して大幅に堅牢であり、より優れたパフォーマンスを発揮することも示しています。
	スペルと発音に敏感なタスク。私たちの貢献の一環として、新しいセットをリリースします。
	T5 アーキテクチャに基づいた事前トレーニング済みのバイトレベルの Transformer モデルと、そこで使用されるすべてのコードとデータ
	実験。*

	このモデルは、[patrickvonplaten](https://huggingface.co/patrickvonplaten) によって提供されました。元のコードは次のとおりです
	[ここ](https://github.com/google-research/byt5) にあります。

	<Tip>

	ByT5 のアーキテクチャは T5v1.1 モデルに基づいています。API リファレンスについては、[T5v1.1 のドキュメントページ](t5v1.1) を参照してください。彼らは
	モデルの入力を準備する方法が異なるだけです。以下のコード例を参照してください。

	</Tip>

	ByT5 は教師なしで事前トレーニングされているため、単一タスク中にタスクプレフィックスを使用する利点はありません。
	微調整。マルチタスクの微調整を行う場合は、プレフィックスを使用する必要があります。

	## Usage Examples

	ByT5 は生の UTF-8 バイトで動作するため、トークナイザーなしで使用できます。

	```python
	>>> from transformers import T5ForConditionalGeneration
	>>> import torch

	>>> model = T5ForConditionalGeneration.from_pretrained("google/byt5-small")

	>>> num_special_tokens = 3
	>>> # Model has 3 special tokens which take up the input ids 0,1,2 of ByT5.
	>>> # => Need to shift utf-8 character encodings by 3 before passing ids to model.

	>>> input_ids = torch.tensor([list("Life is like a box of chocolates.".encode("utf-8"))]) + num_special_tokens

	>>> labels = torch.tensor([list("La vie est comme une boîte de chocolat.".encode("utf-8"))]) + num_special_tokens

	>>> loss = model(input_ids, labels=labels).loss
	>>> loss.item()
	2.66
	```

	ただし、バッチ推論とトレーニングの場合は、トークナイザーを使用することをお勧めします。


	```python
	>>> from transformers import T5ForConditionalGeneration, AutoTokenizer

	>>> model = T5ForConditionalGeneration.from_pretrained("google/byt5-small")
	>>> tokenizer = AutoTokenizer.from_pretrained("google/byt5-small")

	>>> model_inputs = tokenizer(
	... ["Life is like a box of chocolates.", "Today is Monday."], padding="longest", return_tensors="pt"
	... )
	>>> labels_dict = tokenizer(
	... ["La vie est comme une boîte de chocolat.", "Aujourd'hui c'est lundi."], padding="longest", return_tensors="pt"
	... )
	>>> labels = labels_dict.input_ids

	>>> loss = model(**model_inputs, labels=labels).loss
	>>> loss.item()
	17.9
	```

	[T5](t5) と同様に、ByT5 はスパンマスクノイズ除去タスクでトレーニングされました。しかし、
	モデルはキャラクターに直接作用するため、事前トレーニングタスクは少し複雑です
	違う。のいくつかの文字を破損してみましょう
	`"The dog chases a ball in the park."`という文を入力し、ByT5 に予測してもらいます。
	わたしたちのため。

	```python
	>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
	>>> import torch

	>>> tokenizer = AutoTokenizer.from_pretrained("google/byt5-base")
	>>> model = AutoModelForSeq2SeqLM.from_pretrained("google/byt5-base")

	>>> input_ids_prompt = "The dog chases a ball in the park."
	>>> input_ids = tokenizer(input_ids_prompt).input_ids

	>>> # Note that we cannot add "{extra_id_...}" to the string directly
	>>> # as the Byte tokenizer would incorrectly merge the tokens
	>>> # For ByT5, we need to work directly on the character level
	>>> # Contrary to T5, ByT5 does not use sentinel tokens for masking, but instead
	>>> # uses final utf character ids.
	>>> # UTF-8 is represented by 8 bits and ByT5 has 3 special tokens.
	>>> # => There are 2**8+2 = 259 input ids and mask tokens count down from index 258.
	>>> # => mask to "The dog [258]a ball [257]park."

	>>> input_ids = torch.tensor([input_ids[:8] + [258] + input_ids[14:21] + [257] + input_ids[28:]])
	>>> input_ids
	tensor([[ 87, 107, 104, 35, 103, 114, 106, 35, 258, 35, 100, 35, 101, 100, 111, 111, 257, 35, 115, 100, 117, 110, 49, 1]])

	>>> # ByT5 produces only one char at a time so we need to produce many more output characters here -> set `max_length=100`.
	>>> output_ids = model.generate(input_ids, max_length=100)[0].tolist()
	>>> output_ids
	[0, 258, 108, 118, 35, 119, 107, 104, 35, 114, 113, 104, 35, 122, 107, 114, 35, 103, 114, 104, 118, 257, 35, 108, 113, 35, 119, 107, 104, 35, 103, 108, 118, 102, 114, 256, 108, 113, 35, 119, 107, 104, 35, 115, 100, 117, 110, 49, 35, 87, 107, 104, 35, 103, 114, 106, 35, 108, 118, 35, 119, 107, 104, 35, 114, 113, 104, 35, 122, 107, 114, 35, 103, 114, 104, 118, 35, 100, 35, 101, 100, 111, 111, 35, 108, 113, 255, 35, 108, 113, 35, 119, 107, 104, 35, 115, 100, 117, 110, 49]

	>>> # ^- Note how 258 descends to 257, 256, 255

	>>> # Now we need to split on the sentinel tokens, let's write a short loop for this
	>>> output_ids_list = []
	>>> start_token = 0
	>>> sentinel_token = 258
	>>> while sentinel_token in output_ids:
	... split_idx = output_ids.index(sentinel_token)
	... output_ids_list.append(output_ids[start_token:split_idx])
	... start_token = split_idx
	... sentinel_token -= 1

	>>> output_ids_list.append(output_ids[start_token:])
	>>> output_string = tokenizer.batch_decode(output_ids_list)
	>>> output_string
	['<pad>', 'is the one who does', ' in the disco', 'in the park. The dog is the one who does a ball in', ' in the park.']
	```

	## ByT5Tokenizer

	[[autodoc]] ByT5Tokenizer

	詳細については、[`ByT5Tokenizer`] を参照してください。