deepspeed / transformers /docs /source /ko /tasks /token_classification.md
xingzhikb's picture
init
002bd9b
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
โš ๏ธ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# ํ† ํฐ ๋ถ„๋ฅ˜[[token-classification]]
[[open-in-colab]]
<Youtube id="wVHdVlPScxA"/>
ํ† ํฐ ๋ถ„๋ฅ˜๋Š” ๋ฌธ์žฅ์˜ ๊ฐœ๋ณ„ ํ† ํฐ์— ๋ ˆ์ด๋ธ”์„ ํ• ๋‹นํ•ฉ๋‹ˆ๋‹ค. ๊ฐ€์žฅ ์ผ๋ฐ˜์ ์ธ ํ† ํฐ ๋ถ„๋ฅ˜ ์ž‘์—… ์ค‘ ํ•˜๋‚˜๋Š” ๊ฐœ์ฒด๋ช… ์ธ์‹(Named Entity Recognition, NER)์ž…๋‹ˆ๋‹ค. ๊ฐœ์ฒด๋ช… ์ธ์‹์€ ๋ฌธ์žฅ์—์„œ ์‚ฌ๋žŒ, ์œ„์น˜ ๋˜๋Š” ์กฐ์ง๊ณผ ๊ฐ™์€ ๊ฐ ๊ฐœ์ฒด์˜ ๋ ˆ์ด๋ธ”์„ ์ฐพ์œผ๋ ค๊ณ  ์‹œ๋„ํ•ฉ๋‹ˆ๋‹ค.
์ด ๊ฐ€์ด๋“œ์—์„œ ํ•™์Šตํ•  ๋‚ด์šฉ์€:
1. [WNUT 17](https://huggingface.co/datasets/wnut_17) ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased)๋ฅผ ํŒŒ์ธ ํŠœ๋‹ํ•˜์—ฌ ์ƒˆ๋กœ์šด ๊ฐœ์ฒด๋ฅผ ํƒ์ง€ํ•ฉ๋‹ˆ๋‹ค.
2. ์ถ”๋ก ์„ ์œ„ํ•ด ํŒŒ์ธ ํŠœ๋‹ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
<Tip>
์ด ํŠœํ† ๋ฆฌ์–ผ์—์„œ ์„ค๋ช…ํ•˜๋Š” ์ž‘์—…์€ ๋‹ค์Œ ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜์— ์˜ํ•ด ์ง€์›๋ฉ๋‹ˆ๋‹ค:
<!--This tip is automatically generated by `make fix-copies`, do not fill manually!-->
[ALBERT](../model_doc/albert), [BERT](../model_doc/bert), [BigBird](../model_doc/big_bird), [BioGpt](../model_doc/biogpt), [BLOOM](../model_doc/bloom), [CamemBERT](../model_doc/camembert), [CANINE](../model_doc/canine), [ConvBERT](../model_doc/convbert), [Data2VecText](../model_doc/data2vec-text), [DeBERTa](../model_doc/deberta), [DeBERTa-v2](../model_doc/deberta-v2), [DistilBERT](../model_doc/distilbert), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [ErnieM](../model_doc/ernie_m), [ESM](../model_doc/esm), [FlauBERT](../model_doc/flaubert), [FNet](../model_doc/fnet), [Funnel Transformer](../model_doc/funnel), [GPT-Sw3](../model_doc/gpt-sw3), [OpenAI GPT-2](../model_doc/gpt2), [GPTBigCode](../model_doc/gpt_bigcode), [I-BERT](../model_doc/ibert), [LayoutLM](../model_doc/layoutlm), [LayoutLMv2](../model_doc/layoutlmv2), [LayoutLMv3](../model_doc/layoutlmv3), [LiLT](../model_doc/lilt), [Longformer](../model_doc/longformer), [LUKE](../model_doc/luke), [MarkupLM](../model_doc/markuplm), [MEGA](../model_doc/mega), [Megatron-BERT](../model_doc/megatron-bert), [MobileBERT](../model_doc/mobilebert), [MPNet](../model_doc/mpnet), [Nezha](../model_doc/nezha), [Nystrรถmformer](../model_doc/nystromformer), [QDQBert](../model_doc/qdqbert), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [SqueezeBERT](../model_doc/squeezebert), [XLM](../model_doc/xlm), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [XLNet](../model_doc/xlnet), [X-MOD](../model_doc/xmod), [YOSO](../model_doc/yoso)
<!--End of the generated tip-->
</Tip>
์‹œ์ž‘ํ•˜๊ธฐ ์ „์—, ํ•„์š”ํ•œ ๋ชจ๋“  ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ์„ค์น˜๋˜์–ด ์žˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”:
```bash
pip install transformers datasets evaluate seqeval
```
Hugging Face ๊ณ„์ •์— ๋กœ๊ทธ์ธํ•˜์—ฌ ๋ชจ๋ธ์„ ์—…๋กœ๋“œํ•˜๊ณ  ์ปค๋ฎค๋‹ˆํ‹ฐ์— ๊ณต์œ ํ•˜๋Š” ๊ฒƒ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค. ๋ฉ”์‹œ์ง€๊ฐ€ ํ‘œ์‹œ๋˜๋ฉด, ํ† ํฐ์„ ์ž…๋ ฅํ•˜์—ฌ ๋กœ๊ทธ์ธํ•˜์„ธ์š”:
```py
>>> from huggingface_hub import notebook_login
>>> notebook_login()
```
## WNUT 17 ๋ฐ์ดํ„ฐ ์„ธํŠธ ๊ฐ€์ ธ์˜ค๊ธฐ[[load-wnut-17-dataset]]
๋จผ์ € ๐Ÿค— Datasets ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์—์„œ WNUT 17 ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค:
```py
>>> from datasets import load_dataset
>>> wnut = load_dataset("wnut_17")
```
๋‹ค์Œ ์˜ˆ์ œ๋ฅผ ์‚ดํŽด๋ณด์„ธ์š”:
```py
>>> wnut["train"][0]
{'id': '0',
'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0],
'tokens': ['@paulwalk', 'It', "'s", 'the', 'view', 'from', 'where', 'I', "'m", 'living', 'for', 'two', 'weeks', '.', 'Empire', 'State', 'Building', '=', 'ESB', '.', 'Pretty', 'bad', 'storm', 'here', 'last', 'evening', '.']
}
```
`ner_tags`์˜ ๊ฐ ์ˆซ์ž๋Š” ๊ฐœ์ฒด๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ˆซ์ž๋ฅผ ๋ ˆ์ด๋ธ” ์ด๋ฆ„์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ๊ฐœ์ฒด๊ฐ€ ๋ฌด์—‡์ธ์ง€ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค:
```py
>>> label_list = wnut["train"].features[f"ner_tags"].feature.names
>>> label_list
[
"O",
"B-corporation",
"I-corporation",
"B-creative-work",
"I-creative-work",
"B-group",
"I-group",
"B-location",
"I-location",
"B-person",
"I-person",
"B-product",
"I-product",
]
```
๊ฐ `ner_tag`์˜ ์•ž์— ๋ถ™์€ ๋ฌธ์ž๋Š” ๊ฐœ์ฒด์˜ ํ† ํฐ ์œ„์น˜๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค:
- `B-`๋Š” ๊ฐœ์ฒด์˜ ์‹œ์ž‘์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
- `I-`๋Š” ํ† ํฐ์ด ๋™์ผํ•œ ๊ฐœ์ฒด ๋‚ด๋ถ€์— ํฌํ•จ๋˜์–ด ์žˆ์Œ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค(์˜ˆ๋ฅผ ๋“ค์–ด `State` ํ† ํฐ์€ `Empire State Building`์™€ ๊ฐ™์€ ๊ฐœ์ฒด์˜ ์ผ๋ถ€์ž…๋‹ˆ๋‹ค).
- `0`๋Š” ํ† ํฐ์ด ์–ด๋–ค ๊ฐœ์ฒด์—๋„ ํ•ด๋‹นํ•˜์ง€ ์•Š์Œ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
## ์ „์ฒ˜๋ฆฌ[[preprocess]]
<Youtube id="iY2AZYdZAr0"/>
๋‹ค์Œ์œผ๋กœ `tokens` ํ•„๋“œ๋ฅผ ์ „์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด DistilBERT ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค:
```py
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
```
์œ„์˜ ์˜ˆ์ œ `tokens` ํ•„๋“œ๋ฅผ ๋ณด๋ฉด ์ž…๋ ฅ์ด ์ด๋ฏธ ํ† ํฐํ™”๋œ ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์‹ค์ œ๋กœ ์ž…๋ ฅ์€ ์•„์ง ํ† ํฐํ™”๋˜์ง€ ์•Š์•˜์œผ๋ฏ€๋กœ ๋‹จ์–ด๋ฅผ ํ•˜์œ„ ๋‹จ์–ด๋กœ ํ† ํฐํ™”ํ•˜๊ธฐ ์œ„ํ•ด `is_split_into_words=True`๋ฅผ ์„ค์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ์ œ๋กœ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค:
```py
>>> example = wnut["train"][0]
>>> tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
>>> tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
>>> tokens
['[CLS]', '@', 'paul', '##walk', 'it', "'", 's', 'the', 'view', 'from', 'where', 'i', "'", 'm', 'living', 'for', 'two', 'weeks', '.', 'empire', 'state', 'building', '=', 'es', '##b', '.', 'pretty', 'bad', 'storm', 'here', 'last', 'evening', '.', '[SEP]']
```
๊ทธ๋Ÿฌ๋‚˜ ์ด๋กœ ์ธํ•ด `[CLS]`๊ณผ `[SEP]`๋ผ๋Š” ํŠน์ˆ˜ ํ† ํฐ์ด ์ถ”๊ฐ€๋˜๊ณ , ํ•˜์œ„ ๋‹จ์–ด ํ† ํฐํ™”๋กœ ์ธํ•ด ์ž…๋ ฅ๊ณผ ๋ ˆ์ด๋ธ” ๊ฐ„์— ๋ถˆ์ผ์น˜๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ํ•˜๋‚˜์˜ ๋ ˆ์ด๋ธ”์— ํ•ด๋‹นํ•˜๋Š” ๋‹จ์ผ ๋‹จ์–ด๋Š” ์ด์ œ ๋‘ ๊ฐœ์˜ ํ•˜์œ„ ๋‹จ์–ด๋กœ ๋ถ„ํ• ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ† ํฐ๊ณผ ๋ ˆ์ด๋ธ”์„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์žฌ์ •๋ ฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:
1. [`word_ids`](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.BatchEncoding.word_ids) ๋ฉ”์†Œ๋“œ๋กœ ๋ชจ๋“  ํ† ํฐ์„ ํ•ด๋‹น ๋‹จ์–ด์— ๋งคํ•‘ํ•ฉ๋‹ˆ๋‹ค.
2. ํŠน์ˆ˜ ํ† ํฐ `[CLS]`์™€ `[SEP]`์— `-100` ๋ ˆ์ด๋ธ”์„ ํ• ๋‹นํ•˜์—ฌ, PyTorch ์†์‹ค ํ•จ์ˆ˜๊ฐ€ ํ•ด๋‹น ํ† ํฐ์„ ๋ฌด์‹œํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
3. ์ฃผ์–ด์ง„ ๋‹จ์–ด์˜ ์ฒซ ๋ฒˆ์งธ ํ† ํฐ์—๋งŒ ๋ ˆ์ด๋ธ”์„ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ™์€ ๋‹จ์–ด์˜ ๋‹ค๋ฅธ ํ•˜์œ„ ํ† ํฐ์— `-100`์„ ํ• ๋‹นํ•ฉ๋‹ˆ๋‹ค.
๋‹ค์Œ์€ ํ† ํฐ๊ณผ ๋ ˆ์ด๋ธ”์„ ์žฌ์ •๋ ฌํ•˜๊ณ  DistilBERT์˜ ์ตœ๋Œ€ ์ž…๋ ฅ ๊ธธ์ด๋ณด๋‹ค ๊ธธ์ง€ ์•Š๋„๋ก ์‹œํ€€์Šค๋ฅผ ์ž˜๋ผ๋‚ด๋Š” ํ•จ์ˆ˜๋ฅผ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค:
```py
>>> def tokenize_and_align_labels(examples):
... tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
... labels = []
... for i, label in enumerate(examples[f"ner_tags"]):
... word_ids = tokenized_inputs.word_ids(batch_index=i) # Map tokens to their respective word.
... previous_word_idx = None
... label_ids = []
... for word_idx in word_ids: # Set the special tokens to -100.
... if word_idx is None:
... label_ids.append(-100)
... elif word_idx != previous_word_idx: # Only label the first token of a given word.
... label_ids.append(label[word_idx])
... else:
... label_ids.append(-100)
... previous_word_idx = word_idx
... labels.append(label_ids)
... tokenized_inputs["labels"] = labels
... return tokenized_inputs
```
์ „์ฒด ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ์ „์ฒ˜๋ฆฌ ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜๋ ค๋ฉด, ๐Ÿค— Datasets [`~datasets.Dataset.map`] ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์„ธ์š”. `batched=True`๋กœ ์„ค์ •ํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ์—ฌ๋Ÿฌ ์š”์†Œ๋ฅผ ํ•œ ๋ฒˆ์— ์ฒ˜๋ฆฌํ•˜๋ฉด `map` ํ•จ์ˆ˜์˜ ์†๋„๋ฅผ ๋†’์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:
```py
>>> tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True)
```
์ด์ œ [`DataCollatorWithPadding`]๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์˜ˆ์ œ ๋ฐฐ์น˜๋ฅผ ๋งŒ๋“ค์–ด๋ด…์‹œ๋‹ค. ๋ฐ์ดํ„ฐ ์„ธํŠธ ์ „์ฒด๋ฅผ ์ตœ๋Œ€ ๊ธธ์ด๋กœ ํŒจ๋”ฉํ•˜๋Š” ๋Œ€์‹ , *๋™์  ํŒจ๋”ฉ*์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐฐ์น˜์—์„œ ๊ฐ€์žฅ ๊ธด ๊ธธ์ด์— ๋งž๊ฒŒ ๋ฌธ์žฅ์„ ํŒจ๋”ฉํ•˜๋Š” ๊ฒƒ์ด ํšจ์œจ์ ์ž…๋‹ˆ๋‹ค.
<frameworkcontent>
<pt>
```py
>>> from transformers import DataCollatorForTokenClassification
>>> data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
```
</pt>
<tf>
```py
>>> from transformers import DataCollatorForTokenClassification
>>> data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer, return_tensors="tf")
```
</tf>
</frameworkcontent>
## ํ‰๊ฐ€[[evaluation]]
ํ›ˆ๋ จ ์ค‘ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ํ‰๊ฐ€ ์ง€ํ‘œ๋ฅผ ํฌํ•จํ•˜๋Š” ๊ฒƒ์ด ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค. ๐Ÿค— [Evaluate](https://huggingface.co/docs/evaluate/index) ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋น ๋ฅด๊ฒŒ ํ‰๊ฐ€ ๋ฐฉ๋ฒ•์„ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ์ž‘์—…์—์„œ๋Š” [seqeval](https://huggingface.co/spaces/evaluate-metric/seqeval) ํ‰๊ฐ€ ์ง€ํ‘œ๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค. (ํ‰๊ฐ€ ์ง€ํ‘œ๋ฅผ ๊ฐ€์ ธ์˜ค๊ณ  ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด์„œ๋Š” ๐Ÿค— Evaluate [๋น ๋ฅธ ๋‘˜๋Ÿฌ๋ณด๊ธฐ](https://huggingface.co/docs/evaluate/a_quick_tour)๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”). Seqeval์€ ์‹ค์ œ๋กœ ์ •๋ฐ€๋„, ์žฌํ˜„๋ฅ , F1 ๋ฐ ์ •ํ™•๋„์™€ ๊ฐ™์€ ์—ฌ๋Ÿฌ ์ ์ˆ˜๋ฅผ ์‚ฐ์ถœํ•ฉ๋‹ˆ๋‹ค.
```py
>>> import evaluate
>>> seqeval = evaluate.load("seqeval")
```
๋จผ์ € NER ๋ ˆ์ด๋ธ”์„ ๊ฐ€์ ธ์˜จ ๋‹ค์Œ, [`~evaluate.EvaluationModule.compute`]์— ์‹ค์ œ ์˜ˆ์ธก๊ณผ ์‹ค์ œ ๋ ˆ์ด๋ธ”์„ ์ „๋‹ฌํ•˜์—ฌ ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค:
```py
>>> import numpy as np
>>> labels = [label_list[i] for i in example[f"ner_tags"]]
>>> def compute_metrics(p):
... predictions, labels = p
... predictions = np.argmax(predictions, axis=2)
... true_predictions = [
... [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
... for prediction, label in zip(predictions, labels)
... ]
... true_labels = [
... [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
... for prediction, label in zip(predictions, labels)
... ]
... results = seqeval.compute(predictions=true_predictions, references=true_labels)
... return {
... "precision": results["overall_precision"],
... "recall": results["overall_recall"],
... "f1": results["overall_f1"],
... "accuracy": results["overall_accuracy"],
... }
```
์ด์ œ `compute_metrics` ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•  ์ค€๋น„๊ฐ€ ๋˜์—ˆ์œผ๋ฉฐ, ํ›ˆ๋ จ์„ ์„ค์ •ํ•˜๋ฉด ์ด ํ•จ์ˆ˜๋กœ ๋˜๋Œ์•„์˜ฌ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
## ํ›ˆ๋ จ[[train]]
๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๊ธฐ ์ „์—, `id2label`์™€ `label2id`๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์˜ˆ์ƒ๋˜๋Š” id์™€ ๋ ˆ์ด๋ธ”์˜ ๋งต์„ ์ƒ์„ฑํ•˜์„ธ์š”:
```py
>>> id2label = {
... 0: "O",
... 1: "B-corporation",
... 2: "I-corporation",
... 3: "B-creative-work",
... 4: "I-creative-work",
... 5: "B-group",
... 6: "I-group",
... 7: "B-location",
... 8: "I-location",
... 9: "B-person",
... 10: "I-person",
... 11: "B-product",
... 12: "I-product",
... }
>>> label2id = {
... "O": 0,
... "B-corporation": 1,
... "I-corporation": 2,
... "B-creative-work": 3,
... "I-creative-work": 4,
... "B-group": 5,
... "I-group": 6,
... "B-location": 7,
... "I-location": 8,
... "B-person": 9,
... "I-person": 10,
... "B-product": 11,
... "I-product": 12,
... }
```
<frameworkcontent>
<pt>
<Tip>
[`Trainer`]๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ํŒŒ์ธ ํŠœ๋‹ํ•˜๋Š” ๋ฐฉ๋ฒ•์— ์ต์ˆ™ํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ, [์—ฌ๊ธฐ](../training#train-with-pytorch-trainer)์—์„œ ๊ธฐ๋ณธ ํŠœํ† ๋ฆฌ์–ผ์„ ํ™•์ธํ•˜์„ธ์š”!
</Tip>
์ด์ œ ๋ชจ๋ธ์„ ํ›ˆ๋ จ์‹œํ‚ฌ ์ค€๋น„๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค! [`AutoModelForSequenceClassification`]๋กœ DistilBERT๋ฅผ ๊ฐ€์ ธ์˜ค๊ณ  ์˜ˆ์ƒ๋˜๋Š” ๋ ˆ์ด๋ธ” ์ˆ˜์™€ ๋ ˆ์ด๋ธ” ๋งคํ•‘์„ ์ง€์ •ํ•˜์„ธ์š”:
```py
>>> from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
>>> model = AutoModelForTokenClassification.from_pretrained(
... "distilbert/distilbert-base-uncased", num_labels=13, id2label=id2label, label2id=label2id
... )
```
์ด์ œ ์„ธ ๋‹จ๊ณ„๋งŒ ๊ฑฐ์น˜๋ฉด ๋์ž…๋‹ˆ๋‹ค:
1. [`TrainingArguments`]์—์„œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ •์˜ํ•˜์„ธ์š”. `output_dir`๋Š” ๋ชจ๋ธ์„ ์ €์žฅํ•  ์œ„์น˜๋ฅผ ์ง€์ •ํ•˜๋Š” ์œ ์ผํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜์ž…๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์„ ํ—ˆ๋ธŒ์— ์—…๋กœ๋“œํ•˜๊ธฐ ์œ„ํ•ด `push_to_hub=True`๋ฅผ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค(๋ชจ๋ธ์„ ์—…๋กœ๋“œํ•˜๊ธฐ ์œ„ํ•ด Hugging Face์— ๋กœ๊ทธ์ธํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค.) ๊ฐ ์—ํญ์ด ๋๋‚  ๋•Œ๋งˆ๋‹ค, [`Trainer`]๋Š” seqeval ์ ์ˆ˜๋ฅผ ํ‰๊ฐ€ํ•˜๊ณ  ํ›ˆ๋ จ ์ฒดํฌํฌ์ธํŠธ๋ฅผ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
2. [`Trainer`]์— ํ›ˆ๋ จ ์ธ์ˆ˜์™€ ๋ชจ๋ธ, ๋ฐ์ดํ„ฐ ์„ธํŠธ, ํ† ํฌ๋‚˜์ด์ €, ๋ฐ์ดํ„ฐ ์ฝœ๋ ˆ์ดํ„ฐ ๋ฐ `compute_metrics` ํ•จ์ˆ˜๋ฅผ ์ „๋‹ฌํ•˜์„ธ์š”.
3. [`~Trainer.train`]๋ฅผ ํ˜ธ์ถœํ•˜์—ฌ ๋ชจ๋ธ์„ ํŒŒ์ธ ํŠœ๋‹ํ•˜์„ธ์š”.
```py
>>> training_args = TrainingArguments(
... output_dir="my_awesome_wnut_model",
... learning_rate=2e-5,
... per_device_train_batch_size=16,
... per_device_eval_batch_size=16,
... num_train_epochs=2,
... weight_decay=0.01,
... evaluation_strategy="epoch",
... save_strategy="epoch",
... load_best_model_at_end=True,
... push_to_hub=True,
... )
>>> trainer = Trainer(
... model=model,
... args=training_args,
... train_dataset=tokenized_wnut["train"],
... eval_dataset=tokenized_wnut["test"],
... tokenizer=tokenizer,
... data_collator=data_collator,
... compute_metrics=compute_metrics,
... )
>>> trainer.train()
```
ํ›ˆ๋ จ์ด ์™„๋ฃŒ๋˜๋ฉด, [`~transformers.Trainer.push_to_hub`] ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ํ—ˆ๋ธŒ์— ๊ณต์œ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
```py
>>> trainer.push_to_hub()
```
</pt>
<tf>
<Tip>
Keras๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ํŒŒ์ธ ํŠœ๋‹ํ•˜๋Š” ๋ฐฉ๋ฒ•์— ์ต์ˆ™ํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ, [์—ฌ๊ธฐ](../training#train-a-tensorflow-model-with-keras)์˜ ๊ธฐ๋ณธ ํŠœํ† ๋ฆฌ์–ผ์„ ํ™•์ธํ•˜์„ธ์š”!
</Tip>
TensorFlow์—์„œ ๋ชจ๋ธ์„ ํŒŒ์ธ ํŠœ๋‹ํ•˜๋ ค๋ฉด, ๋จผ์ € ์˜ตํ‹ฐ๋งˆ์ด์ € ํ•จ์ˆ˜์™€ ํ•™์Šต๋ฅ  ์Šค์ผ€์ฅด, ๊ทธ๋ฆฌ๊ณ  ์ผ๋ถ€ ํ›ˆ๋ จ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์„ค์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:
```py
>>> from transformers import create_optimizer
>>> batch_size = 16
>>> num_train_epochs = 3
>>> num_train_steps = (len(tokenized_wnut["train"]) // batch_size) * num_train_epochs
>>> optimizer, lr_schedule = create_optimizer(
... init_lr=2e-5,
... num_train_steps=num_train_steps,
... weight_decay_rate=0.01,
... num_warmup_steps=0,
... )
```
๊ทธ๋Ÿฐ ๋‹ค์Œ [`TFAutoModelForSequenceClassification`]์„ ์‚ฌ์šฉํ•˜์—ฌ DistilBERT๋ฅผ ๊ฐ€์ ธ์˜ค๊ณ , ์˜ˆ์ƒ๋˜๋Š” ๋ ˆ์ด๋ธ” ์ˆ˜์™€ ๋ ˆ์ด๋ธ” ๋งคํ•‘์„ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค:
```py
>>> from transformers import TFAutoModelForTokenClassification
>>> model = TFAutoModelForTokenClassification.from_pretrained(
... "distilbert/distilbert-base-uncased", num_labels=13, id2label=id2label, label2id=label2id
... )
```
[`~transformers.TFPreTrainedModel.prepare_tf_dataset`]์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ `tf.data.Dataset` ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค:
```py
>>> tf_train_set = model.prepare_tf_dataset(
... tokenized_wnut["train"],
... shuffle=True,
... batch_size=16,
... collate_fn=data_collator,
... )
>>> tf_validation_set = model.prepare_tf_dataset(
... tokenized_wnut["validation"],
... shuffle=False,
... batch_size=16,
... collate_fn=data_collator,
... )
```
[`compile`](https://keras.io/api/models/model_training_apis/#compile-method)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ›ˆ๋ จํ•  ๋ชจ๋ธ์„ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค:
```py
>>> import tensorflow as tf
>>> model.compile(optimizer=optimizer)
```
ํ›ˆ๋ จ์„ ์‹œ์ž‘ํ•˜๊ธฐ ์ „์— ์„ค์ •ํ•ด์•ผํ•  ๋งˆ์ง€๋ง‰ ๋‘ ๊ฐ€์ง€๋Š” ์˜ˆ์ธก์—์„œ seqeval ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ๋ชจ๋ธ์„ ํ—ˆ๋ธŒ์— ์—…๋กœ๋“œํ•  ๋ฐฉ๋ฒ•์„ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋ชจ๋‘ [Keras callbacks](../main_classes/keras_callbacks)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.
[`~transformers.KerasMetricCallback`]์— `compute_metrics` ํ•จ์ˆ˜๋ฅผ ์ „๋‹ฌํ•˜์„ธ์š”:
```py
>>> from transformers.keras_callbacks import KerasMetricCallback
>>> metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)
```
[`~transformers.PushToHubCallback`]์—์„œ ๋ชจ๋ธ๊ณผ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์—…๋กœ๋“œํ•  ์œ„์น˜๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค:
```py
>>> from transformers.keras_callbacks import PushToHubCallback
>>> push_to_hub_callback = PushToHubCallback(
... output_dir="my_awesome_wnut_model",
... tokenizer=tokenizer,
... )
```
๊ทธ๋Ÿฐ ๋‹ค์Œ ์ฝœ๋ฐฑ์„ ํ•จ๊ป˜ ๋ฌถ์Šต๋‹ˆ๋‹ค:
```py
>>> callbacks = [metric_callback, push_to_hub_callback]
```
๋“œ๋””์–ด, ๋ชจ๋ธ ํ›ˆ๋ จ์„ ์‹œ์ž‘ํ•  ์ค€๋น„๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค! [`fit`](https://keras.io/api/models/model_training_apis/#fit-method)์— ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ์„ธํŠธ, ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ ์„ธํŠธ, ์—ํญ์˜ ์ˆ˜ ๋ฐ ์ฝœ๋ฐฑ์„ ์ „๋‹ฌํ•˜์—ฌ ํŒŒ์ธ ํŠœ๋‹ํ•ฉ๋‹ˆ๋‹ค:
```py
>>> model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3, callbacks=callbacks)
```
ํ›ˆ๋ จ์ด ์™„๋ฃŒ๋˜๋ฉด, ๋ชจ๋ธ์ด ์ž๋™์œผ๋กœ ํ—ˆ๋ธŒ์— ์—…๋กœ๋“œ๋˜์–ด ๋ˆ„๊ตฌ๋‚˜ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค!
</tf>
</frameworkcontent>
<Tip>
ํ† ํฐ ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ ๋ชจ๋ธ์„ ํŒŒ์ธ ํŠœ๋‹ํ•˜๋Š” ์ž์„ธํ•œ ์˜ˆ์ œ๋Š” ๋‹ค์Œ
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification.ipynb)
๋˜๋Š” [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification-tf.ipynb)๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.
</Tip>
## ์ถ”๋ก [[inference]]
์ข‹์•„์š”, ์ด์ œ ๋ชจ๋ธ์„ ํŒŒ์ธ ํŠœ๋‹ํ–ˆ์œผ๋‹ˆ ์ถ”๋ก ์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค!
์ถ”๋ก ์„ ์ˆ˜ํ–‰ํ•˜๊ณ ์ž ํ•˜๋Š” ํ…์ŠคํŠธ๋ฅผ ๊ฐ€์ ธ์™€๋ด…์‹œ๋‹ค:
```py
>>> text = "The Golden State Warriors are an American professional basketball team based in San Francisco."
```
ํŒŒ์ธ ํŠœ๋‹๋œ ๋ชจ๋ธ๋กœ ์ถ”๋ก ์„ ์‹œ๋„ํ•˜๋Š” ๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•์€ [`pipeline`]๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋ชจ๋ธ๋กœ NER์˜ `pipeline`์„ ์ธ์Šคํ„ด์Šคํ™”ํ•˜๊ณ , ํ…์ŠคํŠธ๋ฅผ ์ „๋‹ฌํ•ด๋ณด์„ธ์š”:
```py
>>> from transformers import pipeline
>>> classifier = pipeline("ner", model="stevhliu/my_awesome_wnut_model")
>>> classifier(text)
[{'entity': 'B-location',
'score': 0.42658573,
'index': 2,
'word': 'golden',
'start': 4,
'end': 10},
{'entity': 'I-location',
'score': 0.35856336,
'index': 3,
'word': 'state',
'start': 11,
'end': 16},
{'entity': 'B-group',
'score': 0.3064001,
'index': 4,
'word': 'warriors',
'start': 17,
'end': 25},
{'entity': 'B-location',
'score': 0.65523505,
'index': 13,
'word': 'san',
'start': 80,
'end': 83},
{'entity': 'B-location',
'score': 0.4668663,
'index': 14,
'word': 'francisco',
'start': 84,
'end': 93}]
```
์›ํ•œ๋‹ค๋ฉด, `pipeline`์˜ ๊ฒฐ๊ณผ๋ฅผ ์ˆ˜๋™์œผ๋กœ ๋ณต์ œํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค:
<frameworkcontent>
<pt>
ํ…์ŠคํŠธ๋ฅผ ํ† ํฐํ™”ํ•˜๊ณ  PyTorch ํ…์„œ๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค:
```py
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_wnut_model")
>>> inputs = tokenizer(text, return_tensors="pt")
```
์ž…๋ ฅ์„ ๋ชจ๋ธ์— ์ „๋‹ฌํ•˜๊ณ  `logits`์„ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค:
```py
>>> from transformers import AutoModelForTokenClassification
>>> model = AutoModelForTokenClassification.from_pretrained("stevhliu/my_awesome_wnut_model")
>>> with torch.no_grad():
... logits = model(**inputs).logits
```
๊ฐ€์žฅ ๋†’์€ ํ™•๋ฅ ์„ ๊ฐ€์ง„ ํด๋ž˜์Šค๋ฅผ ๋ชจ๋ธ์˜ `id2label` ๋งคํ•‘์„ ์‚ฌ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ ๋ ˆ์ด๋ธ”๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค:
```py
>>> predictions = torch.argmax(logits, dim=2)
>>> predicted_token_class = [model.config.id2label[t.item()] for t in predictions[0]]
>>> predicted_token_class
['O',
'O',
'B-location',
'I-location',
'B-group',
'O',
'O',
'O',
'O',
'O',
'O',
'O',
'O',
'B-location',
'B-location',
'O',
'O']
```
</pt>
<tf>
ํ…์ŠคํŠธ๋ฅผ ํ† ํฐํ™”ํ•˜๊ณ  TensorFlow ํ…์„œ๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค:
```py
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_wnut_model")
>>> inputs = tokenizer(text, return_tensors="tf")
```
์ž…๋ ฅ๊ฐ’์„ ๋ชจ๋ธ์— ์ „๋‹ฌํ•˜๊ณ  `logits`์„ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค:
```py
>>> from transformers import TFAutoModelForTokenClassification
>>> model = TFAutoModelForTokenClassification.from_pretrained("stevhliu/my_awesome_wnut_model")
>>> logits = model(**inputs).logits
```
๊ฐ€์žฅ ๋†’์€ ํ™•๋ฅ ์„ ๊ฐ€์ง„ ํด๋ž˜์Šค๋ฅผ ๋ชจ๋ธ์˜ `id2label` ๋งคํ•‘์„ ์‚ฌ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ ๋ ˆ์ด๋ธ”๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค:
```py
>>> predicted_token_class_ids = tf.math.argmax(logits, axis=-1)
>>> predicted_token_class = [model.config.id2label[t] for t in predicted_token_class_ids[0].numpy().tolist()]
>>> predicted_token_class
['O',
'O',
'B-location',
'I-location',
'B-group',
'O',
'O',
'O',
'O',
'O',
'O',
'O',
'O',
'B-location',
'B-location',
'O',
'O']
```
</tf>
</frameworkcontent>