AbdulElahGwaith's picture
Upload folder using huggingface_hub
a9bd396 verified
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# ν”„λ‘œμ„Έμ„œ[[processors]]
Transformers λΌμ΄λΈŒλŸ¬λ¦¬μ—μ„œ ν”„λ‘œμ„Έμ„œλŠ” 두 κ°€μ§€ 의미둜 μ‚¬μš©λ©λ‹ˆλ‹€:
- [Wav2Vec2](../model_doc/wav2vec2) (μŒμ„±κ³Ό ν…μŠ€νŠΈ) λ˜λŠ” [CLIP](../model_doc/clip) (ν…μŠ€νŠΈμ™€ λΉ„μ „)κ³Ό 같은 λ©€ν‹°λͺ¨λ‹¬ λͺ¨λΈμ˜ μž…λ ₯을 μ „μ²˜λ¦¬ν•˜λŠ” 객체
- GLUE λ˜λŠ” SQUAD 데이터λ₯Ό μ „μ²˜λ¦¬ν•˜κΈ° μœ„ν•΄ 라이브러리의 이전 λ²„μ „μ—μ„œ μ‚¬μš©λ˜μ—ˆλ˜ μ‚¬μš© μ€‘λ‹¨λœ 객체
## λ©€ν‹°λͺ¨λ‹¬ ν”„λ‘œμ„Έμ„œ[[transformers.ProcessorMixin]]
λͺ¨λ“  λ©€ν‹°λͺ¨λ‹¬ λͺ¨λΈμ€ μ—¬λŸ¬ λͺ¨λ‹¬λ¦¬ν‹°(ν…μŠ€νŠΈ, λΉ„μ „, μ˜€λ””μ˜€)λ₯Ό κ·Έλ£Ήν™”ν•˜λŠ” 데이터λ₯Ό μΈμ½”λ”©ν•˜κ±°λ‚˜ λ””μ½”λ”©ν•˜λŠ” 객체가 ν•„μš”ν•œλ°, 이것은 ν”„λ‘œμ„Έμ„œλΌκ³  λΆˆλ¦¬λŠ” 객체가 λ‹΄λ‹Ήν•©λ‹ˆλ‹€. ν”„λ‘œμ„Έμ„œλŠ” ν† ν¬λ‚˜μ΄μ €(ν…μŠ€νŠΈ λͺ¨λ‹¬λ¦¬ν‹°μš©), 이미지 ν”„λ‘œμ„Έμ„œ(λΉ„μ „μš©), νŠΉμ„± μΆ”μΆœκΈ°(μ˜€λ””μ˜€μš©) 같이 두 개 μ΄μƒμ˜ 처리 객체λ₯Ό ν•˜λ‚˜λ‘œ λ¬ΆμŠ΅λ‹ˆλ‹€.
μ΄λŸ¬ν•œ ν”„λ‘œμ„Έμ„œλŠ” μ €μž₯ 및 λ‘œλ”© κΈ°λŠ₯을 κ΅¬ν˜„ν•˜λŠ” λ‹€μŒ κΈ°λ³Έ 클래슀λ₯Ό μƒμ†λ°›μŠ΅λ‹ˆλ‹€:
[[autodoc]] ProcessorMixin
## μ‚¬μš© μ€‘λ‹¨λœ ν”„λ‘œμ„Έμ„œ[[transformers.DataProcessor]]
λͺ¨λ“  ν”„λ‘œμ„Έμ„œλŠ” [`~data.processors.utils.DataProcessor`]와 같은 λ™μΌν•œ μ•„ν‚€ν…μ²˜λ₯Ό λ”°λ¦…λ‹ˆλ‹€. ν”„λ‘œμ„Έμ„œλŠ” [`~data.processors.utils.InputExample`]의 λͺ©λ‘μ„ λ°˜ν™˜ν•©λ‹ˆλ‹€. 이 [`~data.processors.utils.InputExample`]듀은 λͺ¨λΈμ— μž…λ ₯ν•˜κΈ° μœ„ν•΄ [`~data.processors.utils.InputFeatures`]둜 λ³€ν™˜λ  수 μžˆμŠ΅λ‹ˆλ‹€.
[[autodoc]] data.processors.utils.DataProcessor
[[autodoc]] data.processors.utils.InputExample
[[autodoc]] data.processors.utils.InputFeatures
## GLUE[[transformers.glue_convert_examples_to_features]]
[General Language Understanding Evaluation (GLUE)](https://gluebenchmark.com/)λŠ” λ‹€μ–‘ν•œ κΈ°μ‘΄ NLU μž‘μ—…μ—μ„œ λͺ¨λΈμ˜ μ„±λŠ₯을 ν‰κ°€ν•˜λŠ” λ²€μΉ˜λ§ˆν¬μž…λ‹ˆλ‹€. [GLUE: A multi-task benchmark and analysis platform for natural language understanding](https://openreview.net/pdf?id=rJ4km2R5t7) λ…Όλ¬Έκ³Ό ν•¨κ»˜ λ°œν‘œλ˜μ—ˆμŠ΅λ‹ˆλ‹€.
이 λΌμ΄λΈŒλŸ¬λ¦¬λŠ” MRPC, MNLI, MNLI (뢈일치), CoLA, SST2, STSB, QQP, QNLI, RTE, WNLI 총 10개 μž‘μ—…μ— λŒ€ν•œ ν”„λ‘œμ„Έμ„œλ₯Ό μ œκ³΅ν•©λ‹ˆλ‹€.
μ΄λŸ¬ν•œ ν”„λ‘œμ„Έμ„œλ“€μ€ λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€:
- [`~data.processors.utils.MrpcProcessor`]
- [`~data.processors.utils.MnliProcessor`]
- [`~data.processors.utils.MnliMismatchedProcessor`]
- [`~data.processors.utils.Sst2Processor`]
- [`~data.processors.utils.StsbProcessor`]
- [`~data.processors.utils.QqpProcessor`]
- [`~data.processors.utils.QnliProcessor`]
- [`~data.processors.utils.RteProcessor`]
- [`~data.processors.utils.WnliProcessor`]
λ˜ν•œ, μ•„λž˜μ˜ λ©”μ†Œλ“œλ“€μ„ μ‚¬μš©ν•˜μ—¬ 데이터 νŒŒμΌλ‘œλΆ€ν„° 값을 가져와 [`~data.processors.utils.InputExample`] λͺ©λ‘μœΌλ‘œ λ³€ν™˜ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
[[autodoc]] data.processors.glue.glue_convert_examples_to_features
## XNLI[[xnli]]
[The Cross-Lingual NLI Corpus (XNLI)](https://www.nyu.edu/projects/bowman/xnli/)λŠ” ꡐ차언어 ν…μŠ€νŠΈ ν‘œν˜„μ˜ ν’ˆμ§ˆμ„ ν‰κ°€ν•˜λŠ” λ²€μΉ˜λ§ˆν¬μž…λ‹ˆλ‹€. XNLIλŠ” [*MultiNLI*](http://www.nyu.edu/projects/bowman/multinli/)λ₯Ό 기반으둜 ν•œ ν¬λΌμš°λ“œμ†Œμ‹± 데이터 μ„ΈνŠΈμž…λ‹ˆλ‹€: ν…μŠ€νŠΈ μŒμ€ 15개 μ–Έμ–΄(μ˜μ–΄ 같은 κ³ μžμ› μ–Έμ–΄λΆ€ν„° μŠ€μ™€νžλ¦¬μ–΄ 같은 μ €μžμ› μ–Έμ–΄κΉŒμ§€)에 λŒ€ν•΄ ν…μŠ€νŠΈ ν•¨μ˜ μ–΄λ…Έν…Œμ΄μ…˜μœΌλ‘œ λ ˆμ΄λΈ”λ§λ©λ‹ˆλ‹€.
[XNLI: Evaluating Cross-lingual Sentence Representations](https://huggingface.co/papers/1809.05053) λ…Όλ¬Έκ³Ό ν•¨κ»˜ λ°œν‘œλ˜μ—ˆμŠ΅λ‹ˆλ‹€.
이 λΌμ΄λΈŒλŸ¬λ¦¬λŠ” XNLI 데이터λ₯Ό κ°€μ Έμ˜€λŠ” ν”„λ‘œμ„Έμ„œλ₯Ό μ œκ³΅ν•©λ‹ˆλ‹€:
- [`~data.processors.utils.XnliProcessor`]
ν…ŒμŠ€νŠΈ μ„ΈνŠΈμ— κ³¨λ“œ λ ˆμ΄λΈ”μ΄ μ œκ³΅λ˜λ―€λ‘œ, ν‰κ°€λŠ” ν…ŒμŠ€νŠΈ μ„ΈνŠΈμ—μ„œ μˆ˜ν–‰λ©λ‹ˆλ‹€.
μ΄λŸ¬ν•œ ν”„λ‘œμ„Έμ„œλ₯Ό μ‚¬μš©ν•˜λŠ” μ˜ˆμ‹œλŠ” [run_xnli.py](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification/run_xnli.py) μŠ€ν¬λ¦½νŠΈμ— μ œκ³΅λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€.
## SQuAD[[squad]]
[The Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer//)λŠ” 질문 λ‹΅λ³€μ—μ„œ λͺ¨λΈμ˜ μ„±λŠ₯을 ν‰κ°€ν•˜λŠ” λ²€μΉ˜λ§ˆν¬μž…λ‹ˆλ‹€. v1.1κ³Ό v2.0 두 κ°€μ§€ 버전을 μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€. 첫 번째 버전(v1.1)은 [SQuAD: 100,000+ Questions for Machine Comprehension of Text](https://huggingface.co/papers/1606.05250) λ…Όλ¬Έκ³Ό ν•¨κ»˜ λ°œν‘œλ˜μ—ˆμŠ΅λ‹ˆλ‹€. 두 번째 버전(v2.0)은 [Know What You Don't Know: Unanswerable Questions for SQuAD](https://huggingface.co/papers/1806.03822) λ…Όλ¬Έκ³Ό ν•¨κ»˜ λ°œν‘œλ˜μ—ˆμŠ΅λ‹ˆλ‹€.
이 λΌμ΄λΈŒλŸ¬λ¦¬λŠ” 두 버전 각각에 λŒ€ν•œ ν”„λ‘œμ„Έμ„œλ₯Ό ν˜ΈμŠ€νŒ…ν•©λ‹ˆλ‹€:
### ν”„λ‘œμ„Έμ„œ[[transformers.data.processors.squad.SquadProcessor]]
μ΄λŸ¬ν•œ ν”„λ‘œμ„Έμ„œλ“€μ€ λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€:
- [`~data.processors.utils.SquadV1Processor`]
- [`~data.processors.utils.SquadV2Processor`]
λ‘˜ λ‹€ 좔상 클래슀 [`~data.processors.utils.SquadProcessor`]λ₯Ό μƒμ†λ°›μŠ΅λ‹ˆλ‹€.
[[autodoc]] data.processors.squad.SquadProcessor
- all
λ˜ν•œ, λ‹€μŒ λ©”μ†Œλ“œλ₯Ό μ‚¬μš©ν•˜μ—¬ SQuAD μ˜ˆμ‹œλ₯Ό λͺ¨λΈ μž…λ ₯으둜 μ‚¬μš©ν•  수 μžˆλŠ” [`~data.processors.utils.SquadFeatures`]둜 λ³€ν™˜ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
[[autodoc]] data.processors.squad.squad_convert_examples_to_features
μ΄λŸ¬ν•œ ν”„λ‘œμ„Έμ„œλ“€κ³Ό μ•žμ„œ μ–ΈκΈ‰ν•œ λ©”μ†Œλ“œλŠ” 데이터가 ν¬ν•¨λœ 파일뿐만 μ•„λ‹ˆλΌ *tensorflow_datasets* νŒ¨ν‚€μ§€μ™€λ„ ν•¨κ»˜ μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€. μ˜ˆμ‹œλŠ” μ•„λž˜μ— μ œκ³΅λ©λ‹ˆλ‹€.
### μ‚¬μš© μ˜ˆμ‹œ[[example-usage]]
λ‹€μŒμ€ 데이터 νŒŒμΌμ„ μ‚¬μš©ν•˜μ—¬ ν”„λ‘œμ„Έμ„œμ™€ λ³€ν™˜ λ©”μ†Œλ“œλ₯Ό μ‚¬μš©ν•˜λŠ” μ˜ˆμ‹œμž…λ‹ˆλ‹€:
```python
# V2 ν”„λ‘œμ„Έμ„œ κ°€μ Έμ˜€κΈ°
processor = SquadV2Processor()
examples = processor.get_dev_examples(squad_v2_data_dir)
# V1 ν”„λ‘œμ„Έμ„œ κ°€μ Έμ˜€κΈ°
processor = SquadV1Processor()
examples = processor.get_dev_examples(squad_v1_data_dir)
features = squad_convert_examples_to_features(
examples=examples,
tokenizer=tokenizer,
max_seq_length=max_seq_length,
doc_stride=args.doc_stride,
max_query_length=max_query_length,
is_training=not evaluate,
)
```
*tensorflow_datasets* μ‚¬μš©μ€ 데이터 파일 μ‚¬μš©λ§ŒνΌ μ‰½μŠ΅λ‹ˆλ‹€:
```python
# tensorflow_datasetsλŠ” Squad V1만 μ²˜λ¦¬ν•©λ‹ˆλ‹€.
tfds_examples = tfds.load("squad")
examples = SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=evaluate)
features = squad_convert_examples_to_features(
examples=examples,
tokenizer=tokenizer,
max_seq_length=max_seq_length,
doc_stride=args.doc_stride,
max_query_length=max_query_length,
is_training=not evaluate,
)
```
μ΄λŸ¬ν•œ ν”„λ‘œμ„Έμ„œλ₯Ό μ‚¬μš©ν•˜λŠ” 또 λ‹€λ₯Έ μ˜ˆμ‹œλŠ” [run_squad.py](https://github.com/huggingface/transformers/tree/main/examples/legacy/question-answering/run_squad.py) μŠ€ν¬λ¦½νŠΈμ— μ œκ³΅λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€.