| <!--Copyright 2020 The HuggingFace Team. All rights reserved. | |
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
| the License. You may obtain a copy of the License at | |
| http://www.apache.org/licenses/LICENSE-2.0 | |
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
| specific language governing permissions and limitations under the License. | |
| β οΈ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | |
| rendered properly in your Markdown viewer. | |
| --> | |
| # νλ‘μΈμ[[processors]] | |
| Transformers λΌμ΄λΈλ¬λ¦¬μμ νλ‘μΈμλ λ κ°μ§ μλ―Έλ‘ μ¬μ©λ©λλ€: | |
| - [Wav2Vec2](../model_doc/wav2vec2) (μμ±κ³Ό ν μ€νΈ) λλ [CLIP](../model_doc/clip) (ν μ€νΈμ λΉμ )κ³Ό κ°μ λ©ν°λͺ¨λ¬ λͺ¨λΈμ μ λ ₯μ μ μ²λ¦¬νλ κ°μ²΄ | |
| - GLUE λλ SQUAD λ°μ΄ν°λ₯Ό μ μ²λ¦¬νκΈ° μν΄ λΌμ΄λΈλ¬λ¦¬μ μ΄μ λ²μ μμ μ¬μ©λμλ μ¬μ© μ€λ¨λ κ°μ²΄ | |
| ## λ©ν°λͺ¨λ¬ νλ‘μΈμ[[transformers.ProcessorMixin]] | |
| λͺ¨λ λ©ν°λͺ¨λ¬ λͺ¨λΈμ μ¬λ¬ λͺ¨λ¬λ¦¬ν°(ν μ€νΈ, λΉμ , μ€λμ€)λ₯Ό κ·Έλ£Ήννλ λ°μ΄ν°λ₯Ό μΈμ½λ©νκ±°λ λμ½λ©νλ κ°μ²΄κ° νμνλ°, μ΄κ²μ νλ‘μΈμλΌκ³ λΆλ¦¬λ κ°μ²΄κ° λ΄λΉν©λλ€. νλ‘μΈμλ ν ν¬λμ΄μ (ν μ€νΈ λͺ¨λ¬λ¦¬ν°μ©), μ΄λ―Έμ§ νλ‘μΈμ(λΉμ μ©), νΉμ± μΆμΆκΈ°(μ€λμ€μ©) κ°μ΄ λ κ° μ΄μμ μ²λ¦¬ κ°μ²΄λ₯Ό νλλ‘ λ¬Άμ΅λλ€. | |
| μ΄λ¬ν νλ‘μΈμλ μ μ₯ λ° λ‘λ© κΈ°λ₯μ ꡬννλ λ€μ κΈ°λ³Έ ν΄λμ€λ₯Ό μμλ°μ΅λλ€: | |
| [[autodoc]] ProcessorMixin | |
| ## μ¬μ© μ€λ¨λ νλ‘μΈμ[[transformers.DataProcessor]] | |
| λͺ¨λ νλ‘μΈμλ [`~data.processors.utils.DataProcessor`]μ κ°μ λμΌν μν€ν μ²λ₯Ό λ°λ¦ λλ€. νλ‘μΈμλ [`~data.processors.utils.InputExample`]μ λͺ©λ‘μ λ°νν©λλ€. μ΄ [`~data.processors.utils.InputExample`]λ€μ λͺ¨λΈμ μ λ ₯νκΈ° μν΄ [`~data.processors.utils.InputFeatures`]λ‘ λ³νλ μ μμ΅λλ€. | |
| [[autodoc]] data.processors.utils.DataProcessor | |
| [[autodoc]] data.processors.utils.InputExample | |
| [[autodoc]] data.processors.utils.InputFeatures | |
| ## GLUE[[transformers.glue_convert_examples_to_features]] | |
| [General Language Understanding Evaluation (GLUE)](https://gluebenchmark.com/)λ λ€μν κΈ°μ‘΄ NLU μμ μμ λͺ¨λΈμ μ±λ₯μ νκ°νλ λ²€μΉλ§ν¬μ λλ€. [GLUE: A multi-task benchmark and analysis platform for natural language understanding](https://openreview.net/pdf?id=rJ4km2R5t7) λ Όλ¬Έκ³Ό ν¨κ» λ°νλμμ΅λλ€. | |
| μ΄ λΌμ΄λΈλ¬λ¦¬λ MRPC, MNLI, MNLI (λΆμΌμΉ), CoLA, SST2, STSB, QQP, QNLI, RTE, WNLI μ΄ 10κ° μμ μ λν νλ‘μΈμλ₯Ό μ 곡ν©λλ€. | |
| μ΄λ¬ν νλ‘μΈμλ€μ λ€μκ³Ό κ°μ΅λλ€: | |
| - [`~data.processors.utils.MrpcProcessor`] | |
| - [`~data.processors.utils.MnliProcessor`] | |
| - [`~data.processors.utils.MnliMismatchedProcessor`] | |
| - [`~data.processors.utils.Sst2Processor`] | |
| - [`~data.processors.utils.StsbProcessor`] | |
| - [`~data.processors.utils.QqpProcessor`] | |
| - [`~data.processors.utils.QnliProcessor`] | |
| - [`~data.processors.utils.RteProcessor`] | |
| - [`~data.processors.utils.WnliProcessor`] | |
| λν, μλμ λ©μλλ€μ μ¬μ©νμ¬ λ°μ΄ν° νμΌλ‘λΆν° κ°μ κ°μ Έμ [`~data.processors.utils.InputExample`] λͺ©λ‘μΌλ‘ λ³νν μ μμ΅λλ€. | |
| [[autodoc]] data.processors.glue.glue_convert_examples_to_features | |
| ## XNLI[[xnli]] | |
| [The Cross-Lingual NLI Corpus (XNLI)](https://www.nyu.edu/projects/bowman/xnli/)λ κ΅μ°¨μΈμ΄ ν μ€νΈ ννμ νμ§μ νκ°νλ λ²€μΉλ§ν¬μ λλ€. XNLIλ [*MultiNLI*](http://www.nyu.edu/projects/bowman/multinli/)λ₯Ό κΈ°λ°μΌλ‘ ν ν¬λΌμ°λμμ± λ°μ΄ν° μΈνΈμ λλ€: ν μ€νΈ μμ 15κ° μΈμ΄(μμ΄ κ°μ κ³ μμ μΈμ΄λΆν° μ€μνλ¦¬μ΄ κ°μ μ μμ μΈμ΄κΉμ§)μ λν΄ ν μ€νΈ ν¨μ μ΄λ Έν μ΄μ μΌλ‘ λ μ΄λΈλ§λ©λλ€. | |
| [XNLI: Evaluating Cross-lingual Sentence Representations](https://huggingface.co/papers/1809.05053) λ Όλ¬Έκ³Ό ν¨κ» λ°νλμμ΅λλ€. | |
| μ΄ λΌμ΄λΈλ¬λ¦¬λ XNLI λ°μ΄ν°λ₯Ό κ°μ Έμ€λ νλ‘μΈμλ₯Ό μ 곡ν©λλ€: | |
| - [`~data.processors.utils.XnliProcessor`] | |
| ν μ€νΈ μΈνΈμ 골λ λ μ΄λΈμ΄ μ 곡λλ―λ‘, νκ°λ ν μ€νΈ μΈνΈμμ μνλ©λλ€. | |
| μ΄λ¬ν νλ‘μΈμλ₯Ό μ¬μ©νλ μμλ [run_xnli.py](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification/run_xnli.py) μ€ν¬λ¦½νΈμ μ 곡λμ΄ μμ΅λλ€. | |
| ## SQuAD[[squad]] | |
| [The Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer//)λ μ§λ¬Έ λ΅λ³μμ λͺ¨λΈμ μ±λ₯μ νκ°νλ λ²€μΉλ§ν¬μ λλ€. v1.1κ³Ό v2.0 λ κ°μ§ λ²μ μ μ¬μ©ν μ μμ΅λλ€. 첫 λ²μ§Έ λ²μ (v1.1)μ [SQuAD: 100,000+ Questions for Machine Comprehension of Text](https://huggingface.co/papers/1606.05250) λ Όλ¬Έκ³Ό ν¨κ» λ°νλμμ΅λλ€. λ λ²μ§Έ λ²μ (v2.0)μ [Know What You Don't Know: Unanswerable Questions for SQuAD](https://huggingface.co/papers/1806.03822) λ Όλ¬Έκ³Ό ν¨κ» λ°νλμμ΅λλ€. | |
| μ΄ λΌμ΄λΈλ¬λ¦¬λ λ λ²μ κ°κ°μ λν νλ‘μΈμλ₯Ό νΈμ€ν ν©λλ€: | |
| ### νλ‘μΈμ[[transformers.data.processors.squad.SquadProcessor]] | |
| μ΄λ¬ν νλ‘μΈμλ€μ λ€μκ³Ό κ°μ΅λλ€: | |
| - [`~data.processors.utils.SquadV1Processor`] | |
| - [`~data.processors.utils.SquadV2Processor`] | |
| λ λ€ μΆμ ν΄λμ€ [`~data.processors.utils.SquadProcessor`]λ₯Ό μμλ°μ΅λλ€. | |
| [[autodoc]] data.processors.squad.SquadProcessor | |
| - all | |
| λν, λ€μ λ©μλλ₯Ό μ¬μ©νμ¬ SQuAD μμλ₯Ό λͺ¨λΈ μ λ ₯μΌλ‘ μ¬μ©ν μ μλ [`~data.processors.utils.SquadFeatures`]λ‘ λ³νν μ μμ΅λλ€. | |
| [[autodoc]] data.processors.squad.squad_convert_examples_to_features | |
| μ΄λ¬ν νλ‘μΈμλ€κ³Ό μμ μΈκΈν λ©μλλ λ°μ΄ν°κ° ν¬ν¨λ νμΌλΏλ§ μλλΌ *tensorflow_datasets* ν¨ν€μ§μλ ν¨κ» μ¬μ©ν μ μμ΅λλ€. μμλ μλμ μ 곡λ©λλ€. | |
| ### μ¬μ© μμ[[example-usage]] | |
| λ€μμ λ°μ΄ν° νμΌμ μ¬μ©νμ¬ νλ‘μΈμμ λ³ν λ©μλλ₯Ό μ¬μ©νλ μμμ λλ€: | |
| ```python | |
| # V2 νλ‘μΈμ κ°μ Έμ€κΈ° | |
| processor = SquadV2Processor() | |
| examples = processor.get_dev_examples(squad_v2_data_dir) | |
| # V1 νλ‘μΈμ κ°μ Έμ€κΈ° | |
| processor = SquadV1Processor() | |
| examples = processor.get_dev_examples(squad_v1_data_dir) | |
| features = squad_convert_examples_to_features( | |
| examples=examples, | |
| tokenizer=tokenizer, | |
| max_seq_length=max_seq_length, | |
| doc_stride=args.doc_stride, | |
| max_query_length=max_query_length, | |
| is_training=not evaluate, | |
| ) | |
| ``` | |
| *tensorflow_datasets* μ¬μ©μ λ°μ΄ν° νμΌ μ¬μ©λ§νΌ μ½μ΅λλ€: | |
| ```python | |
| # tensorflow_datasetsλ Squad V1λ§ μ²λ¦¬ν©λλ€. | |
| tfds_examples = tfds.load("squad") | |
| examples = SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=evaluate) | |
| features = squad_convert_examples_to_features( | |
| examples=examples, | |
| tokenizer=tokenizer, | |
| max_seq_length=max_seq_length, | |
| doc_stride=args.doc_stride, | |
| max_query_length=max_query_length, | |
| is_training=not evaluate, | |
| ) | |
| ``` | |
| μ΄λ¬ν νλ‘μΈμλ₯Ό μ¬μ©νλ λ λ€λ₯Έ μμλ [run_squad.py](https://github.com/huggingface/transformers/tree/main/examples/legacy/question-answering/run_squad.py) μ€ν¬λ¦½νΈμ μ 곡λμ΄ μμ΅λλ€. |