Buckets:
| # 프로세서[[processors]] | |
| Transformers 라이브러리에서 프로세서는 두 가지 의미로 사용됩니다: | |
| - [Wav2Vec2](../model_doc/wav2vec2) (음성과 텍스트) 또는 [CLIP](../model_doc/clip) (텍스트와 비전)과 같은 멀티모달 모델의 입력을 전처리하는 객체 | |
| - GLUE 또는 SQUAD 데이터를 전처리하기 위해 라이브러리의 이전 버전에서 사용되었던 사용 중단된 객체 | |
| ## 멀티모달 프로세서[[transformers.ProcessorMixin]][[transformers.ProcessorMixin]] | |
| 모든 멀티모달 모델은 여러 모달리티(텍스트, 비전, 오디오)를 그룹화하는 데이터를 인코딩하거나 디코딩하는 객체가 필요한데, 이것은 프로세서라고 불리는 객체가 담당합니다. 프로세서는 토크나이저(텍스트 모달리티용), 이미지 프로세서(비전용), 특성 추출기(오디오용) 같이 두 개 이상의 처리 객체를 하나로 묶습니다. | |
| 이러한 프로세서는 저장 및 로딩 기능을 구현하는 다음 기본 클래스를 상속받습니다: | |
| #### transformers.ProcessorMixin[[transformers.ProcessorMixin]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L586) | |
| This is a mixin used to provide saving/loading functionality for all processor classes. | |
| apply_chat_templatetransformers.ProcessorMixin.apply_chat_templatehttps://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1679[{"name": "conversation", "val": ": list[dict[str, str]] | list[list[dict[str, str]]]"}, {"name": "chat_template", "val": ": str | None = None"}, {"name": "tools", "val": ": list[dict] | None = None"}, {"name": "documents", "val": ": list[dict[str, str]] | None = None"}, {"name": "add_generation_prompt", "val": ": bool = False"}, {"name": "continue_final_message", "val": ": bool | str = False"}, {"name": "return_assistant_tokens_mask", "val": ": bool = False"}, {"name": "tokenize", "val": ": bool = False"}, {"name": "return_tensors", "val": ": str | transformers.utils.generic.TensorType | None = None"}, {"name": "return_dict", "val": ": bool = False"}, {"name": "load_audio_from_video", "val": ": bool = False"}, {"name": "processor_kwargs", "val": ": dict | None = None"}, {"name": "**kwargs", "val": ""}]- **conversation** (`Union[list[Dict, [str, str]], list[list[dict[str, str]]]]`) -- | |
| The conversation to format. | |
| - **chat_template** (`Optional[str]`, *optional*) -- | |
| The Jinja template to use for formatting the conversation. If not provided, the tokenizer's | |
| chat template is used.0 | |
| Similar to the `apply_chat_template` method on tokenizers, this method applies a Jinja template to input | |
| conversations to turn them into a single tokenizable string. | |
| The input is expected to be in the following format, where each message content is a list consisting of text and | |
| optionally image or video inputs. One can also provide an image, video, URL or local path which will be used to form | |
| `pixel_values` when `return_dict=True`. If not provided, one will get only the formatted text, optionally tokenized text. | |
| conversation = [ | |
| { | |
| "role": "user", | |
| "content": [ | |
| {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"}, | |
| {"type": "text", "text": "Please describe this image in detail."}, | |
| ], | |
| }, | |
| ] | |
| **Parameters:** | |
| conversation (`Union[list[Dict, [str, str]], list[list[dict[str, str]]]]`) : The conversation to format. | |
| chat_template (`Optional[str]`, *optional*) : The Jinja template to use for formatting the conversation. If not provided, the tokenizer's chat template is used. | |
| #### batch_decode[[transformers.ProcessorMixin.batch_decode]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1625) | |
| This method forwards all its arguments to PreTrainedTokenizer's [batch_decode()](/docs/transformers/main/ko/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.batch_decode). Please | |
| refer to the docstring of this method for more information. | |
| #### check_argument_for_proper_class[[transformers.ProcessorMixin.check_argument_for_proper_class]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L706) | |
| Checks the passed argument's class against the expected transformers class. In case of an unexpected | |
| mismatch between expected and actual class, an error is raise. Otherwise, the proper retrieved class | |
| is returned. | |
| #### decode[[transformers.ProcessorMixin.decode]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1634) | |
| This method forwards all its arguments to PreTrainedTokenizer's [decode()](/docs/transformers/main/ko/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.decode). Please refer to | |
| the docstring of this method for more information. | |
| #### from_args_and_dict[[transformers.ProcessorMixin.from_args_and_dict]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1164) | |
| Instantiates a type of `~processing_utils.ProcessingMixin` from a Python dictionary of parameters. | |
| **Parameters:** | |
| processor_dict (`dict[str, Any]`) : Dictionary that will be used to instantiate the processor object. Such a dictionary can be retrieved from a pretrained checkpoint by leveraging the `~processing_utils.ProcessingMixin.to_dict` method. | |
| kwargs (`dict[str, Any]`) : Additional parameters from which to initialize the processor object. | |
| **Returns:** | |
| ``~processing_utils.ProcessingMixin`` | |
| The processor object instantiated from those | |
| parameters. | |
| #### from_pretrained[[transformers.ProcessorMixin.from_pretrained]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1388) | |
| Instantiate a processor associated with a pretrained model. | |
| This class method is simply calling the feature extractor | |
| [from_pretrained()](/docs/transformers/main/ko/main_classes/feature_extractor#transformers.FeatureExtractionMixin.from_pretrained), image processor | |
| [ImageProcessingMixin](/docs/transformers/main/ko/internal/image_processing_utils#transformers.ImageProcessingMixin) and the tokenizer | |
| `~tokenization_utils_base.PreTrainedTokenizer.from_pretrained` methods. Please refer to the docstrings of the | |
| methods above for more information. | |
| **Parameters:** | |
| pretrained_model_name_or_path (`str` or `os.PathLike`) : This can be either: - a string, the *model id* of a pretrained feature_extractor hosted inside a model repo on huggingface.co. - a path to a *directory* containing a feature extractor file saved using the [save_pretrained()](/docs/transformers/main/ko/main_classes/feature_extractor#transformers.FeatureExtractionMixin.save_pretrained) method, e.g., `./my_model_directory/`. - a path to a saved feature extractor JSON *file*, e.g., `./my_model_directory/preprocessor_config.json`. | |
| - ****kwargs** : Additional keyword arguments passed along to both [from_pretrained()](/docs/transformers/main/ko/main_classes/feature_extractor#transformers.FeatureExtractionMixin.from_pretrained) and `~tokenization_utils_base.PreTrainedTokenizer.from_pretrained`. | |
| #### get_processor_dict[[transformers.ProcessorMixin.get_processor_dict]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L925) | |
| From a `pretrained_model_name_or_path`, resolve to a dictionary of parameters, to be used for instantiating a | |
| processor of type `~processing_utils.ProcessingMixin` using `from_args_and_dict`. | |
| **Parameters:** | |
| pretrained_model_name_or_path (`str` or `os.PathLike`) : The identifier of the pre-trained checkpoint from which we want the dictionary of parameters. | |
| subfolder (`str`, *optional*, defaults to `""`) : In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can specify the folder name here. | |
| **Returns:** | |
| ``tuple[Dict, Dict]`` | |
| The dictionary(ies) that will be used to instantiate the processor object. | |
| #### parse_response[[transformers.ProcessorMixin.parse_response]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1947) | |
| Converts an output string created by generating text from a model into a parsed message dictionary. | |
| This method is intended for use with chat models, and will read the tokenizer's `response_schema` attribute to | |
| control parsing, although this can be overridden by passing a `response_schema` argument directly. | |
| **Parameters:** | |
| response (`str`) : The output string generated by the model. This can be either a decoded string or list of strings, or token IDs as a list/array. | |
| schema (`Union[list, dict]`, *optional*) : A response schema that indicates the expected output format and how parsing should be performed. If not provided, the tokenizer's `response_schema` attribute will be used. | |
| #### post_process_image_text_to_text[[transformers.ProcessorMixin.post_process_image_text_to_text]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1998) | |
| Post-process the output of a vlm to decode the text. | |
| **Parameters:** | |
| generated_outputs (`torch.Tensor` or `np.ndarray`) : The output of the model `generate` function. The output is expected to be a tensor of shape `(batch_size, sequence_length)` or `(sequence_length,)`. | |
| skip_special_tokens (`bool`, *optional*, defaults to `True`) : Whether or not to remove special tokens in the output. Argument passed to the tokenizer's `decode` method. | |
| - ****kwargs** : Additional arguments to be passed to the tokenizer's `decode` method. | |
| **Returns:** | |
| ``list[str]`` | |
| The decoded text. | |
| #### post_process_multimodal_output[[transformers.ProcessorMixin.post_process_multimodal_output]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1969) | |
| Post-process the output of a multimodal model to return the requested modality output. | |
| If the model cannot generated the requested modality, an error will be raised. | |
| **Parameters:** | |
| generated_outputs (`torch.Tensor` or `np.ndarray`) : The output of the model `generate` function. The output is expected to be a tensor of shape `(batch_size, sequence_length)` or `(sequence_length,)`. | |
| skip_special_tokens (`bool`, *optional*, defaults to `True`) : Whether or not to remove special tokens in the output. Argument passed to the tokenizer's `batch_decode` method. | |
| generation_mode (`str`, *optional*) : Generation mode indicated which modality to output and can be one of `["text", "image", "audio"]`. | |
| - ****kwargs** : Additional arguments to be passed to the tokenizer's `batch_decode method`. | |
| **Returns:** | |
| ``list[str]`` | |
| The decoded text. | |
| #### push_to_hub[[transformers.ProcessorMixin.push_to_hub]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/utils/hub.py#L720) | |
| Upload the processor files to the 🤗 Model Hub. | |
| Examples: | |
| ```python | |
| from transformers import AutoProcessor | |
| processor = AutoProcessor.from_pretrained("google-bert/bert-base-cased") | |
| # Push the processor to your namespace with the name "my-finetuned-bert". | |
| processor.push_to_hub("my-finetuned-bert") | |
| # Push the processor to an organization with the name "my-finetuned-bert". | |
| processor.push_to_hub("huggingface/my-finetuned-bert") | |
| ``` | |
| **Parameters:** | |
| repo_id (`str`) : The name of the repository you want to push your processor to. It should contain your organization name when pushing to a given organization. | |
| commit_message (`str`, *optional*) : Message to commit while pushing. Will default to `"Upload processor"`. | |
| commit_description (`str`, *optional*) : The description of the commit that will be created | |
| private (`bool`, *optional*) : Whether to make the repo private. If `None` (default), the repo will be public unless the organization's default is private. This value is ignored if the repo already exists. | |
| token (`bool` or `str`, *optional*) : The token to use as HTTP bearer authorization for remote files. If `True` (default), will use the token generated when running `hf auth login` (stored in `~/.huggingface`). | |
| revision (`str`, *optional*) : Branch to push the uploaded files to. | |
| create_pr (`bool`, *optional*, defaults to `False`) : Whether or not to create a PR with the uploaded files or directly commit. | |
| max_shard_size (`int` or `str`, *optional*, defaults to `"50GB"`) : Only applicable for models. The maximum size for a checkpoint before being sharded. Checkpoints shard will then be each of size lower than this size. If expressed as a string, needs to be digits followed by a unit (like `"5MB"`). | |
| tags (`list[str]`, *optional*) : List of tags to push on the Hub. | |
| #### register_for_auto_class[[transformers.ProcessorMixin.register_for_auto_class]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1467) | |
| Register this class with a given auto class. This should only be used for custom feature extractors as the ones | |
| in the library are already mapped with `AutoProcessor`. | |
| **Parameters:** | |
| auto_class (`str` or `type`, *optional*, defaults to `"AutoProcessor"`) : The auto class to register this new feature extractor with. | |
| #### save_pretrained[[transformers.ProcessorMixin.save_pretrained]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L819) | |
| Saves the attributes of this processor (feature extractor, tokenizer...) in the specified directory so that it | |
| can be reloaded using the [from_pretrained()](/docs/transformers/main/ko/main_classes/processors#transformers.ProcessorMixin.from_pretrained) method. | |
| This class method is simply calling [save_pretrained()](/docs/transformers/main/ko/main_classes/feature_extractor#transformers.FeatureExtractionMixin.save_pretrained) and | |
| [save_pretrained()](/docs/transformers/main/ko/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.save_pretrained). Please refer to the docstrings of the | |
| methods above for more information. | |
| **Parameters:** | |
| save_directory (`str` or `os.PathLike`) : Directory where the feature extractor JSON file and the tokenizer files will be saved (directory will be created if it does not exist). | |
| push_to_hub (`bool`, *optional*, defaults to `False`) : Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the repository you want to push to with `repo_id` (will default to the name of `save_directory` in your namespace). | |
| kwargs (`dict[str, Any]`, *optional*) : Additional key word arguments passed along to the [push_to_hub()](/docs/transformers/main/ko/main_classes/model#transformers.utils.PushToHubMixin.push_to_hub) method. | |
| #### to_dict[[transformers.ProcessorMixin.to_dict]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L729) | |
| Serializes this instance to a Python dictionary. | |
| **Returns:** | |
| ``dict[str, Any]`` | |
| Dictionary of all the attributes that make up this processor instance. | |
| #### to_json_file[[transformers.ProcessorMixin.to_json_file]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L803) | |
| Save this instance to a JSON file. | |
| **Parameters:** | |
| json_file_path (`str` or `os.PathLike`) : Path to the JSON file in which this processor instance's parameters will be saved. | |
| #### to_json_string[[transformers.ProcessorMixin.to_json_string]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L792) | |
| Serializes this instance to a JSON string. | |
| **Returns:** | |
| ``str`` | |
| String containing all the attributes that make up this feature_extractor instance in JSON format. | |
| ## 사용 중단된 프로세서[[transformers.DataProcessor]][[transformers.DataProcessor]] | |
| 모든 프로세서는 [DataProcessor](/docs/transformers/main/ko/main_classes/processors#transformers.DataProcessor)와 같은 동일한 아키텍처를 따릅니다. 프로세서는 [InputExample](/docs/transformers/main/ko/main_classes/processors#transformers.InputExample)의 목록을 반환합니다. 이 [InputExample](/docs/transformers/main/ko/main_classes/processors#transformers.InputExample)들은 모델에 입력하기 위해 [InputFeatures](/docs/transformers/main/ko/main_classes/processors#transformers.InputFeatures)로 변환될 수 있습니다. | |
| #### transformers.DataProcessor[[transformers.DataProcessor]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L78) | |
| Base class for data converters for sequence classification data sets. | |
| get_dev_examplestransformers.DataProcessor.get_dev_exampleshttps://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L95[{"name": "data_dir", "val": ""}] | |
| Gets a collection of [InputExample](/docs/transformers/main/ko/main_classes/processors#transformers.InputExample) for the dev set. | |
| #### get_example_from_tensor_dict[[transformers.DataProcessor.get_example_from_tensor_dict]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L81) | |
| Gets an example from a dict. | |
| **Parameters:** | |
| tensor_dict : Keys and values should match the corresponding Glue tensorflow_dataset examples. | |
| #### get_labels[[transformers.DataProcessor.get_labels]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L103) | |
| Gets the list of labels for this data set. | |
| #### get_test_examples[[transformers.DataProcessor.get_test_examples]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L99) | |
| Gets a collection of [InputExample](/docs/transformers/main/ko/main_classes/processors#transformers.InputExample) for the test set. | |
| #### get_train_examples[[transformers.DataProcessor.get_train_examples]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L91) | |
| Gets a collection of [InputExample](/docs/transformers/main/ko/main_classes/processors#transformers.InputExample) for the train set. | |
| #### tfds_map[[transformers.DataProcessor.tfds_map]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L107) | |
| Some tensorflow_datasets datasets are not formatted the same way the GLUE datasets are. This method converts | |
| examples to the correct format. | |
| #### transformers.InputExample[[transformers.InputExample]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L28) | |
| A single training/test example for simple sequence classification. | |
| to_json_stringtransformers.InputExample.to_json_stringhttps://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L47[] | |
| Serializes this instance to a JSON string. | |
| **Parameters:** | |
| guid : Unique id for the example. | |
| text_a : string. The untokenized text of the first sequence. For single sequence tasks, only this sequence must be specified. | |
| text_b : (Optional) string. The untokenized text of the second sequence. Only must be specified for sequence pair tasks. | |
| label : (Optional) string. The label of the example. This should be specified for train and dev examples, but not for test examples. | |
| #### transformers.InputFeatures[[transformers.InputFeatures]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L53) | |
| A single set of features of data. Property names are the same names as the corresponding inputs to a model. | |
| to_json_stringtransformers.InputFeatures.to_json_stringhttps://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L73[] | |
| Serializes this instance to a JSON string. | |
| **Parameters:** | |
| input_ids : Indices of input sequence tokens in the vocabulary. | |
| attention_mask : Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: Usually `1` for tokens that are NOT MASKED, `0` for MASKED (padded) tokens. | |
| token_type_ids : (Optional) Segment token indices to indicate first and second portions of the inputs. Only some models use them. | |
| label : (Optional) Label corresponding to the input. Int for classification problems, float for regression problems. | |
| ## GLUE[[transformers.glue_convert_examples_to_features]][[transformers.glue_convert_examples_to_features]] | |
| [General Language Understanding Evaluation (GLUE)](https://gluebenchmark.com/)는 다양한 기존 NLU 작업에서 모델의 성능을 평가하는 벤치마크입니다. [GLUE: A multi-task benchmark and analysis platform for natural language understanding](https://openreview.net/pdf?id=rJ4km2R5t7) 논문과 함께 발표되었습니다. | |
| 이 라이브러리는 MRPC, MNLI, MNLI (불일치), CoLA, SST2, STSB, QQP, QNLI, RTE, WNLI 총 10개 작업에 대한 프로세서를 제공합니다. | |
| 이러한 프로세서들은 다음과 같습니다: | |
| - `~data.processors.utils.MrpcProcessor` | |
| - `~data.processors.utils.MnliProcessor` | |
| - `~data.processors.utils.MnliMismatchedProcessor` | |
| - `~data.processors.utils.Sst2Processor` | |
| - `~data.processors.utils.StsbProcessor` | |
| - `~data.processors.utils.QqpProcessor` | |
| - `~data.processors.utils.QnliProcessor` | |
| - `~data.processors.utils.RteProcessor` | |
| - `~data.processors.utils.WnliProcessor` | |
| 또한, 아래의 메소드들을 사용하여 데이터 파일로부터 값을 가져와 [InputExample](/docs/transformers/main/ko/main_classes/processors#transformers.InputExample) 목록으로 변환할 수 있습니다. | |
| #### transformers.glue_convert_examples_to_features[[transformers.glue_convert_examples_to_features]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/glue.py#L35) | |
| Loads a data file into a list of `InputFeatures` | |
| **Parameters:** | |
| examples : List of `InputExamples` containing the examples. | |
| tokenizer : Instance of a tokenizer that will tokenize the examples | |
| max_length : Maximum example length. Defaults to the tokenizer's max_len | |
| task : GLUE task | |
| label_list : List of labels. Can be obtained from the processor using the `processor.get_labels()` method | |
| output_mode : String indicating the output mode. Either `regression` or `classification` | |
| **Returns:** | |
| Will return a list of task-specific `InputFeatures` which can be fed to the model. | |
| ## XNLI[[xnli]] | |
| [The Cross-Lingual NLI Corpus (XNLI)](https://www.nyu.edu/projects/bowman/xnli/)는 교차언어 텍스트 표현의 품질을 평가하는 벤치마크입니다. XNLI는 [*MultiNLI*](http://www.nyu.edu/projects/bowman/multinli/)를 기반으로 한 크라우드소싱 데이터 세트입니다: 텍스트 쌍은 15개 언어(영어 같은 고자원 언어부터 스와힐리어 같은 저자원 언어까지)에 대해 텍스트 함의 어노테이션으로 레이블링됩니다. | |
| [XNLI: Evaluating Cross-lingual Sentence Representations](https://huggingface.co/papers/1809.05053) 논문과 함께 발표되었습니다. | |
| 이 라이브러리는 XNLI 데이터를 가져오는 프로세서를 제공합니다: | |
| - `~data.processors.utils.XnliProcessor` | |
| 테스트 세트에 골드 레이블이 제공되므로, 평가는 테스트 세트에서 수행됩니다. | |
| 이러한 프로세서를 사용하는 예시는 [run_xnli.py](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification/run_xnli.py) 스크립트에 제공되어 있습니다. | |
| ## SQuAD[[squad]] | |
| [The Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer//)는 질문 답변에서 모델의 성능을 평가하는 벤치마크입니다. v1.1과 v2.0 두 가지 버전을 사용할 수 있습니다. 첫 번째 버전(v1.1)은 [SQuAD: 100,000+ Questions for Machine Comprehension of Text](https://huggingface.co/papers/1606.05250) 논문과 함께 발표되었습니다. 두 번째 버전(v2.0)은 [Know What You Don't Know: Unanswerable Questions for SQuAD](https://huggingface.co/papers/1806.03822) 논문과 함께 발표되었습니다. | |
| 이 라이브러리는 두 버전 각각에 대한 프로세서를 호스팅합니다: | |
| ### 프로세서[[transformers.data.processors.squad.SquadProcessor]][[transformers.data.processors.squad.SquadProcessor]] | |
| 이러한 프로세서들은 다음과 같습니다: | |
| - `~data.processors.utils.SquadV1Processor` | |
| - `~data.processors.utils.SquadV2Processor` | |
| 둘 다 추상 클래스 `~data.processors.utils.SquadProcessor`를 상속받습니다. | |
| #### transformers.data.processors.squad.SquadProcessor[[transformers.data.processors.squad.SquadProcessor]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/squad.py#L433) | |
| Processor for the SQuAD data set. overridden by SquadV1Processor and SquadV2Processor, used by the version 1.1 and | |
| version 2.0 of SQuAD, respectively. | |
| get_dev_examplestransformers.data.processors.squad.SquadProcessor.get_dev_exampleshttps://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/squad.py#L521[{"name": "data_dir", "val": ""}, {"name": "filename", "val": " = None"}]- **data_dir** -- Directory containing the data files used for training and evaluating. | |
| - **filename** -- None by default, specify this if the evaluation file has a different name than the original one | |
| which is `dev-v1.1.json` and `dev-v2.0.json` for squad versions 1.1 and 2.0 respectively.0 | |
| Returns the evaluation example from the data directory. | |
| **Parameters:** | |
| data_dir : Directory containing the data files used for training and evaluating. | |
| filename : None by default, specify this if the evaluation file has a different name than the original one which is `dev-v1.1.json` and `dev-v2.0.json` for squad versions 1.1 and 2.0 respectively. | |
| #### get_examples_from_dataset[[transformers.data.processors.squad.SquadProcessor.get_examples_from_dataset]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/squad.py#L466) | |
| Creates a list of `SquadExample` using a TFDS dataset. | |
| Examples: | |
| ```python | |
| >>> import tensorflow_datasets as tfds | |
| >>> dataset = tfds.load("squad") | |
| >>> training_examples = get_examples_from_dataset(dataset, evaluate=False) | |
| >>> evaluation_examples = get_examples_from_dataset(dataset, evaluate=True) | |
| ``` | |
| **Parameters:** | |
| dataset : The tfds dataset loaded from *tensorflow_datasets.load("squad")* | |
| evaluate : Boolean specifying if in evaluation mode or in training mode | |
| **Returns:** | |
| List of SquadExample | |
| #### get_train_examples[[transformers.data.processors.squad.SquadProcessor.get_train_examples]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/squad.py#L499) | |
| Returns the training examples from the data directory. | |
| **Parameters:** | |
| data_dir : Directory containing the data files used for training and evaluating. | |
| filename : None by default, specify this if the training file has a different name than the original one which is `train-v1.1.json` and `train-v2.0.json` for squad versions 1.1 and 2.0 respectively. | |
| 또한, 다음 메소드를 사용하여 SQuAD 예시를 모델 입력으로 사용할 수 있는 `~data.processors.utils.SquadFeatures`로 변환할 수 있습니다. | |
| #### transformers.squad_convert_examples_to_features[[transformers.squad_convert_examples_to_features]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/squad.py#L313) | |
| Converts a list of examples into a list of features that can be directly given as input to a model. It is | |
| model-dependant and takes advantage of many of the tokenizer's features to create the model's inputs. | |
| Example: | |
| ```python | |
| processor = SquadV2Processor() | |
| examples = processor.get_dev_examples(data_dir) | |
| features = squad_convert_examples_to_features( | |
| examples=examples, | |
| tokenizer=tokenizer, | |
| max_seq_length=args.max_seq_length, | |
| doc_stride=args.doc_stride, | |
| max_query_length=args.max_query_length, | |
| is_training=not evaluate, | |
| ) | |
| ``` | |
| **Parameters:** | |
| examples : list of `SquadExample` | |
| tokenizer : an instance of a child of [PreTrainedTokenizer](/docs/transformers/main/ko/main_classes/tokenizer#transformers.PythonBackend) | |
| max_seq_length : The maximum sequence length of the inputs. | |
| doc_stride : The stride used when the context is too large and is split across several features. | |
| max_query_length : The maximum length of the query. | |
| is_training : whether to create features for model evaluation or model training. | |
| padding_strategy : Default to "max_length". Which padding strategy to use | |
| return_dataset : Default False. Can also be 'pt'. if 'pt': returns a torch.data.TensorDataset. | |
| threads : multiple processing threads. | |
| **Returns:** | |
| list of `SquadFeatures` | |
| 이러한 프로세서들과 앞서 언급한 메소드는 데이터가 포함된 파일뿐만 아니라 *tensorflow_datasets* 패키지와도 함께 사용할 수 있습니다. 예시는 아래에 제공됩니다. | |
| ### 사용 예시[[example-usage]] | |
| 다음은 데이터 파일을 사용하여 프로세서와 변환 메소드를 사용하는 예시입니다: | |
| ```python | |
| # V2 프로세서 가져오기 | |
| processor = SquadV2Processor() | |
| examples = processor.get_dev_examples(squad_v2_data_dir) | |
| # V1 프로세서 가져오기 | |
| processor = SquadV1Processor() | |
| examples = processor.get_dev_examples(squad_v1_data_dir) | |
| features = squad_convert_examples_to_features( | |
| examples=examples, | |
| tokenizer=tokenizer, | |
| max_seq_length=max_seq_length, | |
| doc_stride=args.doc_stride, | |
| max_query_length=max_query_length, | |
| is_training=not evaluate, | |
| ) | |
| ``` | |
| *tensorflow_datasets* 사용은 데이터 파일 사용만큼 쉽습니다: | |
| ```python | |
| # tensorflow_datasets는 Squad V1만 처리합니다. | |
| tfds_examples = tfds.load("squad") | |
| examples = SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=evaluate) | |
| features = squad_convert_examples_to_features( | |
| examples=examples, | |
| tokenizer=tokenizer, | |
| max_seq_length=max_seq_length, | |
| doc_stride=args.doc_stride, | |
| max_query_length=max_query_length, | |
| is_training=not evaluate, | |
| ) | |
| ``` | |
| 이러한 프로세서를 사용하는 또 다른 예시는 [run_squad.py](https://github.com/huggingface/transformers/tree/main/examples/legacy/question-answering/run_squad.py) 스크립트에 제공되어 있습니다. | |
Xet Storage Details
- Size:
- 30.3 kB
- Xet hash:
- 5609a2010059361b627280675168eb45df1db17e1192bf6dbaf7fa85a1a4e2fc
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.