Buckets:

hf-doc-build
/

doc

Files

xet

hf-doc-build/doc / transformers /main /ko /main_classes /processors.md

HuggingFaceDocBuilder

about 1 hour ago

preview code

download

raw

30.3 kB

	# 프로세서[[processors]]

	Transformers 라이브러리에서 프로세서는 두 가지 의미로 사용됩니다:
	- [Wav2Vec2](../model_doc/wav2vec2) (음성과 텍스트) 또는 [CLIP](../model_doc/clip) (텍스트와 비전)과 같은 멀티모달 모델의 입력을 전처리하는 객체
	- GLUE 또는 SQUAD 데이터를 전처리하기 위해 라이브러리의 이전 버전에서 사용되었던 사용 중단된 객체

	## 멀티모달 프로세서[[transformers.ProcessorMixin]][[transformers.ProcessorMixin]]

	모든 멀티모달 모델은 여러 모달리티(텍스트, 비전, 오디오)를 그룹화하는 데이터를 인코딩하거나 디코딩하는 객체가 필요한데, 이것은 프로세서라고 불리는 객체가 담당합니다. 프로세서는 토크나이저(텍스트 모달리티용), 이미지 프로세서(비전용), 특성 추출기(오디오용) 같이 두 개 이상의 처리 객체를 하나로 묶습니다.

	이러한 프로세서는 저장 및 로딩 기능을 구현하는 다음 기본 클래스를 상속받습니다:

	#### transformers.ProcessorMixin[[transformers.ProcessorMixin]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L586)

	This is a mixin used to provide saving/loading functionality for all processor classes.

	apply_chat_templatetransformers.ProcessorMixin.apply_chat_templatehttps://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1679[{"name": "conversation", "val": ": list[dict[str, str]] \| list[list[dict[str, str]]]"}, {"name": "chat_template", "val": ": str \| None = None"}, {"name": "tools", "val": ": list[dict] \| None = None"}, {"name": "documents", "val": ": list[dict[str, str]] \| None = None"}, {"name": "add_generation_prompt", "val": ": bool = False"}, {"name": "continue_final_message", "val": ": bool \| str = False"}, {"name": "return_assistant_tokens_mask", "val": ": bool = False"}, {"name": "tokenize", "val": ": bool = False"}, {"name": "return_tensors", "val": ": str \| transformers.utils.generic.TensorType \| None = None"}, {"name": "return_dict", "val": ": bool = False"}, {"name": "load_audio_from_video", "val": ": bool = False"}, {"name": "processor_kwargs", "val": ": dict \| None = None"}, {"name": "kwargs", "val": ""}]- conversation** (`Union[list[Dict, [str, str]], list[list[dict[str, str]]]]`) --
	The conversation to format.
	- chat_template (`Optional[str]`, optional) --
	The Jinja template to use for formatting the conversation. If not provided, the tokenizer's
	chat template is used.0

	Similar to the `apply_chat_template` method on tokenizers, this method applies a Jinja template to input
	conversations to turn them into a single tokenizable string.

	The input is expected to be in the following format, where each message content is a list consisting of text and
	optionally image or video inputs. One can also provide an image, video, URL or local path which will be used to form
	`pixel_values` when `return_dict=True`. If not provided, one will get only the formatted text, optionally tokenized text.

	conversation = [
	{
	"role": "user",
	"content": [
	{"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
	{"type": "text", "text": "Please describe this image in detail."},
	],
	},
	]

	Parameters:

	conversation (`Union[list[Dict, [str, str]], list[list[dict[str, str]]]]`) : The conversation to format.

	chat_template (`Optional[str]`, optional) : The Jinja template to use for formatting the conversation. If not provided, the tokenizer's chat template is used.
	#### batch_decode[[transformers.ProcessorMixin.batch_decode]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1625)

	This method forwards all its arguments to PreTrainedTokenizer's [batch_decode()](/docs/transformers/main/ko/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.batch_decode). Please
	refer to the docstring of this method for more information.
	#### check_argument_for_proper_class[[transformers.ProcessorMixin.check_argument_for_proper_class]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L706)

	Checks the passed argument's class against the expected transformers class. In case of an unexpected
	mismatch between expected and actual class, an error is raise. Otherwise, the proper retrieved class
	is returned.
	#### decode[[transformers.ProcessorMixin.decode]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1634)

	This method forwards all its arguments to PreTrainedTokenizer's [decode()](/docs/transformers/main/ko/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.decode). Please refer to
	the docstring of this method for more information.
	#### from_args_and_dict[[transformers.ProcessorMixin.from_args_and_dict]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1164)

	Instantiates a type of `~processing_utils.ProcessingMixin` from a Python dictionary of parameters.

	Parameters:

	processor_dict (`dict[str, Any]`) : Dictionary that will be used to instantiate the processor object. Such a dictionary can be retrieved from a pretrained checkpoint by leveraging the `~processing_utils.ProcessingMixin.to_dict` method.

	kwargs (`dict[str, Any]`) : Additional parameters from which to initialize the processor object.

	Returns:

	``~processing_utils.ProcessingMixin``

	The processor object instantiated from those
	parameters.
	#### from_pretrained[[transformers.ProcessorMixin.from_pretrained]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1388)

	Instantiate a processor associated with a pretrained model.

	This class method is simply calling the feature extractor
	[from_pretrained()](/docs/transformers/main/ko/main_classes/feature_extractor#transformers.FeatureExtractionMixin.from_pretrained), image processor
	[ImageProcessingMixin](/docs/transformers/main/ko/internal/image_processing_utils#transformers.ImageProcessingMixin) and the tokenizer
	`~tokenization_utils_base.PreTrainedTokenizer.from_pretrained` methods. Please refer to the docstrings of the
	methods above for more information.

	Parameters:

	pretrained_model_name_or_path (`str` or `os.PathLike`) : This can be either: - a string, the model id of a pretrained feature_extractor hosted inside a model repo on huggingface.co. - a path to a directory containing a feature extractor file saved using the [save_pretrained()](/docs/transformers/main/ko/main_classes/feature_extractor#transformers.FeatureExtractionMixin.save_pretrained) method, e.g., `./my_model_directory/`. - a path to a saved feature extractor JSON file, e.g., `./my_model_directory/preprocessor_config.json`.

	- **kwargs : Additional keyword arguments passed along to both [from_pretrained()](/docs/transformers/main/ko/main_classes/feature_extractor#transformers.FeatureExtractionMixin.from_pretrained) and `~tokenization_utils_base.PreTrainedTokenizer.from_pretrained`.
	#### get_processor_dict[[transformers.ProcessorMixin.get_processor_dict]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L925)

	From a `pretrained_model_name_or_path`, resolve to a dictionary of parameters, to be used for instantiating a
	processor of type `~processing_utils.ProcessingMixin` using `from_args_and_dict`.

	Parameters:

	pretrained_model_name_or_path (`str` or `os.PathLike`) : The identifier of the pre-trained checkpoint from which we want the dictionary of parameters.

	subfolder (`str`, optional, defaults to `""`) : In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can specify the folder name here.

	Returns:

	``tuple[Dict, Dict]``

	The dictionary(ies) that will be used to instantiate the processor object.
	#### parse_response[[transformers.ProcessorMixin.parse_response]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1947)

	Converts an output string created by generating text from a model into a parsed message dictionary.
	This method is intended for use with chat models, and will read the tokenizer's `response_schema` attribute to
	control parsing, although this can be overridden by passing a `response_schema` argument directly.

	Parameters:

	response (`str`) : The output string generated by the model. This can be either a decoded string or list of strings, or token IDs as a list/array.

	schema (`Union[list, dict]`, optional) : A response schema that indicates the expected output format and how parsing should be performed. If not provided, the tokenizer's `response_schema` attribute will be used.
	#### post_process_image_text_to_text[[transformers.ProcessorMixin.post_process_image_text_to_text]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1998)

	Post-process the output of a vlm to decode the text.

	Parameters:

	generated_outputs (`torch.Tensor` or `np.ndarray`) : The output of the model `generate` function. The output is expected to be a tensor of shape `(batch_size, sequence_length)` or `(sequence_length,)`.

	skip_special_tokens (`bool`, optional, defaults to `True`) : Whether or not to remove special tokens in the output. Argument passed to the tokenizer's `decode` method.

	- **kwargs : Additional arguments to be passed to the tokenizer's `decode` method.

	Returns:

	``list[str]``

	The decoded text.
	#### post_process_multimodal_output[[transformers.ProcessorMixin.post_process_multimodal_output]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1969)

	Post-process the output of a multimodal model to return the requested modality output.
	If the model cannot generated the requested modality, an error will be raised.

	Parameters:

	generated_outputs (`torch.Tensor` or `np.ndarray`) : The output of the model `generate` function. The output is expected to be a tensor of shape `(batch_size, sequence_length)` or `(sequence_length,)`.

	skip_special_tokens (`bool`, optional, defaults to `True`) : Whether or not to remove special tokens in the output. Argument passed to the tokenizer's `batch_decode` method.

	generation_mode (`str`, optional) : Generation mode indicated which modality to output and can be one of `["text", "image", "audio"]`.

	- **kwargs : Additional arguments to be passed to the tokenizer's `batch_decode method`.

	Returns:

	``list[str]``

	The decoded text.
	#### push_to_hub[[transformers.ProcessorMixin.push_to_hub]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/utils/hub.py#L720)

	Upload the processor files to the 🤗 Model Hub.

	Examples:

	```python
	from transformers import AutoProcessor

	processor = AutoProcessor.from_pretrained("google-bert/bert-base-cased")

	# Push the processor to your namespace with the name "my-finetuned-bert".
	processor.push_to_hub("my-finetuned-bert")

	# Push the processor to an organization with the name "my-finetuned-bert".
	processor.push_to_hub("huggingface/my-finetuned-bert")
	```

	Parameters:

	repo_id (`str`) : The name of the repository you want to push your processor to. It should contain your organization name when pushing to a given organization.

	commit_message (`str`, optional) : Message to commit while pushing. Will default to `"Upload processor"`.

	commit_description (`str`, optional) : The description of the commit that will be created

	private (`bool`, optional) : Whether to make the repo private. If `None` (default), the repo will be public unless the organization's default is private. This value is ignored if the repo already exists.

	token (`bool` or `str`, optional) : The token to use as HTTP bearer authorization for remote files. If `True` (default), will use the token generated when running `hf auth login` (stored in `~/.huggingface`).

	revision (`str`, optional) : Branch to push the uploaded files to.

	create_pr (`bool`, optional, defaults to `False`) : Whether or not to create a PR with the uploaded files or directly commit.

	max_shard_size (`int` or `str`, optional, defaults to `"50GB"`) : Only applicable for models. The maximum size for a checkpoint before being sharded. Checkpoints shard will then be each of size lower than this size. If expressed as a string, needs to be digits followed by a unit (like `"5MB"`).

	tags (`list[str]`, optional) : List of tags to push on the Hub.
	#### register_for_auto_class[[transformers.ProcessorMixin.register_for_auto_class]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1467)

	Register this class with a given auto class. This should only be used for custom feature extractors as the ones
	in the library are already mapped with `AutoProcessor`.

	Parameters:

	auto_class (`str` or `type`, optional, defaults to `"AutoProcessor"`) : The auto class to register this new feature extractor with.
	#### save_pretrained[[transformers.ProcessorMixin.save_pretrained]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L819)

	Saves the attributes of this processor (feature extractor, tokenizer...) in the specified directory so that it
	can be reloaded using the [from_pretrained()](/docs/transformers/main/ko/main_classes/processors#transformers.ProcessorMixin.from_pretrained) method.

	This class method is simply calling [save_pretrained()](/docs/transformers/main/ko/main_classes/feature_extractor#transformers.FeatureExtractionMixin.save_pretrained) and
	[save_pretrained()](/docs/transformers/main/ko/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.save_pretrained). Please refer to the docstrings of the
	methods above for more information.

	Parameters:

	save_directory (`str` or `os.PathLike`) : Directory where the feature extractor JSON file and the tokenizer files will be saved (directory will be created if it does not exist).

	push_to_hub (`bool`, optional, defaults to `False`) : Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the repository you want to push to with `repo_id` (will default to the name of `save_directory` in your namespace).

	kwargs (`dict[str, Any]`, optional) : Additional key word arguments passed along to the [push_to_hub()](/docs/transformers/main/ko/main_classes/model#transformers.utils.PushToHubMixin.push_to_hub) method.
	#### to_dict[[transformers.ProcessorMixin.to_dict]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L729)

	Serializes this instance to a Python dictionary.

	Returns:

	``dict[str, Any]``

	Dictionary of all the attributes that make up this processor instance.
	#### to_json_file[[transformers.ProcessorMixin.to_json_file]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L803)

	Save this instance to a JSON file.

	Parameters:

	json_file_path (`str` or `os.PathLike`) : Path to the JSON file in which this processor instance's parameters will be saved.
	#### to_json_string[[transformers.ProcessorMixin.to_json_string]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L792)

	Serializes this instance to a JSON string.

	Returns:

	``str``

	String containing all the attributes that make up this feature_extractor instance in JSON format.

	## 사용 중단된 프로세서[[transformers.DataProcessor]][[transformers.DataProcessor]]

	모든 프로세서는 [DataProcessor](/docs/transformers/main/ko/main_classes/processors#transformers.DataProcessor)와 같은 동일한 아키텍처를 따릅니다. 프로세서는 [InputExample](/docs/transformers/main/ko/main_classes/processors#transformers.InputExample)의 목록을 반환합니다. 이 [InputExample](/docs/transformers/main/ko/main_classes/processors#transformers.InputExample)들은 모델에 입력하기 위해 [InputFeatures](/docs/transformers/main/ko/main_classes/processors#transformers.InputFeatures)로 변환될 수 있습니다.

	#### transformers.DataProcessor[[transformers.DataProcessor]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L78)

	Base class for data converters for sequence classification data sets.

	get_dev_examplestransformers.DataProcessor.get_dev_exampleshttps://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L95[{"name": "data_dir", "val": ""}]
	Gets a collection of [InputExample](/docs/transformers/main/ko/main_classes/processors#transformers.InputExample) for the dev set.
	#### get_example_from_tensor_dict[[transformers.DataProcessor.get_example_from_tensor_dict]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L81)

	Gets an example from a dict.

	Parameters:

	tensor_dict : Keys and values should match the corresponding Glue tensorflow_dataset examples.
	#### get_labels[[transformers.DataProcessor.get_labels]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L103)

	Gets the list of labels for this data set.
	#### get_test_examples[[transformers.DataProcessor.get_test_examples]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L99)

	Gets a collection of [InputExample](/docs/transformers/main/ko/main_classes/processors#transformers.InputExample) for the test set.
	#### get_train_examples[[transformers.DataProcessor.get_train_examples]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L91)

	Gets a collection of [InputExample](/docs/transformers/main/ko/main_classes/processors#transformers.InputExample) for the train set.
	#### tfds_map[[transformers.DataProcessor.tfds_map]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L107)

	Some tensorflow_datasets datasets are not formatted the same way the GLUE datasets are. This method converts
	examples to the correct format.

	#### transformers.InputExample[[transformers.InputExample]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L28)

	A single training/test example for simple sequence classification.

	to_json_stringtransformers.InputExample.to_json_stringhttps://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L47[]
	Serializes this instance to a JSON string.

	Parameters:

	guid : Unique id for the example.

	text_a : string. The untokenized text of the first sequence. For single sequence tasks, only this sequence must be specified.

	text_b : (Optional) string. The untokenized text of the second sequence. Only must be specified for sequence pair tasks.

	label : (Optional) string. The label of the example. This should be specified for train and dev examples, but not for test examples.

	#### transformers.InputFeatures[[transformers.InputFeatures]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L53)

	A single set of features of data. Property names are the same names as the corresponding inputs to a model.

	to_json_stringtransformers.InputFeatures.to_json_stringhttps://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L73[]
	Serializes this instance to a JSON string.

	Parameters:

	input_ids : Indices of input sequence tokens in the vocabulary.

	attention_mask : Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: Usually `1` for tokens that are NOT MASKED, `0` for MASKED (padded) tokens.

	token_type_ids : (Optional) Segment token indices to indicate first and second portions of the inputs. Only some models use them.

	label : (Optional) Label corresponding to the input. Int for classification problems, float for regression problems.

	## GLUE[[transformers.glue_convert_examples_to_features]][[transformers.glue_convert_examples_to_features]]

	[General Language Understanding Evaluation (GLUE)](https://gluebenchmark.com/)는 다양한 기존 NLU 작업에서 모델의 성능을 평가하는 벤치마크입니다. [GLUE: A multi-task benchmark and analysis platform for natural language understanding](https://openreview.net/pdf?id=rJ4km2R5t7) 논문과 함께 발표되었습니다.

	이 라이브러리는 MRPC, MNLI, MNLI (불일치), CoLA, SST2, STSB, QQP, QNLI, RTE, WNLI 총 10개 작업에 대한 프로세서를 제공합니다.

	이러한 프로세서들은 다음과 같습니다:

	- `~data.processors.utils.MrpcProcessor`
	- `~data.processors.utils.MnliProcessor`
	- `~data.processors.utils.MnliMismatchedProcessor`
	- `~data.processors.utils.Sst2Processor`
	- `~data.processors.utils.StsbProcessor`
	- `~data.processors.utils.QqpProcessor`
	- `~data.processors.utils.QnliProcessor`
	- `~data.processors.utils.RteProcessor`
	- `~data.processors.utils.WnliProcessor`

	또한, 아래의 메소드들을 사용하여 데이터 파일로부터 값을 가져와 [InputExample](/docs/transformers/main/ko/main_classes/processors#transformers.InputExample) 목록으로 변환할 수 있습니다.

	#### transformers.glue_convert_examples_to_features[[transformers.glue_convert_examples_to_features]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/glue.py#L35)

	Loads a data file into a list of `InputFeatures`

	Parameters:

	examples : List of `InputExamples` containing the examples.

	tokenizer : Instance of a tokenizer that will tokenize the examples

	max_length : Maximum example length. Defaults to the tokenizer's max_len

	task : GLUE task

	label_list : List of labels. Can be obtained from the processor using the `processor.get_labels()` method

	output_mode : String indicating the output mode. Either `regression` or `classification`

	Returns:

	Will return a list of task-specific `InputFeatures` which can be fed to the model.

	## XNLI[[xnli]]

	[The Cross-Lingual NLI Corpus (XNLI)](https://www.nyu.edu/projects/bowman/xnli/)는 교차언어 텍스트 표현의 품질을 평가하는 벤치마크입니다. XNLI는 [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/)를 기반으로 한 크라우드소싱 데이터 세트입니다: 텍스트 쌍은 15개 언어(영어 같은 고자원 언어부터 스와힐리어 같은 저자원 언어까지)에 대해 텍스트 함의 어노테이션으로 레이블링됩니다.

	[XNLI: Evaluating Cross-lingual Sentence Representations](https://huggingface.co/papers/1809.05053) 논문과 함께 발표되었습니다.

	이 라이브러리는 XNLI 데이터를 가져오는 프로세서를 제공합니다:

	- `~data.processors.utils.XnliProcessor`

	테스트 세트에 골드 레이블이 제공되므로, 평가는 테스트 세트에서 수행됩니다.

	이러한 프로세서를 사용하는 예시는 [run_xnli.py](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification/run_xnli.py) 스크립트에 제공되어 있습니다.

	## SQuAD[[squad]]

	[The Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer//)는 질문 답변에서 모델의 성능을 평가하는 벤치마크입니다. v1.1과 v2.0 두 가지 버전을 사용할 수 있습니다. 첫 번째 버전(v1.1)은 [SQuAD: 100,000+ Questions for Machine Comprehension of Text](https://huggingface.co/papers/1606.05250) 논문과 함께 발표되었습니다. 두 번째 버전(v2.0)은 [Know What You Don't Know: Unanswerable Questions for SQuAD](https://huggingface.co/papers/1806.03822) 논문과 함께 발표되었습니다.

	이 라이브러리는 두 버전 각각에 대한 프로세서를 호스팅합니다:

	### 프로세서[[transformers.data.processors.squad.SquadProcessor]][[transformers.data.processors.squad.SquadProcessor]]

	이러한 프로세서들은 다음과 같습니다:

	- `~data.processors.utils.SquadV1Processor`
	- `~data.processors.utils.SquadV2Processor`

	둘 다 추상 클래스 `~data.processors.utils.SquadProcessor`를 상속받습니다.

	#### transformers.data.processors.squad.SquadProcessor[[transformers.data.processors.squad.SquadProcessor]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/squad.py#L433)

	Processor for the SQuAD data set. overridden by SquadV1Processor and SquadV2Processor, used by the version 1.1 and
	version 2.0 of SQuAD, respectively.

	get_dev_examplestransformers.data.processors.squad.SquadProcessor.get_dev_exampleshttps://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/squad.py#L521[{"name": "data_dir", "val": ""}, {"name": "filename", "val": " = None"}]- data_dir -- Directory containing the data files used for training and evaluating.
	- filename -- None by default, specify this if the evaluation file has a different name than the original one
	which is `dev-v1.1.json` and `dev-v2.0.json` for squad versions 1.1 and 2.0 respectively.0

	Returns the evaluation example from the data directory.

	Parameters:

	data_dir : Directory containing the data files used for training and evaluating.

	filename : None by default, specify this if the evaluation file has a different name than the original one which is `dev-v1.1.json` and `dev-v2.0.json` for squad versions 1.1 and 2.0 respectively.
	#### get_examples_from_dataset[[transformers.data.processors.squad.SquadProcessor.get_examples_from_dataset]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/squad.py#L466)

	Creates a list of `SquadExample` using a TFDS dataset.

	Examples:

	```python
	>>> import tensorflow_datasets as tfds

	>>> dataset = tfds.load("squad")

	>>> training_examples = get_examples_from_dataset(dataset, evaluate=False)
	>>> evaluation_examples = get_examples_from_dataset(dataset, evaluate=True)
	```

	Parameters:

	dataset : The tfds dataset loaded from tensorflow_datasets.load("squad")

	evaluate : Boolean specifying if in evaluation mode or in training mode

	Returns:

	List of SquadExample
	#### get_train_examples[[transformers.data.processors.squad.SquadProcessor.get_train_examples]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/squad.py#L499)

	Returns the training examples from the data directory.

	Parameters:

	data_dir : Directory containing the data files used for training and evaluating.

	filename : None by default, specify this if the training file has a different name than the original one which is `train-v1.1.json` and `train-v2.0.json` for squad versions 1.1 and 2.0 respectively.

	또한, 다음 메소드를 사용하여 SQuAD 예시를 모델 입력으로 사용할 수 있는 `~data.processors.utils.SquadFeatures`로 변환할 수 있습니다.

	#### transformers.squad_convert_examples_to_features[[transformers.squad_convert_examples_to_features]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/squad.py#L313)

	Converts a list of examples into a list of features that can be directly given as input to a model. It is
	model-dependant and takes advantage of many of the tokenizer's features to create the model's inputs.

	Example:

	```python
	processor = SquadV2Processor()
	examples = processor.get_dev_examples(data_dir)

	features = squad_convert_examples_to_features(
	examples=examples,
	tokenizer=tokenizer,
	max_seq_length=args.max_seq_length,
	doc_stride=args.doc_stride,
	max_query_length=args.max_query_length,
	is_training=not evaluate,
	)
	```

	Parameters:

	examples : list of `SquadExample`

	tokenizer : an instance of a child of [PreTrainedTokenizer](/docs/transformers/main/ko/main_classes/tokenizer#transformers.PythonBackend)

	max_seq_length : The maximum sequence length of the inputs.

	doc_stride : The stride used when the context is too large and is split across several features.

	max_query_length : The maximum length of the query.

	is_training : whether to create features for model evaluation or model training.

	padding_strategy : Default to "max_length". Which padding strategy to use

	return_dataset : Default False. Can also be 'pt'. if 'pt': returns a torch.data.TensorDataset.

	threads : multiple processing threads.

	Returns:

	list of `SquadFeatures`

	이러한 프로세서들과 앞서 언급한 메소드는 데이터가 포함된 파일뿐만 아니라 tensorflow_datasets 패키지와도 함께 사용할 수 있습니다. 예시는 아래에 제공됩니다.

	### 사용 예시[[example-usage]]

	다음은 데이터 파일을 사용하여 프로세서와 변환 메소드를 사용하는 예시입니다:

	```python
	# V2 프로세서 가져오기
	processor = SquadV2Processor()
	examples = processor.get_dev_examples(squad_v2_data_dir)

	# V1 프로세서 가져오기
	processor = SquadV1Processor()
	examples = processor.get_dev_examples(squad_v1_data_dir)

	features = squad_convert_examples_to_features(
	examples=examples,
	tokenizer=tokenizer,
	max_seq_length=max_seq_length,
	doc_stride=args.doc_stride,
	max_query_length=max_query_length,
	is_training=not evaluate,
	)
	```

	tensorflow_datasets 사용은 데이터 파일 사용만큼 쉽습니다:

	```python
	# tensorflow_datasets는 Squad V1만 처리합니다.
	tfds_examples = tfds.load("squad")
	examples = SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=evaluate)

	features = squad_convert_examples_to_features(
	examples=examples,
	tokenizer=tokenizer,
	max_seq_length=max_seq_length,
	doc_stride=args.doc_stride,
	max_query_length=max_query_length,
	is_training=not evaluate,
	)
	```

	이러한 프로세서를 사용하는 또 다른 예시는 [run_squad.py](https://github.com/huggingface/transformers/tree/main/examples/legacy/question-answering/run_squad.py) 스크립트에 제공되어 있습니다.

Xet Storage Details

Size:: 30.3 kB
Xet hash:: 5609a2010059361b627280675168eb45df1db17e1192bf6dbaf7fa85a1a4e2fc

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.