Buckets:

hf-doc-build
/

doc

Files

xet

hf-doc-build/doc / transformers /main /zh /main_classes /processors.md

HuggingFaceDocBuilder

about 3 hours ago

preview code

download

raw

33.3 kB

	# Processors

	在 Transformers 库中，processors可以有两种不同的含义：
	- 为多模态模型，例如[Wav2Vec2](../model_doc/wav2vec2)（语音和文本）或[CLIP](../model_doc/clip)（文本和视觉）预处理输入的对象
	- 在库的旧版本中用于预处理GLUE或SQUAD数据的已弃用对象。

	## 多模态processors[[transformers.ProcessorMixin]]

	任何多模态模型都需要一个对象来编码或解码将多个模态（包括文本、视觉和音频）组合在一起的数据。这由称为processors的对象处理，这些processors将两个或多个处理对象组合在一起，例如tokenizers（用于文本模态），image processors（用于视觉）和feature extractors（用于音频）。

	这些processors继承自以下实现保存和加载功能的基类：

	#### transformers.ProcessorMixin[[transformers.ProcessorMixin]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L588)

	This is a mixin used to provide saving/loading functionality for all processor classes.

	apply_chat_templatetransformers.ProcessorMixin.apply_chat_templatehttps://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1918[{"name": "conversation", "val": ": list[dict[str, str]] \| list[list[dict[str, str]]]"}, {"name": "chat_template", "val": ": str \| None = None"}, {"name": "tools", "val": ": list[dict] \| None = None"}, {"name": "documents", "val": ": list[dict[str, str]] \| None = None"}, {"name": "add_generation_prompt", "val": ": bool = False"}, {"name": "continue_final_message", "val": ": bool \| str = False"}, {"name": "return_assistant_tokens_mask", "val": ": bool = False"}, {"name": "tokenize", "val": ": bool = False"}, {"name": "return_tensors", "val": ": str \| transformers.utils.generic.TensorType \| None = None"}, {"name": "return_dict", "val": ": bool = False"}, {"name": "load_audio_from_video", "val": ": bool = False"}, {"name": "processor_kwargs", "val": ": dict \| None = None"}, {"name": "kwargs", "val": ""}]- conversation** (`Union[list[Dict, [str, str]], list[list[dict[str, str]]]]`) --
	The conversation to format.
	- chat_template (`Optional[str]`, optional) --
	The Jinja template to use for formatting the conversation. If not provided, the tokenizer's
	chat template is used.0

	Similar to the `apply_chat_template` method on tokenizers, this method applies a Jinja template to input
	conversations to turn them into a single tokenizable string.

	The input is expected to be in the following format, where each message content is a list consisting of text and
	optionally image or video inputs. One can also provide an image, video, URL or local path which will be used to form
	`pixel_values` when `return_dict=True`. If not provided, one will get only the formatted text, optionally tokenized text.

	conversation = [
	{
	"role": "user",
	"content": [
	{"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
	{"type": "text", "text": "Please describe this image in detail."},
	],
	},
	]

	Parameters:

	conversation (`Union[list[Dict, [str, str]], list[list[dict[str, str]]]]`) : The conversation to format.

	chat_template (`Optional[str]`, optional) : The Jinja template to use for formatting the conversation. If not provided, the tokenizer's chat template is used.
	#### batch_decode[[transformers.ProcessorMixin.batch_decode]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1872)

	This method forwards all its arguments to PreTrainedTokenizer's [batch_decode()](/docs/transformers/main/zh/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.batch_decode). Please
	refer to the docstring of this method for more information.
	#### check_argument_for_proper_class[[transformers.ProcessorMixin.check_argument_for_proper_class]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L950)

	Checks the passed argument's class against the expected transformers class. In case of an unexpected
	mismatch between expected and actual class, an error is raise. Otherwise, the proper retrieved class
	is returned.
	#### create_mm_token_type_ids[[transformers.ProcessorMixin.create_mm_token_type_ids]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L897)

	Build per-token modality type IDs for a batch of token_id sequences.

	Each position is assigned an integer indicating which modality it belongs to:
	`0` for regular text, `1` for image tokens, `2` for video tokens, and
	`3` for audio tokens. Membership is determined by comparing against
	`self.image_token_ids`, `self.video_token_ids`, and `self.audio_token_ids`.

	Parameters:

	input_ids (list[list[int]]) : Batch of token ID sequences. May be unpadded (variable length), so a plain Python list of lists is expected rather than a tensor or uniformly-shaped array.

	Returns:

	`list[list[int]]`

	A list of the same structure as `input_ids`, where each
	integer is the modality type ID for the corresponding token.
	#### decode[[transformers.ProcessorMixin.decode]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1881)

	This method forwards all its arguments to PreTrainedTokenizer's [decode()](/docs/transformers/main/zh/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.decode). Please refer to
	the docstring of this method for more information.
	#### from_args_and_dict[[transformers.ProcessorMixin.from_args_and_dict]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1411)

	Instantiates a type of `~processing_utils.ProcessingMixin` from a Python dictionary of parameters.

	Parameters:

	processor_dict (`dict[str, Any]`) : Dictionary that will be used to instantiate the processor object. Such a dictionary can be retrieved from a pretrained checkpoint by leveraging the `~processing_utils.ProcessingMixin.to_dict` method.

	kwargs (`dict[str, Any]`) : Additional parameters from which to initialize the processor object.

	Returns:

	``~processing_utils.ProcessingMixin``

	The processor object instantiated from those
	parameters.
	#### from_pretrained[[transformers.ProcessorMixin.from_pretrained]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1635)

	Instantiate a processor associated with a pretrained model.

	This class method is simply calling the feature extractor
	[from_pretrained()](/docs/transformers/main/zh/main_classes/feature_extractor#transformers.FeatureExtractionMixin.from_pretrained), image processor
	[ImageProcessingMixin](/docs/transformers/main/zh/main_classes/image_processor#transformers.ImageProcessingMixin) and the tokenizer
	`~tokenization_utils_base.PreTrainedTokenizer.from_pretrained` methods. Please refer to the docstrings of the
	methods above for more information.

	Parameters:

	pretrained_model_name_or_path (`str` or `os.PathLike`) : This can be either: - a string, the model id of a pretrained feature_extractor hosted inside a model repo on huggingface.co. - a path to a directory containing a feature extractor file saved using the [save_pretrained()](/docs/transformers/main/zh/main_classes/feature_extractor#transformers.FeatureExtractionMixin.save_pretrained) method, e.g., `./my_model_directory/`. - a path to a saved feature extractor JSON file, e.g., `./my_model_directory/preprocessor_config.json`.

	- **kwargs : Additional keyword arguments passed along to both [from_pretrained()](/docs/transformers/main/zh/main_classes/feature_extractor#transformers.FeatureExtractionMixin.from_pretrained) and `~tokenization_utils_base.PreTrainedTokenizer.from_pretrained`.
	#### get_processor_dict[[transformers.ProcessorMixin.get_processor_dict]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1172)

	From a `pretrained_model_name_or_path`, resolve to a dictionary of parameters, to be used for instantiating a
	processor of type `~processing_utils.ProcessingMixin` using `from_args_and_dict`.

	Parameters:

	pretrained_model_name_or_path (`str` or `os.PathLike`) : The identifier of the pre-trained checkpoint from which we want the dictionary of parameters.

	subfolder (`str`, optional, defaults to `""`) : In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can specify the folder name here.

	Returns:

	``tuple[Dict, Dict]``

	The dictionary(ies) that will be used to instantiate the processor object.
	#### get_text_with_replacements[[transformers.ProcessorMixin.get_text_with_replacements]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L801)

	Replace multimodal placeholder tokens in a batch of text strings with their
	expanded representations, and return the modified texts alongside offset metadata.

	This method is the core text-side preprocessing step for multimodal inputs. It
	scans each text in the batch for special tokens (image, video, audio) and replaces
	them in-order with the pre-computed replacement strings produced by
	self.replace_image_token / self.replace_video_token / self.replace_audio_token.
	Replacements are consumed from each modality's list sequentially, so the i-th
	occurrence of e.g. `self.image_token` is replaced by `images_replacements[i]`.

	To add a new multimodal processor with placeholder tokens, you need to define a correct
	self.image_token which is the same token that is embedded in input text and also used as
	placeholder and repeated many times. Then you need to override self.replace_image_token
	to return the correct replacement string for a given image at index i. Same goes for all
	other supported modalities.

	Parameters:

	text (list[str]) : Batch of raw text strings, each potentially containing multimodal placeholder tokens. Note that it will be modified in-place and returned.

	images_replacements (list[str], optional, defaults to []) : Expanded replacement strings for each image, in the order they appear across the batch. Produced by self._process_images.

	videos_replacements (list[str], optional, defaults to []) : Expanded replacement strings for each video. Produced by self._process_videos.

	audio_replacements (list[str], optional, defaults to []) : Expanded replacement strings for each audio input. Produced by self._process_audio.

	Returns:

	`tuple[list[str], list[dict[str, Any]]]`

	A tuple of:
	- The modified text batch with all placeholder tokens expanded.
	- batch_replacement_offsets: one entry per batch item, each being a
	list of dicts with keys:
	- "type" (str): modality name — "image", "video", or "audio"
	- "span" (tuple[int, int]): original (start, end) char offsets of the placeholder token
	- "new_span" (tuple[int, int]): (start, end) offsets of placeholder in the expanded string
	- "text" (str): the original placeholder token string that was matched
	- "replacement" (str): the string it was replaced with
	#### parse_response[[transformers.ProcessorMixin.parse_response]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L2186)

	Converts an output string created by generating text from a model into a parsed message dictionary.
	This method is intended for use with chat models, and will read the tokenizer's `response_schema` attribute to
	control parsing, although this can be overridden by passing a `response_schema` argument directly.

	Parameters:

	response (`str`) : The output string generated by the model. This can be either a decoded string or list of strings, or token IDs as a list/array.

	schema (`Union[list, dict]`, optional) : A response schema that indicates the expected output format and how parsing should be performed. If not provided, the tokenizer's `response_schema` attribute will be used.
	#### post_process_image_text_to_text[[transformers.ProcessorMixin.post_process_image_text_to_text]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L2237)

	Post-process the output of a vlm to decode the text.

	Parameters:

	generated_outputs (`torch.Tensor` or `np.ndarray`) : The output of the model `generate` function. The output is expected to be a tensor of shape `(batch_size, sequence_length)` or `(sequence_length,)`.

	skip_special_tokens (`bool`, optional, defaults to `True`) : Whether or not to remove special tokens in the output. Argument passed to the tokenizer's `decode` method.

	- **kwargs : Additional arguments to be passed to the tokenizer's `decode` method.

	Returns:

	``list[str]``

	The decoded text.
	#### post_process_multimodal_output[[transformers.ProcessorMixin.post_process_multimodal_output]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L2208)

	Post-process the output of a multimodal model to return the requested modality output.
	If the model cannot generated the requested modality, an error will be raised.

	Parameters:

	generated_outputs (`torch.Tensor` or `np.ndarray`) : The output of the model `generate` function. The output is expected to be a tensor of shape `(batch_size, sequence_length)` or `(sequence_length,)`.

	skip_special_tokens (`bool`, optional, defaults to `True`) : Whether or not to remove special tokens in the output. Argument passed to the tokenizer's `batch_decode` method.

	generation_mode (`str`, optional) : Generation mode indicated which modality to output and can be one of `["text", "image", "audio"]`.

	- **kwargs : Additional arguments to be passed to the tokenizer's `batch_decode method`.

	Returns:

	``list[str]``

	The decoded text.
	#### prepare_inputs_layout[[transformers.ProcessorMixin.prepare_inputs_layout]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L703)

	Normalize and prefetch inputs before processing. Wraps text in a list for multimodal
	processors, fetches remote images and audio if URLs are provided, and ensures audio
	is properly batched. Returns the normalized `(images, text, videos, audio)` tuple.
	#### push_to_hub[[transformers.ProcessorMixin.push_to_hub]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/utils/hub.py#L720)

	Upload the processor files to the 🤗 Model Hub.

	Examples:

	```python
	from transformers import AutoProcessor

	processor = AutoProcessor.from_pretrained("google-bert/bert-base-cased")

	# Push the processor to your namespace with the name "my-finetuned-bert".
	processor.push_to_hub("my-finetuned-bert")

	# Push the processor to an organization with the name "my-finetuned-bert".
	processor.push_to_hub("huggingface/my-finetuned-bert")
	```

	Parameters:

	repo_id (`str`) : The name of the repository you want to push your processor to. It should contain your organization name when pushing to a given organization.

	commit_message (`str`, optional) : Message to commit while pushing. Will default to `"Upload processor"`.

	commit_description (`str`, optional) : The description of the commit that will be created

	private (`bool`, optional) : Whether to make the repo private. If `None` (default), the repo will be public unless the organization's default is private. This value is ignored if the repo already exists.

	token (`bool` or `str`, optional) : The token to use as HTTP bearer authorization for remote files. If `True` (default), will use the token generated when running `hf auth login` (stored in `~/.huggingface`).

	revision (`str`, optional) : Branch to push the uploaded files to.

	create_pr (`bool`, optional, defaults to `False`) : Whether or not to create a PR with the uploaded files or directly commit.

	max_shard_size (`int` or `str`, optional, defaults to `"50GB"`) : Only applicable for models. The maximum size for a checkpoint before being sharded. Checkpoints shard will then be each of size lower than this size. If expressed as a string, needs to be digits followed by a unit (like `"5MB"`).

	tags (`list[str]`, optional) : List of tags to push on the Hub.
	#### register_for_auto_class[[transformers.ProcessorMixin.register_for_auto_class]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1714)

	Register this class with a given auto class. This should only be used for custom feature extractors as the ones
	in the library are already mapped with `AutoProcessor`.

	Parameters:

	auto_class (`str` or `type`, optional, defaults to `"AutoProcessor"`) : The auto class to register this new feature extractor with.
	#### save_pretrained[[transformers.ProcessorMixin.save_pretrained]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1063)

	Saves the attributes of this processor (feature extractor, tokenizer...) in the specified directory so that it
	can be reloaded using the [from_pretrained()](/docs/transformers/main/zh/main_classes/processors#transformers.ProcessorMixin.from_pretrained) method.

	This class method is simply calling [save_pretrained()](/docs/transformers/main/zh/main_classes/feature_extractor#transformers.FeatureExtractionMixin.save_pretrained) and
	[save_pretrained()](/docs/transformers/main/zh/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.save_pretrained). Please refer to the docstrings of the
	methods above for more information.

	Parameters:

	save_directory (`str` or `os.PathLike`) : Directory where the feature extractor JSON file and the tokenizer files will be saved (directory will be created if it does not exist).

	push_to_hub (`bool`, optional, defaults to `False`) : Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the repository you want to push to with `repo_id` (will default to the name of `save_directory` in your namespace).

	kwargs (`dict[str, Any]`, optional) : Additional key word arguments passed along to the [push_to_hub()](/docs/transformers/main/zh/main_classes/model#transformers.utils.PushToHubMixin.push_to_hub) method.
	#### to_dict[[transformers.ProcessorMixin.to_dict]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L973)

	Serializes this instance to a Python dictionary.

	Returns:

	``dict[str, Any]``

	Dictionary of all the attributes that make up this processor instance.
	#### to_json_file[[transformers.ProcessorMixin.to_json_file]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1047)

	Save this instance to a JSON file.

	Parameters:

	json_file_path (`str` or `os.PathLike`) : Path to the JSON file in which this processor instance's parameters will be saved.
	#### to_json_string[[transformers.ProcessorMixin.to_json_string]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1036)

	Serializes this instance to a JSON string.

	Returns:

	``str``

	String containing all the attributes that make up this feature_extractor instance in JSON format.
	#### validate_inputs[[transformers.ProcessorMixin.validate_inputs]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L733)

	Validate that at least one input is provided and that no deprecated keyword arguments
	are used. Raises `ValueError` otherwise.

	Override when the processor needs additional validation on the input args.

	## 已弃用的processors[[transformers.DataProcessor]]

	所有processor都遵循与 [DataProcessor](/docs/transformers/main/zh/main_classes/processors#transformers.DataProcessor) 相同的架构。processor返回一个 [InputExample](/docs/transformers/main/zh/main_classes/processors#transformers.InputExample) 列表。这些 [InputExample](/docs/transformers/main/zh/main_classes/processors#transformers.InputExample) 可以转换为 [InputFeatures](/docs/transformers/main/zh/main_classes/processors#transformers.InputFeatures) 以供输送到模型。

	#### transformers.DataProcessor[[transformers.DataProcessor]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L78)

	Base class for data converters for sequence classification data sets.

	get_dev_examplestransformers.DataProcessor.get_dev_exampleshttps://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L95[{"name": "data_dir", "val": ""}]
	Gets a collection of [InputExample](/docs/transformers/main/zh/main_classes/processors#transformers.InputExample) for the dev set.
	#### get_example_from_tensor_dict[[transformers.DataProcessor.get_example_from_tensor_dict]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L81)

	Gets an example from a dict.

	Parameters:

	tensor_dict : Keys and values should match the corresponding Glue tensorflow_dataset examples.
	#### get_labels[[transformers.DataProcessor.get_labels]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L103)

	Gets the list of labels for this data set.
	#### get_test_examples[[transformers.DataProcessor.get_test_examples]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L99)

	Gets a collection of [InputExample](/docs/transformers/main/zh/main_classes/processors#transformers.InputExample) for the test set.
	#### get_train_examples[[transformers.DataProcessor.get_train_examples]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L91)

	Gets a collection of [InputExample](/docs/transformers/main/zh/main_classes/processors#transformers.InputExample) for the train set.
	#### tfds_map[[transformers.DataProcessor.tfds_map]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L107)

	Some tensorflow_datasets datasets are not formatted the same way the GLUE datasets are. This method converts
	examples to the correct format.

	#### transformers.InputExample[[transformers.InputExample]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L28)

	A single training/test example for simple sequence classification.

	to_json_stringtransformers.InputExample.to_json_stringhttps://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L47[]
	Serializes this instance to a JSON string.

	Parameters:

	guid : Unique id for the example.

	text_a : string. The untokenized text of the first sequence. For single sequence tasks, only this sequence must be specified.

	text_b : (Optional) string. The untokenized text of the second sequence. Only must be specified for sequence pair tasks.

	label : (Optional) string. The label of the example. This should be specified for train and dev examples, but not for test examples.

	#### transformers.InputFeatures[[transformers.InputFeatures]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L53)

	A single set of features of data. Property names are the same names as the corresponding inputs to a model.

	to_json_stringtransformers.InputFeatures.to_json_stringhttps://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L73[]
	Serializes this instance to a JSON string.

	Parameters:

	input_ids : Indices of input sequence tokens in the vocabulary.

	attention_mask : Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: Usually `1` for tokens that are NOT MASKED, `0` for MASKED (padded) tokens.

	token_type_ids : (Optional) Segment token indices to indicate first and second portions of the inputs. Only some models use them.

	label : (Optional) Label corresponding to the input. Int for classification problems, float for regression problems.

	## GLUE[[transformers.glue_convert_examples_to_features]]

	[General Language Understanding Evaluation (GLUE)](https://gluebenchmark.com/) 是一个基准测试，评估模型在各种现有的自然语言理解任务上的性能。它与论文 [GLUE: A multi-task benchmark and analysis platform for natural language understanding](https://openreview.net/pdf?id=rJ4km2R5t7) 一同发布。

	该库为以下任务提供了总共10个processor：MRPC、MNLI、MNLI（mismatched）、CoLA、SST2、STSB、QQP、QNLI、RTE 和 WNLI。

	这些processor是：

	- `~data.processors.utils.MrpcProcessor`
	- `~data.processors.utils.MnliProcessor`
	- `~data.processors.utils.MnliMismatchedProcessor`
	- `~data.processors.utils.Sst2Processor`
	- `~data.processors.utils.StsbProcessor`
	- `~data.processors.utils.QqpProcessor`
	- `~data.processors.utils.QnliProcessor`
	- `~data.processors.utils.RteProcessor`
	- `~data.processors.utils.WnliProcessor`

	此外，还可以使用以下方法从数据文件加载值并将其转换为 [InputExample](/docs/transformers/main/zh/main_classes/processors#transformers.InputExample) 列表。

	#### transformers.glue_convert_examples_to_features[[transformers.glue_convert_examples_to_features]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/glue.py#L35)

	Loads a data file into a list of `InputFeatures`

	Parameters:

	examples : List of `InputExamples` containing the examples.

	tokenizer : Instance of a tokenizer that will tokenize the examples

	max_length : Maximum example length. Defaults to the tokenizer's max_len

	task : GLUE task

	label_list : List of labels. Can be obtained from the processor using the `processor.get_labels()` method

	output_mode : String indicating the output mode. Either `regression` or `classification`

	Returns:

	Will return a list of task-specific `InputFeatures` which can be fed to the model.

	## XNLI

	[跨语言NLI语料库（XNLI）](https://www.nyu.edu/projects/bowman/xnli/) 是一个评估跨语言文本表示质量的基准测试。XNLI是一个基于[MultiNLI](http://www.nyu.edu/projects/bowman/multinli/)的众包数据集：”文本对“被标记为包含15种不同语言（包括英语等高资源语言和斯瓦希里语等低资源语言）的文本蕴涵注释。

	它与论文 [XNLI: Evaluating Cross-lingual Sentence Representations](https://huggingface.co/papers/1809.05053) 一同发布。

	该库提供了加载XNLI数据的processor：

	- `~data.processors.utils.XnliProcessor`

	请注意，由于测试集上有“gold”标签，因此评估是在测试集上进行的。

	使用这些processor的示例在 [run_xnli.py](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification/run_xnli.py) 脚本中提供。

	## SQuAD

	[斯坦福问答数据集（SQuAD）](https://rajpurkar.github.io/SQuAD-explorer//) 是一个评估模型在问答上性能的基准测试。有两个版本，v1.1 和 v2.0。第一个版本（v1.1）与论文 [SQuAD: 100,000+ Questions for Machine Comprehension of Text](https://huggingface.co/papers/1606.05250) 一同发布。第二个版本（v2.0）与论文 [Know What You Don't Know: Unanswerable Questions for SQuAD](https://huggingface.co/papers/1806.03822) 一同发布。

	该库为两个版本各自提供了一个processor：

	### Processors[[transformers.data.processors.squad.SquadProcessor]]

	这两个processor是：

	- `~data.processors.utils.SquadV1Processor`
	- `~data.processors.utils.SquadV2Processor`

	它们都继承自抽象类 `~data.processors.utils.SquadProcessor`。

	#### transformers.data.processors.squad.SquadProcessor[[transformers.data.processors.squad.SquadProcessor]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/squad.py#L433)

	Processor for the SQuAD data set. overridden by SquadV1Processor and SquadV2Processor, used by the version 1.1 and
	version 2.0 of SQuAD, respectively.

	get_dev_examplestransformers.data.processors.squad.SquadProcessor.get_dev_exampleshttps://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/squad.py#L521[{"name": "data_dir", "val": ""}, {"name": "filename", "val": " = None"}]- data_dir -- Directory containing the data files used for training and evaluating.
	- filename -- None by default, specify this if the evaluation file has a different name than the original one
	which is `dev-v1.1.json` and `dev-v2.0.json` for squad versions 1.1 and 2.0 respectively.0

	Returns the evaluation example from the data directory.

	Parameters:

	data_dir : Directory containing the data files used for training and evaluating.

	filename : None by default, specify this if the evaluation file has a different name than the original one which is `dev-v1.1.json` and `dev-v2.0.json` for squad versions 1.1 and 2.0 respectively.
	#### get_examples_from_dataset[[transformers.data.processors.squad.SquadProcessor.get_examples_from_dataset]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/squad.py#L466)

	Creates a list of `SquadExample` using a TFDS dataset.

	Examples:

	```python
	>>> import tensorflow_datasets as tfds

	>>> dataset = tfds.load("squad")

	>>> training_examples = get_examples_from_dataset(dataset, evaluate=False)
	>>> evaluation_examples = get_examples_from_dataset(dataset, evaluate=True)
	```

	Parameters:

	dataset : The tfds dataset loaded from tensorflow_datasets.load("squad")

	evaluate : Boolean specifying if in evaluation mode or in training mode

	Returns:

	List of SquadExample
	#### get_train_examples[[transformers.data.processors.squad.SquadProcessor.get_train_examples]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/squad.py#L499)

	Returns the training examples from the data directory.

	Parameters:

	data_dir : Directory containing the data files used for training and evaluating.

	filename : None by default, specify this if the training file has a different name than the original one which is `train-v1.1.json` and `train-v2.0.json` for squad versions 1.1 and 2.0 respectively.

	此外，可以使用以下方法将 SQuAD 示例转换为可用作模型输入的 `~data.processors.utils.SquadFeatures`。

	#### transformers.squad_convert_examples_to_features[[transformers.squad_convert_examples_to_features]]

	[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/squad.py#L313)

	Converts a list of examples into a list of features that can be directly given as input to a model. It is
	model-dependant and takes advantage of many of the tokenizer's features to create the model's inputs.

	Example:

	```python
	processor = SquadV2Processor()
	examples = processor.get_dev_examples(data_dir)

	features = squad_convert_examples_to_features(
	examples=examples,
	tokenizer=tokenizer,
	max_seq_length=args.max_seq_length,
	doc_stride=args.doc_stride,
	max_query_length=args.max_query_length,
	is_training=not evaluate,
	)
	```

	Parameters:

	examples : list of `SquadExample`

	tokenizer : an instance of a child of [PreTrainedTokenizer](/docs/transformers/main/zh/main_classes/tokenizer#transformers.PythonBackend)

	max_seq_length : The maximum sequence length of the inputs.

	doc_stride : The stride used when the context is too large and is split across several features.

	max_query_length : The maximum length of the query.

	is_training : whether to create features for model evaluation or model training.

	padding_strategy : Default to "max_length". Which padding strategy to use

	return_dataset : Default False. Can also be 'pt'. if 'pt': returns a torch.data.TensorDataset.

	threads : multiple processing threads.

	Returns:

	list of `SquadFeatures`

	这些processor以及前面提到的方法可以与包含数据的文件以及tensorflow_datasets包一起使用。下面给出了示例。

	### Example使用

	以下是使用processor以及使用数据文件的转换方法的示例：

	```python
	# Loading a V2 processor
	processor = SquadV2Processor()
	examples = processor.get_dev_examples(squad_v2_data_dir)

	# Loading a V1 processor
	processor = SquadV1Processor()
	examples = processor.get_dev_examples(squad_v1_data_dir)

	features = squad_convert_examples_to_features(
	examples=examples,
	tokenizer=tokenizer,
	max_seq_length=max_seq_length,
	doc_stride=args.doc_stride,
	max_query_length=max_query_length,
	is_training=not evaluate,
	)
	```

	使用 tensorflow_datasets 就像使用数据文件一样简单：

	```python
	# tensorflow_datasets only handle Squad V1.
	tfds_examples = tfds.load("squad")
	examples = SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=evaluate)

	features = squad_convert_examples_to_features(
	examples=examples,
	tokenizer=tokenizer,
	max_seq_length=max_seq_length,
	doc_stride=args.doc_stride,
	max_query_length=max_query_length,
	is_training=not evaluate,
	)
	```

	另一个使用这些processor的示例在 [run_squad.py](https://github.com/huggingface/transformers/tree/main/examples/legacy/question-answering/run_squad.py) 脚本中提供。

Xet Storage Details

Size:: 33.3 kB
Xet hash:: 64f8063c5e47a794f9297dcfd859de75019cf2614d59288b813c3d278590e6bc

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.