Buckets:
| # Processors | |
| 在 Transformers 库中,processors可以有两种不同的含义: | |
| - 为多模态模型,例如[Wav2Vec2](../model_doc/wav2vec2)(语音和文本)或[CLIP](../model_doc/clip)(文本和视觉)预处理输入的对象 | |
| - 在库的旧版本中用于预处理GLUE或SQUAD数据的已弃用对象。 | |
| ## 多模态processors[[transformers.ProcessorMixin]] | |
| 任何多模态模型都需要一个对象来编码或解码将多个模态(包括文本、视觉和音频)组合在一起的数据。这由称为processors的对象处理,这些processors将两个或多个处理对象组合在一起,例如tokenizers(用于文本模态),image processors(用于视觉)和feature extractors(用于音频)。 | |
| 这些processors继承自以下实现保存和加载功能的基类: | |
| #### transformers.ProcessorMixin[[transformers.ProcessorMixin]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L588) | |
| This is a mixin used to provide saving/loading functionality for all processor classes. | |
| apply_chat_templatetransformers.ProcessorMixin.apply_chat_templatehttps://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1918[{"name": "conversation", "val": ": list[dict[str, str]] | list[list[dict[str, str]]]"}, {"name": "chat_template", "val": ": str | None = None"}, {"name": "tools", "val": ": list[dict] | None = None"}, {"name": "documents", "val": ": list[dict[str, str]] | None = None"}, {"name": "add_generation_prompt", "val": ": bool = False"}, {"name": "continue_final_message", "val": ": bool | str = False"}, {"name": "return_assistant_tokens_mask", "val": ": bool = False"}, {"name": "tokenize", "val": ": bool = False"}, {"name": "return_tensors", "val": ": str | transformers.utils.generic.TensorType | None = None"}, {"name": "return_dict", "val": ": bool = False"}, {"name": "load_audio_from_video", "val": ": bool = False"}, {"name": "processor_kwargs", "val": ": dict | None = None"}, {"name": "**kwargs", "val": ""}]- **conversation** (`Union[list[Dict, [str, str]], list[list[dict[str, str]]]]`) -- | |
| The conversation to format. | |
| - **chat_template** (`Optional[str]`, *optional*) -- | |
| The Jinja template to use for formatting the conversation. If not provided, the tokenizer's | |
| chat template is used.0 | |
| Similar to the `apply_chat_template` method on tokenizers, this method applies a Jinja template to input | |
| conversations to turn them into a single tokenizable string. | |
| The input is expected to be in the following format, where each message content is a list consisting of text and | |
| optionally image or video inputs. One can also provide an image, video, URL or local path which will be used to form | |
| `pixel_values` when `return_dict=True`. If not provided, one will get only the formatted text, optionally tokenized text. | |
| conversation = [ | |
| { | |
| "role": "user", | |
| "content": [ | |
| {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"}, | |
| {"type": "text", "text": "Please describe this image in detail."}, | |
| ], | |
| }, | |
| ] | |
| **Parameters:** | |
| conversation (`Union[list[Dict, [str, str]], list[list[dict[str, str]]]]`) : The conversation to format. | |
| chat_template (`Optional[str]`, *optional*) : The Jinja template to use for formatting the conversation. If not provided, the tokenizer's chat template is used. | |
| #### batch_decode[[transformers.ProcessorMixin.batch_decode]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1872) | |
| This method forwards all its arguments to PreTrainedTokenizer's [batch_decode()](/docs/transformers/main/zh/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.batch_decode). Please | |
| refer to the docstring of this method for more information. | |
| #### check_argument_for_proper_class[[transformers.ProcessorMixin.check_argument_for_proper_class]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L950) | |
| Checks the passed argument's class against the expected transformers class. In case of an unexpected | |
| mismatch between expected and actual class, an error is raise. Otherwise, the proper retrieved class | |
| is returned. | |
| #### create_mm_token_type_ids[[transformers.ProcessorMixin.create_mm_token_type_ids]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L897) | |
| Build per-token modality type IDs for a batch of token_id sequences. | |
| Each position is assigned an integer indicating which modality it belongs to: | |
| `0` for regular text, `1` for image tokens, `2` for video tokens, and | |
| `3` for audio tokens. Membership is determined by comparing against | |
| `self.image_token_ids`, `self.video_token_ids`, and `self.audio_token_ids`. | |
| **Parameters:** | |
| input_ids (*list[list[int]]*) : Batch of token ID sequences. May be unpadded (variable length), so a plain Python list of lists is expected rather than a tensor or uniformly-shaped array. | |
| **Returns:** | |
| `*list[list[int]]*` | |
| A list of the same structure as `input_ids`, where each | |
| integer is the modality type ID for the corresponding token. | |
| #### decode[[transformers.ProcessorMixin.decode]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1881) | |
| This method forwards all its arguments to PreTrainedTokenizer's [decode()](/docs/transformers/main/zh/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.decode). Please refer to | |
| the docstring of this method for more information. | |
| #### from_args_and_dict[[transformers.ProcessorMixin.from_args_and_dict]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1411) | |
| Instantiates a type of `~processing_utils.ProcessingMixin` from a Python dictionary of parameters. | |
| **Parameters:** | |
| processor_dict (`dict[str, Any]`) : Dictionary that will be used to instantiate the processor object. Such a dictionary can be retrieved from a pretrained checkpoint by leveraging the `~processing_utils.ProcessingMixin.to_dict` method. | |
| kwargs (`dict[str, Any]`) : Additional parameters from which to initialize the processor object. | |
| **Returns:** | |
| ``~processing_utils.ProcessingMixin`` | |
| The processor object instantiated from those | |
| parameters. | |
| #### from_pretrained[[transformers.ProcessorMixin.from_pretrained]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1635) | |
| Instantiate a processor associated with a pretrained model. | |
| This class method is simply calling the feature extractor | |
| [from_pretrained()](/docs/transformers/main/zh/main_classes/feature_extractor#transformers.FeatureExtractionMixin.from_pretrained), image processor | |
| [ImageProcessingMixin](/docs/transformers/main/zh/main_classes/image_processor#transformers.ImageProcessingMixin) and the tokenizer | |
| `~tokenization_utils_base.PreTrainedTokenizer.from_pretrained` methods. Please refer to the docstrings of the | |
| methods above for more information. | |
| **Parameters:** | |
| pretrained_model_name_or_path (`str` or `os.PathLike`) : This can be either: - a string, the *model id* of a pretrained feature_extractor hosted inside a model repo on huggingface.co. - a path to a *directory* containing a feature extractor file saved using the [save_pretrained()](/docs/transformers/main/zh/main_classes/feature_extractor#transformers.FeatureExtractionMixin.save_pretrained) method, e.g., `./my_model_directory/`. - a path to a saved feature extractor JSON *file*, e.g., `./my_model_directory/preprocessor_config.json`. | |
| - ****kwargs** : Additional keyword arguments passed along to both [from_pretrained()](/docs/transformers/main/zh/main_classes/feature_extractor#transformers.FeatureExtractionMixin.from_pretrained) and `~tokenization_utils_base.PreTrainedTokenizer.from_pretrained`. | |
| #### get_processor_dict[[transformers.ProcessorMixin.get_processor_dict]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1172) | |
| From a `pretrained_model_name_or_path`, resolve to a dictionary of parameters, to be used for instantiating a | |
| processor of type `~processing_utils.ProcessingMixin` using `from_args_and_dict`. | |
| **Parameters:** | |
| pretrained_model_name_or_path (`str` or `os.PathLike`) : The identifier of the pre-trained checkpoint from which we want the dictionary of parameters. | |
| subfolder (`str`, *optional*, defaults to `""`) : In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can specify the folder name here. | |
| **Returns:** | |
| ``tuple[Dict, Dict]`` | |
| The dictionary(ies) that will be used to instantiate the processor object. | |
| #### get_text_with_replacements[[transformers.ProcessorMixin.get_text_with_replacements]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L801) | |
| Replace multimodal placeholder tokens in a batch of text strings with their | |
| expanded representations, and return the modified texts alongside offset metadata. | |
| This method is the core text-side preprocessing step for multimodal inputs. It | |
| scans each text in the batch for special tokens (image, video, audio) and replaces | |
| them in-order with the pre-computed replacement strings produced by | |
| *self.replace_image_token* / *self.replace_video_token* / *self.replace_audio_token*. | |
| Replacements are consumed from each modality's list sequentially, so the i-th | |
| occurrence of e.g. `self.image_token` is replaced by `images_replacements[i]`. | |
| To add a new multimodal processor with placeholder tokens, you need to define a correct | |
| *self.image_token* which is the same token that is embedded in input text and also used as | |
| placeholder and repeated many times. Then you need to override *self.replace_image_token* | |
| to return the correct replacement string for a given image at index *i*. Same goes for all | |
| other supported modalities. | |
| **Parameters:** | |
| text (*list[str]*) : Batch of raw text strings, each potentially containing multimodal placeholder tokens. Note that it will be modified in-place and returned. | |
| images_replacements (*list[str]*, *optional*, defaults to *[]*) : Expanded replacement strings for each image, in the order they appear across the batch. Produced by *self._process_images*. | |
| videos_replacements (*list[str]*, *optional*, defaults to *[]*) : Expanded replacement strings for each video. Produced by *self._process_videos*. | |
| audio_replacements (*list[str]*, *optional*, defaults to *[]*) : Expanded replacement strings for each audio input. Produced by *self._process_audio*. | |
| **Returns:** | |
| `*tuple[list[str], list[dict[str, Any]]]*` | |
| A tuple of: | |
| - The modified *text* batch with all placeholder tokens expanded. | |
| - *batch_replacement_offsets*: one entry per batch item, each being a | |
| list of dicts with keys: | |
| - *"type"* (*str*): modality name — *"image"*, *"video"*, or *"audio"* | |
| - *"span"* (*tuple[int, int]*): original *(start, end)* char offsets of the placeholder token | |
| - *"new_span"* (*tuple[int, int]*): *(start, end)* offsets of placeholder in the expanded string | |
| - *"text"* (*str*): the original placeholder token string that was matched | |
| - *"replacement"* (*str*): the string it was replaced with | |
| #### parse_response[[transformers.ProcessorMixin.parse_response]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L2186) | |
| Converts an output string created by generating text from a model into a parsed message dictionary. | |
| This method is intended for use with chat models, and will read the tokenizer's `response_schema` attribute to | |
| control parsing, although this can be overridden by passing a `response_schema` argument directly. | |
| **Parameters:** | |
| response (`str`) : The output string generated by the model. This can be either a decoded string or list of strings, or token IDs as a list/array. | |
| schema (`Union[list, dict]`, *optional*) : A response schema that indicates the expected output format and how parsing should be performed. If not provided, the tokenizer's `response_schema` attribute will be used. | |
| #### post_process_image_text_to_text[[transformers.ProcessorMixin.post_process_image_text_to_text]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L2237) | |
| Post-process the output of a vlm to decode the text. | |
| **Parameters:** | |
| generated_outputs (`torch.Tensor` or `np.ndarray`) : The output of the model `generate` function. The output is expected to be a tensor of shape `(batch_size, sequence_length)` or `(sequence_length,)`. | |
| skip_special_tokens (`bool`, *optional*, defaults to `True`) : Whether or not to remove special tokens in the output. Argument passed to the tokenizer's `decode` method. | |
| - ****kwargs** : Additional arguments to be passed to the tokenizer's `decode` method. | |
| **Returns:** | |
| ``list[str]`` | |
| The decoded text. | |
| #### post_process_multimodal_output[[transformers.ProcessorMixin.post_process_multimodal_output]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L2208) | |
| Post-process the output of a multimodal model to return the requested modality output. | |
| If the model cannot generated the requested modality, an error will be raised. | |
| **Parameters:** | |
| generated_outputs (`torch.Tensor` or `np.ndarray`) : The output of the model `generate` function. The output is expected to be a tensor of shape `(batch_size, sequence_length)` or `(sequence_length,)`. | |
| skip_special_tokens (`bool`, *optional*, defaults to `True`) : Whether or not to remove special tokens in the output. Argument passed to the tokenizer's `batch_decode` method. | |
| generation_mode (`str`, *optional*) : Generation mode indicated which modality to output and can be one of `["text", "image", "audio"]`. | |
| - ****kwargs** : Additional arguments to be passed to the tokenizer's `batch_decode method`. | |
| **Returns:** | |
| ``list[str]`` | |
| The decoded text. | |
| #### prepare_inputs_layout[[transformers.ProcessorMixin.prepare_inputs_layout]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L703) | |
| Normalize and prefetch inputs before processing. Wraps text in a list for multimodal | |
| processors, fetches remote images and audio if URLs are provided, and ensures audio | |
| is properly batched. Returns the normalized `(images, text, videos, audio)` tuple. | |
| #### push_to_hub[[transformers.ProcessorMixin.push_to_hub]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/utils/hub.py#L720) | |
| Upload the processor files to the 🤗 Model Hub. | |
| Examples: | |
| ```python | |
| from transformers import AutoProcessor | |
| processor = AutoProcessor.from_pretrained("google-bert/bert-base-cased") | |
| # Push the processor to your namespace with the name "my-finetuned-bert". | |
| processor.push_to_hub("my-finetuned-bert") | |
| # Push the processor to an organization with the name "my-finetuned-bert". | |
| processor.push_to_hub("huggingface/my-finetuned-bert") | |
| ``` | |
| **Parameters:** | |
| repo_id (`str`) : The name of the repository you want to push your processor to. It should contain your organization name when pushing to a given organization. | |
| commit_message (`str`, *optional*) : Message to commit while pushing. Will default to `"Upload processor"`. | |
| commit_description (`str`, *optional*) : The description of the commit that will be created | |
| private (`bool`, *optional*) : Whether to make the repo private. If `None` (default), the repo will be public unless the organization's default is private. This value is ignored if the repo already exists. | |
| token (`bool` or `str`, *optional*) : The token to use as HTTP bearer authorization for remote files. If `True` (default), will use the token generated when running `hf auth login` (stored in `~/.huggingface`). | |
| revision (`str`, *optional*) : Branch to push the uploaded files to. | |
| create_pr (`bool`, *optional*, defaults to `False`) : Whether or not to create a PR with the uploaded files or directly commit. | |
| max_shard_size (`int` or `str`, *optional*, defaults to `"50GB"`) : Only applicable for models. The maximum size for a checkpoint before being sharded. Checkpoints shard will then be each of size lower than this size. If expressed as a string, needs to be digits followed by a unit (like `"5MB"`). | |
| tags (`list[str]`, *optional*) : List of tags to push on the Hub. | |
| #### register_for_auto_class[[transformers.ProcessorMixin.register_for_auto_class]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1714) | |
| Register this class with a given auto class. This should only be used for custom feature extractors as the ones | |
| in the library are already mapped with `AutoProcessor`. | |
| **Parameters:** | |
| auto_class (`str` or `type`, *optional*, defaults to `"AutoProcessor"`) : The auto class to register this new feature extractor with. | |
| #### save_pretrained[[transformers.ProcessorMixin.save_pretrained]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1063) | |
| Saves the attributes of this processor (feature extractor, tokenizer...) in the specified directory so that it | |
| can be reloaded using the [from_pretrained()](/docs/transformers/main/zh/main_classes/processors#transformers.ProcessorMixin.from_pretrained) method. | |
| This class method is simply calling [save_pretrained()](/docs/transformers/main/zh/main_classes/feature_extractor#transformers.FeatureExtractionMixin.save_pretrained) and | |
| [save_pretrained()](/docs/transformers/main/zh/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.save_pretrained). Please refer to the docstrings of the | |
| methods above for more information. | |
| **Parameters:** | |
| save_directory (`str` or `os.PathLike`) : Directory where the feature extractor JSON file and the tokenizer files will be saved (directory will be created if it does not exist). | |
| push_to_hub (`bool`, *optional*, defaults to `False`) : Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the repository you want to push to with `repo_id` (will default to the name of `save_directory` in your namespace). | |
| kwargs (`dict[str, Any]`, *optional*) : Additional key word arguments passed along to the [push_to_hub()](/docs/transformers/main/zh/main_classes/model#transformers.utils.PushToHubMixin.push_to_hub) method. | |
| #### to_dict[[transformers.ProcessorMixin.to_dict]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L973) | |
| Serializes this instance to a Python dictionary. | |
| **Returns:** | |
| ``dict[str, Any]`` | |
| Dictionary of all the attributes that make up this processor instance. | |
| #### to_json_file[[transformers.ProcessorMixin.to_json_file]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1047) | |
| Save this instance to a JSON file. | |
| **Parameters:** | |
| json_file_path (`str` or `os.PathLike`) : Path to the JSON file in which this processor instance's parameters will be saved. | |
| #### to_json_string[[transformers.ProcessorMixin.to_json_string]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L1036) | |
| Serializes this instance to a JSON string. | |
| **Returns:** | |
| ``str`` | |
| String containing all the attributes that make up this feature_extractor instance in JSON format. | |
| #### validate_inputs[[transformers.ProcessorMixin.validate_inputs]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/processing_utils.py#L733) | |
| Validate that at least one input is provided and that no deprecated keyword arguments | |
| are used. Raises `ValueError` otherwise. | |
| Override when the processor needs additional validation on the input args. | |
| ## 已弃用的processors[[transformers.DataProcessor]] | |
| 所有processor都遵循与 [DataProcessor](/docs/transformers/main/zh/main_classes/processors#transformers.DataProcessor) 相同的架构。processor返回一个 [InputExample](/docs/transformers/main/zh/main_classes/processors#transformers.InputExample) 列表。这些 [InputExample](/docs/transformers/main/zh/main_classes/processors#transformers.InputExample) 可以转换为 [InputFeatures](/docs/transformers/main/zh/main_classes/processors#transformers.InputFeatures) 以供输送到模型。 | |
| #### transformers.DataProcessor[[transformers.DataProcessor]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L78) | |
| Base class for data converters for sequence classification data sets. | |
| get_dev_examplestransformers.DataProcessor.get_dev_exampleshttps://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L95[{"name": "data_dir", "val": ""}] | |
| Gets a collection of [InputExample](/docs/transformers/main/zh/main_classes/processors#transformers.InputExample) for the dev set. | |
| #### get_example_from_tensor_dict[[transformers.DataProcessor.get_example_from_tensor_dict]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L81) | |
| Gets an example from a dict. | |
| **Parameters:** | |
| tensor_dict : Keys and values should match the corresponding Glue tensorflow_dataset examples. | |
| #### get_labels[[transformers.DataProcessor.get_labels]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L103) | |
| Gets the list of labels for this data set. | |
| #### get_test_examples[[transformers.DataProcessor.get_test_examples]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L99) | |
| Gets a collection of [InputExample](/docs/transformers/main/zh/main_classes/processors#transformers.InputExample) for the test set. | |
| #### get_train_examples[[transformers.DataProcessor.get_train_examples]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L91) | |
| Gets a collection of [InputExample](/docs/transformers/main/zh/main_classes/processors#transformers.InputExample) for the train set. | |
| #### tfds_map[[transformers.DataProcessor.tfds_map]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L107) | |
| Some tensorflow_datasets datasets are not formatted the same way the GLUE datasets are. This method converts | |
| examples to the correct format. | |
| #### transformers.InputExample[[transformers.InputExample]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L28) | |
| A single training/test example for simple sequence classification. | |
| to_json_stringtransformers.InputExample.to_json_stringhttps://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L47[] | |
| Serializes this instance to a JSON string. | |
| **Parameters:** | |
| guid : Unique id for the example. | |
| text_a : string. The untokenized text of the first sequence. For single sequence tasks, only this sequence must be specified. | |
| text_b : (Optional) string. The untokenized text of the second sequence. Only must be specified for sequence pair tasks. | |
| label : (Optional) string. The label of the example. This should be specified for train and dev examples, but not for test examples. | |
| #### transformers.InputFeatures[[transformers.InputFeatures]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L53) | |
| A single set of features of data. Property names are the same names as the corresponding inputs to a model. | |
| to_json_stringtransformers.InputFeatures.to_json_stringhttps://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/utils.py#L73[] | |
| Serializes this instance to a JSON string. | |
| **Parameters:** | |
| input_ids : Indices of input sequence tokens in the vocabulary. | |
| attention_mask : Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: Usually `1` for tokens that are NOT MASKED, `0` for MASKED (padded) tokens. | |
| token_type_ids : (Optional) Segment token indices to indicate first and second portions of the inputs. Only some models use them. | |
| label : (Optional) Label corresponding to the input. Int for classification problems, float for regression problems. | |
| ## GLUE[[transformers.glue_convert_examples_to_features]] | |
| [General Language Understanding Evaluation (GLUE)](https://gluebenchmark.com/) 是一个基准测试,评估模型在各种现有的自然语言理解任务上的性能。它与论文 [GLUE: A multi-task benchmark and analysis platform for natural language understanding](https://openreview.net/pdf?id=rJ4km2R5t7) 一同发布。 | |
| 该库为以下任务提供了总共10个processor:MRPC、MNLI、MNLI(mismatched)、CoLA、SST2、STSB、QQP、QNLI、RTE 和 WNLI。 | |
| 这些processor是: | |
| - `~data.processors.utils.MrpcProcessor` | |
| - `~data.processors.utils.MnliProcessor` | |
| - `~data.processors.utils.MnliMismatchedProcessor` | |
| - `~data.processors.utils.Sst2Processor` | |
| - `~data.processors.utils.StsbProcessor` | |
| - `~data.processors.utils.QqpProcessor` | |
| - `~data.processors.utils.QnliProcessor` | |
| - `~data.processors.utils.RteProcessor` | |
| - `~data.processors.utils.WnliProcessor` | |
| 此外,还可以使用以下方法从数据文件加载值并将其转换为 [InputExample](/docs/transformers/main/zh/main_classes/processors#transformers.InputExample) 列表。 | |
| #### transformers.glue_convert_examples_to_features[[transformers.glue_convert_examples_to_features]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/glue.py#L35) | |
| Loads a data file into a list of `InputFeatures` | |
| **Parameters:** | |
| examples : List of `InputExamples` containing the examples. | |
| tokenizer : Instance of a tokenizer that will tokenize the examples | |
| max_length : Maximum example length. Defaults to the tokenizer's max_len | |
| task : GLUE task | |
| label_list : List of labels. Can be obtained from the processor using the `processor.get_labels()` method | |
| output_mode : String indicating the output mode. Either `regression` or `classification` | |
| **Returns:** | |
| Will return a list of task-specific `InputFeatures` which can be fed to the model. | |
| ## XNLI | |
| [跨语言NLI语料库(XNLI)](https://www.nyu.edu/projects/bowman/xnli/) 是一个评估跨语言文本表示质量的基准测试。XNLI是一个基于[*MultiNLI*](http://www.nyu.edu/projects/bowman/multinli/)的众包数据集:”文本对“被标记为包含15种不同语言(包括英语等高资源语言和斯瓦希里语等低资源语言)的文本蕴涵注释。 | |
| 它与论文 [XNLI: Evaluating Cross-lingual Sentence Representations](https://huggingface.co/papers/1809.05053) 一同发布。 | |
| 该库提供了加载XNLI数据的processor: | |
| - `~data.processors.utils.XnliProcessor` | |
| 请注意,由于测试集上有“gold”标签,因此评估是在测试集上进行的。 | |
| 使用这些processor的示例在 [run_xnli.py](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification/run_xnli.py) 脚本中提供。 | |
| ## SQuAD | |
| [斯坦福问答数据集(SQuAD)](https://rajpurkar.github.io/SQuAD-explorer//) 是一个评估模型在问答上性能的基准测试。有两个版本,v1.1 和 v2.0。第一个版本(v1.1)与论文 [SQuAD: 100,000+ Questions for Machine Comprehension of Text](https://huggingface.co/papers/1606.05250) 一同发布。第二个版本(v2.0)与论文 [Know What You Don't Know: Unanswerable Questions for SQuAD](https://huggingface.co/papers/1806.03822) 一同发布。 | |
| 该库为两个版本各自提供了一个processor: | |
| ### Processors[[transformers.data.processors.squad.SquadProcessor]] | |
| 这两个processor是: | |
| - `~data.processors.utils.SquadV1Processor` | |
| - `~data.processors.utils.SquadV2Processor` | |
| 它们都继承自抽象类 `~data.processors.utils.SquadProcessor`。 | |
| #### transformers.data.processors.squad.SquadProcessor[[transformers.data.processors.squad.SquadProcessor]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/squad.py#L433) | |
| Processor for the SQuAD data set. overridden by SquadV1Processor and SquadV2Processor, used by the version 1.1 and | |
| version 2.0 of SQuAD, respectively. | |
| get_dev_examplestransformers.data.processors.squad.SquadProcessor.get_dev_exampleshttps://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/squad.py#L521[{"name": "data_dir", "val": ""}, {"name": "filename", "val": " = None"}]- **data_dir** -- Directory containing the data files used for training and evaluating. | |
| - **filename** -- None by default, specify this if the evaluation file has a different name than the original one | |
| which is `dev-v1.1.json` and `dev-v2.0.json` for squad versions 1.1 and 2.0 respectively.0 | |
| Returns the evaluation example from the data directory. | |
| **Parameters:** | |
| data_dir : Directory containing the data files used for training and evaluating. | |
| filename : None by default, specify this if the evaluation file has a different name than the original one which is `dev-v1.1.json` and `dev-v2.0.json` for squad versions 1.1 and 2.0 respectively. | |
| #### get_examples_from_dataset[[transformers.data.processors.squad.SquadProcessor.get_examples_from_dataset]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/squad.py#L466) | |
| Creates a list of `SquadExample` using a TFDS dataset. | |
| Examples: | |
| ```python | |
| >>> import tensorflow_datasets as tfds | |
| >>> dataset = tfds.load("squad") | |
| >>> training_examples = get_examples_from_dataset(dataset, evaluate=False) | |
| >>> evaluation_examples = get_examples_from_dataset(dataset, evaluate=True) | |
| ``` | |
| **Parameters:** | |
| dataset : The tfds dataset loaded from *tensorflow_datasets.load("squad")* | |
| evaluate : Boolean specifying if in evaluation mode or in training mode | |
| **Returns:** | |
| List of SquadExample | |
| #### get_train_examples[[transformers.data.processors.squad.SquadProcessor.get_train_examples]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/squad.py#L499) | |
| Returns the training examples from the data directory. | |
| **Parameters:** | |
| data_dir : Directory containing the data files used for training and evaluating. | |
| filename : None by default, specify this if the training file has a different name than the original one which is `train-v1.1.json` and `train-v2.0.json` for squad versions 1.1 and 2.0 respectively. | |
| 此外,可以使用以下方法将 SQuAD 示例转换为可用作模型输入的 `~data.processors.utils.SquadFeatures`。 | |
| #### transformers.squad_convert_examples_to_features[[transformers.squad_convert_examples_to_features]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/squad.py#L313) | |
| Converts a list of examples into a list of features that can be directly given as input to a model. It is | |
| model-dependant and takes advantage of many of the tokenizer's features to create the model's inputs. | |
| Example: | |
| ```python | |
| processor = SquadV2Processor() | |
| examples = processor.get_dev_examples(data_dir) | |
| features = squad_convert_examples_to_features( | |
| examples=examples, | |
| tokenizer=tokenizer, | |
| max_seq_length=args.max_seq_length, | |
| doc_stride=args.doc_stride, | |
| max_query_length=args.max_query_length, | |
| is_training=not evaluate, | |
| ) | |
| ``` | |
| **Parameters:** | |
| examples : list of `SquadExample` | |
| tokenizer : an instance of a child of [PreTrainedTokenizer](/docs/transformers/main/zh/main_classes/tokenizer#transformers.PythonBackend) | |
| max_seq_length : The maximum sequence length of the inputs. | |
| doc_stride : The stride used when the context is too large and is split across several features. | |
| max_query_length : The maximum length of the query. | |
| is_training : whether to create features for model evaluation or model training. | |
| padding_strategy : Default to "max_length". Which padding strategy to use | |
| return_dataset : Default False. Can also be 'pt'. if 'pt': returns a torch.data.TensorDataset. | |
| threads : multiple processing threads. | |
| **Returns:** | |
| list of `SquadFeatures` | |
| 这些processor以及前面提到的方法可以与包含数据的文件以及tensorflow_datasets包一起使用。下面给出了示例。 | |
| ### Example使用 | |
| 以下是使用processor以及使用数据文件的转换方法的示例: | |
| ```python | |
| # Loading a V2 processor | |
| processor = SquadV2Processor() | |
| examples = processor.get_dev_examples(squad_v2_data_dir) | |
| # Loading a V1 processor | |
| processor = SquadV1Processor() | |
| examples = processor.get_dev_examples(squad_v1_data_dir) | |
| features = squad_convert_examples_to_features( | |
| examples=examples, | |
| tokenizer=tokenizer, | |
| max_seq_length=max_seq_length, | |
| doc_stride=args.doc_stride, | |
| max_query_length=max_query_length, | |
| is_training=not evaluate, | |
| ) | |
| ``` | |
| 使用 *tensorflow_datasets* 就像使用数据文件一样简单: | |
| ```python | |
| # tensorflow_datasets only handle Squad V1. | |
| tfds_examples = tfds.load("squad") | |
| examples = SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=evaluate) | |
| features = squad_convert_examples_to_features( | |
| examples=examples, | |
| tokenizer=tokenizer, | |
| max_seq_length=max_seq_length, | |
| doc_stride=args.doc_stride, | |
| max_query_length=max_query_length, | |
| is_training=not evaluate, | |
| ) | |
| ``` | |
| 另一个使用这些processor的示例在 [run_squad.py](https://github.com/huggingface/transformers/tree/main/examples/legacy/question-answering/run_squad.py) 脚本中提供。 | |
Xet Storage Details
- Size:
- 33.3 kB
- Xet hash:
- 64f8063c5e47a794f9297dcfd859de75019cf2614d59288b813c3d278590e6bc
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.