Buckets:
| # Data Utilities | |
| ## prepare_multimodal_messages[[trl.prepare_multimodal_messages]] | |
| #### trl.prepare_multimodal_messages[[trl.prepare_multimodal_messages]] | |
| [Source](https://github.com/huggingface/trl/blob/vr_4949/trl/data_utils.py#L32) | |
| Convert messages into a structured multimodal format and inject the provided images into the message contents. | |
| Notes: | |
| - When the input `messages` isn't already in the structured format, (i.e., all `"content"` values are strings), | |
| the function transforms them into the structured format by wrapping text in `{"type": "text", "text": ...}` | |
| and inserting `{"type": "image"}` placeholders for the images *before* the first user message. | |
| - When the input `messages` is already in the structured format (i.e., all `"content"` values are lists of | |
| structured blocks), the function only fills in the actual images in the existing `{"type": "image"}` | |
| placeholders. If the number of placeholders does not match the number of provided images, an error is raised. | |
| Example: | |
| ```python | |
| # Input | |
| [ | |
| {"role": "user", "content": "What's in this image?"}, | |
| {"role": "assistant", "content": "It looks like a cat."}, | |
| ] | |
| # Output, one image provided | |
| [ | |
| {"role": "user", "content": [{"type": "image", "image": }, {"type": "text", "text": "What's in this image?"}]}, | |
| {"role": "assistant", "content": [{"type": "text", "text": "It looks like a cat."}]}, | |
| ] | |
| ``` | |
| **Parameters:** | |
| messages (`list[dict[str, Any]]`) : Messages with `"role"` and `"content"`. Content may be a raw string before transformation. List of messages a `"role"` key (`"system"`, `"user"`, or `"assistant"`) and a `"content"` key containing either a string or a list of structured blocks if already prepared. | |
| images (`list`) : List of image objects to insert. | |
| **Returns:** | |
| ``list[dict[str, Any]]`` | |
| A deep-copied list of messages where every `"content"` value is a list of structured | |
| content blocks, and all `"image"` placeholders are populated with the corresponding image objects. | |
| ## prepare_multimodal_messages_vllm[[trl.prepare_multimodal_messages_vllm]] | |
| #### trl.prepare_multimodal_messages_vllm[[trl.prepare_multimodal_messages_vllm]] | |
| [Source](https://github.com/huggingface/trl/blob/vr_4949/trl/data_utils.py#L112) | |
| Convert structured multimodal messages into a format compatible with vLLM. Replaces `"type": "image"` blocks with | |
| `"type": "image_pil"` blocks, and `"image": Image` with `"image_pil": Image`. | |
| Example: | |
| ```python | |
| # Input | |
| [{"role": "user", "content": [{"type": "image", "image": }, {"type": "text", "text": "What's in this image?"}]}] | |
| # Output | |
| [{"role": "user", "content": [{"type": "image_pil", "image_pil": }, {"type": "text", "text": "What's in this image?"}]}] | |
| ``` | |
| **Parameters:** | |
| messages (`list[dict[str, Any]]`) : Messages with `"role"` and `"content"`. Content is expected to be a list of structured blocks. | |
| **Returns:** | |
| ``list[dict[str, Any]]`` | |
| A deep-copied list of messages compatible with vLLM's expected input format. | |
| ## is_conversational[[trl.is_conversational]] | |
| #### trl.is_conversational[[trl.is_conversational]] | |
| [Source](https://github.com/huggingface/trl/blob/vr_4949/trl/data_utils.py#L145) | |
| Check if the example is in a conversational format. | |
| Examples: | |
| ```python | |
| >>> example = {"prompt": [{"role": "user", "content": "What color is the sky?"}]} | |
| >>> is_conversational(example) | |
| True | |
| >>> example = {"prompt": "The sky is"} | |
| >>> is_conversational(example) | |
| False | |
| ``` | |
| **Parameters:** | |
| example (`dict[str, Any]`) : A single data entry of a dataset. The example can have different keys depending on the dataset type. | |
| **Returns:** | |
| ``bool`` | |
| `True` if the data is in a conversational format, `False` otherwise. | |
| ## is_conversational_from_value[[trl.is_conversational_from_value]] | |
| #### trl.is_conversational_from_value[[trl.is_conversational_from_value]] | |
| [Source](https://github.com/huggingface/trl/blob/vr_4949/trl/data_utils.py#L897) | |
| Check if the example is in a conversational format (from/value). Note that this format isn't recommended. Prefer | |
| the ChatML format (role/content) | |
| Examples: | |
| ```python | |
| >>> example = {"conversations": [{"from": "user", "value": "What color is the sky?"}]} | |
| >>> is_conversational_from_value(example) | |
| True | |
| >>> example = {"conversations": [{"role": "user", "content": "What color is the sky?"}]} | |
| >>> is_conversational_from_value(example) | |
| False | |
| >>> example = {"conversations": "The sky is"} | |
| >>> is_conversational_from_value(example) | |
| False | |
| ``` | |
| **Parameters:** | |
| example (`dict[str, Any]`) : A single data entry of a dataset. The example can have different keys depending on the dataset type. | |
| **Returns:** | |
| ``bool`` | |
| `True` if the data is in a conversational Chatformat, `False` otherwise. | |
| ## apply_chat_template[[trl.apply_chat_template]] | |
| #### trl.apply_chat_template[[trl.apply_chat_template]] | |
| [Source](https://github.com/huggingface/trl/blob/vr_4949/trl/data_utils.py#L186) | |
| Apply a chat template to a conversational example along with the schema for a list of functions in `tools`. | |
| For more details, see [maybe_apply_chat_template()](/docs/trl/pr_4949/en/data_utils#trl.maybe_apply_chat_template). | |
| ## maybe_apply_chat_template[[trl.maybe_apply_chat_template]] | |
| #### trl.maybe_apply_chat_template[[trl.maybe_apply_chat_template]] | |
| [Source](https://github.com/huggingface/trl/blob/vr_4949/trl/data_utils.py#L319) | |
| If the example is in a conversational format, apply a chat template to it. | |
| Notes: | |
| - This function does not alter the keys, except for Language modeling dataset, where `"messages"` is replaced | |
| by `"text"`. | |
| - In case of prompt-only data, if the last role is `"user"`, the generation prompt is added to the prompt. | |
| Else, if the last role is `"assistant"`, the final message is continued. | |
| Example: | |
| ```python | |
| >>> from transformers import AutoTokenizer | |
| >>> tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct") | |
| >>> example = { | |
| ... "prompt": [{"role": "user", "content": "What color is the sky?"}], | |
| ... "completion": [{"role": "assistant", "content": "It is blue."}], | |
| ... } | |
| >>> apply_chat_template(example, tokenizer) | |
| {'prompt': '\nWhat color is the sky?\n\n', 'completion': 'It is blue.\n'} | |
| ``` | |
| **Parameters:** | |
| example (`dict[str, list[dict[str, str]]`) : Dictionary representing a single data entry of a conversational dataset. Each data entry can have different keys depending on the dataset type. The supported dataset types are: - Language modeling dataset: `"messages"`. - Prompt-only dataset: `"prompt"`. - Prompt-completion dataset: `"prompt"` and `"completion"`. - Preference dataset: `"prompt"`, `"chosen"`, and `"rejected"`. - Preference dataset with implicit prompt: `"chosen"` and `"rejected"`. - Unpaired preference dataset: `"prompt"`, `"completion"`, and `"label"`. For keys `"messages"`, `"prompt"`, `"chosen"`, `"rejected"`, and `"completion"`, the values are lists of messages, where each message is a dictionary with keys `"role"` and `"content"`. Additionally, the example may contain a `"chat_template_kwargs"` key, which is a dictionary of additional keyword arguments to pass to the chat template renderer. | |
| tokenizer ([PreTrainedTokenizerBase](https://huggingface.co/docs/transformers/main/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase)) : Tokenizer to apply the chat template with. | |
| tools (`list[dict | Callable]`, *optional*) : A list of tools (callable functions) that will be accessible to the model. If the template does not support function calling, this argument will have no effect. | |
| - ****template_kwargs** (`Any`, *optional*) : Additional kwargs to pass to the template renderer. Will be accessible by the chat template. | |
| **Returns:** | |
| ``dict[str, str]`` | |
| Formatted example with the chat template applied. | |
| ## maybe_convert_to_chatml[[trl.maybe_convert_to_chatml]] | |
| #### trl.maybe_convert_to_chatml[[trl.maybe_convert_to_chatml]] | |
| [Source](https://github.com/huggingface/trl/blob/vr_4949/trl/data_utils.py#L937) | |
| Convert a conversational dataset with fields `from` and `value` to ChatML format. | |
| This function modifies conversational data to align with OpenAI's ChatML format: | |
| - Replaces the key `"from"` with `"role"` in message dictionaries. | |
| - Replaces the key `"value"` with `"content"` in message dictionaries. | |
| - Renames `"conversations"` to `"messages"` for consistency with ChatML. | |
| Example: | |
| ```python | |
| >>> from trl import maybe_convert_to_chatml | |
| >>> example = { | |
| ... "conversations": [ | |
| ... {"from": "user", "value": "What color is the sky?"}, | |
| ... {"from": "assistant", "value": "It is blue."}, | |
| ... ] | |
| ... } | |
| >>> maybe_convert_to_chatml(example) | |
| {'messages': [{'role': 'user', 'content': 'What color is the sky?'}, | |
| {'role': 'assistant', 'content': 'It is blue.'}]} | |
| ``` | |
| **Parameters:** | |
| example (`dict[str, list]`) : A single data entry containing a list of messages. | |
| **Returns:** | |
| ``dict[str, list]`` | |
| Example reformatted to ChatML style. | |
| ## extract_prompt[[trl.extract_prompt]] | |
| #### trl.extract_prompt[[trl.extract_prompt]] | |
| [Source](https://github.com/huggingface/trl/blob/vr_4949/trl/data_utils.py#L488) | |
| Extracts the shared prompt from a preference data example, where the prompt is implicit within both the chosen and | |
| rejected completions. | |
| For more details, see [maybe_extract_prompt()](/docs/trl/pr_4949/en/data_utils#trl.maybe_extract_prompt). | |
| ## maybe_extract_prompt[[trl.maybe_extract_prompt]] | |
| #### trl.maybe_extract_prompt[[trl.maybe_extract_prompt]] | |
| [Source](https://github.com/huggingface/trl/blob/vr_4949/trl/data_utils.py#L507) | |
| Extracts the shared prompt from a preference data example, where the prompt is implicit within both the chosen and | |
| rejected completions. | |
| If the example already contains a `"prompt"` key, the function returns the example as is. Else, the function | |
| identifies the longest common sequence (prefix) of conversation turns between the "chosen" and "rejected" | |
| completions and extracts this as the prompt. It then removes this prompt from the respective "chosen" and | |
| "rejected" completions. | |
| Examples: | |
| ```python | |
| >>> example = { | |
| ... "chosen": [ | |
| ... {"role": "user", "content": "What color is the sky?"}, | |
| ... {"role": "assistant", "content": "It is blue."}, | |
| ... ], | |
| ... "rejected": [ | |
| ... {"role": "user", "content": "What color is the sky?"}, | |
| ... {"role": "assistant", "content": "It is green."}, | |
| ... ], | |
| ... } | |
| >>> extract_prompt(example) | |
| {'prompt': [{'role': 'user', 'content': 'What color is the sky?'}], | |
| 'chosen': [{'role': 'assistant', 'content': 'It is blue.'}], | |
| 'rejected': [{'role': 'assistant', 'content': 'It is green.'}]} | |
| ``` | |
| Or, with the `map` method of [Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset): | |
| ```python | |
| >>> from trl import extract_prompt | |
| >>> from datasets import Dataset | |
| >>> dataset_dict = { | |
| ... "chosen": [ | |
| ... [ | |
| ... {"role": "user", "content": "What color is the sky?"}, | |
| ... {"role": "assistant", "content": "It is blue."}, | |
| ... ], | |
| ... [ | |
| ... {"role": "user", "content": "Where is the sun?"}, | |
| ... {"role": "assistant", "content": "In the sky."}, | |
| ... ], | |
| ... ], | |
| ... "rejected": [ | |
| ... [ | |
| ... {"role": "user", "content": "What color is the sky?"}, | |
| ... {"role": "assistant", "content": "It is green."}, | |
| ... ], | |
| ... [ | |
| ... {"role": "user", "content": "Where is the sun?"}, | |
| ... {"role": "assistant", "content": "In the sea."}, | |
| ... ], | |
| ... ], | |
| ... } | |
| >>> dataset = Dataset.from_dict(dataset_dict) | |
| >>> dataset = dataset.map(extract_prompt) | |
| >>> dataset[0] | |
| {'prompt': [{'role': 'user', 'content': 'What color is the sky?'}], | |
| 'chosen': [{'role': 'assistant', 'content': 'It is blue.'}], | |
| 'rejected': [{'role': 'assistant', 'content': 'It is green.'}]} | |
| ``` | |
| **Parameters:** | |
| example (`dict[str, list]`) : A dictionary representing a single data entry in the preference dataset. It must contain the keys `"chosen"` and `"rejected"`, where each value is either conversational or standard (`str`). | |
| **Returns:** | |
| ``dict[str, list]`` | |
| A dictionary containing: | |
| - `"prompt"`: The longest common prefix between the "chosen" and "rejected" completions. | |
| - `"chosen"`: The remainder of the "chosen" completion, with the prompt removed. | |
| - `"rejected"`: The remainder of the "rejected" completion, with the prompt removed. | |
| ## unpair_preference_dataset[[trl.unpair_preference_dataset]] | |
| #### trl.unpair_preference_dataset[[trl.unpair_preference_dataset]] | |
| [Source](https://github.com/huggingface/trl/blob/vr_4949/trl/data_utils.py#L394) | |
| Unpair a preference dataset. | |
| Example: | |
| ```python | |
| >>> from datasets import Dataset | |
| >>> dataset_dict = { | |
| ... "prompt": ["The sky is", "The sun is"], | |
| ... "chosen": [" blue.", "in the sky."], | |
| ... "rejected": [" green.", " in the sea."], | |
| ... } | |
| >>> dataset = Dataset.from_dict(dataset_dict) | |
| >>> dataset = unpair_preference_dataset(dataset) | |
| >>> dataset | |
| Dataset({ | |
| features: ['prompt', 'completion', 'label'], | |
| num_rows: 4 | |
| }) | |
| >>> dataset[0] | |
| {'prompt': 'The sky is', 'completion': ' blue.', 'label': True} | |
| ``` | |
| **Parameters:** | |
| dataset ([Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) or [DatasetDict](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)) : Preference dataset to unpair. The dataset must have columns `"chosen"`, `"rejected"` and optionally `"prompt"`. | |
| num_proc (`int`, *optional*) : Number of processes to use for processing the dataset. | |
| desc (`str`, *optional*) : Meaningful description to be displayed alongside with the progress bar while mapping examples. | |
| **Returns:** | |
| `[Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset)` | |
| The unpaired preference dataset. | |
| ## maybe_unpair_preference_dataset[[trl.maybe_unpair_preference_dataset]] | |
| #### trl.maybe_unpair_preference_dataset[[trl.maybe_unpair_preference_dataset]] | |
| [Source](https://github.com/huggingface/trl/blob/vr_4949/trl/data_utils.py#L437) | |
| Unpair a preference dataset if it is paired. | |
| Example: | |
| ```python | |
| >>> from datasets import Dataset | |
| >>> dataset_dict = { | |
| ... "prompt": ["The sky is", "The sun is"], | |
| ... "chosen": [" blue.", "in the sky."], | |
| ... "rejected": [" green.", " in the sea."], | |
| ... } | |
| >>> dataset = Dataset.from_dict(dataset_dict) | |
| >>> dataset = unpair_preference_dataset(dataset) | |
| >>> dataset | |
| Dataset({ | |
| features: ['prompt', 'completion', 'label'], | |
| num_rows: 4 | |
| }) | |
| >>> dataset[0] | |
| {'prompt': 'The sky is', 'completion': ' blue.', 'label': True} | |
| ``` | |
| **Parameters:** | |
| dataset ([Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) or [DatasetDict](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)) : Preference dataset to unpair. The dataset must have columns `"chosen"`, `"rejected"` and optionally `"prompt"`. | |
| num_proc (`int`, *optional*) : Number of processes to use for processing the dataset. | |
| desc (`str`, *optional*) : Meaningful description to be displayed alongside with the progress bar while mapping examples. | |
| **Returns:** | |
| `[Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) or [DatasetDict](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)` | |
| The unpaired preference dataset if it was paired, otherwise | |
| the original dataset. | |
| ## pack_dataset[[trl.pack_dataset]] | |
| #### trl.pack_dataset[[trl.pack_dataset]] | |
| [Source](https://github.com/huggingface/trl/blob/vr_4949/trl/data_utils.py#L778) | |
| Pack sequences in a dataset into chunks of size `seq_length`. | |
| Example: | |
| ```python | |
| >>> from datasets import Dataset | |
| >>> from trl import pack_dataset | |
| >>> examples = { | |
| ... "input_ids": [[1, 2, 3, 4, 5], [6, 7], [8, 9, 10], [11]], | |
| ... "attention_mask": [[1, 1, 1, 0, 0], [1, 0], [1, 1, 0], [1]], | |
| ... } | |
| >>> dataset = Dataset.from_dict(examples) | |
| >>> packed_dataset = pack_dataset(dataset, seq_length=4, strategy="bfd") | |
| >>> packed_dataset[:] | |
| {'input_ids': [[1, 2, 3, 4], [8, 9, 10, 5], [6, 7, 11]], | |
| 'attention_mask': [[1, 1, 1, 0], [0, 1, 1, 0], [1, 1, 1]], | |
| 'seq_lengths': [[4], [3, 1], [2, 1]]} | |
| ``` | |
| **Parameters:** | |
| dataset ([Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) or [DatasetDict](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)) : Dataset to pack | |
| seq_length (`int`) : Target sequence length to pack to. | |
| strategy (`str`, *optional*, defaults to `"bfd"`) : Packing strategy to use. Can be either: - `"bfd"` (Best Fit Decreasing): Slower but preserves sequence boundaries inside each packed sample. If a single sequence exceeds `seq_length` it is split into multiple samples. - `"wrapped"`: Faster but more aggressive. Ignores sequence boundaries and will cut sequences in the middle to completely fill each packed sequence with data. | |
| map_kwargs (`dict`, *optional*) : Additional keyword arguments to pass to the dataset's map method when packing examples. | |
| **Returns:** | |
| `[Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) or [DatasetDict](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)` | |
| The dataset with packed sequences. The number of examples | |
| may decrease as sequences are combined. | |
| ## truncate_dataset[[trl.truncate_dataset]] | |
| #### trl.truncate_dataset[[trl.truncate_dataset]] | |
| [Source](https://github.com/huggingface/trl/blob/vr_4949/trl/data_utils.py#L834) | |
| Truncate sequences in a dataset to a specified `max_length`. | |
| Example: | |
| ```python | |
| >>> from datasets import Dataset | |
| >>> examples = { | |
| ... "input_ids": [[1, 2, 3], [4, 5, 6, 7], [8]], | |
| ... "attention_mask": [[0, 1, 1], [0, 0, 1, 1], [1]], | |
| ... } | |
| >>> dataset = Dataset.from_dict(examples) | |
| >>> truncated_dataset = truncate_dataset(dataset, max_length=2) | |
| >>> truncated_dataset[:] | |
| {'input_ids': [[1, 2], [4, 5], [8]], | |
| 'attention_mask': [[0, 1], [0, 0], [1]]} | |
| ``` | |
| **Parameters:** | |
| dataset ([Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) or [DatasetDict](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)) : Dataset to truncate. | |
| max_length (`int`) : Maximum sequence length to truncate to. | |
| map_kwargs (`dict`, *optional*) : Additional keyword arguments to pass to the dataset's map method when truncating examples. | |
| **Returns:** | |
| `[Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) or [DatasetDict](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)` | |
| The dataset with truncated sequences. | |
Xet Storage Details
- Size:
- 18.9 kB
- Xet hash:
- 41a001a726e67007d3f165ac6e95955098739437576f24228dc1b73f5888f0de
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.