Buckets:
| # Data Utilities | |
| ## prepare_multimodal_messages[[trl.prepare_multimodal_messages]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>trl.prepare_multimodal_messages</name><anchor>trl.prepare_multimodal_messages</anchor><source>https://github.com/huggingface/trl/blob/vr_4331/trl/data_utils.py#L32</source><parameters>[{"name": "messages", "val": ": list"}, {"name": "images", "val": ": list"}]</parameters><paramsdesc>- **messages** (`list[dict[str, Any]]`) -- | |
| Messages with `"role"` and `"content"`. Content may be a raw string before transformation. List of messages | |
| a `"role"` key (`"system"`, `"user"`, or `"assistant"`) and a `"content"` key containing either a string or | |
| a list of structured blocks if already prepared. | |
| - **images** (`list`) -- | |
| List of image objects to insert.</paramsdesc><paramgroups>0</paramgroups><rettype>`list[dict[str, Any]]`</rettype><retdesc>A deep-copied list of messages where every `"content"` value is a list of structured | |
| content blocks, and all `"image"` placeholders are populated with the corresponding image objects.</retdesc></docstring> | |
| Convert messages into a structured multimodal format and inject the provided images into the message contents. | |
| Notes: | |
| - When the input `messages` isn't already in the structured format, (i.e., all `"content"` values are strings), | |
| the function transforms them into the structured format by wrapping text in `{"type": "text", "text": ...}` | |
| and inserting `{"type": "image"}` placeholders for the images *before* the first user message. | |
| - When the input `messages` is already in the structured format (i.e., all `"content"` values are lists of | |
| structured blocks), the function only fills in the actual images in the existing `{"type": "image"}` | |
| placeholders. If the number of placeholders does not match the number of provided images, an error is raised. | |
| <ExampleCodeBlock anchor="trl.prepare_multimodal_messages.example"> | |
| Example: | |
| ```python | |
| # Input | |
| [ | |
| {"role": "user", "content": "What's in this image?"}, | |
| {"role": "assistant", "content": "It looks like a cat."}, | |
| ] | |
| # Output, one image provided | |
| [ | |
| {"role": "user", "content": [{"type": "image", "image": <PIL.Image.Image>}, {"type": "text", "text": "What's in this image?"}]}, | |
| {"role": "assistant", "content": [{"type": "text", "text": "It looks like a cat."}]}, | |
| ] | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| ## prepare_multimodal_messages_vllm[[trl.prepare_multimodal_messages_vllm]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>trl.prepare_multimodal_messages_vllm</name><anchor>trl.prepare_multimodal_messages_vllm</anchor><source>https://github.com/huggingface/trl/blob/vr_4331/trl/data_utils.py#L112</source><parameters>[{"name": "messages", "val": ": list"}]</parameters><paramsdesc>- **messages** (`list[dict[str, Any]]`) -- | |
| Messages with `"role"` and `"content"`. Content is expected to be a list of structured blocks.</paramsdesc><paramgroups>0</paramgroups><rettype>`list[dict[str, Any]]`</rettype><retdesc>A deep-copied list of messages compatible with vLLM's expected input format.</retdesc></docstring> | |
| Convert structured multimodal messages into a format compatible with vLLM. Replaces `"type": "image"` blocks with | |
| `"type": "image_pil"` blocks, and `"image": Image` with `"image_pil": Image`. | |
| <ExampleCodeBlock anchor="trl.prepare_multimodal_messages_vllm.example"> | |
| Example: | |
| ```python | |
| # Input | |
| [{"role": "user", "content": [{"type": "image", "image": <PIL.Image.Image>}, {"type": "text", "text": "What's in this image?"}]}] | |
| # Output | |
| [{"role": "user", "content": [{"type": "image_pil", "image_pil": <PIL.Image.Image>}, {"type": "text", "text": "What's in this image?"}]}] | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| ## is_conversational[[trl.is_conversational]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>trl.is_conversational</name><anchor>trl.is_conversational</anchor><source>https://github.com/huggingface/trl/blob/vr_4331/trl/data_utils.py#L145</source><parameters>[{"name": "example", "val": ": dict"}]</parameters><paramsdesc>- **example** (`dict[str, Any]`) -- | |
| A single data entry of a dataset. The example can have different keys depending on the dataset type.</paramsdesc><paramgroups>0</paramgroups><rettype>`bool`</rettype><retdesc>`True` if the data is in a conversational format, `False` otherwise.</retdesc></docstring> | |
| Check if the example is in a conversational format. | |
| <ExampleCodeBlock anchor="trl.is_conversational.example"> | |
| Examples: | |
| ```python | |
| >>> example = {"prompt": [{"role": "user", "content": "What color is the sky?"}]} | |
| >>> is_conversational(example) | |
| True | |
| >>> example = {"prompt": "The sky is"} | |
| >>> is_conversational(example) | |
| False | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| ## is_conversational_from_value[[trl.is_conversational_from_value]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>trl.is_conversational_from_value</name><anchor>trl.is_conversational_from_value</anchor><source>https://github.com/huggingface/trl/blob/vr_4331/trl/data_utils.py#L850</source><parameters>[{"name": "example", "val": ": dict"}]</parameters><paramsdesc>- **example** (`dict[str, Any]`) -- | |
| A single data entry of a dataset. The example can have different keys depending on the dataset type.</paramsdesc><paramgroups>0</paramgroups><rettype>`bool`</rettype><retdesc>`True` if the data is in a conversational Chatformat, `False` otherwise.</retdesc></docstring> | |
| Check if the example is in a conversational format (from/value). Note that this format isn't recommended. Prefer | |
| the ChatML format (role/content) | |
| <ExampleCodeBlock anchor="trl.is_conversational_from_value.example"> | |
| Examples: | |
| ```python | |
| >>> example = {"conversations": [{"from": "user", "value": "What color is the sky?"}]} | |
| >>> is_conversational_from_value(example) | |
| True | |
| >>> example = {"conversations": [{"role": "user", "content": "What color is the sky?"}]} | |
| >>> is_conversational_from_value(example) | |
| False | |
| >>> example = {"conversations": "The sky is"} | |
| >>> is_conversational_from_value(example) | |
| False | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| ## apply_chat_template[[trl.apply_chat_template]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>trl.apply_chat_template</name><anchor>trl.apply_chat_template</anchor><source>https://github.com/huggingface/trl/blob/vr_4331/trl/data_utils.py#L186</source><parameters>[{"name": "example", "val": ": dict"}, {"name": "tokenizer", "val": ": transformers.tokenization_utils_base.PreTrainedTokenizerBase | transformers.processing_utils.ProcessorMixin"}, {"name": "tools", "val": ": list[dict | collections.abc.Callable] | None = None"}, {"name": "**template_kwargs", "val": ""}]</parameters></docstring> | |
| Apply a chat template to a conversational example along with the schema for a list of functions in `tools`. | |
| For more details, see [maybe_apply_chat_template()](/docs/trl/pr_4331/en/data_utils#trl.maybe_apply_chat_template). | |
| </div> | |
| ## maybe_apply_chat_template[[trl.maybe_apply_chat_template]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>trl.maybe_apply_chat_template</name><anchor>trl.maybe_apply_chat_template</anchor><source>https://github.com/huggingface/trl/blob/vr_4331/trl/data_utils.py#L319</source><parameters>[{"name": "example", "val": ": dict"}, {"name": "tokenizer", "val": ": PreTrainedTokenizerBase"}, {"name": "tools", "val": ": list[dict | collections.abc.Callable] | None = None"}, {"name": "**template_kwargs", "val": ": typing.Any"}]</parameters><paramsdesc>- **example** (`dict[str, list[dict[str, str]]`) -- | |
| Dictionary representing a single data entry of a conversational dataset. Each data entry can have different | |
| keys depending on the dataset type. The supported dataset types are: | |
| - Language modeling dataset: `"messages"`. | |
| - Prompt-only dataset: `"prompt"`. | |
| - Prompt-completion dataset: `"prompt"` and `"completion"`. | |
| - Preference dataset: `"prompt"`, `"chosen"`, and `"rejected"`. | |
| - Preference dataset with implicit prompt: `"chosen"` and `"rejected"`. | |
| - Unpaired preference dataset: `"prompt"`, `"completion"`, and `"label"`. | |
| For keys `"messages"`, `"prompt"`, `"chosen"`, `"rejected"`, and `"completion"`, the values are lists of | |
| messages, where each message is a dictionary with keys `"role"` and `"content"`. Additionally, the example | |
| may contain a `"chat_template_kwargs"` key, which is a dictionary of additional keyword arguments to pass | |
| to the chat template renderer. | |
| - **tokenizer** ([PreTrainedTokenizerBase](https://huggingface.co/docs/transformers/main/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase)) -- | |
| Tokenizer to apply the chat template with. | |
| - **tools** (`list[dict | Callable]`, *optional*) -- | |
| A list of tools (callable functions) that will be accessible to the model. If the template does not support | |
| function calling, this argument will have no effect. | |
| - ****template_kwargs** (`Any`, *optional*) -- | |
| Additional kwargs to pass to the template renderer. Will be accessible by the chat template.</paramsdesc><paramgroups>0</paramgroups><rettype>`dict[str, str]`</rettype><retdesc>Formatted example with the chat template applied.</retdesc></docstring> | |
| If the example is in a conversational format, apply a chat template to it. | |
| Notes: | |
| - This function does not alter the keys, except for Language modeling dataset, where `"messages"` is replaced | |
| by `"text"`. | |
| - In case of prompt-only data, if the last role is `"user"`, the generation prompt is added to the prompt. | |
| Else, if the last role is `"assistant"`, the final message is continued. | |
| <ExampleCodeBlock anchor="trl.maybe_apply_chat_template.example"> | |
| Example: | |
| ```python | |
| >>> from transformers import AutoTokenizer | |
| >>> tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct") | |
| >>> example = { | |
| ... "prompt": [{"role": "user", "content": "What color is the sky?"}], | |
| ... "completion": [{"role": "assistant", "content": "It is blue."}], | |
| ... } | |
| >>> apply_chat_template(example, tokenizer) | |
| {'prompt': '<|user|>\nWhat color is the sky?<|end|>\n<|assistant|>\n', 'completion': 'It is blue.<|end|>\n'} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| ## maybe_convert_to_chatml[[trl.maybe_convert_to_chatml]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>trl.maybe_convert_to_chatml</name><anchor>trl.maybe_convert_to_chatml</anchor><source>https://github.com/huggingface/trl/blob/vr_4331/trl/data_utils.py#L890</source><parameters>[{"name": "example", "val": ": dict"}]</parameters><paramsdesc>- **example** (`dict[str, list]`) -- | |
| A single data entry containing a list of messages.</paramsdesc><paramgroups>0</paramgroups><rettype>`dict[str, list]`</rettype><retdesc>Example reformatted to ChatML style.</retdesc></docstring> | |
| Convert a conversational dataset with fields `from` and `value` to ChatML format. | |
| This function modifies conversational data to align with OpenAI's ChatML format: | |
| - Replaces the key `"from"` with `"role"` in message dictionaries. | |
| - Replaces the key `"value"` with `"content"` in message dictionaries. | |
| - Renames `"conversations"` to `"messages"` for consistency with ChatML. | |
| <ExampleCodeBlock anchor="trl.maybe_convert_to_chatml.example"> | |
| Example: | |
| ```python | |
| >>> from trl import maybe_convert_to_chatml | |
| >>> example = { | |
| ... "conversations": [ | |
| ... {"from": "user", "value": "What color is the sky?"}, | |
| ... {"from": "assistant", "value": "It is blue."}, | |
| ... ] | |
| ... } | |
| >>> maybe_convert_to_chatml(example) | |
| {'messages': [{'role': 'user', 'content': 'What color is the sky?'}, | |
| {'role': 'assistant', 'content': 'It is blue.'}]} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| ## extract_prompt[[trl.extract_prompt]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>trl.extract_prompt</name><anchor>trl.extract_prompt</anchor><source>https://github.com/huggingface/trl/blob/vr_4331/trl/data_utils.py#L488</source><parameters>[{"name": "example", "val": ": dict"}]</parameters></docstring> | |
| Extracts the shared prompt from a preference data example, where the prompt is implicit within both the chosen and | |
| rejected completions. | |
| For more details, see [maybe_extract_prompt()](/docs/trl/pr_4331/en/data_utils#trl.maybe_extract_prompt). | |
| </div> | |
| ## maybe_extract_prompt[[trl.maybe_extract_prompt]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>trl.maybe_extract_prompt</name><anchor>trl.maybe_extract_prompt</anchor><source>https://github.com/huggingface/trl/blob/vr_4331/trl/data_utils.py#L507</source><parameters>[{"name": "example", "val": ": dict"}]</parameters><paramsdesc>- **example** (`dict[str, list]`) -- | |
| A dictionary representing a single data entry in the preference dataset. It must contain the keys | |
| `"chosen"` and `"rejected"`, where each value is either conversational or standard (`str`).</paramsdesc><paramgroups>0</paramgroups><rettype>`dict[str, list]`</rettype><retdesc>A dictionary containing: | |
| - `"prompt"`: The longest common prefix between the "chosen" and "rejected" completions. | |
| - `"chosen"`: The remainder of the "chosen" completion, with the prompt removed. | |
| - `"rejected"`: The remainder of the "rejected" completion, with the prompt removed.</retdesc></docstring> | |
| Extracts the shared prompt from a preference data example, where the prompt is implicit within both the chosen and | |
| rejected completions. | |
| If the example already contains a `"prompt"` key, the function returns the example as is. Else, the function | |
| identifies the longest common sequence (prefix) of conversation turns between the "chosen" and "rejected" | |
| completions and extracts this as the prompt. It then removes this prompt from the respective "chosen" and | |
| "rejected" completions. | |
| <ExampleCodeBlock anchor="trl.maybe_extract_prompt.example"> | |
| Examples: | |
| ```python | |
| >>> example = { | |
| ... "chosen": [ | |
| ... {"role": "user", "content": "What color is the sky?"}, | |
| ... {"role": "assistant", "content": "It is blue."}, | |
| ... ], | |
| ... "rejected": [ | |
| ... {"role": "user", "content": "What color is the sky?"}, | |
| ... {"role": "assistant", "content": "It is green."}, | |
| ... ], | |
| ... } | |
| >>> extract_prompt(example) | |
| {'prompt': [{'role': 'user', 'content': 'What color is the sky?'}], | |
| 'chosen': [{'role': 'assistant', 'content': 'It is blue.'}], | |
| 'rejected': [{'role': 'assistant', 'content': 'It is green.'}]} | |
| ``` | |
| </ExampleCodeBlock> | |
| <ExampleCodeBlock anchor="trl.maybe_extract_prompt.example-2"> | |
| Or, with the `map` method of [Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset): | |
| ```python | |
| >>> from trl import extract_prompt | |
| >>> from datasets import Dataset | |
| >>> dataset_dict = { | |
| ... "chosen": [ | |
| ... [ | |
| ... {"role": "user", "content": "What color is the sky?"}, | |
| ... {"role": "assistant", "content": "It is blue."}, | |
| ... ], | |
| ... [ | |
| ... {"role": "user", "content": "Where is the sun?"}, | |
| ... {"role": "assistant", "content": "In the sky."}, | |
| ... ], | |
| ... ], | |
| ... "rejected": [ | |
| ... [ | |
| ... {"role": "user", "content": "What color is the sky?"}, | |
| ... {"role": "assistant", "content": "It is green."}, | |
| ... ], | |
| ... [ | |
| ... {"role": "user", "content": "Where is the sun?"}, | |
| ... {"role": "assistant", "content": "In the sea."}, | |
| ... ], | |
| ... ], | |
| ... } | |
| >>> dataset = Dataset.from_dict(dataset_dict) | |
| >>> dataset = dataset.map(extract_prompt) | |
| >>> dataset[0] | |
| {'prompt': [{'role': 'user', 'content': 'What color is the sky?'}], | |
| 'chosen': [{'role': 'assistant', 'content': 'It is blue.'}], | |
| 'rejected': [{'role': 'assistant', 'content': 'It is green.'}]} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| ## unpair_preference_dataset[[trl.unpair_preference_dataset]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>trl.unpair_preference_dataset</name><anchor>trl.unpair_preference_dataset</anchor><source>https://github.com/huggingface/trl/blob/vr_4331/trl/data_utils.py#L394</source><parameters>[{"name": "dataset", "val": ": ~DatasetType"}, {"name": "num_proc", "val": ": int | None = None"}, {"name": "desc", "val": ": str | None = None"}]</parameters><paramsdesc>- **dataset** ([Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) or [DatasetDict](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)) -- | |
| Preference dataset to unpair. The dataset must have columns `"chosen"`, `"rejected"` and optionally | |
| `"prompt"`. | |
| - **num_proc** (`int`, *optional*) -- | |
| Number of processes to use for processing the dataset. | |
| - **desc** (`str`, *optional*) -- | |
| Meaningful description to be displayed alongside with the progress bar while mapping examples.</paramsdesc><paramgroups>0</paramgroups><rettype>[Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset)</rettype><retdesc>The unpaired preference dataset.</retdesc></docstring> | |
| Unpair a preference dataset. | |
| <ExampleCodeBlock anchor="trl.unpair_preference_dataset.example"> | |
| Example: | |
| ```python | |
| >>> from datasets import Dataset | |
| >>> dataset_dict = { | |
| ... "prompt": ["The sky is", "The sun is"], | |
| ... "chosen": [" blue.", "in the sky."], | |
| ... "rejected": [" green.", " in the sea."], | |
| ... } | |
| >>> dataset = Dataset.from_dict(dataset_dict) | |
| >>> dataset = unpair_preference_dataset(dataset) | |
| >>> dataset | |
| Dataset({ | |
| features: ['prompt', 'completion', 'label'], | |
| num_rows: 4 | |
| }) | |
| >>> dataset[0] | |
| {'prompt': 'The sky is', 'completion': ' blue.', 'label': True} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| ## maybe_unpair_preference_dataset[[trl.maybe_unpair_preference_dataset]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>trl.maybe_unpair_preference_dataset</name><anchor>trl.maybe_unpair_preference_dataset</anchor><source>https://github.com/huggingface/trl/blob/vr_4331/trl/data_utils.py#L437</source><parameters>[{"name": "dataset", "val": ": ~DatasetType"}, {"name": "num_proc", "val": ": int | None = None"}, {"name": "desc", "val": ": str | None = None"}]</parameters><paramsdesc>- **dataset** ([Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) or [DatasetDict](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)) -- | |
| Preference dataset to unpair. The dataset must have columns `"chosen"`, `"rejected"` and optionally | |
| `"prompt"`. | |
| - **num_proc** (`int`, *optional*) -- | |
| Number of processes to use for processing the dataset. | |
| - **desc** (`str`, *optional*) -- | |
| Meaningful description to be displayed alongside with the progress bar while mapping examples.</paramsdesc><paramgroups>0</paramgroups><rettype>[Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) or [DatasetDict](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)</rettype><retdesc>The unpaired preference dataset if it was paired, otherwise | |
| the original dataset.</retdesc></docstring> | |
| Unpair a preference dataset if it is paired. | |
| <ExampleCodeBlock anchor="trl.maybe_unpair_preference_dataset.example"> | |
| Example: | |
| ```python | |
| >>> from datasets import Dataset | |
| >>> dataset_dict = { | |
| ... "prompt": ["The sky is", "The sun is"], | |
| ... "chosen": [" blue.", "in the sky."], | |
| ... "rejected": [" green.", " in the sea."], | |
| ... } | |
| >>> dataset = Dataset.from_dict(dataset_dict) | |
| >>> dataset = unpair_preference_dataset(dataset) | |
| >>> dataset | |
| Dataset({ | |
| features: ['prompt', 'completion', 'label'], | |
| num_rows: 4 | |
| }) | |
| >>> dataset[0] | |
| {'prompt': 'The sky is', 'completion': ' blue.', 'label': True} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| ## pack_dataset[[trl.pack_dataset]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>trl.pack_dataset</name><anchor>trl.pack_dataset</anchor><source>https://github.com/huggingface/trl/blob/vr_4331/trl/data_utils.py#L731</source><parameters>[{"name": "dataset", "val": ": ~DatasetType"}, {"name": "seq_length", "val": ": int"}, {"name": "strategy", "val": ": str = 'bfd'"}, {"name": "map_kwargs", "val": ": dict[str, typing.Any] | None = None"}]</parameters><paramsdesc>- **dataset** ([Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) or [DatasetDict](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)) -- | |
| Dataset to pack | |
| - **seq_length** (`int`) -- | |
| Target sequence length to pack to. | |
| - **strategy** (`str`, *optional*, defaults to `"bfd"`) -- | |
| Packing strategy to use. Can be either: | |
| - `"bfd"` (Best Fit Decreasing): Slower but preserves sequence boundaries. Sequences are never cut in the | |
| middle. | |
| - `"wrapped"`: Faster but more aggressive. Ignores sequence boundaries and will cut sequences in the middle | |
| to completely fill each packed sequence with data. | |
| - **map_kwargs** (`dict`, *optional*) -- | |
| Additional keyword arguments to pass to the dataset's map method when packing examples.</paramsdesc><paramgroups>0</paramgroups><rettype>[Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) or [DatasetDict](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)</rettype><retdesc>The dataset with packed sequences. The number of examples | |
| may decrease as sequences are combined.</retdesc></docstring> | |
| Pack sequences in a dataset into chunks of size `seq_length`. | |
| <ExampleCodeBlock anchor="trl.pack_dataset.example"> | |
| Example: | |
| ```python | |
| >>> from datasets import Dataset | |
| >>> from trl import pack_dataset | |
| >>> examples = { | |
| ... "input_ids": [[1, 2, 3], [4, 5], [6, 7, 8], [9]], | |
| ... "attention_mask": [[1, 1, 0], [1, 0], [1, 0, 0], [1]], | |
| ... } | |
| >>> dataset = Dataset.from_dict(examples) | |
| >>> packed_dataset = pack_dataset(dataset, seq_length=4, strategy="bfd") | |
| >>> packed_dataset[:] | |
| {'input_ids': [[1, 2, 3, 9], [6, 7, 8], [4, 5]], | |
| 'attention_mask': [[1, 1, 0, 1], [1, 0, 0], [1, 0]], | |
| 'seq_lengths': [[3, 1], [3], [2]]} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| ## truncate_dataset[[trl.truncate_dataset]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>trl.truncate_dataset</name><anchor>trl.truncate_dataset</anchor><source>https://github.com/huggingface/trl/blob/vr_4331/trl/data_utils.py#L787</source><parameters>[{"name": "dataset", "val": ": ~DatasetType"}, {"name": "max_length", "val": ": int"}, {"name": "map_kwargs", "val": ": dict[str, typing.Any] | None = None"}]</parameters><paramsdesc>- **dataset** ([Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) or [DatasetDict](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)) -- | |
| Dataset to truncate. | |
| - **max_length** (`int`) -- | |
| Maximum sequence length to truncate to. | |
| - **map_kwargs** (`dict`, *optional*) -- | |
| Additional keyword arguments to pass to the dataset's map method when truncating examples.</paramsdesc><paramgroups>0</paramgroups><rettype>[Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) or [DatasetDict](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)</rettype><retdesc>The dataset with truncated sequences.</retdesc></docstring> | |
| Truncate sequences in a dataset to a specified `max_length`. | |
| <ExampleCodeBlock anchor="trl.truncate_dataset.example"> | |
| Example: | |
| ```python | |
| >>> from datasets import Dataset | |
| >>> examples = { | |
| ... "input_ids": [[1, 2, 3], [4, 5, 6, 7], [8]], | |
| ... "attention_mask": [[0, 1, 1], [0, 0, 1, 1], [1]], | |
| ... } | |
| >>> dataset = Dataset.from_dict(examples) | |
| >>> truncated_dataset = truncate_dataset(dataset, max_length=2) | |
| >>> truncated_dataset[:] | |
| {'input_ids': [[1, 2], [4, 5], [8]], | |
| 'attention_mask': [[0, 1], [0, 0], [1]]} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <EditOnGithub source="https://github.com/huggingface/trl/blob/main/docs/source/data_utils.md" /> |
Xet Storage Details
- Size:
- 24.7 kB
- Xet hash:
- c4b7470905c645e7f48eded4e52245dc69da479ac7d97ee563a5c69698dd29bb
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.