Buckets:
| # Data Utilities | |
| ## prepare_multimodal_messages[[trl.prepare_multimodal_messages]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>trl.prepare_multimodal_messages</name><anchor>trl.prepare_multimodal_messages</anchor><source>https://github.com/huggingface/trl/blob/vr_4305/trl/data_utils.py#L31</source><parameters>[{"name": "messages", "val": ": list"}, {"name": "num_images", "val": ": int"}]</parameters><paramsdesc>- **messages** (`list[dict[str, Any]]`) -- | |
| Messages with `"role"` and `"content"`. Content may be a raw string before transformation. | |
| - **num_images** (`int`) -- | |
| Number of images to include in the first user message. This is used to determine how many image | |
| placeholders to add.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Convert messages into a structured multimodal format if needed. | |
| Each message's content is transformed from a raw string into a list of typed parts. The first user message is | |
| prefixed with an image placeholder, while all other user and assistant messages are wrapped as text entries. | |
| <ExampleCodeBlock anchor="trl.prepare_multimodal_messages.example"> | |
| Example: | |
| ```python | |
| # Input | |
| [ | |
| {"role": "user", "content": "What's in this image?"}, | |
| {"role": "assistant", "content": "It looks like a cat."}, | |
| ] | |
| # Output (num_images=1) | |
| [ | |
| {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "What's in this image?"}]}, | |
| {"role": "assistant", "content": [{"type": "text", "text": "It looks like a cat."}]}, | |
| ] | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| ## is_conversational[[trl.is_conversational]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>trl.is_conversational</name><anchor>trl.is_conversational</anchor><source>https://github.com/huggingface/trl/blob/vr_4305/trl/data_utils.py#L79</source><parameters>[{"name": "example", "val": ": dict"}]</parameters><paramsdesc>- **example** (`dict[str, Any]`) -- | |
| A single data entry of a dataset. The example can have different keys depending on the dataset type.</paramsdesc><paramgroups>0</paramgroups><rettype>`bool`</rettype><retdesc>`True` if the data is in a conversational format, `False` otherwise.</retdesc></docstring> | |
| Check if the example is in a conversational format. | |
| <ExampleCodeBlock anchor="trl.is_conversational.example"> | |
| Examples: | |
| ```python | |
| >>> example = {"prompt": [{"role": "user", "content": "What color is the sky?"}]} | |
| >>> is_conversational(example) | |
| True | |
| >>> example = {"prompt": "The sky is"} | |
| >>> is_conversational(example) | |
| False | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| ## is_conversational_from_value[[trl.is_conversational_from_value]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>trl.is_conversational_from_value</name><anchor>trl.is_conversational_from_value</anchor><source>https://github.com/huggingface/trl/blob/vr_4305/trl/data_utils.py#L782</source><parameters>[{"name": "example", "val": ": dict"}]</parameters><paramsdesc>- **example** (`dict[str, Any]`) -- | |
| A single data entry of a dataset. The example can have different keys depending on the dataset type.</paramsdesc><paramgroups>0</paramgroups><rettype>`bool`</rettype><retdesc>`True` if the data is in a conversational Chatformat, `False` otherwise.</retdesc></docstring> | |
| Check if the example is in a conversational format (from/value). Note that this format isn't recommended. Prefer | |
| the ChatML format (role/content) | |
| <ExampleCodeBlock anchor="trl.is_conversational_from_value.example"> | |
| Examples: | |
| ```python | |
| >>> example = {"conversations": [{"from": "user", "value": "What color is the sky?"}]} | |
| >>> is_conversational_from_value(example) | |
| True | |
| >>> example = {"conversations": [{"role": "user", "content": "What color is the sky?"}]} | |
| >>> is_conversational_from_value(example) | |
| False | |
| >>> example = {"conversations": "The sky is"} | |
| >>> is_conversational_from_value(example) | |
| False | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| ## apply_chat_template[[trl.apply_chat_template]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>trl.apply_chat_template</name><anchor>trl.apply_chat_template</anchor><source>https://github.com/huggingface/trl/blob/vr_4305/trl/data_utils.py#L120</source><parameters>[{"name": "example", "val": ": dict"}, {"name": "tokenizer", "val": ": typing.Union[transformers.tokenization_utils_base.PreTrainedTokenizerBase, transformers.processing_utils.ProcessorMixin]"}, {"name": "tools", "val": ": typing.Optional[list[typing.Union[dict, typing.Callable]]] = None"}, {"name": "**template_kwargs", "val": ""}]</parameters></docstring> | |
| Apply a chat template to a conversational example along with the schema for a list of functions in `tools`. | |
| For more details, see [maybe_apply_chat_template()](/docs/trl/pr_4305/en/data_utils#trl.maybe_apply_chat_template). | |
| </div> | |
| ## maybe_apply_chat_template[[trl.maybe_apply_chat_template]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>trl.maybe_apply_chat_template</name><anchor>trl.maybe_apply_chat_template</anchor><source>https://github.com/huggingface/trl/blob/vr_4305/trl/data_utils.py#L249</source><parameters>[{"name": "example", "val": ": dict"}, {"name": "tokenizer", "val": ": PreTrainedTokenizerBase"}, {"name": "tools", "val": ": typing.Optional[list[typing.Union[dict, typing.Callable]]] = None"}, {"name": "**template_kwargs", "val": ": typing.Any"}]</parameters><paramsdesc>- **example** (`dict[str, list[dict[str, str]]`) -- | |
| Dictionary representing a single data entry of a conversational dataset. Each data entry can have different | |
| keys depending on the dataset type. The supported dataset types are: | |
| - Language modeling dataset: `"messages"`. | |
| - Prompt-only dataset: `"prompt"`. | |
| - Prompt-completion dataset: `"prompt"` and `"completion"`. | |
| - Preference dataset: `"prompt"`, `"chosen"`, and `"rejected"`. | |
| - Preference dataset with implicit prompt: `"chosen"` and `"rejected"`. | |
| - Unpaired preference dataset: `"prompt"`, `"completion"`, and `"label"`. | |
| For keys `"messages"`, `"prompt"`, `"chosen"`, `"rejected"`, and `"completion"`, the values are lists of | |
| messages, where each message is a dictionary with keys `"role"` and `"content"`. Additionally, the example | |
| may contain a `"chat_template_kwargs"` key, which is a dictionary of additional keyword arguments to pass | |
| to the chat template renderer. | |
| - **tokenizer** ([PreTrainedTokenizerBase](https://huggingface.co/docs/transformers/main/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase)) -- | |
| Tokenizer to apply the chat template with. | |
| - **tools** (`list[Union[dict, Callable]]`, *optional*) -- | |
| A list of tools (callable functions) that will be accessible to the model. If the template does not support | |
| function calling, this argument will have no effect. | |
| - ****template_kwargs** (`Any`, *optional*) -- | |
| Additional kwargs to pass to the template renderer. Will be accessible by the chat template.</paramsdesc><paramgroups>0</paramgroups><rettype>`dict[str, str]`</rettype><retdesc>Formatted example with the chat template applied.</retdesc></docstring> | |
| If the example is in a conversational format, apply a chat template to it. | |
| Notes: | |
| - This function does not alter the keys, except for Language modeling dataset, where `"messages"` is replaced | |
| by `"text"`. | |
| - In case of prompt-only data, if the last role is `"user"`, the generation prompt is added to the prompt. | |
| Else, if the last role is `"assistant"`, the final message is continued. | |
| <ExampleCodeBlock anchor="trl.maybe_apply_chat_template.example"> | |
| Example: | |
| ```python | |
| >>> from transformers import AutoTokenizer | |
| >>> tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct") | |
| >>> example = { | |
| ... "prompt": [{"role": "user", "content": "What color is the sky?"}], | |
| ... "completion": [{"role": "assistant", "content": "It is blue."}], | |
| ... } | |
| >>> apply_chat_template(example, tokenizer) | |
| {'prompt': '<|user|>\nWhat color is the sky?<|end|>\n<|assistant|>\n', 'completion': 'It is blue.<|end|>\n'} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| ## maybe_convert_to_chatml[[trl.maybe_convert_to_chatml]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>trl.maybe_convert_to_chatml</name><anchor>trl.maybe_convert_to_chatml</anchor><source>https://github.com/huggingface/trl/blob/vr_4305/trl/data_utils.py#L822</source><parameters>[{"name": "example", "val": ": dict"}]</parameters><paramsdesc>- **example** (`dict[str, list]`) -- | |
| A single data entry containing a list of messages.</paramsdesc><paramgroups>0</paramgroups><rettype>`dict[str, list]`</rettype><retdesc>Example reformatted to ChatML style.</retdesc></docstring> | |
| Convert a conversational dataset with fields `from` and `value` to ChatML format. | |
| This function modifies conversational data to align with OpenAI's ChatML format: | |
| - Replaces the key `"from"` with `"role"` in message dictionaries. | |
| - Replaces the key `"value"` with `"content"` in message dictionaries. | |
| - Renames `"conversations"` to `"messages"` for consistency with ChatML. | |
| <ExampleCodeBlock anchor="trl.maybe_convert_to_chatml.example"> | |
| Example: | |
| ```python | |
| >>> from trl import maybe_convert_to_chatml | |
| >>> example = { | |
| ... "conversations": [ | |
| ... {"from": "user", "value": "What color is the sky?"}, | |
| ... {"from": "assistant", "value": "It is blue."}, | |
| ... ] | |
| ... } | |
| >>> maybe_convert_to_chatml(example) | |
| {'messages': [{'role': 'user', 'content': 'What color is the sky?'}, | |
| {'role': 'assistant', 'content': 'It is blue.'}]} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| ## extract_prompt[[trl.extract_prompt]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>trl.extract_prompt</name><anchor>trl.extract_prompt</anchor><source>https://github.com/huggingface/trl/blob/vr_4305/trl/data_utils.py#L418</source><parameters>[{"name": "example", "val": ": dict"}]</parameters></docstring> | |
| Extracts the shared prompt from a preference data example, where the prompt is implicit within both the chosen and | |
| rejected completions. | |
| For more details, see [maybe_extract_prompt()](/docs/trl/pr_4305/en/data_utils#trl.maybe_extract_prompt). | |
| </div> | |
| ## maybe_extract_prompt[[trl.maybe_extract_prompt]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>trl.maybe_extract_prompt</name><anchor>trl.maybe_extract_prompt</anchor><source>https://github.com/huggingface/trl/blob/vr_4305/trl/data_utils.py#L437</source><parameters>[{"name": "example", "val": ": dict"}]</parameters><paramsdesc>- **example** (`dict[str, list]`) -- | |
| A dictionary representing a single data entry in the preference dataset. It must contain the keys | |
| `"chosen"` and `"rejected"`, where each value is either conversational or standard (`str`).</paramsdesc><paramgroups>0</paramgroups><rettype>`dict[str, list]`</rettype><retdesc>A dictionary containing: | |
| - `"prompt"`: The longest common prefix between the "chosen" and "rejected" completions. | |
| - `"chosen"`: The remainder of the "chosen" completion, with the prompt removed. | |
| - `"rejected"`: The remainder of the "rejected" completion, with the prompt removed.</retdesc></docstring> | |
| Extracts the shared prompt from a preference data example, where the prompt is implicit within both the chosen and | |
| rejected completions. | |
| If the example already contains a `"prompt"` key, the function returns the example as is. Else, the function | |
| identifies the longest common sequence (prefix) of conversation turns between the "chosen" and "rejected" | |
| completions and extracts this as the prompt. It then removes this prompt from the respective "chosen" and | |
| "rejected" completions. | |
| <ExampleCodeBlock anchor="trl.maybe_extract_prompt.example"> | |
| Examples: | |
| ```python | |
| >>> example = { | |
| ... "chosen": [ | |
| ... {"role": "user", "content": "What color is the sky?"}, | |
| ... {"role": "assistant", "content": "It is blue."}, | |
| ... ], | |
| ... "rejected": [ | |
| ... {"role": "user", "content": "What color is the sky?"}, | |
| ... {"role": "assistant", "content": "It is green."}, | |
| ... ], | |
| ... } | |
| >>> extract_prompt(example) | |
| {'prompt': [{'role': 'user', 'content': 'What color is the sky?'}], | |
| 'chosen': [{'role': 'assistant', 'content': 'It is blue.'}], | |
| 'rejected': [{'role': 'assistant', 'content': 'It is green.'}]} | |
| ``` | |
| </ExampleCodeBlock> | |
| <ExampleCodeBlock anchor="trl.maybe_extract_prompt.example-2"> | |
| Or, with the `map` method of [Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset): | |
| ```python | |
| >>> from trl import extract_prompt | |
| >>> from datasets import Dataset | |
| >>> dataset_dict = { | |
| ... "chosen": [ | |
| ... [ | |
| ... {"role": "user", "content": "What color is the sky?"}, | |
| ... {"role": "assistant", "content": "It is blue."}, | |
| ... ], | |
| ... [ | |
| ... {"role": "user", "content": "Where is the sun?"}, | |
| ... {"role": "assistant", "content": "In the sky."}, | |
| ... ], | |
| ... ], | |
| ... "rejected": [ | |
| ... [ | |
| ... {"role": "user", "content": "What color is the sky?"}, | |
| ... {"role": "assistant", "content": "It is green."}, | |
| ... ], | |
| ... [ | |
| ... {"role": "user", "content": "Where is the sun?"}, | |
| ... {"role": "assistant", "content": "In the sea."}, | |
| ... ], | |
| ... ], | |
| ... } | |
| >>> dataset = Dataset.from_dict(dataset_dict) | |
| >>> dataset = dataset.map(extract_prompt) | |
| >>> dataset[0] | |
| {'prompt': [{'role': 'user', 'content': 'What color is the sky?'}], | |
| 'chosen': [{'role': 'assistant', 'content': 'It is blue.'}], | |
| 'rejected': [{'role': 'assistant', 'content': 'It is green.'}]} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| ## unpair_preference_dataset[[trl.unpair_preference_dataset]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>trl.unpair_preference_dataset</name><anchor>trl.unpair_preference_dataset</anchor><source>https://github.com/huggingface/trl/blob/vr_4305/trl/data_utils.py#L324</source><parameters>[{"name": "dataset", "val": ": ~DatasetType"}, {"name": "num_proc", "val": ": typing.Optional[int] = None"}, {"name": "desc", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **dataset** ([Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) or [DatasetDict](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)) -- | |
| Preference dataset to unpair. The dataset must have columns `"chosen"`, `"rejected"` and optionally | |
| `"prompt"`. | |
| - **num_proc** (`int`, *optional*) -- | |
| Number of processes to use for processing the dataset. | |
| - **desc** (`str`, *optional*) -- | |
| Meaningful description to be displayed alongside with the progress bar while mapping examples.</paramsdesc><paramgroups>0</paramgroups><rettype>[Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset)</rettype><retdesc>The unpaired preference dataset.</retdesc></docstring> | |
| Unpair a preference dataset. | |
| <ExampleCodeBlock anchor="trl.unpair_preference_dataset.example"> | |
| Example: | |
| ```python | |
| >>> from datasets import Dataset | |
| >>> dataset_dict = { | |
| ... "prompt": ["The sky is", "The sun is"], | |
| ... "chosen": [" blue.", "in the sky."], | |
| ... "rejected": [" green.", " in the sea."], | |
| ... } | |
| >>> dataset = Dataset.from_dict(dataset_dict) | |
| >>> dataset = unpair_preference_dataset(dataset) | |
| >>> dataset | |
| Dataset({ | |
| features: ['prompt', 'completion', 'label'], | |
| num_rows: 4 | |
| }) | |
| >>> dataset[0] | |
| {'prompt': 'The sky is', 'completion': ' blue.', 'label': True} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| ## maybe_unpair_preference_dataset[[trl.maybe_unpair_preference_dataset]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>trl.maybe_unpair_preference_dataset</name><anchor>trl.maybe_unpair_preference_dataset</anchor><source>https://github.com/huggingface/trl/blob/vr_4305/trl/data_utils.py#L367</source><parameters>[{"name": "dataset", "val": ": ~DatasetType"}, {"name": "num_proc", "val": ": typing.Optional[int] = None"}, {"name": "desc", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **dataset** ([Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) or [DatasetDict](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)) -- | |
| Preference dataset to unpair. The dataset must have columns `"chosen"`, `"rejected"` and optionally | |
| `"prompt"`. | |
| - **num_proc** (`int`, *optional*) -- | |
| Number of processes to use for processing the dataset. | |
| - **desc** (`str`, *optional*) -- | |
| Meaningful description to be displayed alongside with the progress bar while mapping examples.</paramsdesc><paramgroups>0</paramgroups><rettype>[Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) or [DatasetDict](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)</rettype><retdesc>The unpaired preference dataset if it was paired, otherwise | |
| the original dataset.</retdesc></docstring> | |
| Unpair a preference dataset if it is paired. | |
| <ExampleCodeBlock anchor="trl.maybe_unpair_preference_dataset.example"> | |
| Example: | |
| ```python | |
| >>> from datasets import Dataset | |
| >>> dataset_dict = { | |
| ... "prompt": ["The sky is", "The sun is"], | |
| ... "chosen": [" blue.", "in the sky."], | |
| ... "rejected": [" green.", " in the sea."], | |
| ... } | |
| >>> dataset = Dataset.from_dict(dataset_dict) | |
| >>> dataset = unpair_preference_dataset(dataset) | |
| >>> dataset | |
| Dataset({ | |
| features: ['prompt', 'completion', 'label'], | |
| num_rows: 4 | |
| }) | |
| >>> dataset[0] | |
| {'prompt': 'The sky is', 'completion': ' blue.', 'label': True} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| ## pack_dataset[[trl.pack_dataset]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>trl.pack_dataset</name><anchor>trl.pack_dataset</anchor><source>https://github.com/huggingface/trl/blob/vr_4305/trl/data_utils.py#L661</source><parameters>[{"name": "dataset", "val": ": ~DatasetType"}, {"name": "seq_length", "val": ": int"}, {"name": "strategy", "val": ": str = 'bfd'"}, {"name": "map_kwargs", "val": ": typing.Optional[dict[str, typing.Any]] = None"}]</parameters><paramsdesc>- **dataset** ([Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) or [DatasetDict](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)) -- | |
| Dataset to pack | |
| - **seq_length** (`int`) -- | |
| Target sequence length to pack to. | |
| - **strategy** (`str`, *optional*, defaults to `"bfd"`) -- | |
| Packing strategy to use. Can be either: | |
| - `"bfd"` (Best Fit Decreasing): Slower but preserves sequence boundaries. Sequences are never cut in the | |
| middle. | |
| - `"wrapped"`: Faster but more aggressive. Ignores sequence boundaries and will cut sequences in the middle | |
| to completely fill each packed sequence with data. | |
| - **map_kwargs** (`dict`, *optional*) -- | |
| Additional keyword arguments to pass to the dataset's map method when packing examples.</paramsdesc><paramgroups>0</paramgroups><rettype>[Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) or [DatasetDict](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)</rettype><retdesc>The dataset with packed sequences. The number of examples | |
| may decrease as sequences are combined.</retdesc></docstring> | |
| Pack sequences in a dataset into chunks of size `seq_length`. | |
| <ExampleCodeBlock anchor="trl.pack_dataset.example"> | |
| Example: | |
| ```python | |
| >>> from datasets import Dataset | |
| >>> from trl import pack_dataset | |
| >>> examples = { | |
| ... "input_ids": [[1, 2, 3], [4, 5], [6, 7, 8], [9]], | |
| ... "attention_mask": [[1, 1, 0], [1, 0], [1, 0, 0], [1]], | |
| ... } | |
| >>> dataset = Dataset.from_dict(examples) | |
| >>> packed_dataset = pack_dataset(dataset, seq_length=4, strategy="bfd") | |
| >>> packed_dataset[:] | |
| {'input_ids': [[1, 2, 3, 9], [6, 7, 8], [4, 5]], | |
| 'attention_mask': [[1, 1, 0, 1], [1, 0, 0], [1, 0]], | |
| 'seq_lengths': [[3, 1], [3], [2]]} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| ## truncate_dataset[[trl.truncate_dataset]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>trl.truncate_dataset</name><anchor>trl.truncate_dataset</anchor><source>https://github.com/huggingface/trl/blob/vr_4305/trl/data_utils.py#L717</source><parameters>[{"name": "dataset", "val": ": ~DatasetType"}, {"name": "max_length", "val": ": int"}, {"name": "map_kwargs", "val": ": typing.Optional[dict[str, typing.Any]] = None"}]</parameters><paramsdesc>- **dataset** ([Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) or [DatasetDict](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)) -- | |
| Dataset to truncate. | |
| - **max_length** (`int`) -- | |
| Maximum sequence length to truncate to. | |
| - **map_kwargs** (`dict`, *optional*) -- | |
| Additional keyword arguments to pass to the dataset's map method when truncating examples.</paramsdesc><paramgroups>0</paramgroups><rettype>[Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) or [DatasetDict](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)</rettype><retdesc>The dataset with truncated sequences.</retdesc></docstring> | |
| Truncate sequences in a dataset to a specified `max_length`. | |
| <ExampleCodeBlock anchor="trl.truncate_dataset.example"> | |
| Example: | |
| ```python | |
| >>> from datasets import Dataset | |
| >>> examples = { | |
| ... "input_ids": [[1, 2, 3], [4, 5, 6, 7], [8]], | |
| ... "attention_mask": [[0, 1, 1], [0, 0, 1, 1], [1]], | |
| ... } | |
| >>> dataset = Dataset.from_dict(examples) | |
| >>> truncated_dataset = truncate_dataset(dataset, max_length=2) | |
| >>> truncated_dataset[:] | |
| {'input_ids': [[1, 2], [4, 5], [8]], | |
| 'attention_mask': [[0, 1], [0, 0], [1]]} | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <EditOnGithub source="https://github.com/huggingface/trl/blob/main/docs/source/data_utils.md" /> |
Xet Storage Details
- Size:
- 22.6 kB
- Xet hash:
- 3b129d2f7082c0aeeffdd3b54228f2b6ccb837cc9bcdb2c35d66863796891e1f
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.