Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / trl /pr_4331 /en /data_utils.md

rtrm

about 1 month ago

preview code

download

raw

24.7 kB

	# Data Utilities

	## prepare_multimodal_messages[[trl.prepare_multimodal_messages]]

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>trl.prepare_multimodal_messages</name><anchor>trl.prepare_multimodal_messages</anchor><source>https://github.com/huggingface/trl/blob/vr_4331/trl/data_utils.py#L32</source><parameters>[{"name": "messages", "val": ": list"}, {"name": "images", "val": ": list"}]</parameters><paramsdesc>- messages (`list[dict[str, Any]]`) --
	Messages with `"role"` and `"content"`. Content may be a raw string before transformation. List of messages
	a `"role"` key (`"system"`, `"user"`, or `"assistant"`) and a `"content"` key containing either a string or
	a list of structured blocks if already prepared.
	- images (`list`) --
	List of image objects to insert.</paramsdesc><paramgroups>0</paramgroups><rettype>`list[dict[str, Any]]`</rettype><retdesc>A deep-copied list of messages where every `"content"` value is a list of structured
	content blocks, and all `"image"` placeholders are populated with the corresponding image objects.</retdesc></docstring>

	Convert messages into a structured multimodal format and inject the provided images into the message contents.







	Notes:
	- When the input `messages` isn't already in the structured format, (i.e., all `"content"` values are strings),
	the function transforms them into the structured format by wrapping text in `{"type": "text", "text": ...}`
	and inserting `{"type": "image"}` placeholders for the images before the first user message.
	- When the input `messages` is already in the structured format (i.e., all `"content"` values are lists of
	structured blocks), the function only fills in the actual images in the existing `{"type": "image"}`
	placeholders. If the number of placeholders does not match the number of provided images, an error is raised.

	<ExampleCodeBlock anchor="trl.prepare_multimodal_messages.example">

	Example:
	```python
	# Input
	[
	{"role": "user", "content": "What's in this image?"},
	{"role": "assistant", "content": "It looks like a cat."},
	]

	# Output, one image provided
	[
	{"role": "user", "content": [{"type": "image", "image": <PIL.Image.Image>}, {"type": "text", "text": "What's in this image?"}]},
	{"role": "assistant", "content": [{"type": "text", "text": "It looks like a cat."}]},
	]
	```

	</ExampleCodeBlock>


	</div>

	## prepare_multimodal_messages_vllm[[trl.prepare_multimodal_messages_vllm]]

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>trl.prepare_multimodal_messages_vllm</name><anchor>trl.prepare_multimodal_messages_vllm</anchor><source>https://github.com/huggingface/trl/blob/vr_4331/trl/data_utils.py#L112</source><parameters>[{"name": "messages", "val": ": list"}]</parameters><paramsdesc>- messages (`list[dict[str, Any]]`) --
	Messages with `"role"` and `"content"`. Content is expected to be a list of structured blocks.</paramsdesc><paramgroups>0</paramgroups><rettype>`list[dict[str, Any]]`</rettype><retdesc>A deep-copied list of messages compatible with vLLM's expected input format.</retdesc></docstring>

	Convert structured multimodal messages into a format compatible with vLLM. Replaces `"type": "image"` blocks with
	`"type": "image_pil"` blocks, and `"image": Image` with `"image_pil": Image`.







	<ExampleCodeBlock anchor="trl.prepare_multimodal_messages_vllm.example">

	Example:
	```python
	# Input
	[{"role": "user", "content": [{"type": "image", "image": <PIL.Image.Image>}, {"type": "text", "text": "What's in this image?"}]}]

	# Output
	[{"role": "user", "content": [{"type": "image_pil", "image_pil": <PIL.Image.Image>}, {"type": "text", "text": "What's in this image?"}]}]
	```

	</ExampleCodeBlock>


	</div>

	## is_conversational[[trl.is_conversational]]

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>trl.is_conversational</name><anchor>trl.is_conversational</anchor><source>https://github.com/huggingface/trl/blob/vr_4331/trl/data_utils.py#L145</source><parameters>[{"name": "example", "val": ": dict"}]</parameters><paramsdesc>- example (`dict[str, Any]`) --
	A single data entry of a dataset. The example can have different keys depending on the dataset type.</paramsdesc><paramgroups>0</paramgroups><rettype>`bool`</rettype><retdesc>`True` if the data is in a conversational format, `False` otherwise.</retdesc></docstring>

	Check if the example is in a conversational format.







	<ExampleCodeBlock anchor="trl.is_conversational.example">

	Examples:

	```python
	>>> example = {"prompt": [{"role": "user", "content": "What color is the sky?"}]}
	>>> is_conversational(example)
	True

	>>> example = {"prompt": "The sky is"}
	>>> is_conversational(example)
	False
	```

	</ExampleCodeBlock>


	</div>

	## is_conversational_from_value[[trl.is_conversational_from_value]]

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>trl.is_conversational_from_value</name><anchor>trl.is_conversational_from_value</anchor><source>https://github.com/huggingface/trl/blob/vr_4331/trl/data_utils.py#L850</source><parameters>[{"name": "example", "val": ": dict"}]</parameters><paramsdesc>- example (`dict[str, Any]`) --
	A single data entry of a dataset. The example can have different keys depending on the dataset type.</paramsdesc><paramgroups>0</paramgroups><rettype>`bool`</rettype><retdesc>`True` if the data is in a conversational Chatformat, `False` otherwise.</retdesc></docstring>

	Check if the example is in a conversational format (from/value). Note that this format isn't recommended. Prefer
	the ChatML format (role/content)







	<ExampleCodeBlock anchor="trl.is_conversational_from_value.example">

	Examples:

	```python
	>>> example = {"conversations": [{"from": "user", "value": "What color is the sky?"}]}
	>>> is_conversational_from_value(example)
	True

	>>> example = {"conversations": [{"role": "user", "content": "What color is the sky?"}]}
	>>> is_conversational_from_value(example)
	False

	>>> example = {"conversations": "The sky is"}
	>>> is_conversational_from_value(example)
	False
	```

	</ExampleCodeBlock>


	</div>

	## apply_chat_template[[trl.apply_chat_template]]

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>trl.apply_chat_template</name><anchor>trl.apply_chat_template</anchor><source>https://github.com/huggingface/trl/blob/vr_4331/trl/data_utils.py#L186</source><parameters>[{"name": "example", "val": ": dict"}, {"name": "tokenizer", "val": ": transformers.tokenization_utils_base.PreTrainedTokenizerBase \| transformers.processing_utils.ProcessorMixin"}, {"name": "tools", "val": ": list[dict \| collections.abc.Callable] \| None = None"}, {"name": "**template_kwargs", "val": ""}]</parameters></docstring>

	Apply a chat template to a conversational example along with the schema for a list of functions in `tools`.

	For more details, see [maybe_apply_chat_template()](/docs/trl/pr_4331/en/data_utils#trl.maybe_apply_chat_template).


	</div>

	## maybe_apply_chat_template[[trl.maybe_apply_chat_template]]

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>trl.maybe_apply_chat_template</name><anchor>trl.maybe_apply_chat_template</anchor><source>https://github.com/huggingface/trl/blob/vr_4331/trl/data_utils.py#L319</source><parameters>[{"name": "example", "val": ": dict"}, {"name": "tokenizer", "val": ": PreTrainedTokenizerBase"}, {"name": "tools", "val": ": list[dict \| collections.abc.Callable] \| None = None"}, {"name": "template_kwargs", "val": ": typing.Any"}]</parameters><paramsdesc>- example** (`dict[str, list[dict[str, str]]`) --
	Dictionary representing a single data entry of a conversational dataset. Each data entry can have different
	keys depending on the dataset type. The supported dataset types are:

	- Language modeling dataset: `"messages"`.
	- Prompt-only dataset: `"prompt"`.
	- Prompt-completion dataset: `"prompt"` and `"completion"`.
	- Preference dataset: `"prompt"`, `"chosen"`, and `"rejected"`.
	- Preference dataset with implicit prompt: `"chosen"` and `"rejected"`.
	- Unpaired preference dataset: `"prompt"`, `"completion"`, and `"label"`.

	For keys `"messages"`, `"prompt"`, `"chosen"`, `"rejected"`, and `"completion"`, the values are lists of
	messages, where each message is a dictionary with keys `"role"` and `"content"`. Additionally, the example
	may contain a `"chat_template_kwargs"` key, which is a dictionary of additional keyword arguments to pass
	to the chat template renderer.
	- tokenizer ([PreTrainedTokenizerBase](https://huggingface.co/docs/transformers/main/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase)) --
	Tokenizer to apply the chat template with.
	- tools (`list[dict \| Callable]`, optional) --
	A list of tools (callable functions) that will be accessible to the model. If the template does not support
	function calling, this argument will have no effect.
	- **template_kwargs (`Any`, optional) --
	Additional kwargs to pass to the template renderer. Will be accessible by the chat template.</paramsdesc><paramgroups>0</paramgroups><rettype>`dict[str, str]`</rettype><retdesc>Formatted example with the chat template applied.</retdesc></docstring>

	If the example is in a conversational format, apply a chat template to it.







	Notes:
	- This function does not alter the keys, except for Language modeling dataset, where `"messages"` is replaced
	by `"text"`.

	- In case of prompt-only data, if the last role is `"user"`, the generation prompt is added to the prompt.
	Else, if the last role is `"assistant"`, the final message is continued.

	<ExampleCodeBlock anchor="trl.maybe_apply_chat_template.example">

	Example:

	```python
	>>> from transformers import AutoTokenizer

	>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")
	>>> example = {
	... "prompt": [{"role": "user", "content": "What color is the sky?"}],
	... "completion": [{"role": "assistant", "content": "It is blue."}],
	... }
	>>> apply_chat_template(example, tokenizer)
	{'prompt': '<\|user\|>\nWhat color is the sky?<\|end\|>\n<\|assistant\|>\n', 'completion': 'It is blue.<\|end\|>\n'}
	```

	</ExampleCodeBlock>


	</div>

	## maybe_convert_to_chatml[[trl.maybe_convert_to_chatml]]

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>trl.maybe_convert_to_chatml</name><anchor>trl.maybe_convert_to_chatml</anchor><source>https://github.com/huggingface/trl/blob/vr_4331/trl/data_utils.py#L890</source><parameters>[{"name": "example", "val": ": dict"}]</parameters><paramsdesc>- example (`dict[str, list]`) --
	A single data entry containing a list of messages.</paramsdesc><paramgroups>0</paramgroups><rettype>`dict[str, list]`</rettype><retdesc>Example reformatted to ChatML style.</retdesc></docstring>

	Convert a conversational dataset with fields `from` and `value` to ChatML format.

	This function modifies conversational data to align with OpenAI's ChatML format:
	- Replaces the key `"from"` with `"role"` in message dictionaries.
	- Replaces the key `"value"` with `"content"` in message dictionaries.
	- Renames `"conversations"` to `"messages"` for consistency with ChatML.







	<ExampleCodeBlock anchor="trl.maybe_convert_to_chatml.example">

	Example:
	```python
	>>> from trl import maybe_convert_to_chatml

	>>> example = {
	... "conversations": [
	... {"from": "user", "value": "What color is the sky?"},
	... {"from": "assistant", "value": "It is blue."},
	... ]
	... }
	>>> maybe_convert_to_chatml(example)
	{'messages': [{'role': 'user', 'content': 'What color is the sky?'},
	{'role': 'assistant', 'content': 'It is blue.'}]}
	```

	</ExampleCodeBlock>


	</div>

	## extract_prompt[[trl.extract_prompt]]

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>trl.extract_prompt</name><anchor>trl.extract_prompt</anchor><source>https://github.com/huggingface/trl/blob/vr_4331/trl/data_utils.py#L488</source><parameters>[{"name": "example", "val": ": dict"}]</parameters></docstring>

	Extracts the shared prompt from a preference data example, where the prompt is implicit within both the chosen and
	rejected completions.

	For more details, see [maybe_extract_prompt()](/docs/trl/pr_4331/en/data_utils#trl.maybe_extract_prompt).


	</div>

	## maybe_extract_prompt[[trl.maybe_extract_prompt]]

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>trl.maybe_extract_prompt</name><anchor>trl.maybe_extract_prompt</anchor><source>https://github.com/huggingface/trl/blob/vr_4331/trl/data_utils.py#L507</source><parameters>[{"name": "example", "val": ": dict"}]</parameters><paramsdesc>- example (`dict[str, list]`) --
	A dictionary representing a single data entry in the preference dataset. It must contain the keys
	`"chosen"` and `"rejected"`, where each value is either conversational or standard (`str`).</paramsdesc><paramgroups>0</paramgroups><rettype>`dict[str, list]`</rettype><retdesc>A dictionary containing:
	- `"prompt"`: The longest common prefix between the "chosen" and "rejected" completions.
	- `"chosen"`: The remainder of the "chosen" completion, with the prompt removed.
	- `"rejected"`: The remainder of the "rejected" completion, with the prompt removed.</retdesc></docstring>

	Extracts the shared prompt from a preference data example, where the prompt is implicit within both the chosen and
	rejected completions.

	If the example already contains a `"prompt"` key, the function returns the example as is. Else, the function
	identifies the longest common sequence (prefix) of conversation turns between the "chosen" and "rejected"
	completions and extracts this as the prompt. It then removes this prompt from the respective "chosen" and
	"rejected" completions.







	<ExampleCodeBlock anchor="trl.maybe_extract_prompt.example">

	Examples:

	```python
	>>> example = {
	... "chosen": [
	... {"role": "user", "content": "What color is the sky?"},
	... {"role": "assistant", "content": "It is blue."},
	... ],
	... "rejected": [
	... {"role": "user", "content": "What color is the sky?"},
	... {"role": "assistant", "content": "It is green."},
	... ],
	... }
	>>> extract_prompt(example)
	{'prompt': [{'role': 'user', 'content': 'What color is the sky?'}],
	'chosen': [{'role': 'assistant', 'content': 'It is blue.'}],
	'rejected': [{'role': 'assistant', 'content': 'It is green.'}]}
	```

	</ExampleCodeBlock>

	<ExampleCodeBlock anchor="trl.maybe_extract_prompt.example-2">

	Or, with the `map` method of [Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset):

	```python
	>>> from trl import extract_prompt
	>>> from datasets import Dataset

	>>> dataset_dict = {
	... "chosen": [
	... [
	... {"role": "user", "content": "What color is the sky?"},
	... {"role": "assistant", "content": "It is blue."},
	... ],
	... [
	... {"role": "user", "content": "Where is the sun?"},
	... {"role": "assistant", "content": "In the sky."},
	... ],
	... ],
	... "rejected": [
	... [
	... {"role": "user", "content": "What color is the sky?"},
	... {"role": "assistant", "content": "It is green."},
	... ],
	... [
	... {"role": "user", "content": "Where is the sun?"},
	... {"role": "assistant", "content": "In the sea."},
	... ],
	... ],
	... }
	>>> dataset = Dataset.from_dict(dataset_dict)
	>>> dataset = dataset.map(extract_prompt)
	>>> dataset[0]
	{'prompt': [{'role': 'user', 'content': 'What color is the sky?'}],
	'chosen': [{'role': 'assistant', 'content': 'It is blue.'}],
	'rejected': [{'role': 'assistant', 'content': 'It is green.'}]}
	```

	</ExampleCodeBlock>


	</div>

	## unpair_preference_dataset[[trl.unpair_preference_dataset]]

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>trl.unpair_preference_dataset</name><anchor>trl.unpair_preference_dataset</anchor><source>https://github.com/huggingface/trl/blob/vr_4331/trl/data_utils.py#L394</source><parameters>[{"name": "dataset", "val": ": ~DatasetType"}, {"name": "num_proc", "val": ": int \| None = None"}, {"name": "desc", "val": ": str \| None = None"}]</parameters><paramsdesc>- dataset ([Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) or [DatasetDict](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)) --
	Preference dataset to unpair. The dataset must have columns `"chosen"`, `"rejected"` and optionally
	`"prompt"`.
	- num_proc (`int`, optional) --
	Number of processes to use for processing the dataset.
	- desc (`str`, optional) --
	Meaningful description to be displayed alongside with the progress bar while mapping examples.</paramsdesc><paramgroups>0</paramgroups><rettype>[Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset)</rettype><retdesc>The unpaired preference dataset.</retdesc></docstring>

	Unpair a preference dataset.







	<ExampleCodeBlock anchor="trl.unpair_preference_dataset.example">

	Example:

	```python
	>>> from datasets import Dataset

	>>> dataset_dict = {
	... "prompt": ["The sky is", "The sun is"],
	... "chosen": [" blue.", "in the sky."],
	... "rejected": [" green.", " in the sea."],
	... }
	>>> dataset = Dataset.from_dict(dataset_dict)
	>>> dataset = unpair_preference_dataset(dataset)
	>>> dataset
	Dataset({
	features: ['prompt', 'completion', 'label'],
	num_rows: 4
	})

	>>> dataset[0]
	{'prompt': 'The sky is', 'completion': ' blue.', 'label': True}
	```

	</ExampleCodeBlock>


	</div>

	## maybe_unpair_preference_dataset[[trl.maybe_unpair_preference_dataset]]

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>trl.maybe_unpair_preference_dataset</name><anchor>trl.maybe_unpair_preference_dataset</anchor><source>https://github.com/huggingface/trl/blob/vr_4331/trl/data_utils.py#L437</source><parameters>[{"name": "dataset", "val": ": ~DatasetType"}, {"name": "num_proc", "val": ": int \| None = None"}, {"name": "desc", "val": ": str \| None = None"}]</parameters><paramsdesc>- dataset ([Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) or [DatasetDict](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)) --
	Preference dataset to unpair. The dataset must have columns `"chosen"`, `"rejected"` and optionally
	`"prompt"`.
	- num_proc (`int`, optional) --
	Number of processes to use for processing the dataset.
	- desc (`str`, optional) --
	Meaningful description to be displayed alongside with the progress bar while mapping examples.</paramsdesc><paramgroups>0</paramgroups><rettype>[Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) or [DatasetDict](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)</rettype><retdesc>The unpaired preference dataset if it was paired, otherwise
	the original dataset.</retdesc></docstring>

	Unpair a preference dataset if it is paired.







	<ExampleCodeBlock anchor="trl.maybe_unpair_preference_dataset.example">

	Example:

	```python
	>>> from datasets import Dataset

	>>> dataset_dict = {
	... "prompt": ["The sky is", "The sun is"],
	... "chosen": [" blue.", "in the sky."],
	... "rejected": [" green.", " in the sea."],
	... }
	>>> dataset = Dataset.from_dict(dataset_dict)
	>>> dataset = unpair_preference_dataset(dataset)
	>>> dataset
	Dataset({
	features: ['prompt', 'completion', 'label'],
	num_rows: 4
	})

	>>> dataset[0]
	{'prompt': 'The sky is', 'completion': ' blue.', 'label': True}
	```

	</ExampleCodeBlock>


	</div>

	## pack_dataset[[trl.pack_dataset]]

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>trl.pack_dataset</name><anchor>trl.pack_dataset</anchor><source>https://github.com/huggingface/trl/blob/vr_4331/trl/data_utils.py#L731</source><parameters>[{"name": "dataset", "val": ": ~DatasetType"}, {"name": "seq_length", "val": ": int"}, {"name": "strategy", "val": ": str = 'bfd'"}, {"name": "map_kwargs", "val": ": dict[str, typing.Any] \| None = None"}]</parameters><paramsdesc>- dataset ([Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) or [DatasetDict](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)) --
	Dataset to pack
	- seq_length (`int`) --
	Target sequence length to pack to.
	- strategy (`str`, optional, defaults to `"bfd"`) --
	Packing strategy to use. Can be either:

	- `"bfd"` (Best Fit Decreasing): Slower but preserves sequence boundaries. Sequences are never cut in the
	middle.
	- `"wrapped"`: Faster but more aggressive. Ignores sequence boundaries and will cut sequences in the middle
	to completely fill each packed sequence with data.
	- map_kwargs (`dict`, optional) --
	Additional keyword arguments to pass to the dataset's map method when packing examples.</paramsdesc><paramgroups>0</paramgroups><rettype>[Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) or [DatasetDict](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)</rettype><retdesc>The dataset with packed sequences. The number of examples
	may decrease as sequences are combined.</retdesc></docstring>

	Pack sequences in a dataset into chunks of size `seq_length`.







	<ExampleCodeBlock anchor="trl.pack_dataset.example">

	Example:
	```python
	>>> from datasets import Dataset
	>>> from trl import pack_dataset

	>>> examples = {
	... "input_ids": [[1, 2, 3], [4, 5], [6, 7, 8], [9]],
	... "attention_mask": [[1, 1, 0], [1, 0], [1, 0, 0], [1]],
	... }
	>>> dataset = Dataset.from_dict(examples)
	>>> packed_dataset = pack_dataset(dataset, seq_length=4, strategy="bfd")
	>>> packed_dataset[:]
	{'input_ids': [[1, 2, 3, 9], [6, 7, 8], [4, 5]],
	'attention_mask': [[1, 1, 0, 1], [1, 0, 0], [1, 0]],
	'seq_lengths': [[3, 1], [3], [2]]}
	```

	</ExampleCodeBlock>


	</div>

	## truncate_dataset[[trl.truncate_dataset]]

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>trl.truncate_dataset</name><anchor>trl.truncate_dataset</anchor><source>https://github.com/huggingface/trl/blob/vr_4331/trl/data_utils.py#L787</source><parameters>[{"name": "dataset", "val": ": ~DatasetType"}, {"name": "max_length", "val": ": int"}, {"name": "map_kwargs", "val": ": dict[str, typing.Any] \| None = None"}]</parameters><paramsdesc>- dataset ([Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) or [DatasetDict](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)) --
	Dataset to truncate.
	- max_length (`int`) --
	Maximum sequence length to truncate to.
	- map_kwargs (`dict`, optional) --
	Additional keyword arguments to pass to the dataset's map method when truncating examples.</paramsdesc><paramgroups>0</paramgroups><rettype>[Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) or [DatasetDict](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)</rettype><retdesc>The dataset with truncated sequences.</retdesc></docstring>

	Truncate sequences in a dataset to a specified `max_length`.







	<ExampleCodeBlock anchor="trl.truncate_dataset.example">

	Example:
	```python
	>>> from datasets import Dataset

	>>> examples = {
	... "input_ids": [[1, 2, 3], [4, 5, 6, 7], [8]],
	... "attention_mask": [[0, 1, 1], [0, 0, 1, 1], [1]],
	... }
	>>> dataset = Dataset.from_dict(examples)
	>>> truncated_dataset = truncate_dataset(dataset, max_length=2)
	>>> truncated_dataset[:]
	{'input_ids': [[1, 2], [4, 5], [8]],
	'attention_mask': [[0, 1], [0, 0], [1]]}
	```

	</ExampleCodeBlock>


	</div>

	<EditOnGithub source="https://github.com/huggingface/trl/blob/main/docs/source/data_utils.md" />

Xet Storage Details

Size:: 24.7 kB
Xet hash:: c4b7470905c645e7f48eded4e52245dc69da479ac7d97ee563a5c69698dd29bb

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.