Buckets:

hf-doc-build/doc-dev / trl /pr_4305 /en /script_utils.md
rtrm's picture
|
download
raw
13.7 kB
# Scripts Utilities
## ScriptArguments[[trl.ScriptArguments]]
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">
<docstring><name>class trl.ScriptArguments</name><anchor>trl.ScriptArguments</anchor><source>https://github.com/huggingface/trl/blob/vr_4305/trl/scripts/utils.py#L156</source><parameters>[{"name": "dataset_name", "val": ": typing.Optional[str] = None"}, {"name": "dataset_config", "val": ": typing.Optional[str] = None"}, {"name": "dataset_train_split", "val": ": str = 'train'"}, {"name": "dataset_test_split", "val": ": str = 'test'"}, {"name": "dataset_streaming", "val": ": bool = False"}, {"name": "gradient_checkpointing_use_reentrant", "val": ": bool = False"}, {"name": "ignore_bias_buffers", "val": ": bool = False"}]</parameters><paramsdesc>- **dataset_name** (`str`,, *optional*) --
Path or name of the dataset to load. If `datasets` is provided, this will be ignored.
- **dataset_config** (`str`, *optional*) --
Dataset configuration name. Corresponds to the `name` argument of the [load_dataset](https://huggingface.co/docs/datasets/main/en/package_reference/loading_methods#datasets.load_dataset) function.
If `datasets` is provided, this will be ignored.
- **dataset_train_split** (`str`, *optional*, defaults to `"train"`) --
Dataset split to use for training. If `datasets` is provided, this will be ignored.
- **dataset_test_split** (`str`, *optional*, defaults to `"test"`) --
Dataset split to use for evaluation. If `datasets` is provided, this will be ignored.
- **dataset_streaming** (`bool`, *optional*, defaults to `False`) --
Whether to stream the dataset. If True, the dataset will be loaded in streaming mode. If `datasets` is
provided, this will be ignored.
- **gradient_checkpointing_use_reentrant** (`bool`, *optional*, defaults to `False`) --
Whether to apply `use_reentrant` for gradient checkpointing.
- **ignore_bias_buffers** (`bool`, *optional*, defaults to `False`) --
Debug argument for distributed training. Fix for DDP issues with LM bias/mask buffers - invalid scalar
type, inplace operation. See
https://github.com/huggingface/transformers/issues/22482#issuecomment-1595790992.</paramsdesc><paramgroups>0</paramgroups></docstring>
Arguments common to all scripts.
</div>
## TrlParser[[trl.TrlParser]]
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">
<docstring><name>class trl.TrlParser</name><anchor>trl.TrlParser</anchor><source>https://github.com/huggingface/trl/blob/vr_4305/trl/scripts/utils.py#L248</source><parameters>[{"name": "dataclass_types", "val": ": typing.Union[transformers.hf_argparser.DataClassType, collections.abc.Iterable[transformers.hf_argparser.DataClassType], NoneType] = None"}, {"name": "**kwargs", "val": ""}]</parameters><paramsdesc>- **dataclass_types** (`Union[DataClassType, Iterable[DataClassType]]`, *optional*) --
Dataclass types to use for argument parsing.
- ****kwargs** --
Additional keyword arguments passed to the [transformers.HfArgumentParser](https://huggingface.co/docs/transformers/main/en/internal/trainer_utils#transformers.HfArgumentParser) constructor.</paramsdesc><paramgroups>0</paramgroups></docstring>
A subclass of [transformers.HfArgumentParser](https://huggingface.co/docs/transformers/main/en/internal/trainer_utils#transformers.HfArgumentParser) designed for parsing command-line arguments with dataclass-backed
configurations, while also supporting configuration file loading and environment variable management.
<ExampleCodeBlock anchor="trl.TrlParser.example">
Examples:
```yaml
# config.yaml
env:
VAR1: value1
arg1: 23
```
</ExampleCodeBlock>
<ExampleCodeBlock anchor="trl.TrlParser.example-2">
```python
# main.py
import os
from dataclasses import dataclass
from trl import TrlParser
@dataclass
class MyArguments:
arg1: int
arg2: str = "alpha"
parser = TrlParser(dataclass_types=[MyArguments])
training_args = parser.parse_args_and_config()
print(training_args, os.environ.get("VAR1"))
```
</ExampleCodeBlock>
<ExampleCodeBlock anchor="trl.TrlParser.example-3">
```bash
$ python main.py --config config.yaml
(MyArguments(arg1=23, arg2='alpha'),) value1
$ python main.py --arg1 5 --arg2 beta
(MyArguments(arg1=5, arg2='beta'),) None
```
</ExampleCodeBlock>
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">
<docstring><name>parse_args_and_config</name><anchor>trl.TrlParser.parse_args_and_config</anchor><source>https://github.com/huggingface/trl/blob/vr_4305/trl/scripts/utils.py#L317</source><parameters>[{"name": "args", "val": ": typing.Optional[collections.abc.Iterable[str]] = None"}, {"name": "return_remaining_strings", "val": ": bool = False"}, {"name": "fail_with_unknown_args", "val": ": bool = True"}]</parameters></docstring>
Parse command-line args and config file into instances of the specified dataclass types.
This method wraps [transformers.HfArgumentParser.parse_args_into_dataclasses](https://huggingface.co/docs/transformers/main/en/internal/trainer_utils#transformers.HfArgumentParser.parse_args_into_dataclasses) and also parses the config file
specified with the `--config` flag. The config file (in YAML format) provides argument values that replace the
default values in the dataclasses. Command line arguments can override values set by the config file. The
method also sets any environment variables specified in the `env` field of the config file.
</div>
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">
<docstring><name>parse_args_into_dataclasses</name><anchor>trl.TrlParser.parse_args_into_dataclasses</anchor><source>https://github.com/huggingface/trl/blob/vr_4305/transformers/hf_argparser.py#L272</source><parameters>[{"name": "args", "val": " = None"}, {"name": "return_remaining_strings", "val": " = False"}, {"name": "look_for_args_file", "val": " = True"}, {"name": "args_filename", "val": " = None"}, {"name": "args_file_flag", "val": " = None"}]</parameters><paramsdesc>- **args** --
List of strings to parse. The default is taken from sys.argv. (same as argparse.ArgumentParser)
- **return_remaining_strings** --
If true, also return a list of remaining argument strings.
- **look_for_args_file** --
If true, will look for a ".args" file with the same base name as the entry point script for this
process, and will append its potential content to the command line args.
- **args_filename** --
If not None, will uses this file instead of the ".args" file specified in the previous argument.
- **args_file_flag** --
If not None, will look for a file in the command-line args specified with this flag. The flag can be
specified multiple times and precedence is determined by the order (last one wins).</paramsdesc><paramgroups>0</paramgroups><rettype>Tuple consisting of</rettype><retdesc>- the dataclass instances in the same order as they were passed to the initializer.abspath
- if applicable, an additional namespace for more (non-dataclass backed) arguments added to the parser
after initialization.
- The potential list of remaining argument strings. (same as argparse.ArgumentParser.parse_known_args)</retdesc></docstring>
Parse command-line args into instances of the specified dataclass types.
This relies on argparse's `ArgumentParser.parse_known_args`. See the doc at:
docs.python.org/3/library/argparse.html#argparse.ArgumentParser.parse_args
</div>
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">
<docstring><name>set_defaults_with_config</name><anchor>trl.TrlParser.set_defaults_with_config</anchor><source>https://github.com/huggingface/trl/blob/vr_4305/trl/scripts/utils.py#L368</source><parameters>[{"name": "**kwargs", "val": ""}]</parameters></docstring>
Overrides the parser's default values with those provided via keyword arguments, including for subparsers.
Any argument with an updated default will also be marked as not required if it was previously required.
Returns a list of strings that were not consumed by the parser.
</div></div>
## get_dataset[[trl.get_dataset]]
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">
<docstring><name>trl.get_dataset</name><anchor>trl.get_dataset</anchor><source>https://github.com/huggingface/trl/blob/vr_4305/trl/scripts/utils.py#L421</source><parameters>[{"name": "mixture_config", "val": ": DatasetMixtureConfig"}]</parameters><paramsdesc>- **mixture_config** ([DatasetMixtureConfig](/docs/trl/pr_4305/en/script_utils#trl.DatasetMixtureConfig)) --
Script arguments containing dataset configuration.</paramsdesc><paramgroups>0</paramgroups><rettype>[DatasetDict](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)</rettype><retdesc>Combined dataset(s) from the mixture configuration, with optional train/test split if `test_split_size` is
set.</retdesc></docstring>
Load a mixture of datasets based on the configuration.
<ExampleCodeBlock anchor="trl.get_dataset.example">
Example:
```python
from trl import DatasetMixtureConfig, get_dataset
from trl.scripts.utils import DatasetConfig
mixture_config = DatasetMixtureConfig(datasets=[DatasetConfig(path="trl-lib/tldr")])
dataset = get_dataset(mixture_config)
print(dataset)
```
</ExampleCodeBlock>
<ExampleCodeBlock anchor="trl.get_dataset.example-2">
```
DatasetDict({
train: Dataset({
features: ['prompt', 'completion'],
num_rows: 116722
})
})
```
</ExampleCodeBlock>
</div>
## DatasetConfig[[trl.scripts.utils.DatasetConfig]]
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">
<docstring><name>class trl.scripts.utils.DatasetConfig</name><anchor>trl.scripts.utils.DatasetConfig</anchor><source>https://github.com/huggingface/trl/blob/vr_4305/trl/scripts/utils.py#L58</source><parameters>[{"name": "path", "val": ": str"}, {"name": "name", "val": ": typing.Optional[str] = None"}, {"name": "data_dir", "val": ": typing.Optional[str] = None"}, {"name": "data_files", "val": ": typing.Union[str, list[str], dict[str, str], NoneType] = None"}, {"name": "split", "val": ": str = 'train'"}, {"name": "columns", "val": ": typing.Optional[list[str]] = None"}]</parameters><paramsdesc>- **path** (`str`) --
Path or name of the dataset.
- **name** (`str`, *optional*) --
Defining the name of the dataset configuration.
- **data_dir** (`str`, *optional*) --
Defining the `data_dir` of the dataset configuration. If specified for the generic builders(csv, text etc.)
or the Hub datasets and `data_files` is `None`, the behavior is equal to passing `os.path.join(data_dir,
**)` as `data_files` to reference all the files in a directory.
- **data_files** (`str` or `Sequence` or `Mapping`, *optional*) --
Path(s) to source data file(s).
- **split** (`str`, *optional*, defaults to `"train"`) --
Which split of the data to load.
- **columns** (`list[str]`, *optional*) --
List of column names to select from the dataset. If `None`, all columns are selected.</paramsdesc><paramgroups>0</paramgroups></docstring>
Configuration for a dataset.
This class matches the signature of [load_dataset](https://huggingface.co/docs/datasets/main/en/package_reference/loading_methods#datasets.load_dataset) and the arguments are used directly in the
[load_dataset](https://huggingface.co/docs/datasets/main/en/package_reference/loading_methods#datasets.load_dataset) function. You can refer to the [load_dataset](https://huggingface.co/docs/datasets/main/en/package_reference/loading_methods#datasets.load_dataset) documentation for more
details.
</div>
## DatasetMixtureConfig[[trl.DatasetMixtureConfig]]
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">
<docstring><name>class trl.DatasetMixtureConfig</name><anchor>trl.DatasetMixtureConfig</anchor><source>https://github.com/huggingface/trl/blob/vr_4305/trl/scripts/utils.py#L92</source><parameters>[{"name": "datasets", "val": ": list = <factory>"}, {"name": "streaming", "val": ": bool = False"}, {"name": "test_split_size", "val": ": typing.Optional[float] = None"}]</parameters><paramsdesc>- **datasets** (`list[DatasetConfig]`) --
List of dataset configurations to include in the mixture.
- **streaming** (`bool`, *optional*, defaults to `False`) --
Whether to stream the datasets. If `True`, the datasets will be loaded in streaming mode.
- **test_split_size** (`float`, *optional*) --
Size of the test split. Refer to the `test_size` parameter in the `train_test_split` function
for more details. If `None`, the dataset will not be split into train and test sets.</paramsdesc><paramgroups>0</paramgroups></docstring>
Configuration class for a mixture of datasets.
Using [HfArgumentParser](https://huggingface.co/docs/transformers/main/en/internal/trainer_utils#transformers.HfArgumentParser) we can turn this class into
[argparse](https://docs.python.org/3/library/argparse#module-argparse) arguments that can be specified on the
command line.
Usage:
<ExampleCodeBlock anchor="trl.DatasetMixtureConfig.example">
When using the CLI, you can add the following section to your YAML config file:
```yaml
datasets:
- path: ...
name: ...
data_dir: ...
data_files: ...
split: ...
columns: ...
- path: ...
name: ...
data_dir: ...
data_files: ...
split: ...
columns: ...
streaming: ...
test_split_size: ...
```
</ExampleCodeBlock>
</div>
<EditOnGithub source="https://github.com/huggingface/trl/blob/main/docs/source/script_utils.md" />

Xet Storage Details

Size:
13.7 kB
·
Xet hash:
4a3c212d3dc052805ab876bcd6de173da412c4c7dd1ae8b7e34aeea6b05d2df7

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.