Buckets:
| # Run training on Amazon SageMaker | |
| This guide will show you how to train a ๐ค Transformers model with the `HuggingFace` SageMaker Python SDK. Learn how to: | |
| - [Install and setup your training environment](#installation-and-setup). | |
| - [Prepare a training script](#prepare-a-transformers-fine-tuning-script). | |
| - [Create a Hugging Face Estimator](#create-a-hugging-face-estimator). | |
| - [Run training with the `fit` method](#execute-training). | |
| - [Access your trained model](#access-trained-model). | |
| - [Perform distributed training](#distributed-training). | |
| - [Create a spot instance](#spot-instances). | |
| - [Load a training script from a GitHub repository](#git-repository). | |
| - [Collect training metrics](#sagemaker-metrics). | |
| ## Installation and setup | |
| Before you can train a ๐ค Transformers model with SageMaker, you need to sign up for an AWS account. If you don't have an AWS account yet, learn more [here](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-set-up.html). | |
| Once you have an AWS account, get started using one of the following: | |
| - [SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-studio-onboard.html) | |
| - [SageMaker notebook instance](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-console.html) | |
| - Local environment | |
| To start training locally, you need to setup an appropriate [IAM role](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html). | |
| Upgrade to the latest `sagemaker` version: | |
| ```bash | |
| pip install 'sagemaker [!WARNING] | |
| > [SageMaker Python SDK v3 has been recently released](https://github.com/aws/sagemaker-python-sdk), so unless specified otherwise, all the documentation and tutorials are still using the [SageMaker Python SDK v2](https://github.com/aws/sagemaker-python-sdk/tree/master-v2). We are actively working on updating all the tutorials and examples, but in the meantime make sure to install the SageMaker SDK as `pip install "sagemaker 5GB), which can slow down deployment for Amazon SageMaker Inference. | |
| You can control how checkpoints, logs, and artifacts are saved by customization the [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments). For example by providing `save_total_limit` as `TrainingArgument` you can control the limit of the total amount of checkpoints. Deletes the older checkpoints in `output_dir` if new ones are saved and the maximum limit is reached. | |
| In addition to the options already mentioned above, there is another option to save the training artifacts during the training session. Amazon SageMaker supports [Checkpointing](https://docs.aws.amazon.com/sagemaker/latest/dg/model-checkpoints.html), which allows you to continuously save your artifacts during training to Amazon S3 rather than at the end of your training. To enable [Checkpointing](https://docs.aws.amazon.com/sagemaker/latest/dg/model-checkpoints.html) you need to provide the `checkpoint_s3_uri` parameter pointing to an Amazon S3 location in the `HuggingFace` estimator and set `output_dir` to `/opt/ml/checkpoints`. | |
| _Note: If you set `output_dir` to `/opt/ml/checkpoints` make sure to call `trainer.save_model("/opt/ml/model")` or model.save_pretrained("/opt/ml/model")/`tokenizer.save_pretrained("/opt/ml/model")` at the end of your training to be able to deploy your model seamlessly to Amazon SageMaker for Inference._ | |
| ## Create a Hugging Face Estimator | |
| Run ๐ค Transformers training scripts on SageMaker by creating a [Hugging Face Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html#huggingface-estimator). The Estimator handles end-to-end SageMaker training. There are several parameters you should define in the Estimator: | |
| 1. `entry_point` specifies which fine-tuning script to use. | |
| 2. `instance_type` specifies an Amazon instance to launch. Refer [here](https://aws.amazon.com/sagemaker/pricing/) for a complete list of instance types. | |
| 3. `hyperparameters` specifies training hyperparameters. View additional available hyperparameters in [train.py file](https://github.com/huggingface/notebooks/blob/main/sagemaker/01_getting_started_pytorch/scripts/train.py). | |
| The following code sample shows how to train with a custom script `train.py` with three hyperparameters (`epochs`, `per_device_train_batch_size`, and `model_name_or_path`): | |
| ```python | |
| from sagemaker.huggingface import HuggingFace | |
| # hyperparameters which are passed to the training job | |
| hyperparameters={'epochs': 1, | |
| 'per_device_train_batch_size': 32, | |
| 'model_name_or_path': 'distilbert-base-uncased' | |
| } | |
| # create the Estimator | |
| huggingface_estimator = HuggingFace( | |
| entry_point='train.py', | |
| source_dir='./scripts', | |
| instance_type='ml.g6.12xlarge', | |
| instance_count=1, | |
| role=role, | |
| transformers_version='4.26', | |
| pytorch_version='1.13', | |
| py_version='py39', | |
| hyperparameters = hyperparameters | |
| ) | |
| ``` | |
| If you are running a `TrainingJob` locally, define `instance_type='local'` or `instance_type='local_gpu'` for GPU usage. Note that this will not work with SageMaker Studio. | |
| ## Execute training | |
| Start your `TrainingJob` by calling `fit` on a Hugging Face Estimator. Specify your input training data in `fit`. The input training data can be a: | |
| - S3 URI such as `s3://my-bucket/my-training-data`. | |
| - `FileSystemInput` for Amazon Elastic File System or FSx for Lustre. See [here](https://sagemaker.readthedocs.io/en/stable/overview.html?highlight=FileSystemInput#use-file-systems-as-training-inputs) for more details about using these file systems as input. | |
| Call `fit` to begin training: | |
| ```python | |
| huggingface_estimator.fit( | |
| {'train': 's3://sagemaker-us-east-1-558105141721/samples/datasets/imdb/train', | |
| 'test': 's3://sagemaker-us-east-1-558105141721/samples/datasets/imdb/test'} | |
| ) | |
| ``` | |
| SageMaker starts and manages all the required EC2 instances and initiates the `TrainingJob` by running: | |
| ```bash | |
| /opt/conda/bin/python train.py --epochs 1 --model_name_or_path distilbert-base-uncased --per_device_train_batch_size 32 | |
| ``` | |
| ## Access trained model | |
| Once training is complete, you can access your model through the [AWS console](https://console.aws.amazon.com/console/home?nc2=h_ct&src=header-signin) or download it directly from S3. | |
| ```python | |
| from sagemaker.s3 import S3Downloader | |
| S3Downloader.download( | |
| s3_uri=huggingface_estimator.model_data, # S3 URI where the trained model is located | |
| local_path='.', # local path where *.targ.gz is saved | |
| sagemaker_session=sess # SageMaker session used for training the model | |
| ) | |
| ``` | |
| ## Distributed training | |
| SageMaker provides two strategies for distributed training: data parallelism and model parallelism. Data parallelism splits a training set across several GPUs, while model parallelism splits a model across several GPUs. | |
| ### Data parallelism | |
| The Hugging Face [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) supports SageMaker's data parallelism library. If your training script uses the Trainer API, you only need to define the distribution parameter in the Hugging Face Estimator: | |
| ```python | |
| # configuration for running training on smdistributed data parallel | |
| distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}} | |
| # create the Estimator | |
| huggingface_estimator = HuggingFace( | |
| entry_point='train.py', | |
| source_dir='./scripts', | |
| instance_type='ml.p3dn.24xlarge', | |
| instance_count=2, | |
| role=role, | |
| transformers_version='4.26.0', | |
| pytorch_version='1.13.1', | |
| py_version='py39', | |
| hyperparameters = hyperparameters, | |
| distribution = distribution | |
| ) | |
| ``` | |
| ๐ Open the [sagemaker-notebook.ipynb notebook](https://github.com/huggingface/notebooks/blob/main/sagemaker/07_tensorflow_distributed_training_data_parallelism/sagemaker-notebook.ipynb) for an example of how to run the data parallelism library with TensorFlow. | |
| ### Model parallelism | |
| The Hugging Face [Trainer] also supports SageMaker's model parallelism library. If your training script uses the Trainer API, you only need to define the distribution parameter in the Hugging Face Estimator (see [here](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_general.html?highlight=modelparallel#required-sagemaker-python-sdk-parameters) for more detailed information about using model parallelism): | |
| ```python | |
| # configuration for running training on smdistributed model parallel | |
| mpi_options = { | |
| "enabled" : True, | |
| "processes_per_host" : 8 | |
| } | |
| smp_options = { | |
| "enabled":True, | |
| "parameters": { | |
| "microbatches": 4, | |
| "placement_strategy": "spread", | |
| "pipeline": "interleaved", | |
| "optimize": "speed", | |
| "partitions": 4, | |
| "ddp": True, | |
| } | |
| } | |
| distribution={ | |
| "smdistributed": {"modelparallel": smp_options}, | |
| "mpi": mpi_options | |
| } | |
| # create the Estimator | |
| huggingface_estimator = HuggingFace( | |
| entry_point='train.py', | |
| source_dir='./scripts', | |
| instance_type='ml.p3dn.24xlarge', | |
| instance_count=2, | |
| role=role, | |
| transformers_version='4.26.0', | |
| pytorch_version='1.13.1', | |
| py_version='py39', | |
| hyperparameters = hyperparameters, | |
| distribution = distribution | |
| ) | |
| ``` | |
| ๐ Open the [sagemaker-notebook.ipynb notebook](https://github.com/huggingface/notebooks/blob/main/sagemaker/04_distributed_training_model_parallelism/sagemaker-notebook.ipynb) for an example of how to run the model parallelism library. | |
| ## Spot instances | |
| The Hugging Face extension for the SageMaker Python SDK means we can benefit from [fully-managed EC2 spot instances](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html). This can help you save up to 90% of training costs! | |
| _Note: Unless your training job completes quickly, we recommend you use [checkpointing](https://docs.aws.amazon.com/sagemaker/latest/dg/model-checkpoints.html) with managed spot training. In this case, you need to define the `checkpoint_s3_uri`._ | |
| Set `use_spot_instances=True` and define your `max_wait` and `max_run` time in the Estimator to use spot instances: | |
| ```python | |
| # hyperparameters which are passed to the training job | |
| hyperparameters={'epochs': 1, | |
| 'train_batch_size': 32, | |
| 'model_name':'distilbert-base-uncased', | |
| 'output_dir':'/opt/ml/checkpoints' | |
| } | |
| # create the Estimator | |
| huggingface_estimator = HuggingFace( | |
| entry_point='train.py', | |
| source_dir='./scripts', | |
| instance_type='ml.g6.12xlarge', | |
| instance_count=1, | |
| checkpoint_s3_uri=f's3://{sess.default_bucket()}/checkpoints' | |
| use_spot_instances=True, | |
| # max_wait should be equal to or greater than max_run in seconds | |
| max_wait=3600, | |
| max_run=1000, | |
| role=role, | |
| transformers_version='4.26', | |
| pytorch_version='1.13', | |
| py_version='py39', | |
| hyperparameters = hyperparameters | |
| ) | |
| # Training seconds: 874 | |
| # Billable seconds: 262 | |
| # Managed Spot Training savings: 70.0% | |
| ``` | |
| ๐ Open the [sagemaker-notebook.ipynb notebook](https://github.com/huggingface/notebooks/blob/main/sagemaker/05_spot_instances/sagemaker-notebook.ipynb) for an example of how to use spot instances. | |
| ## Git repository | |
| The Hugging Face Estimator can load a training script [stored in a GitHub repository](https://sagemaker.readthedocs.io/en/stable/overview.html#use-scripts-stored-in-a-git-repository). Provide the relative path to the training script in `entry_point` and the relative path to the directory in `source_dir`. | |
| If you are using `git_config` to run the [๐ค Transformers example scripts](https://github.com/huggingface/transformers/tree/main/examples), you need to configure the correct `'branch'` in `transformers_version` (e.g. if you use `transformers_version='4.4.2` you have to use `'branch':'v4.4.2'`). | |
| _Tip: Save your model to S3 by setting `output_dir=/opt/ml/model` in the hyperparameter of your training script._ | |
| ```python | |
| # configure git settings | |
| git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.4.2'} # v4.4.2 refers to the transformers_version you use in the estimator | |
| # create the Estimator | |
| huggingface_estimator = HuggingFace( | |
| entry_point='run_glue.py', | |
| source_dir='./examples/pytorch/text-classification', | |
| git_config=git_config, | |
| instance_type='ml.g6.12xlarge', | |
| instance_count=1, | |
| role=role, | |
| transformers_version='4.26', | |
| pytorch_version='1.13', | |
| py_version='py39', | |
| hyperparameters=hyperparameters | |
| ) | |
| ``` | |
| ## SageMaker metrics | |
| [SageMaker metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/training-metrics.html#define-train-metrics) automatically parses training job logs for metrics and sends them to CloudWatch. If you want SageMaker to parse the logs, you must specify the metric's name and a regular expression for SageMaker to use to find the metric. | |
| ```python | |
| # define metrics definitions | |
| metric_definitions = [ | |
| {"Name": "train_runtime", "Regex": "train_runtime.*=\D*(.*?)$"}, | |
| {"Name": "eval_accuracy", "Regex": "eval_accuracy.*=\D*(.*?)$"}, | |
| {"Name": "eval_loss", "Regex": "eval_loss.*=\D*(.*?)$"}, | |
| ] | |
| # create the Estimator | |
| huggingface_estimator = HuggingFace( | |
| entry_point='train.py', | |
| source_dir='./scripts', | |
| instance_type='ml.g6.12xlarge', | |
| instance_count=1, | |
| role=role, | |
| transformers_version='4.26', | |
| pytorch_version='1.13', | |
| py_version='py39', | |
| metric_definitions=metric_definitions, | |
| hyperparameters = hyperparameters) | |
| ``` | |
| ๐ Open the [notebook](https://github.com/huggingface/notebooks/blob/main/sagemaker/06_sagemaker_metrics/sagemaker-notebook.ipynb) for an example of how to capture metrics in SageMaker. | |
Xet Storage Details
- Size:
- 13.9 kB
- Xet hash:
- 60f3d2a30e4529369ef17f6cba938b156d79361bff31ab9587587879d4ff538d
ยท
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.