Buckets:

hf-doc-build/doc-dev / sagemaker /pr_1995 /en /tutorials /sagemaker-sdk /sagemaker-sdk-quickstart.md
rtrm's picture
|
download
raw
6.83 kB
# Train and deploy a Hugging Face model on Amazon SageMaker with the SDK
The get started guide will show you how to quickly use Hugging Face on Amazon SageMaker with the SDK. Learn how to fine-tune and deploy a pretrained ๐Ÿค— Transformers model on SageMaker for a binary text classification task.
<iframe width="560" height="315" src="https://www.youtube.com/embed/pYqjCzoyWyo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
๐Ÿ““ Open the [sagemaker-notebook.ipynb file](https://github.com/huggingface/notebooks/blob/main/sagemaker/01_getting_started_pytorch/sagemaker-notebook.ipynb) to follow along!
## Installation and setup
Get started by installing the necessary Hugging Face libraries and SageMaker. You will also need to install [PyTorch](https://pytorch.org/get-started/locally/) if you don't already have it installed. If you run this example in SageMaker Studio, it is already installed in the notebook kernel!
```python
pip install "sagemaker>=2.140.0" "transformers==4.26.1" "datasets[s3]==2.10.1" --upgrade
```
If you want to run this example in [SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html), upgrade [ipywidgets](https://ipywidgets.readthedocs.io/en/latest/) for the ๐Ÿค— Datasets library and restart the kernel:
```python
%%capture
import IPython
!conda install -c conda-forge ipywidgets -y
IPython.Application.instance().kernel.do_shutdown(True)
```
Next, you should set up your environment: a SageMaker session and an S3 bucket. The S3 bucket will store data, models, and logs. You will need access to an [IAM execution role](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) with the required permissions.
If you are planning on using SageMaker in a local environment, you need to provide the `role` yourself. Learn more about how to set this up [here](https://huggingface.co/docs/sagemaker/train#installation-and-setup).
โš ๏ธ The execution role is only available when you run a notebook within SageMaker. If you try to run `get_execution_role` in a notebook not on SageMaker, you will get a region error.
```python
import sagemaker
sess = sagemaker.Session()
sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
sagemaker_session_bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
```
## Preprocess
The ๐Ÿค— Datasets library makes it easy to download and preprocess a dataset for training. Download and tokenize the [IMDb](https://huggingface.co/datasets/imdb) dataset:
```python
from datasets import load_dataset
from transformers import AutoTokenizer
# load dataset
train_dataset, test_dataset = load_dataset("imdb", split=["train", "test"])
# load tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
# create tokenization function
def tokenize(batch):
return tokenizer(batch["text"], padding="max_length", truncation=True)
# tokenize train and test datasets
train_dataset = train_dataset.map(tokenize, batched=True)
test_dataset = test_dataset.map(tokenize, batched=True)
# set dataset format for PyTorch
train_dataset = train_dataset.rename_column("label", "labels")
train_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
test_dataset = test_dataset.rename_column("label", "labels")
test_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
```
## Upload dataset to S3 bucket
Next, upload the preprocessed dataset to your S3 session bucket with ๐Ÿค— Datasets S3 [filesystem](https://huggingface.co/docs/datasets/filesystems.html) implementation:
```python
# save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/train'
train_dataset.save_to_disk(training_input_path)
# save test_dataset to s3
test_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/test'
test_dataset.save_to_disk(test_input_path)
```
## Start a training job
Create a Hugging Face Estimator to handle end-to-end SageMaker training and deployment. The most important parameters to pay attention to are:
* `entry_point` refers to the fine-tuning script which you can find in [train.py file](https://github.com/huggingface/notebooks/blob/main/sagemaker/01_getting_started_pytorch/scripts/train.py).
* `instance_type` refers to the SageMaker instance that will be launched. Take a look [here](https://aws.amazon.com/sagemaker/pricing/) for a complete list of instance types.
* `hyperparameters` refers to the training hyperparameters the model will be fine-tuned with.
```python
from sagemaker.huggingface import HuggingFace
hyperparameters={
"epochs": 1, # number of training epochs
"train_batch_size": 32, # training batch size
"model_name":"distilbert/distilbert-base-uncased" # name of pretrained model
}
huggingface_estimator = HuggingFace(
entry_point="train.py", # fine-tuning script to use in training job
source_dir="./scripts", # directory where fine-tuning script is stored
instance_type="ml.p3.2xlarge", # instance type
instance_count=1, # number of instances
role=role, # IAM role used in training job to access AWS resources (S3)
transformers_version="4.36", # Transformers version
pytorch_version="2.1.0", # PyTorch version
py_version="py310", # Python version
hyperparameters=hyperparameters # hyperparameters to use in training job
)
```
Begin training with one line of code:
```python
huggingface_estimator.fit({"train": training_input_path, "test": test_input_path})
```
## Deploy model
Once the training job is complete, deploy your fine-tuned model by calling `deploy()` with the number of instances and instance type:
```python
predictor = huggingface_estimator.deploy(initial_instance_count=1,"ml.g4dn.xlarge")
```
Call `predict()` on your data:
```python
sentiment_input = {"inputs": "It feels like a curtain closing...there was an elegance in the way they moved toward conclusion. No fan is going to watch and feel short-changed."}
predictor.predict(sentiment_input)
```
After running your request, delete the endpoint:
```python
predictor.delete_endpoint()
```
## What's next?
Congratulations, you've just fine-tuned and deployed a pretrained ๐Ÿค— Transformers model on SageMaker for binary text classification! ๐ŸŽ‰
<EditOnGithub source="https://github.com/huggingface/hub-docs/blob/main/docs/sagemaker/source/tutorials/sagemaker-sdk/sagemaker-sdk-quickstart.md" />

Xet Storage Details

Size:
6.83 kB
ยท
Xet hash:
1dc6d8ecc61c4b586444896cf753f44e547d708993b8fbe69577469f61e44fd9

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.