Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / sagemaker /pr_1995 /en /tutorials /sagemaker-sdk /sagemaker-sdk-quickstart.md

rtrm

about 1 month ago

preview code

download

raw

6.83 kB

	# Train and deploy a Hugging Face model on Amazon SageMaker with the SDK

	The get started guide will show you how to quickly use Hugging Face on Amazon SageMaker with the SDK. Learn how to fine-tune and deploy a pretrained 🤗 Transformers model on SageMaker for a binary text classification task.

	<iframe width="560" height="315" src="https://www.youtube.com/embed/pYqjCzoyWyo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

	📓 Open the [sagemaker-notebook.ipynb file](https://github.com/huggingface/notebooks/blob/main/sagemaker/01_getting_started_pytorch/sagemaker-notebook.ipynb) to follow along!

	## Installation and setup

	Get started by installing the necessary Hugging Face libraries and SageMaker. You will also need to install [PyTorch](https://pytorch.org/get-started/locally/) if you don't already have it installed. If you run this example in SageMaker Studio, it is already installed in the notebook kernel!

	```python
	pip install "sagemaker>=2.140.0" "transformers==4.26.1" "datasets[s3]==2.10.1" --upgrade
	```

	If you want to run this example in [SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html), upgrade [ipywidgets](https://ipywidgets.readthedocs.io/en/latest/) for the 🤗 Datasets library and restart the kernel:

	```python
	%%capture
	import IPython
	!conda install -c conda-forge ipywidgets -y
	IPython.Application.instance().kernel.do_shutdown(True)
	```

	Next, you should set up your environment: a SageMaker session and an S3 bucket. The S3 bucket will store data, models, and logs. You will need access to an [IAM execution role](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) with the required permissions.

	If you are planning on using SageMaker in a local environment, you need to provide the `role` yourself. Learn more about how to set this up [here](https://huggingface.co/docs/sagemaker/train#installation-and-setup).

	⚠️ The execution role is only available when you run a notebook within SageMaker. If you try to run `get_execution_role` in a notebook not on SageMaker, you will get a region error.

	```python
	import sagemaker

	sess = sagemaker.Session()
	sagemaker_session_bucket = None
	if sagemaker_session_bucket is None and sess is not None:
	sagemaker_session_bucket = sess.default_bucket()

	role = sagemaker.get_execution_role()
	sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
	```

	## Preprocess

	The 🤗 Datasets library makes it easy to download and preprocess a dataset for training. Download and tokenize the [IMDb](https://huggingface.co/datasets/imdb) dataset:

	```python
	from datasets import load_dataset
	from transformers import AutoTokenizer

	# load dataset
	train_dataset, test_dataset = load_dataset("imdb", split=["train", "test"])

	# load tokenizer
	tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")

	# create tokenization function
	def tokenize(batch):
	return tokenizer(batch["text"], padding="max_length", truncation=True)

	# tokenize train and test datasets
	train_dataset = train_dataset.map(tokenize, batched=True)
	test_dataset = test_dataset.map(tokenize, batched=True)

	# set dataset format for PyTorch
	train_dataset = train_dataset.rename_column("label", "labels")
	train_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
	test_dataset = test_dataset.rename_column("label", "labels")
	test_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
	```

	## Upload dataset to S3 bucket

	Next, upload the preprocessed dataset to your S3 session bucket with 🤗 Datasets S3 [filesystem](https://huggingface.co/docs/datasets/filesystems.html) implementation:

	```python
	# save train_dataset to s3
	training_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/train'
	train_dataset.save_to_disk(training_input_path)

	# save test_dataset to s3
	test_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/test'
	test_dataset.save_to_disk(test_input_path)
	```

	## Start a training job

	Create a Hugging Face Estimator to handle end-to-end SageMaker training and deployment. The most important parameters to pay attention to are:

	* `entry_point` refers to the fine-tuning script which you can find in [train.py file](https://github.com/huggingface/notebooks/blob/main/sagemaker/01_getting_started_pytorch/scripts/train.py).
	* `instance_type` refers to the SageMaker instance that will be launched. Take a look [here](https://aws.amazon.com/sagemaker/pricing/) for a complete list of instance types.
	* `hyperparameters` refers to the training hyperparameters the model will be fine-tuned with.

	```python
	from sagemaker.huggingface import HuggingFace

	hyperparameters={
	"epochs": 1, # number of training epochs
	"train_batch_size": 32, # training batch size
	"model_name":"distilbert/distilbert-base-uncased" # name of pretrained model
	}

	huggingface_estimator = HuggingFace(
	entry_point="train.py", # fine-tuning script to use in training job
	source_dir="./scripts", # directory where fine-tuning script is stored
	instance_type="ml.p3.2xlarge", # instance type
	instance_count=1, # number of instances
	role=role, # IAM role used in training job to access AWS resources (S3)
	transformers_version="4.36", # Transformers version
	pytorch_version="2.1.0", # PyTorch version
	py_version="py310", # Python version
	hyperparameters=hyperparameters # hyperparameters to use in training job
	)
	```

	Begin training with one line of code:

	```python
	huggingface_estimator.fit({"train": training_input_path, "test": test_input_path})
	```

	## Deploy model

	Once the training job is complete, deploy your fine-tuned model by calling `deploy()` with the number of instances and instance type:

	```python
	predictor = huggingface_estimator.deploy(initial_instance_count=1,"ml.g4dn.xlarge")
	```

	Call `predict()` on your data:

	```python
	sentiment_input = {"inputs": "It feels like a curtain closing...there was an elegance in the way they moved toward conclusion. No fan is going to watch and feel short-changed."}

	predictor.predict(sentiment_input)
	```

	After running your request, delete the endpoint:

	```python
	predictor.delete_endpoint()
	```

	## What's next?

	Congratulations, you've just fine-tuned and deployed a pretrained 🤗 Transformers model on SageMaker for binary text classification! 🎉


	<EditOnGithub source="https://github.com/huggingface/hub-docs/blob/main/docs/sagemaker/source/tutorials/sagemaker-sdk/sagemaker-sdk-quickstart.md" />

Xet Storage Details

Size:: 6.83 kB
Xet hash:: 1dc6d8ecc61c4b586444896cf753f44e547d708993b8fbe69577469f61e44fd9

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.