Buckets:
| # Deploy Mixtral 8x7B on AWS Inferentia2 | |
| Mixtral 8x7B is an open-source LLM from Mistral AI. It is a Sparse Mixture of Experts and has a similar architecture to Mistral 7B, but comes with a twist: it’s actually 8 “expert” models in one. If you want to learn more about MoEs check out [Mixture of Experts Explained](https://huggingface.co/blog/moe). | |
| In this tutorial you will learn how to deploy [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) model on AWS Inferentia2 with Hugging Face Optimum Neuron on Amazon SageMaker. We are going to use the Hugging Face vLLM Neuron Container, a purpose-built Inference Container to easily deploy LLMs on AWS Inferentia2 powered by [vLLM](https://github.com/vllm-project/vllm.git) and [Optimum Neuron](https://huggingface.co/docs/optimum-neuron/index). | |
| We will cover how to: | |
| 1. [Setup a development environment](#1-setup-development-environment) | |
| 2. [Retrieve the latest Hugging Face vLLM Neuron DLC](#2-retrieve-the-latest-hugging-face-vllm-neuron-dlc) | |
| 3. [Deploy Mixtral 8x7B to Inferentia2](#3-deploy-Mixtral-8x7B-to-inferentia2) | |
| 4. [Clean up](#4-clean-up) | |
| Lets get started! 🚀 | |
| [AWS inferentia (Inf2)](https://aws.amazon.com/ec2/instance-types/inf2/) are purpose-built EC2 for deep learning (DL) inference workloads. Here are the different instances of the Inferentia2 family. | |
| | instance size | accelerators | Neuron Cores | accelerator memory | vCPU | CPU Memory | on-demand price ($/h) | | |
| | ------------- | ------------ | ------------ | ------------------ | ---- | ---------- | --------------------- | | |
| | inf2.xlarge | 1 | 2 | 32 | 4 | 16 | 0.76 | | |
| | inf2.8xlarge | 1 | 2 | 32 | 32 | 128 | 1.97 | | |
| | inf2.24xlarge | 6 | 12 | 192 | 96 | 384 | 6.49 | | |
| | inf2.48xlarge | 12 | 24 | 384 | 192 | 768 | 12.98 | | |
| ## 1. Setup development environment | |
| For this tutorial, we are going to use a Notebook Instance in Amazon SageMaker with the Python 3 (ipykernel) and the `sagemaker` python SDK to deploy Mixtral 8x7B to a SageMaker inference endpoint. | |
| Make sur you have the latest version of the SageMaker SDK installed. | |
| ```python | |
| !pip install sagemaker --upgrade --quiet | |
| ``` | |
| Then, instantiate the sagemaker role and session. | |
| ```python | |
| import boto3 | |
| from sagemaker.core.helper.session_helper import get_execution_role | |
| try: | |
| role = get_execution_role() | |
| except ValueError: | |
| iam = boto3.client("iam") | |
| role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"] | |
| print(f"sagemaker role arn: {role}") | |
| ``` | |
| ## 2. Retrieve the latest Hugging Face vLLM Neuron DLC | |
| The latest Hugging Face vLLM Neuron DLCs can be used to run inference on AWS Inferentia2. To retrieve it you can use the method `image_uris.retrieve` of the Sagemaker SDK. However, if you have the Optimum Neuron package installed, you can use the `ecr.image_uri` function to retrieve the appropriate Hugging Face vLLM Neuron DLC URI based on your desired `region` and `version`. Default values can be deduced by your AWS credentials. For more details see the [containers](https://huggingface.co/docs/optimum-neuron/containers) documentation. | |
| ```python | |
| !pip install optimum-neuron[neuronx] | |
| from optimum.neuron.utils import ecr | |
| REGION = "us-east-1" | |
| llm_image = ecr.image_uri("vllm", region=REGION) | |
| # print image uri | |
| print(f"llm image uri: {llm_image}") | |
| ``` | |
| ## 3. Deploy Mixtral 8x7B to Inferentia2 | |
| At the time of writing, [AWS Inferentia2 does not support dynamic shapes for inference](https://awsdocs-neuron.readthedocs-hosted.com/en/v2.6.0/general/arch/neuron-features/dynamic-shapes.html#neuron-dynamic-shapes), which means that we need to specify our sequence length and batch size ahead of time. | |
| To make it easier for customers to utilize the full power of Inferentia2, we created a [neuron model cache](https://huggingface.co/docs/optimum-neuron/guides/cache_system), which contains pre-compiled configurations for the most popular LLMs, including Mixtral 8x7B. | |
| This means we don't need to compile the model ourselves, but we can use the pre-compiled model from the cache. You can find compiled/cached configurations on the | |
| [Hugging Face Hub](https://huggingface.co/aws-neuron/optimum-neuron-cache/tree/main/inference-cache-config). If your desired configuration is not yet cached, you can compile it yourself using the [Optimum CLI](https://huggingface.co/docs/optimum-neuron/guides/export_model) or open a request at the [Cache repository](https://huggingface.co/aws-neuron/optimum-neuron-cache/discussions). | |
| Let's check the different configurations that are in the cache. For that you first need to log in the Hugging Face Hub, using a [User Access Token](https://huggingface.co/docs/hub/en/security-tokens) with read access. | |
| Make sure you have the necessary permissions to access the model. You can request access to the model [here](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1). | |
| ```python | |
| from huggingface_hub import notebook_login | |
| notebook_login() | |
| ``` | |
| Then, we need to install the latest version of Optimum Neuron. | |
| ```python | |
| !pip install optimum-neuron --upgrade --quiet | |
| ``` | |
| Finally, we can query the cache and retrieve the existing set of configurations for which we maintained a compiled version of the model. | |
| ```python | |
| !optimum-cli neuron cache lookup "mistralai/Mixtral-8x7B-Instruct-v0.1" | |
| ``` | |
| You should retrieve two entries in the cache: | |
| ```code | |
| *** 2 entrie(s) found in cache for mistralai/Mixtral-8x7B-Instruct-v0.1 for inference.*** | |
| auto_cast_type: bf16 | |
| batch_size: 1 | |
| checkpoint_id: mistralai/Mixtral-8x7B-Instruct-v0.1 | |
| checkpoint_revision: 41bd4c9e7e4fb318ca40e721131d4933966c2cc1 | |
| compiler_type: neuronx-cc | |
| compiler_version: 2.16.372.0+4a9b2326 | |
| num_cores: 24 | |
| sequence_length: 4096 | |
| task: text-generation | |
| auto_cast_type: bf16 | |
| batch_size: 4 | |
| checkpoint_id: mistralai/Mixtral-8x7B-Instruct-v0.1 | |
| checkpoint_revision: 41bd4c9e7e4fb318ca40e721131d4933966c2cc1 | |
| compiler_type: neuronx-cc | |
| compiler_version: 2.16.372.0+4a9b2326 | |
| num_cores: 24 | |
| sequence_length: 4096 | |
| task: text-generation | |
| ``` | |
| **Deploying Mixtral 8x7B to a SageMaker Endpoint** | |
| All we need when deploying the model to Amazon SageMaker, is to set the Hugging Face model id and token. | |
| - `SM_ON_MODEL`: The Hugging Face model ID. | |
| - `HF_TOKEN`: The Hugging Face API token to access gated models. | |
| Note: even if your model is not gated, we recommend setting your Hugging Face token to avoid rate limitations when fetching weights or pre-compiled neuron artifacts. | |
| Optionally, you can specify some deployment parameters to select a specific cached configuration (otherwise a default one will be selected). | |
| - `SM_ON_TENSOR_PARALLEL_SIZE`: Number of Neuron Cores used for the compilation. | |
| - `SM_ON_BATCH_SIZE`: The batch size that was used to compile the model. | |
| - `SM_ON_SEQUENCE_LENGTH`: The sequence length that was used to compile the model. | |
| **Select the right instance type** | |
| Mixtral 8x7B is a large model and requires a lot of memory. We are going to use the `inf2.48xlarge` instance type, which has 192 vCPUs and 384 GB of accelerator memory. The `inf2.48xlarge` instance comes with 12 Inferentia2 accelerators that include 24 Neuron Cores. In our case we will use a batch size of 4 and a sequence length of 4096. | |
| After that we can create our endpoint configuration and deploy the model to Amazon SageMaker. It will be fully compatible with the OpenAI Chat Completion API. | |
| ```python | |
| from sagemaker.core.resources import Model, ContainerDefinition | |
| # Define Model and Endpoint configuration parameter | |
| environment = { | |
| "SM_ON_MODEL": "mistralai/Mixtral-8x7B-Instruct-v0.1", | |
| "SM_ON_BATCH_SIZE": "1", # Select the configuration with batch size 1 | |
| "HF_TOKEN": "", | |
| } | |
| assert environment["HF_TOKEN"] != "", ( | |
| "Please replace '' with your Hugging Face Hub API token" | |
| ) | |
| container = ContainerDefinition(image=llm_image, environment=environment) | |
| # create Model with the container definition | |
| model = Model.create( | |
| model_name="mixtral-8x7b-neuronx-model", primary_container=container, execution_role_arn=role, region=REGION | |
| ) | |
| ``` | |
| After we have created the `Model` we need to define a deployment configuration. We will deploy the model with the `ml.inf2.48xlarge` instance type. vLLM will automatically distribute and shard the model across all Inferentia devices. | |
| ```python | |
| from sagemaker.core.resources import EndpointConfig, ProductionVariant | |
| # sagemaker config | |
| instance_type = "ml.inf2.48xlarge" | |
| health_check_timeout = 3600 # additional time to load the model | |
| volume_size = 512 # size in GB of the EBS volume | |
| # create EndpointConfig | |
| endpoint_config = EndpointConfig( | |
| endpoint_config_name="mixtral-8x7b-endpoint-config", | |
| production_variants=[ | |
| ProductionVariant( | |
| model_name=model.model_name, | |
| instance_type=instance_type, | |
| initial_instance_count=1, | |
| container_startup_health_check_timeout=health_check_timeout, | |
| volume_size=volume_size, | |
| environment=config, | |
| image_uri=llm_image, | |
| ) | |
| ], | |
| ) | |
| ``` | |
| We can now deploy the `Model` to an `Endpoint`. | |
| ```python | |
| from sagemaker.core.resources import Endpoint | |
| endpoint = Endpoint.create( | |
| endpoint_name="mixtral-8x7b-neuronx-endpoint", | |
| endpoint_config_name=endpoint_config.endpoint_config_name, | |
| ) | |
| ``` | |
| SageMaker will now create our endpoint and deploy the model to it. It takes around 15 minutes for deployment. | |
| After our endpoint is deployed we can run inference on it. We will use the `invoke` method from the endpoint to run inference on our endpoint. | |
| The endpoint supports the Messages API, which is fully compatible with the OpenAI Chat Completion API. The Messages API allows us to interact with the model in a conversational way. We can define the role of the message and the content. The role can be either `system`,`assistant` or `user`. The `system` role is used to provide context to the model and the `user` role is used to ask questions or provide input to the model. | |
| Parameters can be defined as separate attributes of the payload. Check out the chat completion [documentation](https://platform.openai.com/docs/api-reference/chat/create) to find supported parameters. | |
| ```python | |
| # Prompt to generate | |
| messages = [ | |
| {"role": "system", "content": "You are a helpful assistant."}, | |
| {"role": "user", "content": "What is deep learning in one sentence?"}, | |
| ] | |
| ``` | |
| Okay lets test it. | |
| ```python | |
| import json | |
| # Generation arguments https://platform.openai.com/docs/api-reference/chat/create | |
| result = endpoint.invoke( | |
| body=json.dumps( | |
| { | |
| "messages": messages, | |
| "max_tokens": 50, | |
| "top_k": 50, | |
| "top_p": 0.9, | |
| "temperature": 0.7, | |
| } | |
| ), | |
| content_type="application/json", | |
| ) | |
| output = json.loads(result.body.read().decode("utf-8")) | |
| message = output["choices"][0]["message"] | |
| assert message["role"] == "assistant" | |
| print("Generated response:", message["content"]) | |
| ``` | |
| ## 4. Clean up | |
| To clean up, we can delete the model and endpoint. | |
| ```python | |
| model.delete() | |
| endpoint_config.delete() | |
| endpoint.delete() | |
| ``` |
Xet Storage Details
- Size:
- 11.3 kB
- Xet hash:
- 4dfaaa82cf4cd30db8d26cdc1b54bc8fe5d5e95edbce74670e776f1c1365c9e7
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.