Sagemaker: CUDA out of memory

#66

by zkrider - opened Jul 21, 2023

Discussion

zkrider

Jul 21, 2023

•

edited Jul 21, 2023

Has anyone had luck in getting this to work with Sagemaker?

I'm getting the following errors in CloudWatch and even with the instance type: ml.g5.8xlarge

Error 1:

Error: ShardCannotStart
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve
    asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 124, in serve_inner
    model = get_model(model_id, revision, sharded, quantize, trust_remote_code)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 134, in get_model
    return santacoder_cls(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_santacoder.py", line 62, in __init__
    self.load_weights(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_santacoder.py", line 96, in load_weights
    value = value.to(device if quantize is None else "cpu").to(dtype)

Error 2:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 22.20 GiB total capacity; 19.72 GiB already allocated; 143.12 MiB free; 21.11 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Using the following for the deployment:

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

# Hub Model configuration. https://huggingface.co/models
hub = {
    'HF_MODEL_ID':'bigcode/starcoder',
    'SM_NUM_GPUS': json.dumps(1),
    'HF_API_TOKEN': '<TOKEN>'
}


# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    image_uri=get_huggingface_llm_image_uri("huggingface",version="0.8.2"),
    env=hub,
    role=role, 
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.8xlarge",
    container_startup_health_check_timeout=400,
    endpoint_name="Starcoder"
   )
  
# send request
predictor.predict({
    "inputs": "def print_hello_world():",
})

zkrider

Jul 31, 2023

It worked by putting it on the AWS instance type: ml.g4dn.12xlarge and setting SM_NUM_GPUS: "4"

zkrider changed discussion status to closed Jul 31, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment