Errors when deploying to AWS Sagemaker

#60

by djokowsj90 - opened Jul 2, 2023

Jul 2, 2023

I got an error when deploying this model to AWS Sagemaker.

"No safetensors weights found for model bigcode/starcoder at revision None. Converting PyTorch weights to safetensors."

It seems Sagemaker expects one bin file "model.pth" or "pytorch_model.bin"
but this repo has many bin files like "pytorch_model-00003-of-00007.bin" etc..
I don't think I can simply contact those bin files.
Anyone has encountered this issue?

FarziBuilder

Jul 6, 2023

I also faced, don't know how to solve it

djokowsj90

Jul 8, 2023

I passed this error.
Sagemaker will actually do the conversion for you. But you need to give it more time.

predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.8xlarge",
    container_startup_health_check_timeout=1200,
  )

Set up the container_startup_health_check_timeout to a bigger number and it will pass this error.

But I encountered the next error

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 22.20 GiB total capacity; 19.72 GiB already allocated; 143.12 MiB free; 
21.11 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I upgraded to a bigger instance type, and played with param PYTORCH_CUDA_ALLOC_CONF but the error persisted.
Let me know if you see the same error.

zkrider

Jul 31, 2023

It worked by putting it on the AWS instance type: ml.g4dn.12xlarge and setting SM_NUM_GPUS: "4"

djokowsj90

Aug 1, 2023

Yes, I got it worked with these configs. Thank you so much~

djokowsj90 changed discussion status to closed Aug 1, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment