Issue deploying on Runpod

#1
by reedbender - opened

Here is the error I'm receiving when attempting to deploy the One Click Template to Runpod. I have set my HUGGING_FACE_HUB_TOKEN as an environment variable for the deployment. Any help would be appreciated, thanks!

2024-01-02T18:53:27Z create pod network
2024-01-02T18:53:27Z create 500GB volume
2024-01-02T18:53:27Z create container ghcr.io/huggingface/text-generation-inference:latest
2024-01-02T18:53:27Z latest Pulling from huggingface/text-generation-inference
2024-01-02T18:53:27Z Digest: sha256:b68d9f4bd3a4e3978d23ea188bded199dca8fd72ad377b156c8398e92644ae2e
2024-01-02T18:53:27Z Status: Image is up to date for ghcr.io/huggingface/text-generation-inference:latest
2024-01-02T18:53:27Z start container
2024-01-02T18:53:28Z error starting container: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.1, please update your driver to a newer version, or use an earlier cuda container: unknown
2024-01-02T18:53:44Z start container
2024-01-02T18:53:45Z error starting container: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.1, please update your driver to a newer version, or use an earlier cuda container: unknown
2024-01-02T18:54:01Z start container
2024-01-02T18:54:01Z error starting container: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.1, please update your driver to a newer version, or use an earlier cuda container: unknown
2024-01-02T18:54:17Z start container
2024-01-02T18:54:18Z error starting container: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.1, please update your driver to a newer version, or use an earlier cuda container: unknown
2024-01-02T18:54:33Z start container
2024-01-02T18:54:34Z error starting container: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.1, please update your driver to a newer version, or use an earlier cuda container: unknown
2024-01-02T18:54:50Z start container
2024-01-02T18:54:51Z error starting container: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.1, please update your driver to a newer version, or use an earlier cuda container: unknown
2024-01-02T18:55:06Z start container
2024-01-02T18:55:07Z error starting container: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.1, please update your driver to a newer version, or use an earlier cuda container: unknown
2024-01-02T18:55:23Z start container
2024-01-02T18:55:24Z error starting container: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.1, please update your driver to a newer version, or use an earlier cuda container: unknown
2024-01-02T18:55:39Z start container
2024-01-02T18:55:40Z error starting container: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.1, please update your driver to a newer version, or use an earlier cuda container: unknown
2024-01-02T18:55:56Z start container
2024-01-02T18:55:57Z error starting container: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.1, please update your driver to a newer version, or use an earlier cuda container: unknown

I'm attempting to run on 2 A100s

Update - this issue goes away when running on a newer generation H100. Is this an issue with Runpod's hosted CUDA drivers on A100s now, or could this template also be made to work with older GPU servers...?

Thanks for reporting. Previously I had only tested on an A6000 and assumed A100 would work. I've created an issue here using the Mixtral (non-function calling) template.

Answer via GitHub issue: You probably selected a host with old drivers. Try an A100 host in secure cloud on KS-1 or KS-2 data centers.

RonanMcGovern changed discussion status to closed

Ok will do, thank you!! Your youtube videos and these gated models, along with the ADVANCED Inference repo, have been an absolute god-send! Thank you for making these resources available and 10x-ing the rate at which I can build right now! πŸ™πŸΌ

Sign up or log in to comment