Cannot run vLLM on DGX Spark: ImportError: libcudart.so.12

#18
by yyg201708 - opened
vllm serve /home/sie/.cache/huggingface/hub/models--zai-org--GLM-4.7-Flash/snapshots/279ecdf8ee35f17f1939f95d6b113d8b806a7b2b \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 1 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-4.7-flash \
  --gpu-memory-utilization 0.8 \
  --served-model-name GB10 \
  --host 0.0.0.0 \
  --port 80 \
  --max-model-len 162144
Traceback (most recent call last):
  File "/home/sie/Downloads/models/glm_47_flash/bin/vllm", line 4, in <module>
    from vllm.entrypoints.cli.main import main
  File "/home/sie/Downloads/models/glm_47_flash/lib/python3.12/site-packages/vllm/entrypoints/cli/__init__.py", line 3, in <module>
    from vllm.entrypoints.cli.benchmark.latency import BenchmarkLatencySubcommand
  File "/home/sie/Downloads/models/glm_47_flash/lib/python3.12/site-packages/vllm/entrypoints/cli/benchmark/latency.py", line 5, in <module>
    from vllm.benchmarks.latency import add_cli_args, main
  File "/home/sie/Downloads/models/glm_47_flash/lib/python3.12/site-packages/vllm/benchmarks/latency.py", line 16, in <module>
    from vllm.engine.arg_utils import EngineArgs
  File "/home/sie/Downloads/models/glm_47_flash/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 35, in <module>
    from vllm.config import (
  File "/home/sie/Downloads/models/glm_47_flash/lib/python3.12/site-packages/vllm/config/__init__.py", line 5, in <module>
    from vllm.config.cache import CacheConfig
  File "/home/sie/Downloads/models/glm_47_flash/lib/python3.12/site-packages/vllm/config/cache.py", line 14, in <module>
    from vllm.utils.mem_utils import format_gib, get_cpu_memory
  File "/home/sie/Downloads/models/glm_47_flash/lib/python3.12/site-packages/vllm/utils/mem_utils.py", line 14, in <module>
    from vllm.platforms import current_platform
  File "/home/sie/Downloads/models/glm_47_flash/lib/python3.12/site-packages/vllm/platforms/__init__.py", line 257, in __getattr__
    _current_platform = resolve_obj_by_qualname(platform_cls_qualname)()
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sie/Downloads/models/glm_47_flash/lib/python3.12/site-packages/vllm/utils/import_utils.py", line 111, in resolve_obj_by_qualname
    module = importlib.import_module(module_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sie/Downloads/models/glm_47_flash/lib/python3.12/site-packages/vllm/platforms/cuda.py", line 16, in <module>
    import vllm._C  # noqa
    ^^^^^^^^^^^^^^
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory
(glm_47_flash) sie@edgexpert-0882:~/Downloads/models$ nvidia-smi
Tue Jan 20 14:10:01 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GB10                    On  |   0000000F:01:00.0 Off |                  N/A |
| N/A   39C    P8              3W /  N/A  | Not Supported          |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            2765      G   /usr/lib/xorg/Xorg                       18MiB |
|    0   N/A  N/A            3003      G   /usr/bin/gnome-shell                      6MiB |
+-----------------------------------------------------------------------------------------+
 uv pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://   wheels.vllm.ai/nightly
Using Python 3.12.3 environment at: glm_47_flash
⠼ flashinfer-python==0.5.3                                                                                                                          ⠴ flashinfer-python==0.5.3                                                                                                                          ⠦ flashinfer-python==0.5.3                                                                                                                          Resolved 152 packages in 1m 42s
Prepared 152 packages in 6m 52s
Installed 152 packages in 98ms
 + aiohappyeyeballs==2.6.1
 + aiohttp==3.13.3
 + aiosignal==1.4.0
 + annotated-doc==0.0.4
 + annotated-types==0.7.0
 + anthropic==0.76.0
 + anyio==4.12.1
 + apache-tvm-ffi==0.1.8.post2
 + astor==0.8.1
 + attrs==25.4.0
 + blake3==1.0.8
 + cachetools==6.2.4
 + cbor2==5.8.0
 + certifi==2026.1.4
 + cffi==2.0.0
 + charset-normalizer==3.4.4
 + click==8.3.1
 + cloudpickle==3.1.2
 + compressed-tensors==0.13.0
 + cryptography==46.0.3
 + cuda-bindings==13.1.1
 + cuda-pathfinder==1.3.3
 + cuda-python==13.1.1
 + cupy-cuda12x==13.6.0
 + depyf==0.20.0
 + dill==0.4.1
 + diskcache==5.6.3
 + distro==1.9.0
 + dnspython==2.8.0
 + docstring-parser==0.17.0
 + einops==0.8.1
 + email-validator==2.3.0
 + fastapi==0.128.0
 + fastapi-cli==0.0.20
 + fastapi-cloud-cli==0.11.0
 + fastar==0.8.0
 + fastrlock==0.8.3
 + filelock==3.20.3
 + flashinfer-python==0.5.3
 + frozenlist==1.8.0
 + fsspec==2026.1.0
 + gguf==0.17.1
 + grpcio==1.78.0rc2
 + grpcio-reflection==1.78.0rc2
 + h11==0.16.0
 + hf-xet==1.2.1rc0
 + httpcore==1.0.9
 + httptools==0.7.1
 + httpx==0.28.1
 + httpx-sse==0.4.3
 + huggingface-hub==0.36.0
 + idna==3.11
 + ijson==3.4.0.post0
 + importlib-metadata==8.7.1
 + interegular==0.3.3
 + jinja2==3.1.6
 + jiter==0.12.0
 + jmespath==1.0.1
 + jsonschema==4.26.0
 + jsonschema-specifications==2025.9.1
 + lark==1.2.2
 + llguidance==1.3.0
 + llvmlite==0.44.0
 + lm-format-enforcer==0.11.3
 + loguru==0.7.3
 + markdown-it-py==4.0.0
 + markupsafe==3.0.3
 + mcp==1.25.0
 + mdurl==0.1.2
 + mistral-common==1.8.8
 + model-hosting-container-standards==0.1.13
 + mpmath==1.3.0
 + msgpack==1.1.2
 + msgspec==0.20.0
 + multidict==6.7.0
 + networkx==3.6.1
 + ninja==1.13.0
 + numba==0.61.2
 + numpy==2.2.6
 + nvidia-cudnn-frontend==1.17.0
 + nvidia-cutlass-dsl==4.3.5
 + nvidia-ml-py==13.590.44
 + openai==2.15.0
 + openai-harmony==0.0.8
 + opencv-python-headless==4.13.0.90
 + opentelemetry-api==1.39.1
 + opentelemetry-sdk==1.39.1
 + opentelemetry-semantic-conventions==0.60b1
 + outlines-core==0.2.11
 + packaging==26.0rc3
 + partial-json-parser==0.2.1.1.post7
 + pillow==12.1.0
 + prometheus-client==0.24.1
 + prometheus-fastapi-instrumentator==7.1.0
 + propcache==0.4.1
 + protobuf==6.33.4
 + psutil==7.2.1
 + py-cpuinfo==9.0.0
 + pybase64==1.4.3
 + pycountry==24.6.1
 + pycparser==2.23
 + pydantic==2.12.5
 + pydantic-core==2.41.5
 + pydantic-extra-types==2.11.0
 + pydantic-settings==2.12.0
 + pygments==2.19.2
 + pyjwt==2.10.1
 + python-dotenv==1.2.1
 + python-json-logger==4.0.0
 + python-multipart==0.0.21
 + pyyaml==6.0.3
 + pyzmq==27.1.0
 + ray==2.53.0
 + referencing==0.37.0
 + regex==2026.1.15
 + requests==2.32.5
 + rich==14.2.0
 + rich-toolkit==0.17.1
 + rignore==0.7.6
 + rpds-py==0.30.0
 + safetensors==0.7.0
 + sentencepiece==0.2.1
 + sentry-sdk==3.0.0a7
 + setproctitle==1.3.7
 + setuptools==80.9.0
 + shellingham==1.5.4
 + six==1.17.0
 + sniffio==1.3.1
 + sse-starlette==3.2.0
 + starlette==0.50.0
 + supervisor==4.3.0
 + sympy==1.14.0
 + tabulate==0.9.0
 + tiktoken==0.12.0
 + tokenizers==0.22.2
 + torch==2.9.1
 + torchaudio==2.9.1
 + torchvision==0.24.1
 + tqdm==4.67.1
 + transformers==4.57.6
 + typer==0.21.1
 + typing-extensions==4.15.0
 + typing-inspection==0.4.2
 + urllib3==2.6.3
 + uvicorn==0.40.0
 + uvloop==0.22.1
 + vllm==0.14.0rc2.dev163+g4753f3bf6
 + watchfiles==1.1.1
 + websockets==16.0
 + xgrammar==0.1.29
 + yarl==1.22.0
 + zipp==3.23.0

(glm_47_flash) sie@edgexpert-0882:~/Downloads/models$ uv pip install git+https://github.com/huggingface/transformers.git
Using Python 3.12.3 environment at: glm_47_flash
    Updated https://github.com/huggingface/transformers.git (314f10929a2215b74c2ad6ecf7b2f380c9b7468a)
Resolved 25 packages in 3m 35s
      Built transformers @ git+https://github.com/huggingface/transformers.git@314f10929a2215b74c2ad6ecf7b2f380c9b7468a
Prepared 3 packages in 1.37s
Uninstalled 2 packages in 28ms
Installed 3 packages in 33ms
 - huggingface-hub==0.36.0
 + huggingface-hub==1.3.2
 - transformers==4.57.6
 + transformers==5.0.0.dev0 (from git+https://github.com/huggingface/transformers.git@314f10929a2215b74c2ad6ecf7b2f380c9b7468a)
 + typer-slim==0.21.1

Give ik_llama.cpp a try. Some discussion and links to GGUF quants here including a benchmark on RTX A6000: https://github.com/ikawrakow/ik_llama.cpp/issues/1167#issuecomment-3775037120

I have the same issue

try this:

docker run
--privileged
--gpus all
-it --rm
--network host --ipc=host
-v ~/.cache/huggingface:/root/.cache/huggingface
vllm-node
bash -c "pip install git+https://github.com/huggingface/transformers.git &&
curl -sSL https://raw.githubusercontent.com/eugr/spark-vllm-docker/main/mods/fix-Salyut1-GLM-4.7-NVFP4/glm4_moe.patch | patch -p1 -d / &&
vllm serve
zai-org/GLM-4.7-Flash
--port 8000 --host 0.0.0.0
--gpu-memory-utilization 0.9
--max-model-len 32768
--trust-remote-code"

https://github.com/eugr/spark-vllm-docker

Hello! Noticed your CUDA driver is on version 13 but vllm is trying to load libcudart from CUDA version 12.0. CUDA is backwards compatible, but libcudart from version 12 still needs to be present. You can keep your main CUDA driver the same just pull the older version into your environment.

Conda

conda install nvidia/label/cuda-12.4.0::cuda-cudart

Or just Torch

pip install torch --index-url https://download.pytorch.org/whl/cu124

Hope that helps! :)

Also, if you'd like to read more about CUDA driver compatibility here is the Nvidia docs:
CUDA compatibility

Sign up or log in to comment