Cannot run vLLM on DGX Spark: ImportError: libcudart.so.12
vllm serve /home/sie/.cache/huggingface/hub/models--zai-org--GLM-4.7-Flash/snapshots/279ecdf8ee35f17f1939f95d6b113d8b806a7b2b \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 1 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-4.7-flash \
--gpu-memory-utilization 0.8 \
--served-model-name GB10 \
--host 0.0.0.0 \
--port 80 \
--max-model-len 162144
Traceback (most recent call last):
File "/home/sie/Downloads/models/glm_47_flash/bin/vllm", line 4, in <module>
from vllm.entrypoints.cli.main import main
File "/home/sie/Downloads/models/glm_47_flash/lib/python3.12/site-packages/vllm/entrypoints/cli/__init__.py", line 3, in <module>
from vllm.entrypoints.cli.benchmark.latency import BenchmarkLatencySubcommand
File "/home/sie/Downloads/models/glm_47_flash/lib/python3.12/site-packages/vllm/entrypoints/cli/benchmark/latency.py", line 5, in <module>
from vllm.benchmarks.latency import add_cli_args, main
File "/home/sie/Downloads/models/glm_47_flash/lib/python3.12/site-packages/vllm/benchmarks/latency.py", line 16, in <module>
from vllm.engine.arg_utils import EngineArgs
File "/home/sie/Downloads/models/glm_47_flash/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 35, in <module>
from vllm.config import (
File "/home/sie/Downloads/models/glm_47_flash/lib/python3.12/site-packages/vllm/config/__init__.py", line 5, in <module>
from vllm.config.cache import CacheConfig
File "/home/sie/Downloads/models/glm_47_flash/lib/python3.12/site-packages/vllm/config/cache.py", line 14, in <module>
from vllm.utils.mem_utils import format_gib, get_cpu_memory
File "/home/sie/Downloads/models/glm_47_flash/lib/python3.12/site-packages/vllm/utils/mem_utils.py", line 14, in <module>
from vllm.platforms import current_platform
File "/home/sie/Downloads/models/glm_47_flash/lib/python3.12/site-packages/vllm/platforms/__init__.py", line 257, in __getattr__
_current_platform = resolve_obj_by_qualname(platform_cls_qualname)()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sie/Downloads/models/glm_47_flash/lib/python3.12/site-packages/vllm/utils/import_utils.py", line 111, in resolve_obj_by_qualname
module = importlib.import_module(module_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/importlib/__init__.py", line 90, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sie/Downloads/models/glm_47_flash/lib/python3.12/site-packages/vllm/platforms/cuda.py", line 16, in <module>
import vllm._C # noqa
^^^^^^^^^^^^^^
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory
(glm_47_flash) sie@edgexpert-0882:~/Downloads/models$ nvidia-smi
Tue Jan 20 14:10:01 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GB10 On | 0000000F:01:00.0 Off | N/A |
| N/A 39C P8 3W / N/A | Not Supported | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2765 G /usr/lib/xorg/Xorg 18MiB |
| 0 N/A N/A 3003 G /usr/bin/gnome-shell 6MiB |
+-----------------------------------------------------------------------------------------+
uv pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https:// wheels.vllm.ai/nightly
Using Python 3.12.3 environment at: glm_47_flash
⠼ flashinfer-python==0.5.3 ⠴ flashinfer-python==0.5.3 ⠦ flashinfer-python==0.5.3 Resolved 152 packages in 1m 42s
Prepared 152 packages in 6m 52s
Installed 152 packages in 98ms
+ aiohappyeyeballs==2.6.1
+ aiohttp==3.13.3
+ aiosignal==1.4.0
+ annotated-doc==0.0.4
+ annotated-types==0.7.0
+ anthropic==0.76.0
+ anyio==4.12.1
+ apache-tvm-ffi==0.1.8.post2
+ astor==0.8.1
+ attrs==25.4.0
+ blake3==1.0.8
+ cachetools==6.2.4
+ cbor2==5.8.0
+ certifi==2026.1.4
+ cffi==2.0.0
+ charset-normalizer==3.4.4
+ click==8.3.1
+ cloudpickle==3.1.2
+ compressed-tensors==0.13.0
+ cryptography==46.0.3
+ cuda-bindings==13.1.1
+ cuda-pathfinder==1.3.3
+ cuda-python==13.1.1
+ cupy-cuda12x==13.6.0
+ depyf==0.20.0
+ dill==0.4.1
+ diskcache==5.6.3
+ distro==1.9.0
+ dnspython==2.8.0
+ docstring-parser==0.17.0
+ einops==0.8.1
+ email-validator==2.3.0
+ fastapi==0.128.0
+ fastapi-cli==0.0.20
+ fastapi-cloud-cli==0.11.0
+ fastar==0.8.0
+ fastrlock==0.8.3
+ filelock==3.20.3
+ flashinfer-python==0.5.3
+ frozenlist==1.8.0
+ fsspec==2026.1.0
+ gguf==0.17.1
+ grpcio==1.78.0rc2
+ grpcio-reflection==1.78.0rc2
+ h11==0.16.0
+ hf-xet==1.2.1rc0
+ httpcore==1.0.9
+ httptools==0.7.1
+ httpx==0.28.1
+ httpx-sse==0.4.3
+ huggingface-hub==0.36.0
+ idna==3.11
+ ijson==3.4.0.post0
+ importlib-metadata==8.7.1
+ interegular==0.3.3
+ jinja2==3.1.6
+ jiter==0.12.0
+ jmespath==1.0.1
+ jsonschema==4.26.0
+ jsonschema-specifications==2025.9.1
+ lark==1.2.2
+ llguidance==1.3.0
+ llvmlite==0.44.0
+ lm-format-enforcer==0.11.3
+ loguru==0.7.3
+ markdown-it-py==4.0.0
+ markupsafe==3.0.3
+ mcp==1.25.0
+ mdurl==0.1.2
+ mistral-common==1.8.8
+ model-hosting-container-standards==0.1.13
+ mpmath==1.3.0
+ msgpack==1.1.2
+ msgspec==0.20.0
+ multidict==6.7.0
+ networkx==3.6.1
+ ninja==1.13.0
+ numba==0.61.2
+ numpy==2.2.6
+ nvidia-cudnn-frontend==1.17.0
+ nvidia-cutlass-dsl==4.3.5
+ nvidia-ml-py==13.590.44
+ openai==2.15.0
+ openai-harmony==0.0.8
+ opencv-python-headless==4.13.0.90
+ opentelemetry-api==1.39.1
+ opentelemetry-sdk==1.39.1
+ opentelemetry-semantic-conventions==0.60b1
+ outlines-core==0.2.11
+ packaging==26.0rc3
+ partial-json-parser==0.2.1.1.post7
+ pillow==12.1.0
+ prometheus-client==0.24.1
+ prometheus-fastapi-instrumentator==7.1.0
+ propcache==0.4.1
+ protobuf==6.33.4
+ psutil==7.2.1
+ py-cpuinfo==9.0.0
+ pybase64==1.4.3
+ pycountry==24.6.1
+ pycparser==2.23
+ pydantic==2.12.5
+ pydantic-core==2.41.5
+ pydantic-extra-types==2.11.0
+ pydantic-settings==2.12.0
+ pygments==2.19.2
+ pyjwt==2.10.1
+ python-dotenv==1.2.1
+ python-json-logger==4.0.0
+ python-multipart==0.0.21
+ pyyaml==6.0.3
+ pyzmq==27.1.0
+ ray==2.53.0
+ referencing==0.37.0
+ regex==2026.1.15
+ requests==2.32.5
+ rich==14.2.0
+ rich-toolkit==0.17.1
+ rignore==0.7.6
+ rpds-py==0.30.0
+ safetensors==0.7.0
+ sentencepiece==0.2.1
+ sentry-sdk==3.0.0a7
+ setproctitle==1.3.7
+ setuptools==80.9.0
+ shellingham==1.5.4
+ six==1.17.0
+ sniffio==1.3.1
+ sse-starlette==3.2.0
+ starlette==0.50.0
+ supervisor==4.3.0
+ sympy==1.14.0
+ tabulate==0.9.0
+ tiktoken==0.12.0
+ tokenizers==0.22.2
+ torch==2.9.1
+ torchaudio==2.9.1
+ torchvision==0.24.1
+ tqdm==4.67.1
+ transformers==4.57.6
+ typer==0.21.1
+ typing-extensions==4.15.0
+ typing-inspection==0.4.2
+ urllib3==2.6.3
+ uvicorn==0.40.0
+ uvloop==0.22.1
+ vllm==0.14.0rc2.dev163+g4753f3bf6
+ watchfiles==1.1.1
+ websockets==16.0
+ xgrammar==0.1.29
+ yarl==1.22.0
+ zipp==3.23.0
(glm_47_flash) sie@edgexpert-0882:~/Downloads/models$ uv pip install git+https://github.com/huggingface/transformers.git
Using Python 3.12.3 environment at: glm_47_flash
Updated https://github.com/huggingface/transformers.git (314f10929a2215b74c2ad6ecf7b2f380c9b7468a)
Resolved 25 packages in 3m 35s
Built transformers @ git+https://github.com/huggingface/transformers.git@314f10929a2215b74c2ad6ecf7b2f380c9b7468a
Prepared 3 packages in 1.37s
Uninstalled 2 packages in 28ms
Installed 3 packages in 33ms
- huggingface-hub==0.36.0
+ huggingface-hub==1.3.2
- transformers==4.57.6
+ transformers==5.0.0.dev0 (from git+https://github.com/huggingface/transformers.git@314f10929a2215b74c2ad6ecf7b2f380c9b7468a)
+ typer-slim==0.21.1
Give ik_llama.cpp a try. Some discussion and links to GGUF quants here including a benchmark on RTX A6000: https://github.com/ikawrakow/ik_llama.cpp/issues/1167#issuecomment-3775037120
I have the same issue
try this:
docker run
--privileged
--gpus all
-it --rm
--network host --ipc=host
-v ~/.cache/huggingface:/root/.cache/huggingface
vllm-node
bash -c "pip install git+https://github.com/huggingface/transformers.git &&
curl -sSL https://raw.githubusercontent.com/eugr/spark-vllm-docker/main/mods/fix-Salyut1-GLM-4.7-NVFP4/glm4_moe.patch | patch -p1 -d / &&
vllm serve
zai-org/GLM-4.7-Flash
--port 8000 --host 0.0.0.0
--gpu-memory-utilization 0.9
--max-model-len 32768
--trust-remote-code"
Hello! Noticed your CUDA driver is on version 13 but vllm is trying to load libcudart from CUDA version 12.0. CUDA is backwards compatible, but libcudart from version 12 still needs to be present. You can keep your main CUDA driver the same just pull the older version into your environment.
Conda
conda install nvidia/label/cuda-12.4.0::cuda-cudart
Or just Torch
pip install torch --index-url https://download.pytorch.org/whl/cu124
Hope that helps! :)
Also, if you'd like to read more about CUDA driver compatibility here is the Nvidia docs:
CUDA compatibility