vLLM 0.17.0 — Pre-built Wheel for Jetson Orin (SM 8.7)

Pre-built vLLM wheel with Marlin GPTQ tensor core kernels compiled for SM 8.7 (Jetson AGX Orin, Orin Nano, Orin NX).

Official vLLM wheels ship Marlin kernels for SM 8.0, 8.6, 8.9, and 9.0 — but not SM 8.7. Without SM 8.7 support, --quantization gptq_marlin falls back to generic CUDA core kernels, leaving Orin's tensor cores idle. This wheel fixes that with a single pip install.

Performance

Tested on Jetson AGX Orin 64GB with Qwen3.5-35B-A3B-GPTQ-Int4 (35B parameters, 3.5B active via MoE).

Prefill Throughput

Engine	Prefill (tok/s)	Speedup
llama.cpp Q4_K_M	523	1.0x
vLLM GPTQ (no Marlin)	241	0.5x
vLLM GPTQ Marlin	2,001	3.8x

Decode Throughput (streaming)

Context Length	vLLM Marlin	llama.cpp	Speedup
~38 tokens	31.4 tok/s	22.5 tok/s	+40%
~4,000 tokens	30.9 tok/s	22.5 tok/s	+37%
~20,000 tokens	29.2 tok/s	22.5 tok/s	+30%

End-to-End (20k context + 200 output tokens)

Engine	Total Time	Speedup
llama.cpp	47s	1.0x
vLLM Marlin	17s	2.8x

Install

# Create venv and install PyTorch for Jetson
python3 -m venv ~/vllm-venv && source ~/vllm-venv/bin/activate
pip install torch --index-url https://pypi.jetson-ai-lab.io/jp6/cu126

# Install this wheel
pip install https://huggingface.co/thehighnotes/vllm-jetson-orin/resolve/main/vllm-0.17.0+cu126-cp310-cp310-linux_aarch64.whl

# Install Triton
pip install triton

Launch

# Set required environment variables
export LD_LIBRARY_PATH="$(python -c 'import nvidia.cu12; print(nvidia.cu12.__path__[0])')/lib:${LD_LIBRARY_PATH:-}"
export LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/local/cuda/targets/aarch64-linux/lib:${LD_LIBRARY_PATH}"

# Start serving (any GPTQ-Int4 model)
vllm serve Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 \
    --host 0.0.0.0 --port 8000 \
    --quantization gptq_marlin \
    --dtype half \
    --gpu-memory-utilization 0.8 \
    --max-model-len 4096 \
    --max-num-seqs 1

The server exposes an OpenAI-compatible API at http://localhost:8000/v1/.

Requirements

Component	Version
Device	Jetson AGX Orin, Orin Nano, or Orin NX (SM 8.7)
JetPack	6.x (tested on 6.2.1)
CUDA	12.6
Python	3.10
PyTorch	2.10.0 from Jetson AI Lab
NumPy	1.x (required by Jetson PyTorch)

What's Included

vLLM 0.17.0 with all compiled .so extensions targeting SM 8.7
Marlin GPTQ kernels — fused INT4 dequantization + FP16 tensor core matrix multiply
Scheduler fast-path patch — reduces per-token Python overhead for single-user decode
Saves a 75-minute build-from-source process

Resources

Scripts, patches, systemd template, benchmarks: GitHub: thehighnotes/vllm-jetson-orin
Build from source instructions: GitHub README

Part of the AIquest Research Lab

This wheel powers the LLM inference backend for the Multi-Machine Dev Hub — a suite of CLI tools connecting Jetson devices into a unified development environment. It also provides the inference layer for Prometheus Mind, a memory architecture for frozen language models.

Explore more at aiquest.info/research.

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track