vLLM 0.17.0 β Pre-built Wheel for Jetson Orin (SM 8.7)
Pre-built vLLM wheel with Marlin GPTQ tensor core kernels compiled for SM 8.7 (Jetson AGX Orin, Orin Nano, Orin NX).
Official vLLM wheels ship Marlin kernels for SM 8.0, 8.6, 8.9, and 9.0 β but not SM 8.7. Without SM 8.7 support, --quantization gptq_marlin falls back to generic CUDA core kernels, leaving Orin's tensor cores idle. This wheel fixes that with a single pip install.
Performance
Tested on Jetson AGX Orin 64GB with Qwen3.5-35B-A3B-GPTQ-Int4 (35B parameters, 3.5B active via MoE).
Prefill Throughput
| Engine | Prefill (tok/s) | Speedup |
|---|---|---|
| llama.cpp Q4_K_M | 523 | 1.0x |
| vLLM GPTQ (no Marlin) | 241 | 0.5x |
| vLLM GPTQ Marlin | 2,001 | 3.8x |
Decode Throughput (streaming)
| Context Length | vLLM Marlin | llama.cpp | Speedup |
|---|---|---|---|
| ~38 tokens | 31.4 tok/s | 22.5 tok/s | +40% |
| ~4,000 tokens | 30.9 tok/s | 22.5 tok/s | +37% |
| ~20,000 tokens | 29.2 tok/s | 22.5 tok/s | +30% |
End-to-End (20k context + 200 output tokens)
| Engine | Total Time | Speedup |
|---|---|---|
| llama.cpp | 47s | 1.0x |
| vLLM Marlin | 17s | 2.8x |
Install
# Create venv and install PyTorch for Jetson
python3 -m venv ~/vllm-venv && source ~/vllm-venv/bin/activate
pip install torch --index-url https://pypi.jetson-ai-lab.io/jp6/cu126
# Install this wheel
pip install https://huggingface.co/thehighnotes/vllm-jetson-orin/resolve/main/vllm-0.17.0+cu126-cp310-cp310-linux_aarch64.whl
# Install Triton
pip install triton
Launch
# Set required environment variables
export LD_LIBRARY_PATH="$(python -c 'import nvidia.cu12; print(nvidia.cu12.__path__[0])')/lib:${LD_LIBRARY_PATH:-}"
export LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/local/cuda/targets/aarch64-linux/lib:${LD_LIBRARY_PATH}"
# Start serving (any GPTQ-Int4 model)
vllm serve Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 \
--host 0.0.0.0 --port 8000 \
--quantization gptq_marlin \
--dtype half \
--gpu-memory-utilization 0.8 \
--max-model-len 4096 \
--max-num-seqs 1
The server exposes an OpenAI-compatible API at http://localhost:8000/v1/.
Requirements
| Component | Version |
|---|---|
| Device | Jetson AGX Orin, Orin Nano, or Orin NX (SM 8.7) |
| JetPack | 6.x (tested on 6.2.1) |
| CUDA | 12.6 |
| Python | 3.10 |
| PyTorch | 2.10.0 from Jetson AI Lab |
| NumPy | 1.x (required by Jetson PyTorch) |
What's Included
- vLLM 0.17.0 with all compiled
.soextensions targeting SM 8.7 - Marlin GPTQ kernels β fused INT4 dequantization + FP16 tensor core matrix multiply
- Scheduler fast-path patch β reduces per-token Python overhead for single-user decode
- Saves a 75-minute build-from-source process
Resources
- Scripts, patches, systemd template, benchmarks: GitHub: thehighnotes/vllm-jetson-orin
- Build from source instructions: GitHub README
Part of the AIquest Research Lab
This wheel powers the LLM inference backend for the Multi-Machine Dev Hub β a suite of CLI tools connecting Jetson devices into a unified development environment. It also provides the inference layer for Prometheus Mind, a memory architecture for frozen language models.
Explore more at aiquest.info/research.