vLLM 0.17.0 β€” Pre-built Wheel for Jetson Orin (SM 8.7)

Pre-built vLLM wheel with Marlin GPTQ tensor core kernels compiled for SM 8.7 (Jetson AGX Orin, Orin Nano, Orin NX).

Official vLLM wheels ship Marlin kernels for SM 8.0, 8.6, 8.9, and 9.0 β€” but not SM 8.7. Without SM 8.7 support, --quantization gptq_marlin falls back to generic CUDA core kernels, leaving Orin's tensor cores idle. This wheel fixes that with a single pip install.

Performance

Tested on Jetson AGX Orin 64GB with Qwen3.5-35B-A3B-GPTQ-Int4 (35B parameters, 3.5B active via MoE).

Prefill Throughput

Engine Prefill (tok/s) Speedup
llama.cpp Q4_K_M 523 1.0x
vLLM GPTQ (no Marlin) 241 0.5x
vLLM GPTQ Marlin 2,001 3.8x

Decode Throughput (streaming)

Context Length vLLM Marlin llama.cpp Speedup
~38 tokens 31.4 tok/s 22.5 tok/s +40%
~4,000 tokens 30.9 tok/s 22.5 tok/s +37%
~20,000 tokens 29.2 tok/s 22.5 tok/s +30%

End-to-End (20k context + 200 output tokens)

Engine Total Time Speedup
llama.cpp 47s 1.0x
vLLM Marlin 17s 2.8x

Install

# Create venv and install PyTorch for Jetson
python3 -m venv ~/vllm-venv && source ~/vllm-venv/bin/activate
pip install torch --index-url https://pypi.jetson-ai-lab.io/jp6/cu126

# Install this wheel
pip install https://huggingface.co/thehighnotes/vllm-jetson-orin/resolve/main/vllm-0.17.0+cu126-cp310-cp310-linux_aarch64.whl

# Install Triton
pip install triton

Launch

# Set required environment variables
export LD_LIBRARY_PATH="$(python -c 'import nvidia.cu12; print(nvidia.cu12.__path__[0])')/lib:${LD_LIBRARY_PATH:-}"
export LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/local/cuda/targets/aarch64-linux/lib:${LD_LIBRARY_PATH}"

# Start serving (any GPTQ-Int4 model)
vllm serve Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 \
    --host 0.0.0.0 --port 8000 \
    --quantization gptq_marlin \
    --dtype half \
    --gpu-memory-utilization 0.8 \
    --max-model-len 4096 \
    --max-num-seqs 1

The server exposes an OpenAI-compatible API at http://localhost:8000/v1/.

Requirements

Component Version
Device Jetson AGX Orin, Orin Nano, or Orin NX (SM 8.7)
JetPack 6.x (tested on 6.2.1)
CUDA 12.6
Python 3.10
PyTorch 2.10.0 from Jetson AI Lab
NumPy 1.x (required by Jetson PyTorch)

What's Included

  • vLLM 0.17.0 with all compiled .so extensions targeting SM 8.7
  • Marlin GPTQ kernels β€” fused INT4 dequantization + FP16 tensor core matrix multiply
  • Scheduler fast-path patch β€” reduces per-token Python overhead for single-user decode
  • Saves a 75-minute build-from-source process

Resources

Part of the AIquest Research Lab

This wheel powers the LLM inference backend for the Multi-Machine Dev Hub β€” a suite of CLI tools connecting Jetson devices into a unified development environment. It also provides the inference layer for Prometheus Mind, a memory architecture for frozen language models.

Explore more at aiquest.info/research.

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support