Qwen3-VL-8B-Instruct — Precompiled Neuron Artifacts (inf2, TP=2)

Precompiled Neuron (NxD Inference + vLLM) artifacts for Qwen/Qwen3-VL-8B-Instruct on AWS Inferentia2 with TP=2.

These artifacts allow you to skip the ~55 minute compilation step and run inference immediately on inf2.xlarge or inf2.8xlarge.

Quick Start

# On an inf2 instance with the Neuron DLAMI (Ubuntu 24.04, SDK 2.28):
source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/bin/activate
pip install huggingface_hub

# Download precompiled artifacts
python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='jburtoft/Qwen3-VL-8B-Instruct-neuron-inf2-tp2', local_dir='neuron-compiled-artifacts')"

# Set the environment variable so vLLM loads precompiled model
export NEURON_COMPILED_ARTIFACTS=$PWD/neuron-compiled-artifacts/bs1_tp2
export VLLM_NEURON_FRAMEWORK=neuronx-distributed-inference

# Start vLLM server (compilation skipped!)
vllm serve Qwen/Qwen3-VL-8B-Instruct \
  --device neuron \
  --tensor-parallel-size 2 \
  --max-model-len 8192 \
  --max-num-seqs 1 \
  --override-neuron-config '{"text_neuron_config": {"batch_size": 1, "ctx_batch_size": 1, "tkg_batch_size": 1, "seq_len": 8192, "max_context_length": 8192, "enable_bucketing": true, "context_encoding_buckets": [1024, 4096, 8192], "token_generation_buckets": [1024, 4096, 8192], "world_size": 2, "tp_degree": 2, "torch_dtype": "bfloat16", "rpl_reduce_dtype": "bfloat16", "attention_dtype": "bfloat16", "cast_type": "as-declared", "logical_neuron_cores": 1, "cc_pipeline_tiling_factor": 1, "fused_qkv": true, "qkv_kernel_enabled": false, "mlp_kernel_enabled": false, "attn_kernel_enabled": false}, "vision_neuron_config": {"batch_size": 1, "seq_len": 8192, "max_context_length": 8192, "enable_bucketing": true, "buckets": [1024, 4096, 8192], "world_size": 2, "tp_degree": 2, "torch_dtype": "bfloat16", "rpl_reduce_dtype": "bfloat16", "cast_type": "as-declared", "logical_neuron_cores": 1, "cc_pipeline_tiling_factor": 1, "fused_qkv": true, "attn_kernel_enabled": false, "mlp_kernel_enabled": false}}' \
  --enable-prefix-caching false \
  --enable-chunked-prefill false

How These Artifacts Were Created

To regenerate these artifacts (e.g., for a different batch size or SDK version):

1. Launch an inf2.8xlarge

inf2.xlarge (16 GB RAM) cannot compile this model — it OOMs. Use inf2.8xlarge (128 GB RAM, same 2 NeuronCores + 32 GB HBM).

# Use Deep Learning AMI Neuron (Ubuntu 24.04) 20260227 — SDK 2.28
aws ec2 run-instances \
  --image-id <ami-id> \
  --instance-type inf2.8xlarge \
  --key-name <your-key> \
  ...

2. Set up the environment

source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/bin/activate
pip install qwen-vl-utils transformers huggingface_hub

3. Apply the NxDI batch fix (required for batch_size > 1)

import os, urllib.request
import neuronx_distributed_inference
nxdi_path = os.path.dirname(neuronx_distributed_inference.__file__)

BRANCH = "fix/qwen3-vl-batch-size-gt1-v2"
BASE = f"https://raw.githubusercontent.com/jimburtoft/neuronx-distributed-inference/{BRANCH}/src/neuronx_distributed_inference/models"

urllib.request.urlretrieve(f"{BASE}/image_to_text_model_wrapper.py", f"{nxdi_path}/models/image_to_text_model_wrapper.py")
urllib.request.urlretrieve(f"{BASE}/qwen3_vl/modeling_qwen3_vl.py", f"{nxdi_path}/models/qwen3_vl/modeling_qwen3_vl.py")

CRITICAL: Do NOT pip install the fork — it overwrites the DLAMI's NxDI which has tensor_capture_hook support required by vLLM.

4. Compile and save artifacts

import os
os.environ["NEURON_COMPILED_ARTIFACTS"] = os.path.expanduser("~/neuron-compiled-artifacts/bs1_tp2")
os.environ["VLLM_NEURON_FRAMEWORK"] = "neuronx-distributed-inference"

from vllm import LLM

# inf2 config: ALL ISA kernels disabled, LNC=1
text_cfg = {
    "batch_size": 1, "ctx_batch_size": 1, "tkg_batch_size": 1,
    "seq_len": 8192, "max_context_length": 8192,
    "enable_bucketing": True,
    "context_encoding_buckets": [1024, 4096, 8192],
    "token_generation_buckets": [1024, 4096, 8192],
    "world_size": 2, "tp_degree": 2,
    "torch_dtype": "bfloat16", "rpl_reduce_dtype": "bfloat16",
    "attention_dtype": "bfloat16", "cast_type": "as-declared",
    "logical_neuron_cores": 1, "cc_pipeline_tiling_factor": 1,
    "fused_qkv": True,
    "qkv_kernel_enabled": False,   # inf2 compiler bug NCC_IBVF032
    "mlp_kernel_enabled": False,
    "attn_kernel_enabled": False,
}
vision_cfg = {
    "batch_size": 1,
    "seq_len": 8192, "max_context_length": 8192,
    "enable_bucketing": True, "buckets": [1024, 4096, 8192],
    "world_size": 2, "tp_degree": 2,
    "torch_dtype": "bfloat16", "rpl_reduce_dtype": "bfloat16",
    "cast_type": "as-declared",
    "logical_neuron_cores": 1, "cc_pipeline_tiling_factor": 1,
    "fused_qkv": True,
    "attn_kernel_enabled": False, "mlp_kernel_enabled": False,
}

llm = LLM(
    model="Qwen/Qwen3-VL-8B-Instruct",
    trust_remote_code=True,
    dtype="bfloat16",
    tensor_parallel_size=2,
    max_num_seqs=1,
    max_model_len=8192,
    additional_config=dict(override_neuron_config=dict(
        text_neuron_config=text_cfg,
        vision_neuron_config=vision_cfg,
    )),
    limit_mm_per_prompt={"image": 1},
    enable_prefix_caching=False,
    enable_chunked_prefill=False,
)
# Compilation takes ~55 min on inf2.8xlarge
# Artifacts are saved to ~/neuron-compiled-artifacts/bs1_tp2/

5. Upload to Hugging Face

from huggingface_hub import HfApi
api = HfApi()
api.upload_folder(
    folder_path=os.path.expanduser("~/neuron-compiled-artifacts/bs1_tp2"),
    repo_id="your-username/Qwen3-VL-8B-Instruct-neuron-inf2-tp2",
    path_in_repo="bs1_tp2",
    repo_type="model",
)

Configuration Details

Parameter	Value
Model	Qwen/Qwen3-VL-8B-Instruct (9B BF16)
SDK	Neuron SDK 2.28
DLAMI	Deep Learning AMI Neuron (Ubuntu 24.04) 20260227
NxDI	0.8.16251+f3ca5575 (DLAMI stock)
vLLM	0.13.0 + vllm-neuron 0.4.1
TP degree	2
Batch size	1
Seq len	8192
Vision seq len	8192 (supports up to 1080p images)
ISA kernels	ALL disabled (inf2 compiler bug NCC_IBVF032)
LNC	1 (no LNC on inf2)
Compile time	~55 min on inf2.8xlarge
Artifact size	~266 MB

Performance

Instance	Throughput	Compile Time
inf2.8xlarge (compile + run)	~17 tok/s	~55 min
inf2.xlarge (precompiled)	~17 tok/s	0 (loads from artifacts)
trn2.3xlarge (tp=4)	~75 tok/s	~10 min

Constraints

Artifacts are specific to SDK 2.28, inf2 (2 NeuronCores), and LNC=1
Cannot use these artifacts on trn2 (different NeuronCore count and LNC)
When loading precompiled artifacts, vLLM will not recompile even if you pass different configs
Requires the NxDI batch fix patch for batch_size > 1

Qwen/Qwen3-VL-8B-Instruct — base model
NxDI batch fix branch — required for batch > 1

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jburtoft/Qwen3-VL-8B-Instruct-neuron-inf2-tp2

Base model

Qwen/Qwen3-VL-8B-Instruct

Finetuned

(238)

this model

jburtoft
/

Qwen3-VL-8B-Instruct-neuron-inf2-tp2