Qwen3-VL-8B-Instruct β€” Precompiled Neuron Artifacts (inf2, TP=2)

Precompiled Neuron (NxD Inference + vLLM) artifacts for Qwen/Qwen3-VL-8B-Instruct on AWS Inferentia2 with TP=2.

These artifacts allow you to skip the ~55 minute compilation step and run inference immediately on inf2.xlarge or inf2.8xlarge.

Quick Start

# On an inf2 instance with the Neuron DLAMI (Ubuntu 24.04, SDK 2.28):
source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/bin/activate
pip install huggingface_hub

# Download precompiled artifacts
python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='jburtoft/Qwen3-VL-8B-Instruct-neuron-inf2-tp2', local_dir='neuron-compiled-artifacts')"

# Set the environment variable so vLLM loads precompiled model
export NEURON_COMPILED_ARTIFACTS=$PWD/neuron-compiled-artifacts/bs1_tp2
export VLLM_NEURON_FRAMEWORK=neuronx-distributed-inference

# Start vLLM server (compilation skipped!)
vllm serve Qwen/Qwen3-VL-8B-Instruct \
  --device neuron \
  --tensor-parallel-size 2 \
  --max-model-len 8192 \
  --max-num-seqs 1 \
  --override-neuron-config '{"text_neuron_config": {"batch_size": 1, "ctx_batch_size": 1, "tkg_batch_size": 1, "seq_len": 8192, "max_context_length": 8192, "enable_bucketing": true, "context_encoding_buckets": [1024, 4096, 8192], "token_generation_buckets": [1024, 4096, 8192], "world_size": 2, "tp_degree": 2, "torch_dtype": "bfloat16", "rpl_reduce_dtype": "bfloat16", "attention_dtype": "bfloat16", "cast_type": "as-declared", "logical_neuron_cores": 1, "cc_pipeline_tiling_factor": 1, "fused_qkv": true, "qkv_kernel_enabled": false, "mlp_kernel_enabled": false, "attn_kernel_enabled": false}, "vision_neuron_config": {"batch_size": 1, "seq_len": 8192, "max_context_length": 8192, "enable_bucketing": true, "buckets": [1024, 4096, 8192], "world_size": 2, "tp_degree": 2, "torch_dtype": "bfloat16", "rpl_reduce_dtype": "bfloat16", "cast_type": "as-declared", "logical_neuron_cores": 1, "cc_pipeline_tiling_factor": 1, "fused_qkv": true, "attn_kernel_enabled": false, "mlp_kernel_enabled": false}}' \
  --enable-prefix-caching false \
  --enable-chunked-prefill false

How These Artifacts Were Created

To regenerate these artifacts (e.g., for a different batch size or SDK version):

1. Launch an inf2.8xlarge

inf2.xlarge (16 GB RAM) cannot compile this model β€” it OOMs. Use inf2.8xlarge (128 GB RAM, same 2 NeuronCores + 32 GB HBM).

# Use Deep Learning AMI Neuron (Ubuntu 24.04) 20260227 β€” SDK 2.28
aws ec2 run-instances \
  --image-id <ami-id> \
  --instance-type inf2.8xlarge \
  --key-name <your-key> \
  ...

2. Set up the environment

source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/bin/activate
pip install qwen-vl-utils transformers huggingface_hub

3. Apply the NxDI batch fix (required for batch_size > 1)

import os, urllib.request
import neuronx_distributed_inference
nxdi_path = os.path.dirname(neuronx_distributed_inference.__file__)

BRANCH = "fix/qwen3-vl-batch-size-gt1-v2"
BASE = f"https://raw.githubusercontent.com/jimburtoft/neuronx-distributed-inference/{BRANCH}/src/neuronx_distributed_inference/models"

urllib.request.urlretrieve(f"{BASE}/image_to_text_model_wrapper.py", f"{nxdi_path}/models/image_to_text_model_wrapper.py")
urllib.request.urlretrieve(f"{BASE}/qwen3_vl/modeling_qwen3_vl.py", f"{nxdi_path}/models/qwen3_vl/modeling_qwen3_vl.py")

CRITICAL: Do NOT pip install the fork β€” it overwrites the DLAMI's NxDI which has tensor_capture_hook support required by vLLM.

4. Compile and save artifacts

import os
os.environ["NEURON_COMPILED_ARTIFACTS"] = os.path.expanduser("~/neuron-compiled-artifacts/bs1_tp2")
os.environ["VLLM_NEURON_FRAMEWORK"] = "neuronx-distributed-inference"

from vllm import LLM

# inf2 config: ALL ISA kernels disabled, LNC=1
text_cfg = {
    "batch_size": 1, "ctx_batch_size": 1, "tkg_batch_size": 1,
    "seq_len": 8192, "max_context_length": 8192,
    "enable_bucketing": True,
    "context_encoding_buckets": [1024, 4096, 8192],
    "token_generation_buckets": [1024, 4096, 8192],
    "world_size": 2, "tp_degree": 2,
    "torch_dtype": "bfloat16", "rpl_reduce_dtype": "bfloat16",
    "attention_dtype": "bfloat16", "cast_type": "as-declared",
    "logical_neuron_cores": 1, "cc_pipeline_tiling_factor": 1,
    "fused_qkv": True,
    "qkv_kernel_enabled": False,   # inf2 compiler bug NCC_IBVF032
    "mlp_kernel_enabled": False,
    "attn_kernel_enabled": False,
}
vision_cfg = {
    "batch_size": 1,
    "seq_len": 8192, "max_context_length": 8192,
    "enable_bucketing": True, "buckets": [1024, 4096, 8192],
    "world_size": 2, "tp_degree": 2,
    "torch_dtype": "bfloat16", "rpl_reduce_dtype": "bfloat16",
    "cast_type": "as-declared",
    "logical_neuron_cores": 1, "cc_pipeline_tiling_factor": 1,
    "fused_qkv": True,
    "attn_kernel_enabled": False, "mlp_kernel_enabled": False,
}

llm = LLM(
    model="Qwen/Qwen3-VL-8B-Instruct",
    trust_remote_code=True,
    dtype="bfloat16",
    tensor_parallel_size=2,
    max_num_seqs=1,
    max_model_len=8192,
    additional_config=dict(override_neuron_config=dict(
        text_neuron_config=text_cfg,
        vision_neuron_config=vision_cfg,
    )),
    limit_mm_per_prompt={"image": 1},
    enable_prefix_caching=False,
    enable_chunked_prefill=False,
)
# Compilation takes ~55 min on inf2.8xlarge
# Artifacts are saved to ~/neuron-compiled-artifacts/bs1_tp2/

5. Upload to Hugging Face

from huggingface_hub import HfApi
api = HfApi()
api.upload_folder(
    folder_path=os.path.expanduser("~/neuron-compiled-artifacts/bs1_tp2"),
    repo_id="your-username/Qwen3-VL-8B-Instruct-neuron-inf2-tp2",
    path_in_repo="bs1_tp2",
    repo_type="model",
)

Configuration Details

Parameter Value
Model Qwen/Qwen3-VL-8B-Instruct (9B BF16)
SDK Neuron SDK 2.28
DLAMI Deep Learning AMI Neuron (Ubuntu 24.04) 20260227
NxDI 0.8.16251+f3ca5575 (DLAMI stock)
vLLM 0.13.0 + vllm-neuron 0.4.1
TP degree 2
Batch size 1
Seq len 8192
Vision seq len 8192 (supports up to 1080p images)
ISA kernels ALL disabled (inf2 compiler bug NCC_IBVF032)
LNC 1 (no LNC on inf2)
Compile time ~55 min on inf2.8xlarge
Artifact size ~266 MB

Performance

Instance Throughput Compile Time
inf2.8xlarge (compile + run) ~17 tok/s ~55 min
inf2.xlarge (precompiled) ~17 tok/s 0 (loads from artifacts)
trn2.3xlarge (tp=4) ~75 tok/s ~10 min

Constraints

  • Artifacts are specific to SDK 2.28, inf2 (2 NeuronCores), and LNC=1
  • Cannot use these artifacts on trn2 (different NeuronCore count and LNC)
  • When loading precompiled artifacts, vLLM will not recompile even if you pass different configs
  • Requires the NxDI batch fix patch for batch_size > 1

Related

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jburtoft/Qwen3-VL-8B-Instruct-neuron-inf2-tp2

Finetuned
(238)
this model