Qwen3-VL-8B-Instruct β Precompiled Neuron Artifacts (inf2, TP=2)
Precompiled Neuron (NxD Inference + vLLM) artifacts for Qwen/Qwen3-VL-8B-Instruct on AWS Inferentia2 with TP=2.
These artifacts allow you to skip the ~55 minute compilation step and run inference immediately on inf2.xlarge or inf2.8xlarge.
Quick Start
# On an inf2 instance with the Neuron DLAMI (Ubuntu 24.04, SDK 2.28):
source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/bin/activate
pip install huggingface_hub
# Download precompiled artifacts
python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='jburtoft/Qwen3-VL-8B-Instruct-neuron-inf2-tp2', local_dir='neuron-compiled-artifacts')"
# Set the environment variable so vLLM loads precompiled model
export NEURON_COMPILED_ARTIFACTS=$PWD/neuron-compiled-artifacts/bs1_tp2
export VLLM_NEURON_FRAMEWORK=neuronx-distributed-inference
# Start vLLM server (compilation skipped!)
vllm serve Qwen/Qwen3-VL-8B-Instruct \
--device neuron \
--tensor-parallel-size 2 \
--max-model-len 8192 \
--max-num-seqs 1 \
--override-neuron-config '{"text_neuron_config": {"batch_size": 1, "ctx_batch_size": 1, "tkg_batch_size": 1, "seq_len": 8192, "max_context_length": 8192, "enable_bucketing": true, "context_encoding_buckets": [1024, 4096, 8192], "token_generation_buckets": [1024, 4096, 8192], "world_size": 2, "tp_degree": 2, "torch_dtype": "bfloat16", "rpl_reduce_dtype": "bfloat16", "attention_dtype": "bfloat16", "cast_type": "as-declared", "logical_neuron_cores": 1, "cc_pipeline_tiling_factor": 1, "fused_qkv": true, "qkv_kernel_enabled": false, "mlp_kernel_enabled": false, "attn_kernel_enabled": false}, "vision_neuron_config": {"batch_size": 1, "seq_len": 8192, "max_context_length": 8192, "enable_bucketing": true, "buckets": [1024, 4096, 8192], "world_size": 2, "tp_degree": 2, "torch_dtype": "bfloat16", "rpl_reduce_dtype": "bfloat16", "cast_type": "as-declared", "logical_neuron_cores": 1, "cc_pipeline_tiling_factor": 1, "fused_qkv": true, "attn_kernel_enabled": false, "mlp_kernel_enabled": false}}' \
--enable-prefix-caching false \
--enable-chunked-prefill false
How These Artifacts Were Created
To regenerate these artifacts (e.g., for a different batch size or SDK version):
1. Launch an inf2.8xlarge
inf2.xlarge (16 GB RAM) cannot compile this model β it OOMs. Use inf2.8xlarge (128 GB RAM, same 2 NeuronCores + 32 GB HBM).
# Use Deep Learning AMI Neuron (Ubuntu 24.04) 20260227 β SDK 2.28
aws ec2 run-instances \
--image-id <ami-id> \
--instance-type inf2.8xlarge \
--key-name <your-key> \
...
2. Set up the environment
source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/bin/activate
pip install qwen-vl-utils transformers huggingface_hub
3. Apply the NxDI batch fix (required for batch_size > 1)
import os, urllib.request
import neuronx_distributed_inference
nxdi_path = os.path.dirname(neuronx_distributed_inference.__file__)
BRANCH = "fix/qwen3-vl-batch-size-gt1-v2"
BASE = f"https://raw.githubusercontent.com/jimburtoft/neuronx-distributed-inference/{BRANCH}/src/neuronx_distributed_inference/models"
urllib.request.urlretrieve(f"{BASE}/image_to_text_model_wrapper.py", f"{nxdi_path}/models/image_to_text_model_wrapper.py")
urllib.request.urlretrieve(f"{BASE}/qwen3_vl/modeling_qwen3_vl.py", f"{nxdi_path}/models/qwen3_vl/modeling_qwen3_vl.py")
CRITICAL: Do NOT pip install the fork β it overwrites the DLAMI's NxDI which has tensor_capture_hook support required by vLLM.
4. Compile and save artifacts
import os
os.environ["NEURON_COMPILED_ARTIFACTS"] = os.path.expanduser("~/neuron-compiled-artifacts/bs1_tp2")
os.environ["VLLM_NEURON_FRAMEWORK"] = "neuronx-distributed-inference"
from vllm import LLM
# inf2 config: ALL ISA kernels disabled, LNC=1
text_cfg = {
"batch_size": 1, "ctx_batch_size": 1, "tkg_batch_size": 1,
"seq_len": 8192, "max_context_length": 8192,
"enable_bucketing": True,
"context_encoding_buckets": [1024, 4096, 8192],
"token_generation_buckets": [1024, 4096, 8192],
"world_size": 2, "tp_degree": 2,
"torch_dtype": "bfloat16", "rpl_reduce_dtype": "bfloat16",
"attention_dtype": "bfloat16", "cast_type": "as-declared",
"logical_neuron_cores": 1, "cc_pipeline_tiling_factor": 1,
"fused_qkv": True,
"qkv_kernel_enabled": False, # inf2 compiler bug NCC_IBVF032
"mlp_kernel_enabled": False,
"attn_kernel_enabled": False,
}
vision_cfg = {
"batch_size": 1,
"seq_len": 8192, "max_context_length": 8192,
"enable_bucketing": True, "buckets": [1024, 4096, 8192],
"world_size": 2, "tp_degree": 2,
"torch_dtype": "bfloat16", "rpl_reduce_dtype": "bfloat16",
"cast_type": "as-declared",
"logical_neuron_cores": 1, "cc_pipeline_tiling_factor": 1,
"fused_qkv": True,
"attn_kernel_enabled": False, "mlp_kernel_enabled": False,
}
llm = LLM(
model="Qwen/Qwen3-VL-8B-Instruct",
trust_remote_code=True,
dtype="bfloat16",
tensor_parallel_size=2,
max_num_seqs=1,
max_model_len=8192,
additional_config=dict(override_neuron_config=dict(
text_neuron_config=text_cfg,
vision_neuron_config=vision_cfg,
)),
limit_mm_per_prompt={"image": 1},
enable_prefix_caching=False,
enable_chunked_prefill=False,
)
# Compilation takes ~55 min on inf2.8xlarge
# Artifacts are saved to ~/neuron-compiled-artifacts/bs1_tp2/
5. Upload to Hugging Face
from huggingface_hub import HfApi
api = HfApi()
api.upload_folder(
folder_path=os.path.expanduser("~/neuron-compiled-artifacts/bs1_tp2"),
repo_id="your-username/Qwen3-VL-8B-Instruct-neuron-inf2-tp2",
path_in_repo="bs1_tp2",
repo_type="model",
)
Configuration Details
| Parameter | Value |
|---|---|
| Model | Qwen/Qwen3-VL-8B-Instruct (9B BF16) |
| SDK | Neuron SDK 2.28 |
| DLAMI | Deep Learning AMI Neuron (Ubuntu 24.04) 20260227 |
| NxDI | 0.8.16251+f3ca5575 (DLAMI stock) |
| vLLM | 0.13.0 + vllm-neuron 0.4.1 |
| TP degree | 2 |
| Batch size | 1 |
| Seq len | 8192 |
| Vision seq len | 8192 (supports up to 1080p images) |
| ISA kernels | ALL disabled (inf2 compiler bug NCC_IBVF032) |
| LNC | 1 (no LNC on inf2) |
| Compile time | ~55 min on inf2.8xlarge |
| Artifact size | ~266 MB |
Performance
| Instance | Throughput | Compile Time |
|---|---|---|
| inf2.8xlarge (compile + run) | ~17 tok/s | ~55 min |
| inf2.xlarge (precompiled) | ~17 tok/s | 0 (loads from artifacts) |
| trn2.3xlarge (tp=4) | ~75 tok/s | ~10 min |
Constraints
- Artifacts are specific to SDK 2.28, inf2 (2 NeuronCores), and LNC=1
- Cannot use these artifacts on trn2 (different NeuronCore count and LNC)
- When loading precompiled artifacts, vLLM will not recompile even if you pass different configs
- Requires the NxDI batch fix patch for batch_size > 1
Related
- Qwen/Qwen3-VL-8B-Instruct β base model
- NxDI batch fix branch β required for batch > 1
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support
Model tree for jburtoft/Qwen3-VL-8B-Instruct-neuron-inf2-tp2
Base model
Qwen/Qwen3-VL-8B-Instruct