Thanks! And MTP key question
Thank you for this quantization, it runs very quickly on NVIDIA DGX Spark (GB10 Grace Blackwell, sm_121, 128GB unified memory) and has very high accuracy in my local evals.
I noticed the latest revision (7cdbcbb) has 0 MTP keys, while the base model has 785 (mtp.fc.weight, mtp.layers.0.*, etc.), and an older AutoRound snapshot (89766f.) also had those 785 keys. Was dropping the MTP heads intentional? I tried importing MTP heads from the upstream Qwen FP8 quant into this AutoRound model, but inference became slower, not faster. My guess was due to running them in FP8 instead of int4.
Since this build is symmetric INT4 (group size 128) on the GPTQ-Marlin path, do you expect any benefit from keeping/quantizing MTP heads in AutoRound on this hardware, and is that flow currently supported?
what token/sec throughput are you getting ?
| model | test | t/s (total) | t/s (req) | peak t/s | peak t/s (req) | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:-------------------------------------|---------------------:|----------------:|-----------------:|--------------:|-----------------:|-------------------:|-------------------:|-------------------:|
| Intel/Qwen3.5-35B-A3B-int4-AutoRound | pp2048 (c1) | 5668.47 ± 16.79 | 5668.47 ± 16.79 | | | 416.16 ± 1.07 | 361.48 ± 1.07 | 416.23 ± 1.07 |
| Intel/Qwen3.5-35B-A3B-int4-AutoRound | tg32 (c1) | 64.42 ± 0.10 | 64.42 ± 0.10 | 66.51 ± 0.11 | 66.51 ± 0.11 | | | |
| Intel/Qwen3.5-35B-A3B-int4-AutoRound | pp2048 (c4) | 6143.34 ± 23.18 | 1714.60 ± 148.88 | | | 1258.18 ± 98.51 | 1203.50 ± 98.51 | 1258.23 ± 98.50 |
| Intel/Qwen3.5-35B-A3B-int4-AutoRound | tg32 (c4) | 142.84 ± 6.44 | 43.31 ± 4.94 | 147.44 ± 6.64 | 44.71 ± 5.10 | | | |
| Intel/Qwen3.5-35B-A3B-int4-AutoRound | ctx_pp @ d4096 (c1) | 6186.42 ± 44.44 | 6186.42 ± 44.44 | | | 716.97 ± 4.76 | 662.29 ± 4.76 | 717.06 ± 4.74 |
| Intel/Qwen3.5-35B-A3B-int4-AutoRound | ctx_tg @ d4096 (c1) | 63.79 ± 0.25 | 63.79 ± 0.25 | 65.87 ± 0.26 | 65.87 ± 0.26 | | | |
| Intel/Qwen3.5-35B-A3B-int4-AutoRound | pp2048 @ d4096 (c1) | 3867.41 ± 57.49 | 3867.41 ± 57.49 | | | 584.35 ± 7.87 | 529.67 ± 7.87 | 584.43 ± 7.85 |
| Intel/Qwen3.5-35B-A3B-int4-AutoRound | tg32 @ d4096 (c1) | 62.62 ± 0.96 | 62.62 ± 0.96 | 64.66 ± 1.00 | 64.66 ± 1.00 | | | |
| Intel/Qwen3.5-35B-A3B-int4-AutoRound | ctx_pp @ d4096 (c4) | 6506.07 ± 1.30 | 1707.74 ± 77.56 | | | 2458.70 ± 103.85 | 2404.02 ± 103.85 | 2458.73 ± 103.84 |
| Intel/Qwen3.5-35B-A3B-int4-AutoRound | ctx_tg @ d4096 (c4) | 141.47 ± 0.40 | 45.59 ± 5.44 | 146.04 ± 0.42 | 47.07 ± 5.62 | | | |
| Intel/Qwen3.5-35B-A3B-int4-AutoRound | pp2048 @ d4096 (c4) | 4189.65 ± 2.12 | 1121.54 ± 58.71 | | | 1885.57 ± 92.30 | 1830.89 ± 92.30 | 1885.61 ± 92.28 |
| Intel/Qwen3.5-35B-A3B-int4-AutoRound | tg32 @ d4096 (c4) | 142.99 ± 5.86 | 42.81 ± 4.54 | 147.60 ± 6.05 | 44.20 ± 4.68 | | | |
| Intel/Qwen3.5-35B-A3B-int4-AutoRound | ctx_pp @ d16384 (c1) | 6232.64 ± 17.00 | 6232.64 ± 17.00 | | | 2683.53 ± 7.25 | 2628.84 ± 7.25 | 2683.58 ± 7.25 |
| Intel/Qwen3.5-35B-A3B-int4-AutoRound | ctx_tg @ d16384 (c1) | 60.11 ± 0.78 | 60.11 ± 0.78 | 62.06 ± 0.81 | 62.06 ± 0.81 | | | |
| Intel/Qwen3.5-35B-A3B-int4-AutoRound | pp2048 @ d16384 (c1) | 3580.03 ± 11.38 | 3580.03 ± 11.38 | | | 626.75 ± 1.82 | 572.07 ± 1.82 | 626.82 ± 1.80 |
| Intel/Qwen3.5-35B-A3B-int4-AutoRound | tg32 @ d16384 (c1) | 59.19 ± 0.19 | 59.19 ± 0.19 | 61.11 ± 0.19 | 61.11 ± 0.19 | | | |
| Intel/Qwen3.5-35B-A3B-int4-AutoRound | ctx_pp @ d16384 (c4) | 6194.68 ± 28.35 | 1822.81 ± 416.81 | | | 9432.55 ± 1702.26 | 9377.87 ± 1702.26 | 9432.60 ± 1702.25 |
| Intel/Qwen3.5-35B-A3B-int4-AutoRound | ctx_tg @ d16384 (c4) | 25.35 ± 0.01 | 26.49 ± 12.40 | 124.00 ± 0.00 | 33.94 ± 4.17 | | | |
| Intel/Qwen3.5-35B-A3B-int4-AutoRound | pp2048 @ d16384 (c4) | 3836.96 ± 45.16 | 1014.88 ± 42.88 | | | 2076.21 ± 84.05 | 2021.53 ± 84.05 | 2076.26 ± 84.04 |
| Intel/Qwen3.5-35B-A3B-int4-AutoRound | tg32 @ d16384 (c4) | 131.95 ± 4.45 | 37.87 ± 2.94 | 136.20 ± 4.59 | 39.10 ± 3.03 | | | |
| Intel/Qwen3.5-35B-A3B-int4-AutoRound | ctx_pp @ d32768 (c1) | 5564.77 ± 3.67 | 5564.77 ± 3.67 | | | 5943.42 ± 3.97 | 5888.74 ± 3.97 | 5943.48 ± 3.98 |
| Intel/Qwen3.5-35B-A3B-int4-AutoRound | ctx_tg @ d32768 (c1) | 55.76 ± 0.03 | 55.76 ± 0.03 | 57.57 ± 0.04 | 57.57 ± 0.04 | | | |
| Intel/Qwen3.5-35B-A3B-int4-AutoRound | pp2048 @ d32768 (c1) | 3385.45 ± 1.61 | 3385.45 ± 1.61 | | | 659.62 ± 0.29 | 604.94 ± 0.29 | 659.69 ± 0.28 |
| Intel/Qwen3.5-35B-A3B-int4-AutoRound | tg32 @ d32768 (c1) | 55.06 ± 0.05 | 55.06 ± 0.05 | 56.85 ± 0.04 | 56.85 ± 0.04 | | | |
| Intel/Qwen3.5-35B-A3B-int4-AutoRound | ctx_pp @ d32768 (c4) | 5584.04 ± 26.33 | 1795.39 ± 469.98 | | | 19414.02 ± 4279.08 | 19359.34 ± 4279.08 | 19414.07 ± 4279.06 |
| Intel/Qwen3.5-35B-A3B-int4-AutoRound | ctx_tg @ d32768 (c4) | 10.71 ± 0.02 | 15.69 ± 12.45 | 118.50 ± 0.50 | 30.34 ± 2.72 | | | |
| Intel/Qwen3.5-35B-A3B-int4-AutoRound | pp2048 @ d32768 (c4) | 3586.35 ± 20.89 | 1050.22 ± 221.27 | | | 2071.57 ± 321.08 | 2016.89 ± 321.08 | 2071.62 ± 321.06 |
| Intel/Qwen3.5-35B-A3B-int4-AutoRound | tg32 @ d32768 (c4) | 77.36 ± 10.65 | 28.58 ± 5.83 | 122.00 ± 3.00 | 32.45 ± 2.00 | | | |
Doing 1 and 4 concurrent requests. Best throughput is around 145t/s @ c4 and single request max is around 66 t/s out.
When I tried with the imported FP8 weights it went down to something low, like 11 t/s
We recommend copying the BF16 MTP layer and modifying config.json to mark these layers as 16-bit as a temporary workaround.
1 is that flow currently supported?
This is a known issue that we are tracking https://github.com/intel/auto-round/issues/1477. Since we rely on transformers for tuning, the MTP layer is discarded during model loading. In addition, AutoRound does not support FP8 layers yet, but it does support BF16 layers. We plan to fix it in the next release.
Backend is vLLM, built from main (tracking v0.16.x), serving with gptq_marlin quantization. Hardware is an NVIDIA DGX Spark (as above, GB10 Grace Blackwell, sm_121, 128 GB unified memory)
Main serving flags: --load-format fastsafetensors, --enable-chunked-prefill, --enable-prefix-caching, --dtype bfloat16, compilation mode 0 with FULL_DECODE_ONLY cudagraphs.
On the MTP side, I did try the BF16 graft approach that @wenhuach mentioned. Pulled 785 MTP keys from the base Qwen/Qwen3.5-35B-A3B checkpoint, wrote them into a local overlay directory, updated config.json to mark those layers as 16-bit. It loads and serves fine, but throughput tanked. It went from ~66 t/s down to ~11 t/s at single-request concurrency. Looks like a compute-path mismatch where the main model runs through Marlin int4 kernels while the BF16 draft heads take an unquantized path. Acceptance rate was still around ~85%, but throughput was lower and the compute path is more complex, so I'm thinking this level of MTP just adds overhead without any speculative benefit.
It functions mechanically, just not useful on this hardware + quant combination yet. My general sophomore take leads me to believe that MTP won't provide meaningful acceleration without the same quant benefits from the INT4 Autoround shape.
@wenhuach I am trying to re-implement and use this model with gptq-marlin but I am facing this issue:
<class 'auto_round_extension.cuda.gptqmodel_marlin.get_marlin_layer..MarlinQuantLinear'>: out_features: 32 must be divisible by [64].
the out_features og the in_proj_b/a layers is 32 and in autoround there is restriction to have this divisble by 64.
how can we still use marlin (since the vLLM code works) with autoround's code ?
EDIT: actually I realized that vLLM has the same constraint so there must be some fall back to another kernel.
@wenhuach I am trying to re-implement and use this model with gptq-marlin but I am facing this issue:
<class 'auto_round_extension.cuda.gptqmodel_marlin.get_marlin_layer..MarlinQuantLinear'>:out_features: 32 must be divisible by [64].
the out_features og the in_proj_b/a layers is 32 and in autoround there is restriction to have this divisble by 64.
how can we still use marlin (since the vLLM code works) with autoround's code ?
Could you share your model ? AutoRound could automatically switch different backends on Transformers for each layer if the current backend does not support the layer. This may be a bug.
1 The root cause is we don’t own any CUDA kernels. On Transformers, we leverage the marlin kernel from GPTQModel rather than vLLM for simplicity. To my knowledge, GPTQModel also adapts it from vLLM, so it should support 32 output feature. You may submit an issue to GPTQModel directly, as we don’t maintain the kernel. https://github.com/ModelCloud/GPTQModel/blob/1c794a48987690230db5c38946a929740aac4afc/gptqmodel/nn_modules/qlinear/marlin.py#L63
2 Uninstall GPTQModel or manually set the backend as shown in https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#specify-inference-backend so that AutoRound can use other backends like Triton, which may be much slower.
3 Set the layers to 16-bit precision.
I ended up doing the same (fallback on other layers).
My code is here: https://github.com/eole-nlp/eole/blob/main/eole/modules/autoround_linear.py
FYI I am getting for the 35BA3B: 20 tok/Sec in eager mode, and 77 tok/sec in torch.compile mode (compared to 66 above)
My best use-case scenario is 9B-int4 up to 145 tok/sec which is actually faster than the BF16 model (82 tok/sec)
EDIT: Now getting 150 tok/sec with torch.compile + Moe Marlin cuda kernel
Hi @seanthomaswilliams
Currently, the RTN W4A16 MTP tensor has been added. You are welcome to give it a try.
@xinhe Thank you for the follow up. I have tried refining my vLLM params for last 4 hours, but always come back to the same vLLM crash with the latest MTP tensors.
EngineCore_DP0 pid=124) INFO 03-13 21:38:53 [core.py:101] Initializing a V1 LLM engine (v0.16.1rc1.dev209+g6f0dd9380.d20260304) with config: model='Intel/Qwen3.5-35B-A3B-int4-AutoRound', speculative_config=SpeculativeConfig(method='mtp', model='Intel/Qwen3.5-35B-A3B-int4-AutoRound', num_spec_tokens=2), tokenizer='Intel/Qwen3.5-35B-A3B-int4-AutoRound', skip_tokenizer_init=False,
tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=fastsafetensors, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=inc, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device
_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces
=None, kv_cache_metrics=True, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=1337, served_model_name=qwen3.5-35b, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode'
: <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_con
v', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_fun
ctionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fu
se_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 48, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all
_moe_layers': []}
...
(EngineCore_DP0 pid=124) ERROR 03-13 21:40:41 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.16.1rc1.dev209+g6f0dd9380.d20260304) with config: model='Intel/Qwen3.5-35B-A3B-int4-AutoRound', speculative_config=SpeculativeConfig(method='mtp', model='Intel/Qwen3.5-35B-A3B-int4-AutoRound', num_spec_tokens=2), tokenizer='Intel/Qwen3.5-35B-A3B-int4-AutoRound', skip_tokenize
r_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=fastsafetensors, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=inc, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtyp
e=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_d
etailed_traces=None, kv_cache_metrics=True, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=1337, served_model_name=qwen3.5-35b, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level'
: None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '/root/.cache/vllm/torch_compile_cache/3f6fb0b944', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output
', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split
_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [3, 6, 9, 18, 24, 33, 42, 48], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True,
'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 48, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_
cache_dir': '/root/.cache/vllm/torch_compile_cache/3f6fb0b944/rank_0_0/eagle_head', 'fast_moe_cold_start': False, 'static_all_moe_layers': ['language_model.model.layers.0.mlp.experts', 'language_model.model.layers.1.mlp.experts', 'language_model.model.layers.2.mlp.experts', 'language_model.model.layers.3.mlp.experts', 'language_model.model.layers.4.mlp.experts', 'language_model.m
odel.layers.5.mlp.experts', 'language_model.model.layers.6.mlp.experts', 'language_model.model.layers.7.mlp.experts', 'language_model.model.layers.8.mlp.experts', 'language_model.model.layers.9.mlp.experts', 'language_model.model.layers.10.mlp.experts', 'language_model.model.layers.11.mlp.experts', 'language_model.model.layers.12.mlp.experts', 'language_model.model.layers.13.mlp.
experts', 'language_model.model.layers.14.mlp.experts', 'language_model.model.layers.15.mlp.experts', 'language_model.model.layers.16.mlp.experts', 'language_model.model.layers.17.mlp.experts', 'language_model.model.layers.18.mlp.experts', 'language_model.model.layers.19.mlp.experts', 'language_model.model.layers.20.mlp.experts', 'language_model.model.layers.21.mlp.experts', 'lan
guage_model.model.layers.22.mlp.experts', 'language_model.model.layers.23.mlp.experts', 'language_model.model.layers.24.mlp.experts', 'language_model.model.layers.25.mlp.experts', 'language_model.model.layers.26.mlp.experts', 'language_model.model.layers.27.mlp.experts', 'language_model.model.layers.28.mlp.experts', 'language_model.model.layers.29.mlp.experts', 'language_model.mo
del.layers.30.mlp.experts', 'language_model.model.layers.31.mlp.experts', 'language_model.model.layers.32.mlp.experts', 'language_model.model.layers.33.mlp.experts', 'language_model.model.layers.34.mlp.experts', 'language_model.model.layers.35.mlp.experts', 'language_model.model.layers.36.mlp.experts', 'language_model.model.layers.37.mlp.experts', 'language_model.model.layers.38.
mlp.experts', 'language_model.model.layers.39.mlp.experts', 'mtp.layers.0.mlp.experts']},
(EngineCore_DP0 pid=124) ERROR 03-13 21:40:41 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-bdee390853547987-aad99d2b,prompt_token_ids_len=11,prefill_token_ids_len=None,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, tempera
ture=0.6, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[248044], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None),block_ids=([41, 42, 43], [44, 45, 46], [47, 48, 49], [50]),
num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[],num_computed_tokens=[],num_output_tokens=[]), num_scheduled_tokens={chatcmpl-bdee390853547987-aad99d2b: 11}, total_num_scheduled_tokens=11, scheduled_spec_decode_tokens={}, schedul
ed_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 1], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null)
(EngineCore_DP0 pid=124) ERROR 03-13 21:40:41 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.0030826140567200566, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=11, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0),
connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore_DP0 pid=124) ERROR 03-13 21:40:41 [core.py:1102] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=124) ERROR 03-13 21:40:41 [core.py:1102] Traceback (most recent call last):
(EngineCore_DP0 pid=124) ERROR 03-13 21:40:41 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1093, in run_engine_core
(EngineCore_DP0 pid=124) ERROR 03-13 21:40:41 [core.py:1102] engine_core.run_busy_loop()
(EngineCore_DP0 pid=124) ERROR 03-13 21:40:41 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1128, in run_busy_loop
(EngineCore_DP0 pid=124) ERROR 03-13 21:40:41 [core.py:1102] self._process_engine_step()
(EngineCore_DP0 pid=124) ERROR 03-13 21:40:41 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1165, in _process_engine_step
(EngineCore_DP0 pid=124) ERROR 03-13 21:40:41 [core.py:1102] outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=124) ERROR 03-13 21:40:41 [core.py:1102] ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=124) ERROR 03-13 21:40:41 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 497, in step_with_batch_queue
(EngineCore_DP0 pid=124) ERROR 03-13 21:40:41 [core.py:1102] model_output = future.result()
(EngineCore_DP0 pid=124) ERROR 03-13 21:40:41 [core.py:1102] ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=124) ERROR 03-13 21:40:41 [core.py:1102] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 456, in result
(EngineCore_DP0 pid=124) ERROR 03-13 21:40:41 [core.py:1102] return self.__get_result()
(EngineCore_DP0 pid=124) ERROR 03-13 21:40:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=124) ERROR 03-13 21:40:41 [core.py:1102] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=124) ERROR 03-13 21:40:41 [core.py:1102] raise self._exception
(EngineCore_DP0 pid=124) ERROR 03-13 21:40:41 [core.py:1102] File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
(EngineCore_DP0 pid=124) ERROR 03-13 21:40:41 [core.py:1102] result = self.fn(*self.args, **self.kwargs)
(EngineCore_DP0 pid=124) ERROR 03-13 21:40:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=124) ERROR 03-13 21:40:41 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 250, in get_output
(EngineCore_DP0 pid=124) ERROR 03-13 21:40:41 [core.py:1102] self.async_copy_ready_event.synchronize()
(EngineCore_DP0 pid=124) ERROR 03-13 21:40:41 [core.py:1102] torch.AcceleratorError: CUDA error: an illegal memory access was encountered
(EngineCore_DP0 pid=124) ERROR 03-13 21:40:41 [core.py:1102] Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(EngineCore_DP0 pid=124) ERROR 03-13 21:40:41 [core.py:1102] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore_DP0 pid=124) ERROR 03-13 21:40:41 [core.py:1102] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_DP0 pid=124) ERROR 03-13 21:40:41 [core.py:1102] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Thank you for your consideration. Have a great weekend!
Note: This is a DGX Spark deployment using https://github.com/eugr/spark-vllm-docker with the tf5 variant.
Hi @seanthomaswilliams
I cannot reproduce this issue with my testing command attached in the model card.
I'm using A100 and nightly vLLM.