ValueError: Unsupported weight strategy=block, supported strategies are [<QuantizationStrategy.CHANNEL: 'channel'>, <QuantizationStrategy.TENSOR: 'tensor'>]
I am getting this unsupported weight strategy error when trying to run this model on vllm 0.13 and A6000s, is it due to the cards being unsupported for this quant?
(EngineCore_DP0 pid=6066) (RayWorkerWrapper pid=520) INFO 01-13 16:37:34 [gpu_model_runner.py:3562] Starting to load model /llm/model...
(EngineCore_DP0 pid=6066) Process EngineCore_DP0:
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] EngineCore failed to start.
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] Traceback (most recent call last):
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 857, in run_engine_core
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 637, in __init__
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] super().__init__(
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] self._init_executor()
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_executor.py", line 97, in _init_executor
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] self._init_workers_ray(placement_group)
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_executor.py", line 371, in _init_workers_ray
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] self.collective_rpc("load_model")
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_executor.py", line 493, in collective_rpc
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] return ray.get(ray_worker_outputs, timeout=timeout)
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] return fn(*args, **kwargs)
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] return func(*args, **kwargs)
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 2972, in get
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] values, debugger_breakpoint = worker.get_objects(
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 1031, in get_objects
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] raise value.as_instanceof_cause()
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] ray.exceptions.RayTaskError(ValueError): ray::RayWorkerWrapper.execute_method() (pid=520, ip=vllm-9, actor_id=9049a7c5dd8faf743ff4bc1c04000000, repr=<vllm.v1.executor.ray_utils.RayWorkerWrapper object at 0x7f8837bd9ee0>)
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 345, in execute_method
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] raise e
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 334, in execute_method
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] return run_method(self, method, args, kwargs)
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 461, in run_method
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] return func(*args, **kwargs)
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 289, in load_model
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3581, in load_model
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] self.model = model_loader.load_model(
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 49, in load_model
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] model = initialize_model(
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 48, in initialize_model
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] return model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/minimax_m2.py", line 497, in __init__
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] self.model = MiniMaxM2Model(
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 291, in __init__
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] old_init(self, **kwargs)
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/minimax_m2.py", line 341, in __init__
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] self.start_layer, self.end_layer, self.layers = make_layers(
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] ^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 606, in make_layers
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/minimax_m2.py", line 343, in <lambda>
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] lambda prefix: MiniMaxM2DecoderLayer(
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/minimax_m2.py", line 266, in __init__
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] self.self_attn = MiniMaxM2Attention(
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/minimax_m2.py", line 184, in __init__
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] self.qkv_proj = QKVParallelLinear(
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] ^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 935, in __init__
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] super().__init__(
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 484, in __init__
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] self.quant_method.create_weights(
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 914, in create_weights
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] layer.scheme.create_weights(
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a16_fp8.py", line 108, in create_weights
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] raise ValueError(
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] ValueError: Unsupported weight strategy=block, supported strategies are [<QuantizationStrategy.CHANNEL: 'channel'>, <QuantizationStrategy.TENSOR: 'tensor'>]
(EngineCore_DP0 pid=6066) Traceback (most recent call last):
(EngineCore_DP0 pid=6066) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=6066) self.run()
(EngineCore_DP0 pid=6066) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=6066) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=6066) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 870, in run_engine_core
(EngineCore_DP0 pid=6066) raise e
(EngineCore_DP0 pid=6066) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 857, in run_engine_core
(EngineCore_DP0 pid=6066) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=6066) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 637, in __init__
(EngineCore_DP0 pid=6066) super().__init__(
(EngineCore_DP0 pid=6066) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=6066) self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=6066) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=6066) self._init_executor()
(EngineCore_DP0 pid=6066) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_executor.py", line 97, in _init_executor
(EngineCore_DP0 pid=6066) self._init_workers_ray(placement_group)
(EngineCore_DP0 pid=6066) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_executor.py", line 371, in _init_workers_ray
(EngineCore_DP0 pid=6066) self.collective_rpc("load_model")
(EngineCore_DP0 pid=6066) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_executor.py", line 493, in collective_rpc
(EngineCore_DP0 pid=6066) return ray.get(ray_worker_outputs, timeout=timeout)
(EngineCore_DP0 pid=6066) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) File "/usr/local/lib/python3.12/dist-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
(EngineCore_DP0 pid=6066) return fn(*args, **kwargs)
(EngineCore_DP0 pid=6066) ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) File "/usr/local/lib/python3.12/dist-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
(EngineCore_DP0 pid=6066) return func(*args, **kwargs)
(EngineCore_DP0 pid=6066) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 2972, in get
(EngineCore_DP0 pid=6066) values, debugger_breakpoint = worker.get_objects(
(EngineCore_DP0 pid=6066) ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 1031, in get_objects
(EngineCore_DP0 pid=6066) raise value.as_instanceof_cause()
(EngineCore_DP0 pid=6066) ray.exceptions.RayTaskError(ValueError): ray::RayWorkerWrapper.execute_method() (pid=520, ip=vllm-9, actor_id=9049a7c5dd8faf743ff4bc1c04000000, repr=<vllm.v1.executor.ray_utils.RayWorkerWrapper object at 0x7f8837bd9ee0>)
(EngineCore_DP0 pid=6066) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 345, in execute_method
(EngineCore_DP0 pid=6066) raise e
(EngineCore_DP0 pid=6066) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 334, in execute_method
(EngineCore_DP0 pid=6066) return run_method(self, method, args, kwargs)
(EngineCore_DP0 pid=6066) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 461, in run_method
(EngineCore_DP0 pid=6066) return func(*args, **kwargs)
(EngineCore_DP0 pid=6066) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 289, in load_model
(EngineCore_DP0 pid=6066) self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=6066) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3581, in load_model
(EngineCore_DP0 pid=6066) self.model = model_loader.load_model(
(EngineCore_DP0 pid=6066) ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 49, in load_model
(EngineCore_DP0 pid=6066) model = initialize_model(
(EngineCore_DP0 pid=6066) ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 48, in initialize_model
(EngineCore_DP0 pid=6066) return model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore_DP0 pid=6066) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/minimax_m2.py", line 497, in __init__
(EngineCore_DP0 pid=6066) self.model = MiniMaxM2Model(
(EngineCore_DP0 pid=6066) ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 291, in __init__
(EngineCore_DP0 pid=6066) old_init(self, **kwargs)
(EngineCore_DP0 pid=6066) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/minimax_m2.py", line 341, in __init__
(EngineCore_DP0 pid=6066) self.start_layer, self.end_layer, self.layers = make_layers(
(EngineCore_DP0 pid=6066) ^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 606, in make_layers
(EngineCore_DP0 pid=6066) maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(EngineCore_DP0 pid=6066) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/minimax_m2.py", line 343, in <lambda>
(EngineCore_DP0 pid=6066) lambda prefix: MiniMaxM2DecoderLayer(
(EngineCore_DP0 pid=6066) ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/minimax_m2.py", line 266, in __init__
(EngineCore_DP0 pid=6066) self.self_attn = MiniMaxM2Attention(
(EngineCore_DP0 pid=6066) ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/minimax_m2.py", line 184, in __init__
(EngineCore_DP0 pid=6066) self.qkv_proj = QKVParallelLinear(
(EngineCore_DP0 pid=6066) ^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 935, in __init__
(EngineCore_DP0 pid=6066) super().__init__(
(EngineCore_DP0 pid=6066) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 484, in __init__
(EngineCore_DP0 pid=6066) self.quant_method.create_weights(
(EngineCore_DP0 pid=6066) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 914, in create_weights
(EngineCore_DP0 pid=6066) layer.scheme.create_weights(
(EngineCore_DP0 pid=6066) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a16_fp8.py", line 108, in create_weights
(EngineCore_DP0 pid=6066) raise ValueError(
(EngineCore_DP0 pid=6066) ValueError: Unsupported weight strategy=block, supported strategies are [<QuantizationStrategy.CHANNEL: 'channel'>, <QuantizationStrategy.TENSOR: 'tensor'>]
(EngineCore_DP0 pid=6066) INFO 01-13 16:37:35 [ray_executor.py:121] Shutting down Ray distributed executor. If you see error log from logging.cc regarding SIGTERM received, please ignore because this is the expected termination process in Ray.
Your error is ValueError: Unsupported weight strategy=block, supported strategies are [<QuantizationStrategy.CHANNEL: 'channel'>, <QuantizationStrategy.TENSOR: 'tensor'>]
- BLOCK quantization is relatively new, introduced in December 2024 with DeepSeek, and added to quantized MoE weights in September 2025 (https://github.com/vllm-project/vllm/commit/f11e3c516be3d88733ea4b0c79f47e33cc319197), you need to have a vLLM version from after September 2025.
- Do you have RTX A6000 or RTX A6000 Ada? RTX A6000 from the Ampere generation indeed do not have hardware FP8 support and I don't think vLLM has a fallback path for it.
Yep, on 0.13.0 so definitely with support for block quantization. Its the Ampere A6000 so yea FP8 is not supported. Thanks!
Your error is
ValueError: Unsupported weight strategy=block, supported strategies are [<QuantizationStrategy.CHANNEL: 'channel'>, <QuantizationStrategy.TENSOR: 'tensor'>]
- BLOCK quantization is relatively new, introduced in December 2024 with DeepSeek, and added to quantized MoE weights in September 2025 (https://github.com/vllm-project/vllm/commit/f11e3c516be3d88733ea4b0c79f47e33cc319197), you need to have a vLLM version from after September 2025.
- Do you have RTX A6000 or RTX A6000 Ada? RTX A6000 from the Ampere generation indeed do not have hardware FP8 support and I don't think vLLM has a fallback path for it.
Hi! Would it be too much to kindly ask you for a version, similar size (to fit on 8 x 3090) that would work on Ampere too (CC 8.6 and 8.9)? I know a lot of people would appreciate such a quality model for coding, instead / aside of llama. Thank you.
I can upload the BF16+AWQ base I used to build this FP8+AWQ mixed precision quant tomorrow. Just be aware that as the base model was FP8 you spend an extra 3GB for no extra quality.
vLLM documentation states this:
"FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper). FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin."
So I leave it up to your far better knowledge in quantization to choose which recipe should go in! 🤭
P.S: I am not able to run this version (I'm still waiting for the 8th 3090 to arrive) so I don't know exactly if it would work on CC 8.6.. my observation was based on the other owner of an Ampere (though cc 8.0)!
It's out: https://huggingface.co/mratsim/MiniMax-M2.1-BF16-INT4-AWQ
I'm so very thankful for your effort! 👍
It's out: https://huggingface.co/mratsim/MiniMax-M2.1-BF16-INT4-AWQ
Thanks, that model is working very well on Ampere A6000s