ValueError: Unsupported weight strategy=block, supported strategies are [<QuantizationStrategy.CHANNEL: 'channel'>, <QuantizationStrategy.TENSOR: 'tensor'>]

#5
by ablueleaf - opened

I am getting this unsupported weight strategy error when trying to run this model on vllm 0.13 and A6000s, is it due to the cards being unsupported for this quant?

(EngineCore_DP0 pid=6066) (RayWorkerWrapper pid=520) INFO 01-13 16:37:34 [gpu_model_runner.py:3562] Starting to load model /llm/model...
(EngineCore_DP0 pid=6066) Process EngineCore_DP0:
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] EngineCore failed to start.
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] Traceback (most recent call last):
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 857, in run_engine_core
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 637, in __init__
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]     super().__init__(
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]     self._init_executor()
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_executor.py", line 97, in _init_executor
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]     self._init_workers_ray(placement_group)
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_executor.py", line 371, in _init_workers_ray
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]     self.collective_rpc("load_model")
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_executor.py", line 493, in collective_rpc
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]     return ray.get(ray_worker_outputs, timeout=timeout)
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]     return fn(*args, **kwargs)
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]     return func(*args, **kwargs)
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 2972, in get
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]     values, debugger_breakpoint = worker.get_objects(
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]                                   ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 1031, in get_objects
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]     raise value.as_instanceof_cause()
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] ray.exceptions.RayTaskError(ValueError): ray::RayWorkerWrapper.execute_method() (pid=520, ip=vllm-9, actor_id=9049a7c5dd8faf743ff4bc1c04000000, repr=<vllm.v1.executor.ray_utils.RayWorkerWrapper object at 0x7f8837bd9ee0>)
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 345, in execute_method
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]     raise e
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 334, in execute_method
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]     return run_method(self, method, args, kwargs)
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 461, in run_method
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]     return func(*args, **kwargs)
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 289, in load_model
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3581, in load_model
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]     self.model = model_loader.load_model(
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 49, in load_model
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]     model = initialize_model(
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]             ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 48, in initialize_model
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]     return model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/minimax_m2.py", line 497, in __init__
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]     self.model = MiniMaxM2Model(
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]                  ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 291, in __init__
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]     old_init(self, **kwargs)
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/minimax_m2.py", line 341, in __init__
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]     self.start_layer, self.end_layer, self.layers = make_layers(
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]                                                     ^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 606, in make_layers
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/minimax_m2.py", line 343, in <lambda>
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]     lambda prefix: MiniMaxM2DecoderLayer(
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]                    ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/minimax_m2.py", line 266, in __init__
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]     self.self_attn = MiniMaxM2Attention(
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]                      ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/minimax_m2.py", line 184, in __init__
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]     self.qkv_proj = QKVParallelLinear(
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]                     ^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 935, in __init__
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]     super().__init__(
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 484, in __init__
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]     self.quant_method.create_weights(
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 914, in create_weights
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]     layer.scheme.create_weights(
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a16_fp8.py", line 108, in create_weights
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866]     raise ValueError(
(EngineCore_DP0 pid=6066) ERROR 01-13 16:37:35 [core.py:866] ValueError: Unsupported weight strategy=block, supported strategies are [<QuantizationStrategy.CHANNEL: 'channel'>, <QuantizationStrategy.TENSOR: 'tensor'>]
(EngineCore_DP0 pid=6066) Traceback (most recent call last):
(EngineCore_DP0 pid=6066)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=6066)     self.run()
(EngineCore_DP0 pid=6066)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=6066)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=6066)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 870, in run_engine_core
(EngineCore_DP0 pid=6066)     raise e
(EngineCore_DP0 pid=6066)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 857, in run_engine_core
(EngineCore_DP0 pid=6066)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=6066)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 637, in __init__
(EngineCore_DP0 pid=6066)     super().__init__(
(EngineCore_DP0 pid=6066)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=6066)     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=6066)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=6066)     self._init_executor()
(EngineCore_DP0 pid=6066)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_executor.py", line 97, in _init_executor
(EngineCore_DP0 pid=6066)     self._init_workers_ray(placement_group)
(EngineCore_DP0 pid=6066)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_executor.py", line 371, in _init_workers_ray
(EngineCore_DP0 pid=6066)     self.collective_rpc("load_model")
(EngineCore_DP0 pid=6066)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_executor.py", line 493, in collective_rpc
(EngineCore_DP0 pid=6066)     return ray.get(ray_worker_outputs, timeout=timeout)
(EngineCore_DP0 pid=6066)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066)   File "/usr/local/lib/python3.12/dist-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
(EngineCore_DP0 pid=6066)     return fn(*args, **kwargs)
(EngineCore_DP0 pid=6066)            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066)   File "/usr/local/lib/python3.12/dist-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
(EngineCore_DP0 pid=6066)     return func(*args, **kwargs)
(EngineCore_DP0 pid=6066)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066)   File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 2972, in get
(EngineCore_DP0 pid=6066)     values, debugger_breakpoint = worker.get_objects(
(EngineCore_DP0 pid=6066)                                   ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066)   File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 1031, in get_objects
(EngineCore_DP0 pid=6066)     raise value.as_instanceof_cause()
(EngineCore_DP0 pid=6066) ray.exceptions.RayTaskError(ValueError): ray::RayWorkerWrapper.execute_method() (pid=520, ip=vllm-9, actor_id=9049a7c5dd8faf743ff4bc1c04000000, repr=<vllm.v1.executor.ray_utils.RayWorkerWrapper object at 0x7f8837bd9ee0>)
(EngineCore_DP0 pid=6066)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 345, in execute_method
(EngineCore_DP0 pid=6066)     raise e
(EngineCore_DP0 pid=6066)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 334, in execute_method
(EngineCore_DP0 pid=6066)     return run_method(self, method, args, kwargs)
(EngineCore_DP0 pid=6066)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 461, in run_method
(EngineCore_DP0 pid=6066)     return func(*args, **kwargs)
(EngineCore_DP0 pid=6066)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 289, in load_model
(EngineCore_DP0 pid=6066)     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=6066)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3581, in load_model
(EngineCore_DP0 pid=6066)     self.model = model_loader.load_model(
(EngineCore_DP0 pid=6066)                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 49, in load_model
(EngineCore_DP0 pid=6066)     model = initialize_model(
(EngineCore_DP0 pid=6066)             ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 48, in initialize_model
(EngineCore_DP0 pid=6066)     return model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore_DP0 pid=6066)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/minimax_m2.py", line 497, in __init__
(EngineCore_DP0 pid=6066)     self.model = MiniMaxM2Model(
(EngineCore_DP0 pid=6066)                  ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066)   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 291, in __init__
(EngineCore_DP0 pid=6066)     old_init(self, **kwargs)
(EngineCore_DP0 pid=6066)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/minimax_m2.py", line 341, in __init__
(EngineCore_DP0 pid=6066)     self.start_layer, self.end_layer, self.layers = make_layers(
(EngineCore_DP0 pid=6066)                                                     ^^^^^^^^^^^^
(EngineCore_DP0 pid=6066)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 606, in make_layers
(EngineCore_DP0 pid=6066)     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(EngineCore_DP0 pid=6066)                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/minimax_m2.py", line 343, in <lambda>
(EngineCore_DP0 pid=6066)     lambda prefix: MiniMaxM2DecoderLayer(
(EngineCore_DP0 pid=6066)                    ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/minimax_m2.py", line 266, in __init__
(EngineCore_DP0 pid=6066)     self.self_attn = MiniMaxM2Attention(
(EngineCore_DP0 pid=6066)                      ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/minimax_m2.py", line 184, in __init__
(EngineCore_DP0 pid=6066)     self.qkv_proj = QKVParallelLinear(
(EngineCore_DP0 pid=6066)                     ^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=6066)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 935, in __init__
(EngineCore_DP0 pid=6066)     super().__init__(
(EngineCore_DP0 pid=6066)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 484, in __init__
(EngineCore_DP0 pid=6066)     self.quant_method.create_weights(
(EngineCore_DP0 pid=6066)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 914, in create_weights
(EngineCore_DP0 pid=6066)     layer.scheme.create_weights(
(EngineCore_DP0 pid=6066)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a16_fp8.py", line 108, in create_weights
(EngineCore_DP0 pid=6066)     raise ValueError(
(EngineCore_DP0 pid=6066) ValueError: Unsupported weight strategy=block, supported strategies are [<QuantizationStrategy.CHANNEL: 'channel'>, <QuantizationStrategy.TENSOR: 'tensor'>]
(EngineCore_DP0 pid=6066) INFO 01-13 16:37:35 [ray_executor.py:121] Shutting down Ray distributed executor. If you see error log from logging.cc regarding SIGTERM received, please ignore because this is the expected termination process in Ray.
Owner

Your error is ValueError: Unsupported weight strategy=block, supported strategies are [<QuantizationStrategy.CHANNEL: 'channel'>, <QuantizationStrategy.TENSOR: 'tensor'>]

  1. BLOCK quantization is relatively new, introduced in December 2024 with DeepSeek, and added to quantized MoE weights in September 2025 (https://github.com/vllm-project/vllm/commit/f11e3c516be3d88733ea4b0c79f47e33cc319197), you need to have a vLLM version from after September 2025.
  2. Do you have RTX A6000 or RTX A6000 Ada? RTX A6000 from the Ampere generation indeed do not have hardware FP8 support and I don't think vLLM has a fallback path for it.

Yep, on 0.13.0 so definitely with support for block quantization. Its the Ampere A6000 so yea FP8 is not supported. Thanks!

Your error is ValueError: Unsupported weight strategy=block, supported strategies are [<QuantizationStrategy.CHANNEL: 'channel'>, <QuantizationStrategy.TENSOR: 'tensor'>]

  1. BLOCK quantization is relatively new, introduced in December 2024 with DeepSeek, and added to quantized MoE weights in September 2025 (https://github.com/vllm-project/vllm/commit/f11e3c516be3d88733ea4b0c79f47e33cc319197), you need to have a vLLM version from after September 2025.
  2. Do you have RTX A6000 or RTX A6000 Ada? RTX A6000 from the Ampere generation indeed do not have hardware FP8 support and I don't think vLLM has a fallback path for it.

Hi! Would it be too much to kindly ask you for a version, similar size (to fit on 8 x 3090) that would work on Ampere too (CC 8.6 and 8.9)? I know a lot of people would appreciate such a quality model for coding, instead / aside of llama. Thank you.

Owner

I can upload the BF16+AWQ base I used to build this FP8+AWQ mixed precision quant tomorrow. Just be aware that as the base model was FP8 you spend an extra 3GB for no extra quality.

vLLM documentation states this:
"FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper). FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin."

So I leave it up to your far better knowledge in quantization to choose which recipe should go in! 🤭

P.S: I am not able to run this version (I'm still waiting for the 8th 3090 to arrive) so I don't know exactly if it would work on CC 8.6.. my observation was based on the other owner of an Ampere (though cc 8.0)!

Owner

It's out: https://huggingface.co/mratsim/MiniMax-M2.1-BF16-INT4-AWQ

I'm so very thankful for your effort! 👍

It's out: https://huggingface.co/mratsim/MiniMax-M2.1-BF16-INT4-AWQ

Thanks, that model is working very well on Ampere A6000s

Sign up or log in to comment