NotImplementedError: The class UnquantizedLinearMethod must implement the 'embedding' method, see UnquantizedEmbeddingMethod
vllm 0.24.0 (official docker container)
changing any of the following did not help:
--quantization or
VLLM_USE_V2_MODEL_RUNNER or
VLLM_USE_FLASHINFER_SAMPLER
Docker log:
(Worker_PP1 pid=58) ERROR 06-30 07:29:23 [multiproc_executor.py:898] WorkerProc failed to start.
(Worker_PP1 pid=58) ERROR 06-30 07:29:23 [multiproc_executor.py:898] Traceback (most recent call last):
(Worker_PP1 pid=58) ERROR 06-30 07:29:23 [multiproc_executor.py:898] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 865, in worker_main
(Worker_PP1 pid=58) ERROR 06-30 07:29:23 [multiproc_executor.py:898] worker = WorkerProc(*args, **kwargs)
(Worker_PP1 pid=58) ERROR 06-30 07:29:23 [multiproc_executor.py:898] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PP1 pid=58) ERROR 06-30 07:29:23 [multiproc_executor.py:898] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_PP1 pid=58) ERROR 06-30 07:29:23 [multiproc_executor.py:898] return func(*args, **kwargs)
(Worker_PP1 pid=58) ERROR 06-30 07:29:23 [multiproc_executor.py:898] ^^^^^^^^^^^^^^^^^^^^^
(Worker_PP1 pid=58) ERROR 06-30 07:29:23 [multiproc_executor.py:898] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 634, in init
(Worker_PP1 pid=58) ERROR 06-30 07:29:23 [multiproc_executor.py:898] self.worker.load_model()
(Worker_PP1 pid=58) ERROR 06-30 07:29:23 [multiproc_executor.py:898] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 384, in load_model
(Worker_PP1 pid=58) ERROR 06-30 07:29:23 [multiproc_executor.py:898] self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(Worker_PP1 pid=58) ERROR 06-30 07:29:23 [multiproc_executor.py:898] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_PP1 pid=58) ERROR 06-30 07:29:23 [multiproc_executor.py:898] return func(*args, **kwargs)
(Worker_PP1 pid=58) ERROR 06-30 07:29:23 [multiproc_executor.py:898] ^^^^^^^^^^^^^^^^^^^^^
(Worker_PP1 pid=58) ERROR 06-30 07:29:23 [multiproc_executor.py:898] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5176, in load_model
(Worker_PP1 pid=58) ERROR 06-30 07:29:23 [multiproc_executor.py:898] self.model = model_loader.load_model(
(Worker_PP1 pid=58) ERROR 06-30 07:29:23 [multiproc_executor.py:898] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PP1 pid=58) ERROR 06-30 07:29:23 [multiproc_executor.py:898] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_PP1 pid=58) ERROR 06-30 07:29:23 [multiproc_executor.py:898] return func(*args, **kwargs)
(Worker_PP1 pid=58) ERROR 06-30 07:29:23 [multiproc_executor.py:898] ^^^^^^^^^^^^^^^^^^^^^
(Worker_PP1 pid=58) ERROR 06-30 07:29:23 [multiproc_executor.py:898] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(Worker_PP1 pid=58) ERROR 06-30 07:29:23 [multiproc_executor.py:898] model = initialize_model(
(Worker_PP1 pid=58) ERROR 06-30 07:29:23 [multiproc_executor.py:898] old_init(self, *args, **kwargs)
(Worker_PP1 pid=58) ERROR 06-30 07:29:23 [multiproc_executor.py:898] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py", line 973, in init
(Worker_PP1 pid=58) ERROR 06-30 07:29:23 [multiproc_executor.py:898] self.embed_tokens = VocabParallelEmbedding(
(Worker_PP1 pid=58) ERROR 06-30 07:29:23 [multiproc_executor.py:898] ^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PP1 pid=58) ERROR 06-30 07:29:23 [multiproc_executor.py:898] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 290, in init
(Worker_PP1 pid=58) ERROR 06-30 07:29:23 [multiproc_executor.py:898] raise NotImplementedError(
(Worker_PP1 pid=58) ERROR 06-30 07:29:23 [multiproc_executor.py:898] NotImplementedError: The class UnquantizedLinearMethod must implement the 'embedding' method, see UnquantizedEmbeddingMethod.
(Worker_PP0 pid=57) ERROR 06-30 07:29:23 [multiproc_executor.py:898] WorkerProc failed to start.
Sorry for the trouble. There's a bug in vllm/model_executor/layers/quantization/inc/inc.py: the unquantized fallback path unconditionally returns UnquantizedLinearMethod(), which incorrectly applies to embedding layers (VocabParallelEmbedding) too. Please apply this temporary patch for now โ we'll get it upstreamed ASAP:
git diff vllm/model_executor/layers/quantization/inc/inc.py
diff --git a/vllm/model_executor/layers/quantization/inc/inc.py b/vllm/model_executor/layers/quantization/inc/inc.py
index 86fa7cefc..2eb3fb205 100644
--- a/vllm/model_executor/layers/quantization/inc/inc.py
+++ b/vllm/model_executor/layers/quantization/inc/inc.py
@@ -155,7 +155,12 @@ class INCConfig(QuantizationConfig):
) and self.extra_config[layer_name].get("bits", 16) >= 16:
if isinstance(layer, RoutedExperts):
return UnquantizedFusedMoEMethod(layer.moe_config)
-- return UnquantizedLinearMethod()
++ if isinstance(layer, (LinearBase, ParallelLMHead)):
++ return UnquantizedLinearMethod()
++ # Embedding layers (VocabParallelEmbedding) must not get a
++ # linear method; returning None lets vLLM fall back to
++ # UnquantizedEmbeddingMethod.
++ return None
layer_config = self.config_parser.resolve(layer, prefix)
if not layer_config.quantized:
I can confirm that the above patch fixed the NotImplementedError.
I had other unrelated issues, but should be fixed later.
Thanks!