Can't deploy by vllm 0.14.1 + transformers

#6
by Butterfly-314 - opened

environment: A30 + cuda 12.2

vllm serve ./models/GadflyII/GLM-4.7-Flash-NVFP4/ --max-model-len 4096 --max-num-seqs 1 --port 8080 --served-model-name GLM-4.7-Flash-FP4

ValueError: There is no module or parameter named 'model.layers.39.mlp.gate.e_score_correction_bias' in TransformersMoEForCausalLM

full output:

(APIServer pid=811283) INFO 01-25 22:59:57 [model.py:530] Resolved architecture: TransformersMoEForCausalLM
(APIServer pid=811283) INFO 01-25 22:59:57 [model.py:1545] Using max model len 4096
(APIServer pid=811283) INFO 01-25 22:59:57 [scheduler.py:229] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=811283) INFO 01-25 22:59:57 [vllm.py:630] Asynchronous scheduling is enabled.
(APIServer pid=811283) INFO 01-25 22:59:57 [vllm.py:637] Disabling NCCL for DP synchronization when using async scheduling.
(EngineCore_DP0 pid=811974) INFO 01-25 23:00:09 [core.py:97] Initializing a V1 LLM engine (v0.14.1) with config: model='./models/GadflyII/GLM-4.7-Flash-NVFP4/', speculative_config=None, tokenizer='./models/GadflyII/GLM-4.7-Flash-NVFP4/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=GLM-4.7-Flash-FP4, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 2, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None}

(EngineCore_DP0 pid=811974) INFO 01-25 23:00:09 [parallel_state.py:1214] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.16.0.52:39475 backend=nccl
(EngineCore_DP0 pid=811974) INFO 01-25 23:00:09 [parallel_state.py:1425] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=811974) WARNING 01-25 23:00:10 [utils.py:184] TransformersMoEForCausalLM has no vLLM implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
(EngineCore_DP0 pid=811974) INFO 01-25 23:00:10 [gpu_model_runner.py:3808] Starting to load model ./models/GadflyII/GLM-4.7-Flash-NVFP4/...
(EngineCore_DP0 pid=811974) INFO 01-25 23:00:10 [base.py:134] Using Transformers modeling backend.
(EngineCore_DP0 pid=811974) INFO 01-25 23:00:11 [nvfp4.py:110] Using vLLM MARLIN backend for NvFp4 MoE
(EngineCore_DP0 pid=811974) WARNING 01-25 23:00:11 [compressed_tensors.py:738] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod
(EngineCore_DP0 pid=811974) WARNING 01-25 23:00:11 [compressed_tensors.py:617] Current platform does not support cutlass NVFP4. Running CompressedTensorsW4A16Fp4.
(EngineCore_DP0 pid=811974) INFO 01-25 23:00:11 [cuda.py:351] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION')
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936] EngineCore failed to start.
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936] Traceback (most recent call last):
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   File "/home/ma-user/anaconda3/envs/py310/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 927, in run_engine_core
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   File "/home/ma-user/anaconda3/envs/py310/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 692, in __init__
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]     super().__init__(
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   File "/home/ma-user/anaconda3/envs/py310/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 106, in __init__
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   File "/home/ma-user/anaconda3/envs/py310/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]     self._init_executor()
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   File "/home/ma-user/anaconda3/envs/py310/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]     self.driver_worker.load_model()
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   File "/home/ma-user/anaconda3/envs/py310/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 274, in load_model
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   File "/home/ma-user/anaconda3/envs/py310/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3827, in load_model
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]     self.model = model_loader.load_model(
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   File "/home/ma-user/anaconda3/envs/py310/lib/python3.10/site-packages/vllm/model_executor/model_loader/base_loader.py", line 58, in load_model
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]     self.load_weights(model, model_config)
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   File "/home/ma-user/anaconda3/envs/py310/lib/python3.10/site-packages/vllm/model_executor/model_loader/default_loader.py", line 288, in load_weights
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]     loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   File "/home/ma-user/anaconda3/envs/py310/lib/python3.10/site-packages/vllm/model_executor/models/transformers/base.py", line 492, in load_weights
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]     return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   File "/home/ma-user/anaconda3/envs/py310/lib/python3.10/site-packages/vllm/model_executor/model_loader/online_quantization.py", line 173, in patched_model_load_weights
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]     return original_load_weights(auto_weight_loader, weights, mapper=mapper)
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   File "/home/ma-user/anaconda3/envs/py310/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 335, in load_weights
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]     autoloaded_weights = set(self._load_module("", self.module, weights))
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   File "/home/ma-user/anaconda3/envs/py310/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]     yield from self._load_module(
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   File "/home/ma-user/anaconda3/envs/py310/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]     yield from self._load_module(
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   File "/home/ma-user/anaconda3/envs/py310/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]     yield from self._load_module(
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   [Previous line repeated 2 more times]
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   File "/home/ma-user/anaconda3/envs/py310/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 319, in _load_module
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]     raise ValueError(msg)
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936] ValueError: There is no module or parameter named 'model.layers.39.mlp.gate.e_score_correction_bias' in TransformersMoEForCausalLM

I have the same problem , please help

Shouldn't vLLM be using Glm4MoeLiteForCausalLM instead of TransformersMoEForCausalLM? Check your config.json.
FYI, I had to update my transformers to 5.0.0rc3 to get the model to run. But it's working now.

it is Glm4MoeLiteForCausalLM, any idea ?
config.json

{
  "architectures": [
    "Glm4MoeLiteForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "pad_token_id": 154820,
  "eos_token_id": [
    154820,
    154827,
    154829
  ],
  "hidden_act": "silu",
  "hidden_size": 2048,
}

Shouldn't vLLM be using Glm4MoeLiteForCausalLM instead of TransformersMoEForCausalLM? Check your config.json.
FYI, I had to update my transformers to 5.0.0rc3 to get the model to run. But it's working now.

it is Glm4MoeLiteForCausalLM, any idea ?

Sounds like your version of transformers needs to be updated:

pip install transformers==5.0.0

The 5.0.0 release just dropped a couple hours ago.

Per the model card:

Requirements:
vLLM: 0.14.0+ (for MXFP4 Marlin backend support)
transformers: 5.0.0+ (for glm4_moe_lite architecture)

Upgrade your transformers to 5.0.0+

GadflyII changed discussion status to closed
GadflyII changed discussion status to open

Hugging Face must add support for new models in their transformers library when they're released. If your version of transformers precedes the GLM-4.7 release, then the necessary torch subclasses won't be in there. GLM-4.7 classes are in 5.0.0, so use that version.

vLLM 0.14.1
transformers 5.0.0
same issue

@gbj231321 Exactly the same issue? Same error?

It should be "Glm4MoeLiteForCausalLM", not "TransformersMoEForCausalLM", per the config.json; for some reason vllm isn't loading correctly. Verify that you have all the files downloaded, including the config.json, etc.

You can try to update vLLM by pulling from main, or you can try my fork of vLLM (which will only exist until the PR's are merged into upstream).

https://github.com/Gadflyii/vllm/

Here is the full launch command I use for single GPU:

NVIDIA_TF32_OVERRIDE=1 \
VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS=1 \
VLLM_FLASH_ATTN_VERSION=2 \
python -m vllm.entrypoints.openai.api_server \
--model PATH
--tensor-parallel-size 1 \
--trust-remote-code \
--max-model-len NUMBER \
--gpu-memory-utilization 0.95 \
--port 8000

With the new updates from vLLM and transformers 5.0.0 you should see "Glm4MoeLiteForCausalLM", and as of yesterday, it should use the "TRITON_MLA" backend (which makes it a LOT faster).

Sign up or log in to comment