Trying to serve with vllm, got this error: ValueError: There is no module or parameter named 'model.layers.1.mlp.gate.e_score_correction_bias' in TransformersMoEForCausalLM

#4
by firow2 - opened

WARNING 01-29 01:49:58 [argparse_utils.py:195] With vllm serve, you should provide the model as a positional argument or in a config file instead of via the --model option. The --model option will be removed in v0.13.
(APIServer pid=273) INFO 01-29 01:49:58 [api_server.py:1272] vLLM API server version 0.14.1
(APIServer pid=273) INFO 01-29 01:49:58 [utils.py:263] non-default args: {'model_tag': '/.cache/huggingface/hub/models--unsloth--GLM-4.7-Flash-FP8-Dynamic/snapshots/1174de1393d38e3d30c2882b98eb54fd8c1e9d1f', 'port': 30007, 'enable_auto_tool_choice': True, 'tool_call_parser': 'glm47', 'model': '/.cache/huggingface/hub/models--unsloth--GLM-4.7-Flash-FP8-Dynamic/snapshots/1174de1393d38e3d30c2882b98eb54fd8c1e9d1f', 'dtype': 'bfloat16', 'seed': 3407, 'max_model_len': 200000, 'served_model_name': ['unsloth/GLM-4.7-Flash'], 'reasoning_parser': 'glm45', 'gpu_memory_utilization': 0.95, 'kv_cache_dtype': 'fp8', 'max_num_batched_tokens': 16384}
(APIServer pid=273) INFO 01-29 01:50:04 [model.py:530] Resolved architecture: TransformersMoEForCausalLM
(APIServer pid=273) INFO 01-29 01:50:04 [model.py:1545] Using max model len 200000
(APIServer pid=273) INFO 01-29 01:50:04 [cache.py:206] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
(APIServer pid=273) INFO 01-29 01:50:05 [scheduler.py:229] Chunked prefill is enabled with max_num_batched_tokens=16384.
(APIServer pid=273) INFO 01-29 01:50:05 [vllm.py:630] Asynchronous scheduling is enabled.
(APIServer pid=273) INFO 01-29 01:50:05 [vllm.py:637] Disabling NCCL for DP synchronization when using async scheduling.
(EngineCore_DP0 pid=368) INFO 01-29 01:50:13 [core.py:97] Initializing a V1 LLM engine (v0.14.1) with config: model='/.cache/huggingface/hub/models--unsloth--GLM-4.7-Flash-FP8-Dynamic/snapshots/1174de1393d38e3d30c2882b98eb54fd8c1e9d1f', speculative_config=None, tokenizer='/.cache/huggingface/hub/models--unsloth--GLM-4.7-Flash-FP8-Dynamic/snapshots/1174de1393d38e3d30c2882b98eb54fd8c1e9d1f', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=200000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='glm45', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=3407, served_model_name=unsloth/GLM-4.7-Flash, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [16384], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None}
(EngineCore_DP0 pid=368) INFO 01-29 01:50:13 [parallel_state.py:1214] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.17.0.3:59363 backend=nccl
(EngineCore_DP0 pid=368) INFO 01-29 01:50:13 [parallel_state.py:1425] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=368) WARNING 01-29 01:50:14 [utils.py:184] TransformersMoEForCausalLM has no vLLM implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
(EngineCore_DP0 pid=368) INFO 01-29 01:50:14 [gpu_model_runner.py:3808] Starting to load model /.cache/huggingface/hub/models--unsloth--GLM-4.7-Flash-FP8-Dynamic/snapshots/1174de1393d38e3d30c2882b98eb54fd8c1e9d1f...
(EngineCore_DP0 pid=368) INFO 01-29 01:50:15 [base.py:134] Using Transformers modeling backend.
(EngineCore_DP0 pid=368) INFO 01-29 01:50:15 [fp8.py:126] DeepGEMM is disabled because the platform does not support it.
(EngineCore_DP0 pid=368) INFO 01-29 01:50:15 [fp8.py:149] Using Triton backend for FP8 MoE
(EngineCore_DP0 pid=368) WARNING 01-29 01:50:15 [compressed_tensors.py:738] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod
(EngineCore_DP0 pid=368) INFO 01-29 01:50:16 [cuda.py:351] Using FLASHINFER attention backend out of potential backends: ('FLASHINFER', 'TRITON_ATTN')
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] EngineCore failed to start.
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] Traceback (most recent call last):
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 927, in run_engine_core
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 692, in __init__
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] super().__init__(
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 106, in __init__
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] self._init_executor()
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] self.driver_worker.load_model()
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 274, in load_model
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3827, in load_model
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] self.model = model_loader.load_model(
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 58, in load_model
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] self.load_weights(model, model_config)
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/default_loader.py", line 288, in load_weights
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/models/transformers/base.py", line 492, in load_weights
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/online_quantization.py", line 173, in patched_model_load_weights
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] return original_load_weights(auto_weight_loader, weights, mapper=mapper)
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 335, in load_weights
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] autoloaded_weights = set(self._load_module("", self.module, weights))
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] yield from self._load_module(
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] yield from self._load_module(
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] yield from self._load_module(
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] [Previous line repeated 2 more times]
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 319, in _load_module
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] raise ValueError(msg)
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] ValueError: There is no module or parameter named 'model.layers.1.mlp.gate.e_score_correction_bias' in TransformersMoEForCausalLM
(EngineCore_DP0 pid=368) Process EngineCore_DP0:
(EngineCore_DP0 pid=368) Traceback (most recent call last):
(EngineCore_DP0 pid=368) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=368) self.run()
(EngineCore_DP0 pid=368) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=368) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 940, in run_engine_core
(EngineCore_DP0 pid=368) raise e
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 927, in run_engine_core
(EngineCore_DP0 pid=368) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=368) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 692, in __init__
(EngineCore_DP0 pid=368) super().__init__(
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 106, in __init__
(EngineCore_DP0 pid=368) self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=368) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=368) self._init_executor()
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=368) self.driver_worker.load_model()
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 274, in load_model
(EngineCore_DP0 pid=368) self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3827, in load_model
(EngineCore_DP0 pid=368) self.model = model_loader.load_model(
(EngineCore_DP0 pid=368) ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 58, in load_model
(EngineCore_DP0 pid=368) self.load_weights(model, model_config)
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/default_loader.py", line 288, in load_weights
(EngineCore_DP0 pid=368) loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(EngineCore_DP0 pid=368) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/models/transformers/base.py", line 492, in load_weights
(EngineCore_DP0 pid=368) return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
(EngineCore_DP0 pid=368) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/online_quantization.py", line 173, in patched_model_load_weights
(EngineCore_DP0 pid=368) return original_load_weights(auto_weight_loader, weights, mapper=mapper)
(EngineCore_DP0 pid=368) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 335, in load_weights
(EngineCore_DP0 pid=368) autoloaded_weights = set(self._load_module("", self.module, weights))
(EngineCore_DP0 pid=368) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=368) yield from self._load_module(
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=368) yield from self._load_module(
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=368) yield from self._load_module(
(EngineCore_DP0 pid=368) [Previous line repeated 2 more times]
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 319, in _load_module
(EngineCore_DP0 pid=368) raise ValueError(msg)
(EngineCore_DP0 pid=368) ValueError: There is no module or parameter named 'model.layers.1.mlp.gate.e_score_correction_bias' in TransformersMoEForCausalLM
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
(EngineCore_DP0 pid=368)
[rank0]:[W129 01:50:17.708180018 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank0]:[W129 01:50:18.261641129 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator())
(APIServer pid=273) Traceback (most recent call last):
(APIServer pid=273) File "/vllm-workspace/.venv/bin/vllm", line 10, in
(APIServer pid=273) sys.exit(main())
(APIServer pid=273) ^^^^^^
(APIServer pid=273) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=273) args.dispatch_function(args)
(APIServer pid=273) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 60, in cmd
(APIServer pid=273) uvloop.run(run_server(args))
(APIServer pid=273) File "/vllm-workspace/.venv/lib/python3.12/site-packages/uvloop/init.py", line 96, in run
(APIServer pid=273) return __asyncio.run(
(APIServer pid=273) ^^^^^^^^^^^^^^
(APIServer pid=273) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=273) return runner.run(main)
(APIServer pid=273) ^^^^^^^^^^^^^^^^
(APIServer pid=273) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=273) return self._loop.run_until_complete(task)
(APIServer pid=273) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=273) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=273) File "/vllm-workspace/.venv/lib/python3.12/site-packages/uvloop/init.py", line 48, in wrapper
(APIServer pid=273) return await main
(APIServer pid=273) ^^^^^^^^^^
(APIServer pid=273) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1319, in run_server
(APIServer pid=273) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=273) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1338, in run_server_worker
(APIServer pid=273) async with build_async_engine_client(
(APIServer pid=273) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=273) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=273) return await anext(self.gen)
(APIServer pid=273) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=273) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 173, in build_async_engine_client
(APIServer pid=273) async with build_async_engine_client_from_engine_args(
(APIServer pid=273) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=273) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=273) return await anext(self.gen)
(APIServer pid=273) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=273) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 214, in build_async_engine_client_from_engine_args
(APIServer pid=273) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=273) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=273) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 205, in from_vllm_config
(APIServer pid=273) return cls(
(APIServer pid=273) ^^^^
(APIServer pid=273) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 132, in init
(APIServer pid=273) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=273) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=273) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 122, in make_async_mp_client
(APIServer pid=273) return AsyncMPClient(*client_args)
(APIServer pid=273) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=273) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 824, in init
(APIServer pid=273) super().init(
(APIServer pid=273) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 479, in init
(APIServer pid=273) with launch_core_engines(vllm_config, executor_class, log_stats) as (
(APIServer pid=273) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=273) File "/usr/lib/python3.12/contextlib.py", line 144, in exit
(APIServer pid=273) next(self.gen)
(APIServer pid=273) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 921, in launch_core_engines
(APIServer pid=273) wait_for_engine_startup(
(APIServer pid=273) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 980, in wait_for_engine_startup
(APIServer pid=273) raise RuntimeError(
(APIServer pid=273) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
[W129 01:50:19.889487332 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator())

Following the guide in: https://unsloth.ai/docs/models/glm-4.7-flash#glm-4.7-flash-in-vllm
The transformers version in my enviroment is latest: 5.0.1.dev0
Did I do something wrong?

firow2 changed discussion title from ValueError: There is no module or parameter named 'model.layers.1.mlp.gate.e_score_correction_bias' in TransformersMoEForCausalLM to Trying to serve with vllm, got this error: ValueError: There is no module or parameter named 'model.layers.1.mlp.gate.e_score_correction_bias' in TransformersMoEForCausalLM

my gpu is nvidia L40s

had the same issue, fixed it by updating vllm (nightly) and transformers

pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly
pip install git+https://github.com/huggingface/transformers.git
Name: transformers
Version: 5.0.1.dev0
---
Name: vllm
Version: 0.16.0rc1.dev4+g8bfc8d560

Thanks, but I get another error:

ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.bfloat16, kv_cache_dtype=fp8, block_size=None, use_mla=True, has_sink=False, use_sparse=False, use_mm_prefix=False, attn_type=AttentionType.DECODER). Reasons: {FLASH_ATTN_MLA: [kv_cache_dtype not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [compute capability not supported, FlashMLA Dense is only supported on Hopper devices.], FLASHINFER_MLA: [compute capability not supported, FlashInfer MLA kernel requires qk_nope_head_dim == 128, but got 192], TRITON_MLA: [kv_cache_dtype not supported], FLASHMLA_SPARSE: [kv_cache_dtype not supported, non-sparse not supported, compute capability not supported]}

seems fp8 has not been supported on L40S,is there any solution

Sign up or log in to comment