Instructions to use unsloth/GLM-4.7-Flash-FP8-Dynamic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use unsloth/GLM-4.7-Flash-FP8-Dynamic with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="unsloth/GLM-4.7-Flash-FP8-Dynamic")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("unsloth/GLM-4.7-Flash-FP8-Dynamic")
model = AutoModelForCausalLM.from_pretrained("unsloth/GLM-4.7-Flash-FP8-Dynamic")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use unsloth/GLM-4.7-Flash-FP8-Dynamic with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "unsloth/GLM-4.7-Flash-FP8-Dynamic"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/GLM-4.7-Flash-FP8-Dynamic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/unsloth/GLM-4.7-Flash-FP8-Dynamic

SGLang

How to use unsloth/GLM-4.7-Flash-FP8-Dynamic with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "unsloth/GLM-4.7-Flash-FP8-Dynamic" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/GLM-4.7-Flash-FP8-Dynamic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "unsloth/GLM-4.7-Flash-FP8-Dynamic" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/GLM-4.7-Flash-FP8-Dynamic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Unsloth Studio

How to use unsloth/GLM-4.7-Flash-FP8-Dynamic with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/GLM-4.7-Flash-FP8-Dynamic to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/GLM-4.7-Flash-FP8-Dynamic to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for unsloth/GLM-4.7-Flash-FP8-Dynamic to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="unsloth/GLM-4.7-Flash-FP8-Dynamic",
    max_seq_length=2048,
)

Docker Model Runner
How to use unsloth/GLM-4.7-Flash-FP8-Dynamic with Docker Model Runner:
```
docker model run hf.co/unsloth/GLM-4.7-Flash-FP8-Dynamic
```

Trying to serve with vllm, got this error: ValueError: There is no module or parameter named 'model.layers.1.mlp.gate.e_score_correction_bias' in TransformersMoEForCausalLM

by firow2 - opened Jan 29

Discussion

firow2

Jan 29

WARNING 01-29 01:49:58 [argparse_utils.py:195] With vllm serve, you should provide the model as a positional argument or in a config file instead of via the --model option. The --model option will be removed in v0.13.
(APIServer pid=273) INFO 01-29 01:49:58 [api_server.py:1272] vLLM API server version 0.14.1
(APIServer pid=273) INFO 01-29 01:49:58 [utils.py:263] non-default args: {'model_tag': '/.cache/huggingface/hub/models--unsloth--GLM-4.7-Flash-FP8-Dynamic/snapshots/1174de1393d38e3d30c2882b98eb54fd8c1e9d1f', 'port': 30007, 'enable_auto_tool_choice': True, 'tool_call_parser': 'glm47', 'model': '/.cache/huggingface/hub/models--unsloth--GLM-4.7-Flash-FP8-Dynamic/snapshots/1174de1393d38e3d30c2882b98eb54fd8c1e9d1f', 'dtype': 'bfloat16', 'seed': 3407, 'max_model_len': 200000, 'served_model_name': ['unsloth/GLM-4.7-Flash'], 'reasoning_parser': 'glm45', 'gpu_memory_utilization': 0.95, 'kv_cache_dtype': 'fp8', 'max_num_batched_tokens': 16384}
(APIServer pid=273) INFO 01-29 01:50:04 [model.py:530] Resolved architecture: TransformersMoEForCausalLM
(APIServer pid=273) INFO 01-29 01:50:04 [model.py:1545] Using max model len 200000
(APIServer pid=273) INFO 01-29 01:50:04 [cache.py:206] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
(APIServer pid=273) INFO 01-29 01:50:05 [scheduler.py:229] Chunked prefill is enabled with max_num_batched_tokens=16384.
(APIServer pid=273) INFO 01-29 01:50:05 [vllm.py:630] Asynchronous scheduling is enabled.
(APIServer pid=273) INFO 01-29 01:50:05 [vllm.py:637] Disabling NCCL for DP synchronization when using async scheduling.
(EngineCore_DP0 pid=368) INFO 01-29 01:50:13 [core.py:97] Initializing a V1 LLM engine (v0.14.1) with config: model='/.cache/huggingface/hub/models--unsloth--GLM-4.7-Flash-FP8-Dynamic/snapshots/1174de1393d38e3d30c2882b98eb54fd8c1e9d1f', speculative_config=None, tokenizer='/.cache/huggingface/hub/models--unsloth--GLM-4.7-Flash-FP8-Dynamic/snapshots/1174de1393d38e3d30c2882b98eb54fd8c1e9d1f', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=200000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='glm45', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=3407, served_model_name=unsloth/GLM-4.7-Flash, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [16384], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None}
(EngineCore_DP0 pid=368) INFO 01-29 01:50:13 [parallel_state.py:1214] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.17.0.3:59363 backend=nccl
(EngineCore_DP0 pid=368) INFO 01-29 01:50:13 [parallel_state.py:1425] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=368) WARNING 01-29 01:50:14 [utils.py:184] TransformersMoEForCausalLM has no vLLM implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
(EngineCore_DP0 pid=368) INFO 01-29 01:50:14 [gpu_model_runner.py:3808] Starting to load model /.cache/huggingface/hub/models--unsloth--GLM-4.7-Flash-FP8-Dynamic/snapshots/1174de1393d38e3d30c2882b98eb54fd8c1e9d1f...
(EngineCore_DP0 pid=368) INFO 01-29 01:50:15 [base.py:134] Using Transformers modeling backend.
(EngineCore_DP0 pid=368) INFO 01-29 01:50:15 [fp8.py:126] DeepGEMM is disabled because the platform does not support it.
(EngineCore_DP0 pid=368) INFO 01-29 01:50:15 [fp8.py:149] Using Triton backend for FP8 MoE
(EngineCore_DP0 pid=368) WARNING 01-29 01:50:15 [compressed_tensors.py:738] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod
(EngineCore_DP0 pid=368) INFO 01-29 01:50:16 [cuda.py:351] Using FLASHINFER attention backend out of potential backends: ('FLASHINFER', 'TRITON_ATTN')
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] EngineCore failed to start.
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] Traceback (most recent call last):
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 927, in run_engine_core
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 692, in __init__
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] super().__init__(
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 106, in __init__
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] self._init_executor()
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] self.driver_worker.load_model()
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 274, in load_model
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3827, in load_model
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] self.model = model_loader.load_model(
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 58, in load_model
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] self.load_weights(model, model_config)
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/default_loader.py", line 288, in load_weights
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/models/transformers/base.py", line 492, in load_weights
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/online_quantization.py", line 173, in patched_model_load_weights
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] return original_load_weights(auto_weight_loader, weights, mapper=mapper)
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 335, in load_weights
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] autoloaded_weights = set(self._load_module("", self.module, weights))
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] yield from self._load_module(
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] yield from self._load_module(
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] yield from self._load_module(
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] [Previous line repeated 2 more times]
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 319, in _load_module
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] raise ValueError(msg)
(EngineCore_DP0 pid=368) ERROR 01-29 01:50:16 [core.py:936] ValueError: There is no module or parameter named 'model.layers.1.mlp.gate.e_score_correction_bias' in TransformersMoEForCausalLM
(EngineCore_DP0 pid=368) Process EngineCore_DP0:
(EngineCore_DP0 pid=368) Traceback (most recent call last):
(EngineCore_DP0 pid=368) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=368) self.run()
(EngineCore_DP0 pid=368) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=368) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 940, in run_engine_core
(EngineCore_DP0 pid=368) raise e
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 927, in run_engine_core
(EngineCore_DP0 pid=368) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=368) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 692, in __init__
(EngineCore_DP0 pid=368) super().__init__(
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 106, in __init__
(EngineCore_DP0 pid=368) self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=368) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=368) self._init_executor()
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=368) self.driver_worker.load_model()
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 274, in load_model
(EngineCore_DP0 pid=368) self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3827, in load_model
(EngineCore_DP0 pid=368) self.model = model_loader.load_model(
(EngineCore_DP0 pid=368) ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 58, in load_model
(EngineCore_DP0 pid=368) self.load_weights(model, model_config)
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/default_loader.py", line 288, in load_weights
(EngineCore_DP0 pid=368) loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(EngineCore_DP0 pid=368) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/models/transformers/base.py", line 492, in load_weights
(EngineCore_DP0 pid=368) return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
(EngineCore_DP0 pid=368) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/online_quantization.py", line 173, in patched_model_load_weights
(EngineCore_DP0 pid=368) return original_load_weights(auto_weight_loader, weights, mapper=mapper)
(EngineCore_DP0 pid=368) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 335, in load_weights
(EngineCore_DP0 pid=368) autoloaded_weights = set(self._load_module("", self.module, weights))
(EngineCore_DP0 pid=368) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=368) yield from self._load_module(
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=368) yield from self._load_module(
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=368) yield from self._load_module(
(EngineCore_DP0 pid=368) [Previous line repeated 2 more times]
(EngineCore_DP0 pid=368) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 319, in _load_module
(EngineCore_DP0 pid=368) raise ValueError(msg)
(EngineCore_DP0 pid=368) ValueError: There is no module or parameter named 'model.layers.1.mlp.gate.e_score_correction_bias' in TransformersMoEForCausalLM
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
(EngineCore_DP0 pid=368)
[rank0]:[W129 01:50:17.708180018 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank0]:[W129 01:50:18.261641129 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator())
(APIServer pid=273) Traceback (most recent call last):
(APIServer pid=273) File "/vllm-workspace/.venv/bin/vllm", line 10, in
(APIServer pid=273) sys.exit(main())
(APIServer pid=273) ^^^^^^
(APIServer pid=273) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=273) args.dispatch_function(args)
(APIServer pid=273) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 60, in cmd
(APIServer pid=273) uvloop.run(run_server(args))
(APIServer pid=273) File "/vllm-workspace/.venv/lib/python3.12/site-packages/uvloop/init.py", line 96, in run
(APIServer pid=273) return __asyncio.run(
(APIServer pid=273) ^^^^^^^^^^^^^^
(APIServer pid=273) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=273) return runner.run(main)
(APIServer pid=273) ^^^^^^^^^^^^^^^^
(APIServer pid=273) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=273) return self._loop.run_until_complete(task)
(APIServer pid=273) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=273) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=273) File "/vllm-workspace/.venv/lib/python3.12/site-packages/uvloop/init.py", line 48, in wrapper
(APIServer pid=273) return await main
(APIServer pid=273) ^^^^^^^^^^
(APIServer pid=273) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1319, in run_server
(APIServer pid=273) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=273) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1338, in run_server_worker
(APIServer pid=273) async with build_async_engine_client(
(APIServer pid=273) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=273) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=273) return await anext(self.gen)
(APIServer pid=273) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=273) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 173, in build_async_engine_client
(APIServer pid=273) async with build_async_engine_client_from_engine_args(
(APIServer pid=273) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=273) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=273) return await anext(self.gen)
(APIServer pid=273) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=273) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 214, in build_async_engine_client_from_engine_args
(APIServer pid=273) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=273) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=273) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 205, in from_vllm_config
(APIServer pid=273) return cls(
(APIServer pid=273) ^^^^
(APIServer pid=273) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 132, in init
(APIServer pid=273) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=273) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=273) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 122, in make_async_mp_client
(APIServer pid=273) return AsyncMPClient(*client_args)
(APIServer pid=273) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=273) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 824, in init
(APIServer pid=273) super().init(
(APIServer pid=273) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 479, in init
(APIServer pid=273) with launch_core_engines(vllm_config, executor_class, log_stats) as (
(APIServer pid=273) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=273) File "/usr/lib/python3.12/contextlib.py", line 144, in exit
(APIServer pid=273) next(self.gen)
(APIServer pid=273) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 921, in launch_core_engines
(APIServer pid=273) wait_for_engine_startup(
(APIServer pid=273) File "/vllm-workspace/.venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 980, in wait_for_engine_startup
(APIServer pid=273) raise RuntimeError(
(APIServer pid=273) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
[W129 01:50:19.889487332 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator())

firow2

Jan 29

Following the guide in: https://unsloth.ai/docs/models/glm-4.7-flash#glm-4.7-flash-in-vllm
The transformers version in my enviroment is latest: 5.0.1.dev0
Did I do something wrong?

firow2 changed discussion title from ValueError: There is no module or parameter named 'model.layers.1.mlp.gate.e_score_correction_bias' in TransformersMoEForCausalLM to Trying to serve with vllm, got this error: ValueError: There is no module or parameter named 'model.layers.1.mlp.gate.e_score_correction_bias' in TransformersMoEForCausalLM Jan 29

firow2

Jan 29

my gpu is nvidia L40s

KT313

Jan 30

had the same issue, fixed it by updating vllm (nightly) and transformers

pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly
pip install git+https://github.com/huggingface/transformers.git

KT313

Jan 30

•

edited Jan 30

Name: transformers
Version: 5.0.1.dev0
---
Name: vllm
Version: 0.16.0rc1.dev4+g8bfc8d560

firow2

Jan 30

Thanks, but I get another error：

ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.bfloat16, kv_cache_dtype=fp8, block_size=None, use_mla=True, has_sink=False, use_sparse=False, use_mm_prefix=False, attn_type=AttentionType.DECODER). Reasons: {FLASH_ATTN_MLA: [kv_cache_dtype not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [compute capability not supported, FlashMLA Dense is only supported on Hopper devices.], FLASHINFER_MLA: [compute capability not supported, FlashInfer MLA kernel requires qk_nope_head_dim == 128, but got 192], TRITON_MLA: [kv_cache_dtype not supported], FLASHMLA_SPARSE: [kv_cache_dtype not supported, non-sparse not supported, compute capability not supported]}

seems fp8 has not been supported on L40S，is there any solution

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment