Instructions to use GadflyII/GLM-4.7-Flash-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use GadflyII/GLM-4.7-Flash-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="GadflyII/GLM-4.7-Flash-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("GadflyII/GLM-4.7-Flash-NVFP4")
model = AutoModelForCausalLM.from_pretrained("GadflyII/GLM-4.7-Flash-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use GadflyII/GLM-4.7-Flash-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "GadflyII/GLM-4.7-Flash-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "GadflyII/GLM-4.7-Flash-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/GadflyII/GLM-4.7-Flash-NVFP4

SGLang

How to use GadflyII/GLM-4.7-Flash-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "GadflyII/GLM-4.7-Flash-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "GadflyII/GLM-4.7-Flash-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "GadflyII/GLM-4.7-Flash-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "GadflyII/GLM-4.7-Flash-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use GadflyII/GLM-4.7-Flash-NVFP4 with Docker Model Runner:
```
docker model run hf.co/GadflyII/GLM-4.7-Flash-NVFP4
```

Can't deploy by vllm 0.14.1 + transformers

by Butterfly-314 - opened Jan 25

Discussion

Butterfly-314

Jan 25

environment: A30 + cuda 12.2

vllm serve ./models/GadflyII/GLM-4.7-Flash-NVFP4/ --max-model-len 4096 --max-num-seqs 1 --port 8080 --served-model-name GLM-4.7-Flash-FP4

ValueError: There is no module or parameter named 'model.layers.39.mlp.gate.e_score_correction_bias' in TransformersMoEForCausalLM

full output:

(APIServer pid=811283) INFO 01-25 22:59:57 [model.py:530] Resolved architecture: TransformersMoEForCausalLM
(APIServer pid=811283) INFO 01-25 22:59:57 [model.py:1545] Using max model len 4096
(APIServer pid=811283) INFO 01-25 22:59:57 [scheduler.py:229] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=811283) INFO 01-25 22:59:57 [vllm.py:630] Asynchronous scheduling is enabled.
(APIServer pid=811283) INFO 01-25 22:59:57 [vllm.py:637] Disabling NCCL for DP synchronization when using async scheduling.
(EngineCore_DP0 pid=811974) INFO 01-25 23:00:09 [core.py:97] Initializing a V1 LLM engine (v0.14.1) with config: model='./models/GadflyII/GLM-4.7-Flash-NVFP4/', speculative_config=None, tokenizer='./models/GadflyII/GLM-4.7-Flash-NVFP4/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=GLM-4.7-Flash-FP4, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 2, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None}

(EngineCore_DP0 pid=811974) INFO 01-25 23:00:09 [parallel_state.py:1214] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.16.0.52:39475 backend=nccl
(EngineCore_DP0 pid=811974) INFO 01-25 23:00:09 [parallel_state.py:1425] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=811974) WARNING 01-25 23:00:10 [utils.py:184] TransformersMoEForCausalLM has no vLLM implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
(EngineCore_DP0 pid=811974) INFO 01-25 23:00:10 [gpu_model_runner.py:3808] Starting to load model ./models/GadflyII/GLM-4.7-Flash-NVFP4/...
(EngineCore_DP0 pid=811974) INFO 01-25 23:00:10 [base.py:134] Using Transformers modeling backend.
(EngineCore_DP0 pid=811974) INFO 01-25 23:00:11 [nvfp4.py:110] Using vLLM MARLIN backend for NvFp4 MoE
(EngineCore_DP0 pid=811974) WARNING 01-25 23:00:11 [compressed_tensors.py:738] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod
(EngineCore_DP0 pid=811974) WARNING 01-25 23:00:11 [compressed_tensors.py:617] Current platform does not support cutlass NVFP4. Running CompressedTensorsW4A16Fp4.
(EngineCore_DP0 pid=811974) INFO 01-25 23:00:11 [cuda.py:351] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION')
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936] EngineCore failed to start.
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936] Traceback (most recent call last):
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   File "/home/ma-user/anaconda3/envs/py310/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 927, in run_engine_core
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   File "/home/ma-user/anaconda3/envs/py310/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 692, in __init__
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]     super().__init__(
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   File "/home/ma-user/anaconda3/envs/py310/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 106, in __init__
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   File "/home/ma-user/anaconda3/envs/py310/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]     self._init_executor()
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   File "/home/ma-user/anaconda3/envs/py310/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]     self.driver_worker.load_model()
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   File "/home/ma-user/anaconda3/envs/py310/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 274, in load_model
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   File "/home/ma-user/anaconda3/envs/py310/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3827, in load_model
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]     self.model = model_loader.load_model(
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   File "/home/ma-user/anaconda3/envs/py310/lib/python3.10/site-packages/vllm/model_executor/model_loader/base_loader.py", line 58, in load_model
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]     self.load_weights(model, model_config)
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   File "/home/ma-user/anaconda3/envs/py310/lib/python3.10/site-packages/vllm/model_executor/model_loader/default_loader.py", line 288, in load_weights
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]     loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   File "/home/ma-user/anaconda3/envs/py310/lib/python3.10/site-packages/vllm/model_executor/models/transformers/base.py", line 492, in load_weights
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]     return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   File "/home/ma-user/anaconda3/envs/py310/lib/python3.10/site-packages/vllm/model_executor/model_loader/online_quantization.py", line 173, in patched_model_load_weights
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]     return original_load_weights(auto_weight_loader, weights, mapper=mapper)
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   File "/home/ma-user/anaconda3/envs/py310/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 335, in load_weights
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]     autoloaded_weights = set(self._load_module("", self.module, weights))
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   File "/home/ma-user/anaconda3/envs/py310/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]     yield from self._load_module(
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   File "/home/ma-user/anaconda3/envs/py310/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]     yield from self._load_module(
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   File "/home/ma-user/anaconda3/envs/py310/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]     yield from self._load_module(
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   [Previous line repeated 2 more times]
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]   File "/home/ma-user/anaconda3/envs/py310/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 319, in _load_module
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936]     raise ValueError(msg)
(EngineCore_DP0 pid=811974) ERROR 01-25 23:00:12 [core.py:936] ValueError: There is no module or parameter named 'model.layers.39.mlp.gate.e_score_correction_bias' in TransformersMoEForCausalLM

lokimando25

Jan 25

I have the same problem , please help

Phaserblast

Jan 25

•

edited Jan 26

Shouldn't vLLM be using Glm4MoeLiteForCausalLM instead of TransformersMoEForCausalLM? Check your config.json.
FYI, I had to update my transformers to 5.0.0rc3 to get the model to run. But it's working now.

Butterfly-314

Jan 26

it is Glm4MoeLiteForCausalLM, any idea ?
config.json

{
  "architectures": [
    "Glm4MoeLiteForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "pad_token_id": 154820,
  "eos_token_id": [
    154820,
    154827,
    154829
  ],
  "hidden_act": "silu",
  "hidden_size": 2048,
}

Shouldn't vLLM be using Glm4MoeLiteForCausalLM instead of TransformersMoEForCausalLM? Check your config.json.
FYI, I had to update my transformers to 5.0.0rc3 to get the model to run. But it's working now.

Phaserblast

Jan 26

it is Glm4MoeLiteForCausalLM, any idea ?

Sounds like your version of transformers needs to be updated:

pip install transformers==5.0.0

The 5.0.0 release just dropped a couple hours ago.

GadflyII

Owner Jan 26

Per the model card:

Requirements:
vLLM: 0.14.0+ (for MXFP4 Marlin backend support)
transformers: 5.0.0+ (for glm4_moe_lite architecture)

Upgrade your transformers to 5.0.0+

GadflyII changed discussion status to closed Jan 26

GadflyII changed discussion status to open Jan 26

Phaserblast

Jan 26

Hugging Face must add support for new models in their transformers library when they're released. If your version of transformers precedes the GLM-4.7 release, then the necessary torch subclasses won't be in there. GLM-4.7 classes are in 5.0.0, so use that version.

gbj231321

Jan 27

vLLM 0.14.1
transformers 5.0.0
same issue

GadflyII

Owner Jan 27

•

edited Jan 27

@gbj231321 Exactly the same issue? Same error?

It should be "Glm4MoeLiteForCausalLM", not "TransformersMoEForCausalLM", per the config.json; for some reason vllm isn't loading correctly. Verify that you have all the files downloaded, including the config.json, etc.

You can try to update vLLM by pulling from main, or you can try my fork of vLLM (which will only exist until the PR's are merged into upstream).

https://github.com/Gadflyii/vllm/

Here is the full launch command I use for single GPU:

NVIDIA_TF32_OVERRIDE=1 \
VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS=1 \
VLLM_FLASH_ATTN_VERSION=2 \
python -m vllm.entrypoints.openai.api_server \
--model PATH
--tensor-parallel-size 1 \
--trust-remote-code \
--max-model-len NUMBER \
--gpu-memory-utilization 0.95 \
--port 8000

With the new updates from vLLM and transformers 5.0.0 you should see "Glm4MoeLiteForCausalLM", and as of yesterday, it should use the "TRITON_MLA" backend (which makes it a LOT faster).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment