metadata
license: other
license_name: glm-4
license_link: https://huggingface.co/zai-org/GLM-4.6V/blob/main/LICENSE
base_model: zai-org/GLM-4.6V
tags:
- nvfp4
- quantized
- vllm
- vision-language-model
- moe
library_name: vllm
pipeline_tag: image-text-to-text
GLM-4.6V-NVFP4
NVFP4 (4-bit floating point) quantized version of zai-org/GLM-4.6V.
Model Details
| Property | Value |
|---|---|
| Base Model | zai-org/GLM-4.6V |
| Architecture | Glm4vMoeForConditionalGeneration (108B MoE) |
| Quantization | NVFP4 (E2M1 format) with dynamic activation scaling |
| Model Size | 64 GB (vs 216 GB BF16) |
| Compression | 3.4x |
| Max Context | 131,072 tokens (128K) |
Benchmark Results
MMLU (0-shot, 14,042 questions)
| Category | BF16 | NVFP4 | Accuracy Loss |
|---|---|---|---|
| Overall | 76.01% | 73.56% | -2.45% |
| STEM | 74.72% | 70.25% | -4.47% |
| Humanities | 68.63% | 67.14% | -1.49% |
| Social Sciences | 83.62% | 81.90% | -1.72% |
| Other | 80.98% | 78.37% | -2.61% |
Usage with vLLM
Launch Command
# Single GPU (full 128K context)
python -m vllm.entrypoints.openai.api_server \
--model GadflyII/GLM-4.6V-NVFP4 \
--tensor-parallel-size 1 \
--trust-remote-code \
--max-model-len 131072 \
--port 8000
# Two GPUs
python -m vllm.entrypoints.openai.api_server \
--model GadflyII/GLM-4.6V-NVFP4 \
--tensor-parallel-size 2 \
--trust-remote-code \
--max-model-len 131072 \
--port 8000
Python API
from vllm import LLM, SamplingParams
model = LLM(
"GadflyII/GLM-4.6V-NVFP4",
tensor_parallel_size=1,
trust_remote_code=True,
max_model_len=131072
)
# Recommended sampling parameters
params = SamplingParams(
temperature=0.8,
top_p=0.6,
top_k=2,
repetition_penalty=1.1,
max_tokens=1024
)
outputs = model.generate(["The capital of France is"], params)
print(outputs[0].outputs[0].text)
Quantization Details
This model uses dynamic NVFP4 quantization:
- Weights: Quantized to FP4 (E2M1 format) with two-level scaling
- Activations: Dynamically quantized at runtime (
input_global_scale=1.0,dynamic=true) - Vision encoder: Preserved in original precision
Hardware Tested
- NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB VRAM)
- Single GPU: 78 tok/s generation throughput
If you are having issues running multiple SM120 Blackwell GPU's (RTX 50 series/RTX Pro Blackwell), try my vLLM fork found below while waiting for PR back into main:
https://github.com/Gadflyii/vllm/
License
Same as base model: GLM-4 License
Acknowledgments
- Original model by Zhipu AI
- Quantization methodology informed by vLLM's compressed-tensors implementation