GLM-4.6V-NVFP4

NVFP4 (4-bit floating point) quantized version of zai-org/GLM-4.6V.

Model Details

Property Value
Base Model zai-org/GLM-4.6V
Architecture Glm4vMoeForConditionalGeneration (108B MoE)
Quantization NVFP4 (E2M1 format) with dynamic activation scaling
Model Size 64 GB (vs 216 GB BF16)
Compression 3.4x
Max Context 131,072 tokens (128K)

Benchmark Results

MMLU (0-shot, 14,042 questions)

Category BF16 NVFP4 Accuracy Loss
Overall 76.01% 73.56% -2.45%
STEM 74.72% 70.25% -4.47%
Humanities 68.63% 67.14% -1.49%
Social Sciences 83.62% 81.90% -1.72%
Other 80.98% 78.37% -2.61%

Usage with vLLM

Launch Command

# Single GPU (full 128K context)
python -m vllm.entrypoints.openai.api_server \
  --model GadflyII/GLM-4.6V-NVFP4 \
  --tensor-parallel-size 1 \
  --trust-remote-code \
  --max-model-len 131072 \
  --port 8000

# Two GPUs
python -m vllm.entrypoints.openai.api_server \
  --model GadflyII/GLM-4.6V-NVFP4 \
  --tensor-parallel-size 2 \
  --trust-remote-code \
  --max-model-len 131072 \
  --port 8000

Python API

from vllm import LLM, SamplingParams

model = LLM(
    "GadflyII/GLM-4.6V-NVFP4",
    tensor_parallel_size=1,
    trust_remote_code=True,
    max_model_len=131072
)

# Recommended sampling parameters
params = SamplingParams(
    temperature=0.8,
    top_p=0.6,
    top_k=2,
    repetition_penalty=1.1,
    max_tokens=1024
)

outputs = model.generate(["The capital of France is"], params)
print(outputs[0].outputs[0].text)

Quantization Details

This model uses dynamic NVFP4 quantization:

  • Weights: Quantized to FP4 (E2M1 format) with two-level scaling
  • Activations: Dynamically quantized at runtime (input_global_scale=1.0, dynamic=true)
  • Vision encoder: Preserved in original precision

Hardware Tested

  • NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB VRAM)
  • Single GPU: 78 tok/s generation throughput

If you are having issues running multiple SM120 Blackwell GPU's (RTX 50 series/RTX Pro Blackwell), try my vLLM fork found below while waiting for PR back into main:

https://github.com/Gadflyii/vllm/

License

Same as base model: GLM-4 License

Acknowledgments

  • Original model by Zhipu AI
  • Quantization methodology informed by vLLM's compressed-tensors implementation
Downloads last month
3,930
Safetensors
Model size
62B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for GadflyII/GLM-4.6V-NVFP4

Base model

zai-org/GLM-4.6V
Quantized
(15)
this model