GLM-4.6V-NVFP4 / README.md

GadflyII

Update README.md

6a18ef6 verified 9 days ago

preview code

raw

history blame contribute delete

2.93 kB

metadata

license: other
license_name: glm-4
license_link: https://huggingface.co/zai-org/GLM-4.6V/blob/main/LICENSE
base_model: zai-org/GLM-4.6V
tags:
  - nvfp4
  - quantized
  - vllm
  - vision-language-model
  - moe
library_name: vllm
pipeline_tag: image-text-to-text

GLM-4.6V-NVFP4

NVFP4 (4-bit floating point) quantized version of zai-org/GLM-4.6V.

Model Details

Property	Value
Base Model	zai-org/GLM-4.6V
Architecture	Glm4vMoeForConditionalGeneration (108B MoE)
Quantization	NVFP4 (E2M1 format) with dynamic activation scaling
Model Size	64 GB (vs 216 GB BF16)
Compression	3.4x
Max Context	131,072 tokens (128K)

Benchmark Results

MMLU (0-shot, 14,042 questions)

Category	BF16	NVFP4	Accuracy Loss
Overall	76.01%	73.56%	-2.45%
STEM	74.72%	70.25%	-4.47%
Humanities	68.63%	67.14%	-1.49%
Social Sciences	83.62%	81.90%	-1.72%
Other	80.98%	78.37%	-2.61%

Usage with vLLM

Launch Command

# Single GPU (full 128K context)
python -m vllm.entrypoints.openai.api_server \
  --model GadflyII/GLM-4.6V-NVFP4 \
  --tensor-parallel-size 1 \
  --trust-remote-code \
  --max-model-len 131072 \
  --port 8000

# Two GPUs
python -m vllm.entrypoints.openai.api_server \
  --model GadflyII/GLM-4.6V-NVFP4 \
  --tensor-parallel-size 2 \
  --trust-remote-code \
  --max-model-len 131072 \
  --port 8000

Python API

from vllm import LLM, SamplingParams

model = LLM(
    "GadflyII/GLM-4.6V-NVFP4",
    tensor_parallel_size=1,
    trust_remote_code=True,
    max_model_len=131072
)

# Recommended sampling parameters
params = SamplingParams(
    temperature=0.8,
    top_p=0.6,
    top_k=2,
    repetition_penalty=1.1,
    max_tokens=1024
)

outputs = model.generate(["The capital of France is"], params)
print(outputs[0].outputs[0].text)

Quantization Details

This model uses dynamic NVFP4 quantization:

Weights: Quantized to FP4 (E2M1 format) with two-level scaling
Activations: Dynamically quantized at runtime (input_global_scale=1.0, dynamic=true)
Vision encoder: Preserved in original precision

Hardware Tested

NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB VRAM)
Single GPU: 78 tok/s generation throughput

If you are having issues running multiple SM120 Blackwell GPU's (RTX 50 series/RTX Pro Blackwell), try my vLLM fork found below while waiting for PR back into main:

https://github.com/Gadflyii/vllm/

License

Same as base model: GLM-4 License

Acknowledgments

Original model by Zhipu AI
Quantization methodology informed by vLLM's compressed-tensors implementation