GadflyII
/

GLM-4.6V-NVFP4

Image-Text-to-Text

vision-language-model

Mixture of Experts

8-bit precision

compressed-tensors

Model card Files Files and versions

GLM-4.6V-NVFP4 / README.md

GadflyII's picture

Update README.md

6a18ef6 verified 10 days ago

|

history blame contribute delete

2.93 kB

	---
	license: other
	license_name: glm-4
	license_link: https://huggingface.co/zai-org/GLM-4.6V/blob/main/LICENSE
	base_model: zai-org/GLM-4.6V
	tags:
	- nvfp4
	- quantized
	- vllm
	- vision-language-model
	- moe
	library_name: vllm
	pipeline_tag: image-text-to-text
	---

	# GLM-4.6V-NVFP4

	NVFP4 (4-bit floating point) quantized version of [zai-org/GLM-4.6V](https://huggingface.co/zai-org/GLM-4.6V).

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base Model \| [zai-org/GLM-4.6V](https://huggingface.co/zai-org/GLM-4.6V) \|
	\| Architecture \| Glm4vMoeForConditionalGeneration (108B MoE) \|
	\| Quantization \| NVFP4 (E2M1 format) with dynamic activation scaling \|
	\| Model Size \| 64 GB (vs 216 GB BF16) \|
	\| Compression \| 3.4x \|
	\| Max Context \| 131,072 tokens (128K) \|

	## Benchmark Results

	### MMLU (0-shot, 14,042 questions)

	\| Category \| BF16 \| NVFP4 \| Accuracy Loss \|
	\|----------\|------\|-------\|---------------\|
	\| Overall \| 76.01% \| 73.56% \| -2.45% \|
	\| STEM \| 74.72% \| 70.25% \| -4.47% \|
	\| Humanities \| 68.63% \| 67.14% \| -1.49% \|
	\| Social Sciences \| 83.62% \| 81.90% \| -1.72% \|
	\| Other \| 80.98% \| 78.37% \| -2.61% \|

	## Usage with vLLM

	### Launch Command

	```bash
	# Single GPU (full 128K context)
	python -m vllm.entrypoints.openai.api_server \
	--model GadflyII/GLM-4.6V-NVFP4 \
	--tensor-parallel-size 1 \
	--trust-remote-code \
	--max-model-len 131072 \
	--port 8000

	# Two GPUs
	python -m vllm.entrypoints.openai.api_server \
	--model GadflyII/GLM-4.6V-NVFP4 \
	--tensor-parallel-size 2 \
	--trust-remote-code \
	--max-model-len 131072 \
	--port 8000
	```

	### Python API

	```python
	from vllm import LLM, SamplingParams

	model = LLM(
	"GadflyII/GLM-4.6V-NVFP4",
	tensor_parallel_size=1,
	trust_remote_code=True,
	max_model_len=131072
	)

	# Recommended sampling parameters
	params = SamplingParams(
	temperature=0.8,
	top_p=0.6,
	top_k=2,
	repetition_penalty=1.1,
	max_tokens=1024
	)

	outputs = model.generate(["The capital of France is"], params)
	print(outputs[0].outputs[0].text)
	```

	## Quantization Details

	This model uses dynamic NVFP4 quantization:
	- Weights: Quantized to FP4 (E2M1 format) with two-level scaling
	- Activations: Dynamically quantized at runtime (`input_global_scale=1.0`, `dynamic=true`)
	- Vision encoder: Preserved in original precision

	## Hardware Tested

	- NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB VRAM)
	- Single GPU: 78 tok/s generation throughput

	## If you are having issues running multiple SM120 Blackwell GPU's (RTX 50 series/RTX Pro Blackwell), try my vLLM fork found below while waiting for PR back into main:
	https://github.com/Gadflyii/vllm/

	## License

	Same as base model: [GLM-4 License](https://huggingface.co/zai-org/GLM-4.6V/blob/main/LICENSE)

	## Acknowledgments

	- Original model by [Zhipu AI](https://huggingface.co/zai-org)
	- Quantization methodology informed by vLLM's compressed-tensors implementation