---
license: other
license_name: glm-4
license_link: https://huggingface.co/zai-org/GLM-4.6V/blob/main/LICENSE
base_model: zai-org/GLM-4.6V
tags:
  - nvfp4
  - quantized
  - vllm
  - vision-language-model
  - moe
library_name: vllm
pipeline_tag: image-text-to-text
---

# GLM-4.6V-NVFP4

NVFP4 (4-bit floating point) quantized version of [zai-org/GLM-4.6V](https://huggingface.co/zai-org/GLM-4.6V).

## Model Details

| Property | Value |
|----------|-------|
| Base Model | [zai-org/GLM-4.6V](https://huggingface.co/zai-org/GLM-4.6V) |
| Architecture | Glm4vMoeForConditionalGeneration (108B MoE) |
| Quantization | NVFP4 (E2M1 format) with dynamic activation scaling |
| Model Size | 64 GB (vs 216 GB BF16) |
| Compression | 3.4x |
| Max Context | 131,072 tokens (128K) |

## Benchmark Results

### MMLU (0-shot, 14,042 questions)

| Category | BF16 | NVFP4 | Accuracy Loss |
|----------|------|-------|---------------|
| **Overall** | **76.01%** | **73.56%** | **-2.45%** |
| STEM | 74.72% | 70.25% | -4.47% |
| Humanities | 68.63% | 67.14% | -1.49% |
| Social Sciences | 83.62% | 81.90% | -1.72% |
| Other | 80.98% | 78.37% | -2.61% |

## Usage with vLLM

### Launch Command

```bash
# Single GPU (full 128K context)
python -m vllm.entrypoints.openai.api_server \
  --model GadflyII/GLM-4.6V-NVFP4 \
  --tensor-parallel-size 1 \
  --trust-remote-code \
  --max-model-len 131072 \
  --port 8000

# Two GPUs
python -m vllm.entrypoints.openai.api_server \
  --model GadflyII/GLM-4.6V-NVFP4 \
  --tensor-parallel-size 2 \
  --trust-remote-code \
  --max-model-len 131072 \
  --port 8000
```

### Python API

```python
from vllm import LLM, SamplingParams

model = LLM(
    "GadflyII/GLM-4.6V-NVFP4",
    tensor_parallel_size=1,
    trust_remote_code=True,
    max_model_len=131072
)

# Recommended sampling parameters
params = SamplingParams(
    temperature=0.8,
    top_p=0.6,
    top_k=2,
    repetition_penalty=1.1,
    max_tokens=1024
)

outputs = model.generate(["The capital of France is"], params)
print(outputs[0].outputs[0].text)
```

## Quantization Details

This model uses **dynamic NVFP4 quantization**:
- Weights: Quantized to FP4 (E2M1 format) with two-level scaling
- Activations: Dynamically quantized at runtime (`input_global_scale=1.0`, `dynamic=true`)
- Vision encoder: Preserved in original precision

## Hardware Tested

- NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB VRAM)
- Single GPU: 78 tok/s generation throughput

## If you are having issues running multiple SM120 Blackwell GPU's (RTX 50 series/RTX Pro Blackwell), try my vLLM fork found below while waiting for PR back into main:
https://github.com/Gadflyii/vllm/

## License

Same as base model: [GLM-4 License](https://huggingface.co/zai-org/GLM-4.6V/blob/main/LICENSE)

## Acknowledgments

- Original model by [Zhipu AI](https://huggingface.co/zai-org)
- Quantization methodology informed by vLLM's compressed-tensors implementation