|
|
--- |
|
|
license: other |
|
|
license_name: glm-4 |
|
|
license_link: https://huggingface.co/zai-org/GLM-4.6V/blob/main/LICENSE |
|
|
base_model: zai-org/GLM-4.6V |
|
|
tags: |
|
|
- nvfp4 |
|
|
- quantized |
|
|
- vllm |
|
|
- vision-language-model |
|
|
- moe |
|
|
library_name: vllm |
|
|
pipeline_tag: image-text-to-text |
|
|
--- |
|
|
|
|
|
# GLM-4.6V-NVFP4 |
|
|
|
|
|
NVFP4 (4-bit floating point) quantized version of [zai-org/GLM-4.6V](https://huggingface.co/zai-org/GLM-4.6V). |
|
|
|
|
|
## Model Details |
|
|
|
|
|
| Property | Value | |
|
|
|----------|-------| |
|
|
| Base Model | [zai-org/GLM-4.6V](https://huggingface.co/zai-org/GLM-4.6V) | |
|
|
| Architecture | Glm4vMoeForConditionalGeneration (108B MoE) | |
|
|
| Quantization | NVFP4 (E2M1 format) with dynamic activation scaling | |
|
|
| Model Size | 64 GB (vs 216 GB BF16) | |
|
|
| Compression | 3.4x | |
|
|
| Max Context | 131,072 tokens (128K) | |
|
|
|
|
|
## Benchmark Results |
|
|
|
|
|
### MMLU (0-shot, 14,042 questions) |
|
|
|
|
|
| Category | BF16 | NVFP4 | Accuracy Loss | |
|
|
|----------|------|-------|---------------| |
|
|
| **Overall** | **76.01%** | **73.56%** | **-2.45%** | |
|
|
| STEM | 74.72% | 70.25% | -4.47% | |
|
|
| Humanities | 68.63% | 67.14% | -1.49% | |
|
|
| Social Sciences | 83.62% | 81.90% | -1.72% | |
|
|
| Other | 80.98% | 78.37% | -2.61% | |
|
|
|
|
|
## Usage with vLLM |
|
|
|
|
|
### Launch Command |
|
|
|
|
|
```bash |
|
|
# Single GPU (full 128K context) |
|
|
python -m vllm.entrypoints.openai.api_server \ |
|
|
--model GadflyII/GLM-4.6V-NVFP4 \ |
|
|
--tensor-parallel-size 1 \ |
|
|
--trust-remote-code \ |
|
|
--max-model-len 131072 \ |
|
|
--port 8000 |
|
|
|
|
|
# Two GPUs |
|
|
python -m vllm.entrypoints.openai.api_server \ |
|
|
--model GadflyII/GLM-4.6V-NVFP4 \ |
|
|
--tensor-parallel-size 2 \ |
|
|
--trust-remote-code \ |
|
|
--max-model-len 131072 \ |
|
|
--port 8000 |
|
|
``` |
|
|
|
|
|
### Python API |
|
|
|
|
|
```python |
|
|
from vllm import LLM, SamplingParams |
|
|
|
|
|
model = LLM( |
|
|
"GadflyII/GLM-4.6V-NVFP4", |
|
|
tensor_parallel_size=1, |
|
|
trust_remote_code=True, |
|
|
max_model_len=131072 |
|
|
) |
|
|
|
|
|
# Recommended sampling parameters |
|
|
params = SamplingParams( |
|
|
temperature=0.8, |
|
|
top_p=0.6, |
|
|
top_k=2, |
|
|
repetition_penalty=1.1, |
|
|
max_tokens=1024 |
|
|
) |
|
|
|
|
|
outputs = model.generate(["The capital of France is"], params) |
|
|
print(outputs[0].outputs[0].text) |
|
|
``` |
|
|
|
|
|
## Quantization Details |
|
|
|
|
|
This model uses **dynamic NVFP4 quantization**: |
|
|
- Weights: Quantized to FP4 (E2M1 format) with two-level scaling |
|
|
- Activations: Dynamically quantized at runtime (`input_global_scale=1.0`, `dynamic=true`) |
|
|
- Vision encoder: Preserved in original precision |
|
|
|
|
|
## Hardware Tested |
|
|
|
|
|
- NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB VRAM) |
|
|
- Single GPU: 78 tok/s generation throughput |
|
|
|
|
|
## If you are having issues running multiple SM120 Blackwell GPU's (RTX 50 series/RTX Pro Blackwell), try my vLLM fork found below while waiting for PR back into main: |
|
|
https://github.com/Gadflyii/vllm/ |
|
|
|
|
|
## License |
|
|
|
|
|
Same as base model: [GLM-4 License](https://huggingface.co/zai-org/GLM-4.6V/blob/main/LICENSE) |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- Original model by [Zhipu AI](https://huggingface.co/zai-org) |
|
|
- Quantization methodology informed by vLLM's compressed-tensors implementation |
|
|
|