--- license: other license_name: glm-4 license_link: https://huggingface.co/zai-org/GLM-4.6V/blob/main/LICENSE base_model: zai-org/GLM-4.6V tags: - nvfp4 - quantized - vllm - vision-language-model - moe library_name: vllm pipeline_tag: image-text-to-text --- # GLM-4.6V-NVFP4 NVFP4 (4-bit floating point) quantized version of [zai-org/GLM-4.6V](https://huggingface.co/zai-org/GLM-4.6V). ## Model Details | Property | Value | |----------|-------| | Base Model | [zai-org/GLM-4.6V](https://huggingface.co/zai-org/GLM-4.6V) | | Architecture | Glm4vMoeForConditionalGeneration (108B MoE) | | Quantization | NVFP4 (E2M1 format) with dynamic activation scaling | | Model Size | 64 GB (vs 216 GB BF16) | | Compression | 3.4x | | Max Context | 131,072 tokens (128K) | ## Benchmark Results ### MMLU (0-shot, 14,042 questions) | Category | BF16 | NVFP4 | Accuracy Loss | |----------|------|-------|---------------| | **Overall** | **76.01%** | **73.56%** | **-2.45%** | | STEM | 74.72% | 70.25% | -4.47% | | Humanities | 68.63% | 67.14% | -1.49% | | Social Sciences | 83.62% | 81.90% | -1.72% | | Other | 80.98% | 78.37% | -2.61% | ## Usage with vLLM ### Launch Command ```bash # Single GPU (full 128K context) python -m vllm.entrypoints.openai.api_server \ --model GadflyII/GLM-4.6V-NVFP4 \ --tensor-parallel-size 1 \ --trust-remote-code \ --max-model-len 131072 \ --port 8000 # Two GPUs python -m vllm.entrypoints.openai.api_server \ --model GadflyII/GLM-4.6V-NVFP4 \ --tensor-parallel-size 2 \ --trust-remote-code \ --max-model-len 131072 \ --port 8000 ``` ### Python API ```python from vllm import LLM, SamplingParams model = LLM( "GadflyII/GLM-4.6V-NVFP4", tensor_parallel_size=1, trust_remote_code=True, max_model_len=131072 ) # Recommended sampling parameters params = SamplingParams( temperature=0.8, top_p=0.6, top_k=2, repetition_penalty=1.1, max_tokens=1024 ) outputs = model.generate(["The capital of France is"], params) print(outputs[0].outputs[0].text) ``` ## Quantization Details This model uses **dynamic NVFP4 quantization**: - Weights: Quantized to FP4 (E2M1 format) with two-level scaling - Activations: Dynamically quantized at runtime (`input_global_scale=1.0`, `dynamic=true`) - Vision encoder: Preserved in original precision ## Hardware Tested - NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB VRAM) - Single GPU: 78 tok/s generation throughput ## If you are having issues running multiple SM120 Blackwell GPU's (RTX 50 series/RTX Pro Blackwell), try my vLLM fork found below while waiting for PR back into main: https://github.com/Gadflyii/vllm/ ## License Same as base model: [GLM-4 License](https://huggingface.co/zai-org/GLM-4.6V/blob/main/LICENSE) ## Acknowledgments - Original model by [Zhipu AI](https://huggingface.co/zai-org) - Quantization methodology informed by vLLM's compressed-tensors implementation