GadflyII
/

GLM-4.6V-NVFP4

@@ -15,7 +15,7 @@ pipeline_tag: image-text-to-text
 # GLM-4.6V-NVFP4
-NVFP4 (4-bit floating point) quantized version of [zai-org/GLM-4.6V](https://huggingface.co/zai-org/GLM-4.6V) for efficient inference on NVIDIA GPUs.
 ## Model Details
@@ -40,24 +40,8 @@ NVFP4 (4-bit floating point) quantized version of [zai-org/GLM-4.6V](https://hug
 | Social Sciences | 83.62% | 81.90% | -1.72% |
 | Other | 80.98% | 78.37% | -2.61% |
-### Performance Comparison
-| Metric | BF16 | NVFP4 | Improvement |
-|--------|------|-------|-------------|
-| Model Size | 216 GB | 64 GB | **3.4x smaller** |
-| Min. VRAM | 192+ GB | 64 GB | **3x less** |
-| Generation Speed | 4 tok/s* | 78 tok/s | **19.5x faster** |
-| MMLU Accuracy | 76.01% | 73.56% | -2.45% |
-*BF16 tested with CPU offload due to memory constraints
 ## Usage with vLLM
-### Requirements
-- vLLM 0.13.0+
-- NVIDIA GPU with 64+ GB VRAM (RTX 6000, A100, H100, etc.)
-- CUDA 12.0+
 ### Launch Command
 ```bash
@@ -74,7 +58,7 @@ python -m vllm.entrypoints.openai.api_server \
   --model GadflyII/GLM-4.6V-NVFP4 \
   --tensor-parallel-size 1 \
   --trust-remote-code \
-  --max-model-len 32768 \
   --port 8000
 # Two GPUs (for 48GB cards)
@@ -82,7 +66,7 @@ python -m vllm.entrypoints.openai.api_server \
   --model GadflyII/GLM-4.6V-NVFP4 \
   --tensor-parallel-size 2 \
   --trust-remote-code \
-  --max-model-len 65536 \
   --port 8000
 ```
@@ -95,7 +79,7 @@ model = LLM(
     "GadflyII/GLM-4.6V-NVFP4",
     tensor_parallel_size=1,
     trust_remote_code=True,
-    max_model_len=4096
 )
 # Recommended sampling parameters
@@ -118,15 +102,13 @@ This model uses **dynamic NVFP4 quantization**:
 - Activations: Dynamically quantized at runtime (`input_global_scale=1.0`, `dynamic=true`)
 - Vision encoder: Preserved in original precision
-### Why Dynamic Quantization?
-Static calibration for NVFP4 fails due to the alpha scaling chain between layers. When calibrating on the BF16 model, activation magnitudes don't account for the inter-layer scaling that occurs during NVFP4 inference. Dynamic quantization computes scales at runtime, adapting to actual activation values.
 ## Hardware Tested
 - NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB VRAM)
 - Single GPU: 78 tok/s generation throughput
-- Fits entirely in VRAM with room for 4K context
 ## License

 # GLM-4.6V-NVFP4
+NVFP4 (4-bit floating point) quantized version of [zai-org/GLM-4.6V](https://huggingface.co/zai-org/GLM-4.6V).
 ## Model Details
 | Social Sciences | 83.62% | 81.90% | -1.72% |
 | Other | 80.98% | 78.37% | -2.61% |
 ## Usage with vLLM
 ### Launch Command
 ```bash
   --model GadflyII/GLM-4.6V-NVFP4 \
   --tensor-parallel-size 1 \
   --trust-remote-code \
+  --max-model-len 131072 \
   --port 8000
 # Two GPUs (for 48GB cards)
   --model GadflyII/GLM-4.6V-NVFP4 \
   --tensor-parallel-size 2 \
   --trust-remote-code \
+  --max-model-len 131072 \
   --port 8000
 ```
     "GadflyII/GLM-4.6V-NVFP4",
     tensor_parallel_size=1,
     trust_remote_code=True,
+    max_model_len=131072
 )
 # Recommended sampling parameters
 - Activations: Dynamically quantized at runtime (`input_global_scale=1.0`, `dynamic=true`)
 - Vision encoder: Preserved in original precision
 ## Hardware Tested
 - NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB VRAM)
 - Single GPU: 78 tok/s generation throughput
+## If you are having issues running multiple SM120 Blackwell GPU's (RTX 50 series/RTX Pro Blackwell), try my vLLM fork found below while waiting for PR back into main:
+https://github.com/Gadflyii/vllm/
 ## License