Update README.md
Browse files
README.md
CHANGED
|
@@ -15,7 +15,7 @@ pipeline_tag: image-text-to-text
|
|
| 15 |
|
| 16 |
# GLM-4.6V-NVFP4
|
| 17 |
|
| 18 |
-
NVFP4 (4-bit floating point) quantized version of [zai-org/GLM-4.6V](https://huggingface.co/zai-org/GLM-4.6V)
|
| 19 |
|
| 20 |
## Model Details
|
| 21 |
|
|
@@ -40,24 +40,8 @@ NVFP4 (4-bit floating point) quantized version of [zai-org/GLM-4.6V](https://hug
|
|
| 40 |
| Social Sciences | 83.62% | 81.90% | -1.72% |
|
| 41 |
| Other | 80.98% | 78.37% | -2.61% |
|
| 42 |
|
| 43 |
-
### Performance Comparison
|
| 44 |
-
|
| 45 |
-
| Metric | BF16 | NVFP4 | Improvement |
|
| 46 |
-
|--------|------|-------|-------------|
|
| 47 |
-
| Model Size | 216 GB | 64 GB | **3.4x smaller** |
|
| 48 |
-
| Min. VRAM | 192+ GB | 64 GB | **3x less** |
|
| 49 |
-
| Generation Speed | 4 tok/s* | 78 tok/s | **19.5x faster** |
|
| 50 |
-
| MMLU Accuracy | 76.01% | 73.56% | -2.45% |
|
| 51 |
-
|
| 52 |
-
*BF16 tested with CPU offload due to memory constraints
|
| 53 |
-
|
| 54 |
## Usage with vLLM
|
| 55 |
|
| 56 |
-
### Requirements
|
| 57 |
-
- vLLM 0.13.0+
|
| 58 |
-
- NVIDIA GPU with 64+ GB VRAM (RTX 6000, A100, H100, etc.)
|
| 59 |
-
- CUDA 12.0+
|
| 60 |
-
|
| 61 |
### Launch Command
|
| 62 |
|
| 63 |
```bash
|
|
@@ -74,7 +58,7 @@ python -m vllm.entrypoints.openai.api_server \
|
|
| 74 |
--model GadflyII/GLM-4.6V-NVFP4 \
|
| 75 |
--tensor-parallel-size 1 \
|
| 76 |
--trust-remote-code \
|
| 77 |
-
--max-model-len
|
| 78 |
--port 8000
|
| 79 |
|
| 80 |
# Two GPUs (for 48GB cards)
|
|
@@ -82,7 +66,7 @@ python -m vllm.entrypoints.openai.api_server \
|
|
| 82 |
--model GadflyII/GLM-4.6V-NVFP4 \
|
| 83 |
--tensor-parallel-size 2 \
|
| 84 |
--trust-remote-code \
|
| 85 |
-
--max-model-len
|
| 86 |
--port 8000
|
| 87 |
```
|
| 88 |
|
|
@@ -95,7 +79,7 @@ model = LLM(
|
|
| 95 |
"GadflyII/GLM-4.6V-NVFP4",
|
| 96 |
tensor_parallel_size=1,
|
| 97 |
trust_remote_code=True,
|
| 98 |
-
max_model_len=
|
| 99 |
)
|
| 100 |
|
| 101 |
# Recommended sampling parameters
|
|
@@ -118,15 +102,13 @@ This model uses **dynamic NVFP4 quantization**:
|
|
| 118 |
- Activations: Dynamically quantized at runtime (`input_global_scale=1.0`, `dynamic=true`)
|
| 119 |
- Vision encoder: Preserved in original precision
|
| 120 |
|
| 121 |
-
### Why Dynamic Quantization?
|
| 122 |
-
|
| 123 |
-
Static calibration for NVFP4 fails due to the alpha scaling chain between layers. When calibrating on the BF16 model, activation magnitudes don't account for the inter-layer scaling that occurs during NVFP4 inference. Dynamic quantization computes scales at runtime, adapting to actual activation values.
|
| 124 |
-
|
| 125 |
## Hardware Tested
|
| 126 |
|
| 127 |
- NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB VRAM)
|
| 128 |
- Single GPU: 78 tok/s generation throughput
|
| 129 |
-
|
|
|
|
|
|
|
| 130 |
|
| 131 |
## License
|
| 132 |
|
|
|
|
| 15 |
|
| 16 |
# GLM-4.6V-NVFP4
|
| 17 |
|
| 18 |
+
NVFP4 (4-bit floating point) quantized version of [zai-org/GLM-4.6V](https://huggingface.co/zai-org/GLM-4.6V).
|
| 19 |
|
| 20 |
## Model Details
|
| 21 |
|
|
|
|
| 40 |
| Social Sciences | 83.62% | 81.90% | -1.72% |
|
| 41 |
| Other | 80.98% | 78.37% | -2.61% |
|
| 42 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
## Usage with vLLM
|
| 44 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
### Launch Command
|
| 46 |
|
| 47 |
```bash
|
|
|
|
| 58 |
--model GadflyII/GLM-4.6V-NVFP4 \
|
| 59 |
--tensor-parallel-size 1 \
|
| 60 |
--trust-remote-code \
|
| 61 |
+
--max-model-len 131072 \
|
| 62 |
--port 8000
|
| 63 |
|
| 64 |
# Two GPUs (for 48GB cards)
|
|
|
|
| 66 |
--model GadflyII/GLM-4.6V-NVFP4 \
|
| 67 |
--tensor-parallel-size 2 \
|
| 68 |
--trust-remote-code \
|
| 69 |
+
--max-model-len 131072 \
|
| 70 |
--port 8000
|
| 71 |
```
|
| 72 |
|
|
|
|
| 79 |
"GadflyII/GLM-4.6V-NVFP4",
|
| 80 |
tensor_parallel_size=1,
|
| 81 |
trust_remote_code=True,
|
| 82 |
+
max_model_len=131072
|
| 83 |
)
|
| 84 |
|
| 85 |
# Recommended sampling parameters
|
|
|
|
| 102 |
- Activations: Dynamically quantized at runtime (`input_global_scale=1.0`, `dynamic=true`)
|
| 103 |
- Vision encoder: Preserved in original precision
|
| 104 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 105 |
## Hardware Tested
|
| 106 |
|
| 107 |
- NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB VRAM)
|
| 108 |
- Single GPU: 78 tok/s generation throughput
|
| 109 |
+
|
| 110 |
+
## If you are having issues running multiple SM120 Blackwell GPU's (RTX 50 series/RTX Pro Blackwell), try my vLLM fork found below while waiting for PR back into main:
|
| 111 |
+
https://github.com/Gadflyii/vllm/
|
| 112 |
|
| 113 |
## License
|
| 114 |
|