GadflyII commited on
Commit
9dc7907
·
verified ·
1 Parent(s): d895f4f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -25
README.md CHANGED
@@ -15,7 +15,7 @@ pipeline_tag: image-text-to-text
15
 
16
  # GLM-4.6V-NVFP4
17
 
18
- NVFP4 (4-bit floating point) quantized version of [zai-org/GLM-4.6V](https://huggingface.co/zai-org/GLM-4.6V) for efficient inference on NVIDIA GPUs.
19
 
20
  ## Model Details
21
 
@@ -40,24 +40,8 @@ NVFP4 (4-bit floating point) quantized version of [zai-org/GLM-4.6V](https://hug
40
  | Social Sciences | 83.62% | 81.90% | -1.72% |
41
  | Other | 80.98% | 78.37% | -2.61% |
42
 
43
- ### Performance Comparison
44
-
45
- | Metric | BF16 | NVFP4 | Improvement |
46
- |--------|------|-------|-------------|
47
- | Model Size | 216 GB | 64 GB | **3.4x smaller** |
48
- | Min. VRAM | 192+ GB | 64 GB | **3x less** |
49
- | Generation Speed | 4 tok/s* | 78 tok/s | **19.5x faster** |
50
- | MMLU Accuracy | 76.01% | 73.56% | -2.45% |
51
-
52
- *BF16 tested with CPU offload due to memory constraints
53
-
54
  ## Usage with vLLM
55
 
56
- ### Requirements
57
- - vLLM 0.13.0+
58
- - NVIDIA GPU with 64+ GB VRAM (RTX 6000, A100, H100, etc.)
59
- - CUDA 12.0+
60
-
61
  ### Launch Command
62
 
63
  ```bash
@@ -74,7 +58,7 @@ python -m vllm.entrypoints.openai.api_server \
74
  --model GadflyII/GLM-4.6V-NVFP4 \
75
  --tensor-parallel-size 1 \
76
  --trust-remote-code \
77
- --max-model-len 32768 \
78
  --port 8000
79
 
80
  # Two GPUs (for 48GB cards)
@@ -82,7 +66,7 @@ python -m vllm.entrypoints.openai.api_server \
82
  --model GadflyII/GLM-4.6V-NVFP4 \
83
  --tensor-parallel-size 2 \
84
  --trust-remote-code \
85
- --max-model-len 65536 \
86
  --port 8000
87
  ```
88
 
@@ -95,7 +79,7 @@ model = LLM(
95
  "GadflyII/GLM-4.6V-NVFP4",
96
  tensor_parallel_size=1,
97
  trust_remote_code=True,
98
- max_model_len=4096
99
  )
100
 
101
  # Recommended sampling parameters
@@ -118,15 +102,13 @@ This model uses **dynamic NVFP4 quantization**:
118
  - Activations: Dynamically quantized at runtime (`input_global_scale=1.0`, `dynamic=true`)
119
  - Vision encoder: Preserved in original precision
120
 
121
- ### Why Dynamic Quantization?
122
-
123
- Static calibration for NVFP4 fails due to the alpha scaling chain between layers. When calibrating on the BF16 model, activation magnitudes don't account for the inter-layer scaling that occurs during NVFP4 inference. Dynamic quantization computes scales at runtime, adapting to actual activation values.
124
-
125
  ## Hardware Tested
126
 
127
  - NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB VRAM)
128
  - Single GPU: 78 tok/s generation throughput
129
- - Fits entirely in VRAM with room for 4K context
 
 
130
 
131
  ## License
132
 
 
15
 
16
  # GLM-4.6V-NVFP4
17
 
18
+ NVFP4 (4-bit floating point) quantized version of [zai-org/GLM-4.6V](https://huggingface.co/zai-org/GLM-4.6V).
19
 
20
  ## Model Details
21
 
 
40
  | Social Sciences | 83.62% | 81.90% | -1.72% |
41
  | Other | 80.98% | 78.37% | -2.61% |
42
 
 
 
 
 
 
 
 
 
 
 
 
43
  ## Usage with vLLM
44
 
 
 
 
 
 
45
  ### Launch Command
46
 
47
  ```bash
 
58
  --model GadflyII/GLM-4.6V-NVFP4 \
59
  --tensor-parallel-size 1 \
60
  --trust-remote-code \
61
+ --max-model-len 131072 \
62
  --port 8000
63
 
64
  # Two GPUs (for 48GB cards)
 
66
  --model GadflyII/GLM-4.6V-NVFP4 \
67
  --tensor-parallel-size 2 \
68
  --trust-remote-code \
69
+ --max-model-len 131072 \
70
  --port 8000
71
  ```
72
 
 
79
  "GadflyII/GLM-4.6V-NVFP4",
80
  tensor_parallel_size=1,
81
  trust_remote_code=True,
82
+ max_model_len=131072
83
  )
84
 
85
  # Recommended sampling parameters
 
102
  - Activations: Dynamically quantized at runtime (`input_global_scale=1.0`, `dynamic=true`)
103
  - Vision encoder: Preserved in original precision
104
 
 
 
 
 
105
  ## Hardware Tested
106
 
107
  - NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB VRAM)
108
  - Single GPU: 78 tok/s generation throughput
109
+
110
+ ## If you are having issues running multiple SM120 Blackwell GPU's (RTX 50 series/RTX Pro Blackwell), try my vLLM fork found below while waiting for PR back into main:
111
+ https://github.com/Gadflyii/vllm/
112
 
113
  ## License
114