GadflyII commited on
Commit
5fbf254
·
verified ·
1 Parent(s): 43a0605

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +12 -1
README.md CHANGED
@@ -10,6 +10,7 @@ tags:
10
  - quantized
11
  - vllm
12
  - glm
 
13
  library_name: transformers
14
  pipeline_tag: text-generation
15
  ---
@@ -20,7 +21,7 @@ This is a **mixed precision NVFP4 quantization** of [zai-org/GLM-4.7-Flash](http
20
 
21
  ## Quantization Strategy
22
 
23
- Custom quantization and full expert calibration scripts methoed Based on NVIDIA's approach for DeepSeek-V3, this model uses **mixed precision** to preserve accuracy:
24
 
25
  | Component | Precision | Rationale |
26
  |-----------|-----------|-----------|
@@ -38,6 +39,8 @@ Custom quantization and full expert calibration scripts methoed Based on NVIDIA'
38
  | Compression | 1x | 3.3x | **3.1x** |
39
  | Accuracy Loss | - | -8.0% | **-1.3%** |
40
 
 
 
41
  ## Usage
42
 
43
  ### Requirements
@@ -98,6 +101,14 @@ vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \
98
  - **Calibration**: 128 samples from neuralmagic/calibration dataset
99
  - **Full Expert Calibration**: All 64 experts calibrated per sample
100
 
 
 
 
 
 
 
 
 
101
 
102
  ## Evaluation
103
 
 
10
  - quantized
11
  - vllm
12
  - glm
13
+ - 30b
14
  library_name: transformers
15
  pipeline_tag: text-generation
16
  ---
 
21
 
22
  ## Quantization Strategy
23
 
24
+ Based on NVIDIA's approach for DeepSeek-V3, this model uses **mixed precision** to preserve accuracy:
25
 
26
  | Component | Precision | Rationale |
27
  |-----------|-----------|-----------|
 
39
  | Compression | 1x | 3.3x | **3.1x** |
40
  | Accuracy Loss | - | -8.0% | **-1.3%** |
41
 
42
+ **Key Finding**: Uniform FP4 quantization causes 8% accuracy loss on this model due to the sensitive MLA attention layers. Mixed precision reduces this to only 1.3% while maintaining 3.1x compression.
43
+
44
  ## Usage
45
 
46
  ### Requirements
 
101
  - **Calibration**: 128 samples from neuralmagic/calibration dataset
102
  - **Full Expert Calibration**: All 64 experts calibrated per sample
103
 
104
+ ## Why Mixed Precision?
105
+
106
+ GLM-4.7-Flash uses **MLA (Multi-head Latent Attention)** from DeepSeek-V2, which compresses queries and key-values through low-rank projections:
107
+
108
+ - `q_a_proj`, `q_b_proj` - Query compression
109
+ - `kv_a_proj`, `kv_b_proj` - Key-Value compression
110
+
111
+ These compressed representations are highly sensitive to quantization noise. NVIDIA's DeepSeek-V3 NVFP4 release explicitly keeps all attention layers in higher precision for this reason.
112
 
113
  ## Evaluation
114