GadflyII commited on
Commit
176f59f
·
verified ·
1 Parent(s): 70a0ab1

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +14 -3
README.md CHANGED
@@ -16,11 +16,11 @@ pipeline_tag: text-generation
16
 
17
  # GLM-4.7-Flash NVFP4 (Mixed Precision)
18
 
19
- This is a **mixed precision NVFP4 quantization** of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash), a 31B parameter Mixture-of-Experts model.
20
 
21
  ## Quantization Strategy
22
 
23
- Qunatizied with custom scripts / Calibration based on NVIDIA's approach for DeepSeek-V3, this model uses **mixed precision** to preserve accuracy:
24
 
25
  | Component | Precision | Rationale |
26
  |-----------|-----------|-----------|
@@ -38,6 +38,8 @@ Qunatizied with custom scripts / Calibration based on NVIDIA's approach for Deep
38
  | Compression | 1x | 3.3x | **3.1x** |
39
  | Accuracy Loss | - | -8.0% | **-1.3%** |
40
 
 
 
41
  ## Usage
42
 
43
  ### Requirements
@@ -84,7 +86,7 @@ vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \
84
 
85
  - **Base Model**: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash)
86
  - **Architecture**: `Glm4MoeLiteForCausalLM`
87
- - **Parameters**: 31B total (5.5B active per token)
88
  - **MoE Configuration**: 64 routed experts, 4 active, 1 shared expert
89
  - **Layers**: 47
90
  - **Context Length**: 202,752 tokens (max)
@@ -98,6 +100,15 @@ vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \
98
  - **Calibration**: 128 samples from neuralmagic/calibration dataset
99
  - **Full Expert Calibration**: All 64 experts calibrated per sample
100
 
 
 
 
 
 
 
 
 
 
101
  ## Evaluation
102
 
103
  MMLU-Pro results by category:
 
16
 
17
  # GLM-4.7-Flash NVFP4 (Mixed Precision)
18
 
19
+ This is a **mixed precision NVFP4 quantization** of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash), a 30B-A3B (30B total, 3B active) Mixture-of-Experts model.
20
 
21
  ## Quantization Strategy
22
 
23
+ Based on NVIDIA's approach for DeepSeek-V3, this model uses **mixed precision** to preserve accuracy:
24
 
25
  | Component | Precision | Rationale |
26
  |-----------|-----------|-----------|
 
38
  | Compression | 1x | 3.3x | **3.1x** |
39
  | Accuracy Loss | - | -8.0% | **-1.3%** |
40
 
41
+ **Key Finding**: Uniform FP4 quantization causes 8% accuracy loss on this model due to the sensitive MLA attention layers. Mixed precision reduces this to only 1.3% while maintaining 3.1x compression.
42
+
43
  ## Usage
44
 
45
  ### Requirements
 
86
 
87
  - **Base Model**: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash)
88
  - **Architecture**: `Glm4MoeLiteForCausalLM`
89
+ - **Parameters**: 30B total, 3B active per token (30B-A3B)
90
  - **MoE Configuration**: 64 routed experts, 4 active, 1 shared expert
91
  - **Layers**: 47
92
  - **Context Length**: 202,752 tokens (max)
 
100
  - **Calibration**: 128 samples from neuralmagic/calibration dataset
101
  - **Full Expert Calibration**: All 64 experts calibrated per sample
102
 
103
+ ## Why Mixed Precision?
104
+
105
+ GLM-4.7-Flash uses **MLA (Multi-head Latent Attention)** from DeepSeek-V2, which compresses queries and key-values through low-rank projections:
106
+
107
+ - `q_a_proj`, `q_b_proj` - Query compression
108
+ - `kv_a_proj`, `kv_b_proj` - Key-Value compression
109
+
110
+ These compressed representations are highly sensitive to quantization noise. NVIDIA's DeepSeek-V3 NVFP4 release explicitly keeps all attention layers in higher precision for this reason.
111
+
112
  ## Evaluation
113
 
114
  MMLU-Pro results by category: