Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -16,11 +16,11 @@ pipeline_tag: text-generation
|
|
| 16 |
|
| 17 |
# GLM-4.7-Flash NVFP4 (Mixed Precision)
|
| 18 |
|
| 19 |
-
This is a **mixed precision NVFP4 quantization** of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash), a
|
| 20 |
|
| 21 |
## Quantization Strategy
|
| 22 |
|
| 23 |
-
|
| 24 |
|
| 25 |
| Component | Precision | Rationale |
|
| 26 |
|-----------|-----------|-----------|
|
|
@@ -38,6 +38,8 @@ Qunatizied with custom scripts / Calibration based on NVIDIA's approach for Deep
|
|
| 38 |
| Compression | 1x | 3.3x | **3.1x** |
|
| 39 |
| Accuracy Loss | - | -8.0% | **-1.3%** |
|
| 40 |
|
|
|
|
|
|
|
| 41 |
## Usage
|
| 42 |
|
| 43 |
### Requirements
|
|
@@ -84,7 +86,7 @@ vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \
|
|
| 84 |
|
| 85 |
- **Base Model**: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash)
|
| 86 |
- **Architecture**: `Glm4MoeLiteForCausalLM`
|
| 87 |
-
- **Parameters**:
|
| 88 |
- **MoE Configuration**: 64 routed experts, 4 active, 1 shared expert
|
| 89 |
- **Layers**: 47
|
| 90 |
- **Context Length**: 202,752 tokens (max)
|
|
@@ -98,6 +100,15 @@ vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \
|
|
| 98 |
- **Calibration**: 128 samples from neuralmagic/calibration dataset
|
| 99 |
- **Full Expert Calibration**: All 64 experts calibrated per sample
|
| 100 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
## Evaluation
|
| 102 |
|
| 103 |
MMLU-Pro results by category:
|
|
|
|
| 16 |
|
| 17 |
# GLM-4.7-Flash NVFP4 (Mixed Precision)
|
| 18 |
|
| 19 |
+
This is a **mixed precision NVFP4 quantization** of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash), a 30B-A3B (30B total, 3B active) Mixture-of-Experts model.
|
| 20 |
|
| 21 |
## Quantization Strategy
|
| 22 |
|
| 23 |
+
Based on NVIDIA's approach for DeepSeek-V3, this model uses **mixed precision** to preserve accuracy:
|
| 24 |
|
| 25 |
| Component | Precision | Rationale |
|
| 26 |
|-----------|-----------|-----------|
|
|
|
|
| 38 |
| Compression | 1x | 3.3x | **3.1x** |
|
| 39 |
| Accuracy Loss | - | -8.0% | **-1.3%** |
|
| 40 |
|
| 41 |
+
**Key Finding**: Uniform FP4 quantization causes 8% accuracy loss on this model due to the sensitive MLA attention layers. Mixed precision reduces this to only 1.3% while maintaining 3.1x compression.
|
| 42 |
+
|
| 43 |
## Usage
|
| 44 |
|
| 45 |
### Requirements
|
|
|
|
| 86 |
|
| 87 |
- **Base Model**: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash)
|
| 88 |
- **Architecture**: `Glm4MoeLiteForCausalLM`
|
| 89 |
+
- **Parameters**: 30B total, 3B active per token (30B-A3B)
|
| 90 |
- **MoE Configuration**: 64 routed experts, 4 active, 1 shared expert
|
| 91 |
- **Layers**: 47
|
| 92 |
- **Context Length**: 202,752 tokens (max)
|
|
|
|
| 100 |
- **Calibration**: 128 samples from neuralmagic/calibration dataset
|
| 101 |
- **Full Expert Calibration**: All 64 experts calibrated per sample
|
| 102 |
|
| 103 |
+
## Why Mixed Precision?
|
| 104 |
+
|
| 105 |
+
GLM-4.7-Flash uses **MLA (Multi-head Latent Attention)** from DeepSeek-V2, which compresses queries and key-values through low-rank projections:
|
| 106 |
+
|
| 107 |
+
- `q_a_proj`, `q_b_proj` - Query compression
|
| 108 |
+
- `kv_a_proj`, `kv_b_proj` - Key-Value compression
|
| 109 |
+
|
| 110 |
+
These compressed representations are highly sensitive to quantization noise. NVIDIA's DeepSeek-V3 NVFP4 release explicitly keeps all attention layers in higher precision for this reason.
|
| 111 |
+
|
| 112 |
## Evaluation
|
| 113 |
|
| 114 |
MMLU-Pro results by category:
|