Upload README.md with huggingface_hub

Files changed (1) hide show

README.md CHANGED Viewed

@@ -10,6 +10,7 @@ tags:
 - quantized
 - vllm
 - glm
 library_name: transformers
 pipeline_tag: text-generation
 ---
@@ -20,7 +21,7 @@ This is a **mixed precision NVFP4 quantization** of [zai-org/GLM-4.7-Flash](http
 ## Quantization Strategy
-Custom quantization and full expert calibration scripts methoed Based on NVIDIA's approach for DeepSeek-V3, this model uses **mixed precision** to preserve accuracy:
 | Component | Precision | Rationale |
 |-----------|-----------|-----------|
@@ -38,6 +39,8 @@ Custom quantization and full expert calibration scripts methoed Based on NVIDIA'
 | Compression | 1x | 3.3x | **3.1x** |
 | Accuracy Loss | - | -8.0% | **-1.3%** |
 ## Usage
 ### Requirements
@@ -98,6 +101,14 @@ vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \
 - **Calibration**: 128 samples from neuralmagic/calibration dataset
 - **Full Expert Calibration**: All 64 experts calibrated per sample
 ## Evaluation

 - quantized
 - vllm
 - glm
+- 30b
 library_name: transformers
 pipeline_tag: text-generation
 ---
 ## Quantization Strategy
+Based on NVIDIA's approach for DeepSeek-V3, this model uses **mixed precision** to preserve accuracy:
 | Component | Precision | Rationale |
 |-----------|-----------|-----------|
 | Compression | 1x | 3.3x | **3.1x** |
 | Accuracy Loss | - | -8.0% | **-1.3%** |
+**Key Finding**: Uniform FP4 quantization causes 8% accuracy loss on this model due to the sensitive MLA attention layers. Mixed precision reduces this to only 1.3% while maintaining 3.1x compression.
 ## Usage
 ### Requirements
 - **Calibration**: 128 samples from neuralmagic/calibration dataset
 - **Full Expert Calibration**: All 64 experts calibrated per sample
+## Why Mixed Precision?
+GLM-4.7-Flash uses **MLA (Multi-head Latent Attention)** from DeepSeek-V2, which compresses queries and key-values through low-rank projections:
+- `q_a_proj`, `q_b_proj` - Query compression
+- `kv_a_proj`, `kv_b_proj` - Key-Value compression
+These compressed representations are highly sensitive to quantization noise. NVIDIA's DeepSeek-V3 NVFP4 release explicitly keeps all attention layers in higher precision for this reason.
 ## Evaluation