File size: 5,670 Bytes
ed522a2 3715db7 ed522a2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 |
---
license: mit
language:
- en
- zh
base_model: zai-org/GLM-4.7-Flash
pipeline_tag: text-generation
tags:
- quantized
- Mixture of Experts
- 4-bit
- GPTQ
- MMFP4
- glm
- metal-marlin
- moe
library_name: transformers
arxiv: "2508.06471"
---
# GLM-4.7-Flash-Marlin-MMFP4

**MMFP4-quantized GLM-4.7-Flash** β a 30B-A3B MoE model compressed to **4 bits per weight** using GPTQ with actorder and Metal Marlin's E2M1 FP4 format.
| Metric | Value |
|--------|-------|
| **Effective bits** | 4.0 bpw |
| **Compression** | 4Γ vs FP16 |
| **Model size** | ~16 GB (vs ~60 GB FP16) |
| **Parameters** | 29.3B |
| **Format** | HuggingFace sharded safetensors |
## Model Description
This is a quantized version of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash), the strongest model in the 30B class that balances performance and efficiency.
GLM-4.7-Flash features:
- **30B-A3B MoE architecture** (64 experts + shared expert, 2-4 active per token)
- **Multi-head Latent Attention (MLA)** for 8Γ KV cache compression
- **State-of-the-art reasoning** (91.6% on AIME 2025, 59.2% on SWE-bench Verified)
- **Bilingual** (English + Chinese)
## Quantization Details
Quantized using **MR-GPTQ** (Metal Marlin GPTQ) with CUDA acceleration:
### Method
- **Format**: MMFP4 (E2M1 FP4) β Metal Marlin's native FP4 format
- **Quantization**: GPTQ with actorder (activation-order column permutation)
- **Hessian calibration**: Pre-computed Hessians for attention layers
- **Expert quantization**: Identity Hessian with actorder (no calibration data for MoE experts)
- **Group size**: 128
- **Hardware**: NVIDIA RTX 3090 Ti (CUDA-accelerated Cholesky factorization)
### Quantization Statistics
| Component | Bit Width | Notes |
|-----------|-----------|-------|
| Embeddings | FP16 | Full precision |
| LM Head | FP16 | Full precision |
| Attention (q/k/v/o) | 4-bit | GPTQ with Hessians |
| MoE Experts (64Γ) | 4-bit | GPTQ with actorder |
| Layer Norms | FP16 | Full precision |
| Router Weights | FP16 | Full precision |
- **Total tensors**: 19,066
- **Shards**: 48 safetensors files
- **Quantization time**: ~20 minutes (RTX 3090 Ti)
## Files
```
GLM-4.7-Flash-Marlin-MMFP4/
βββ model-00001-of-00048.safetensors # Layer 0 (embeddings)
βββ model-00002-of-00048.safetensors # Layer 1
βββ ...
βββ model-00048-of-00048.safetensors # Layer 47 + lm_head
βββ model.safetensors.index.json # Weight map
βββ config.json # Model config
βββ generation_config.json
βββ tokenizer.json # Tokenizer
βββ tokenizer_config.json
```
## Usage
### With Metal Marlin (Apple Silicon)
```python
from metal_marlin import MarlinForCausalLM
from transformers import AutoTokenizer
model = MarlinForCausalLM.from_pretrained(
"RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4",
device="mps"
)
tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7-Flash")
prompt = "<|user|>\nExplain quantum computing in simple terms.\n<|assistant|>\n"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("mps")
output = model.generate(input_ids, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
### Tensor Format
Each quantized weight tensor has corresponding scale factors:
- `{name}.weight`: Packed FP4 weights (uint8)
- `{name}.scales`: FP16 per-group scales (group_size=128)
## Hardware Requirements
| Device | Memory | Notes |
|--------|--------|-------|
| Apple M4 Max | 36 GB+ | Via Metal Marlin |
| Apple M2 Ultra | 36 GB+ | Via Metal Marlin |
## Benchmarks
### Original Model Performance (from Z.AI)
| Benchmark | GLM-4.7-Flash | Qwen3-30B-A3B | GPT-OSS-20B |
|-----------|---------------|---------------|-------------|
| AIME 2025 | **91.6** | 85.0 | 91.7 |
| GPQA | **75.2** | 73.4 | 71.5 |
| SWE-bench Verified | **59.2** | 22.0 | 34.0 |
| ΟΒ²-Bench | **79.5** | 49.0 | 47.7 |
| BrowseComp | **42.8** | 2.29 | 28.3 |
### Quantized Model Notes
- GPTQ with actorder minimizes quality loss vs RTN
- Expected degradation: ~1-2% on benchmarks vs FP16
- E2M1 FP4 format optimized for Metal Performance Shaders
## Comparison with Trellis Quant
| Model | Format | Size | Bits | Method |
|-------|--------|------|------|--------|
| [GLM-4.7-Flash-Trellis-MM](https://huggingface.co/RESMP-DEV/GLM-4.7-Flash-Trellis-MM) | Trellis | 14 GB | 3.78 bpw | EXL3-style mixed precision |
| **This model** | MMFP4 | 16 GB | 4.0 bpw | GPTQ + actorder |
Choose **Trellis** for smaller size, **MMFP4** for simpler tensor format and potentially better compatibility.
## Limitations
- **Metal Marlin required** for optimal inference on Apple Silicon
- **No speculative decoding** yet
- **Quality loss**: ~1-2% on benchmarks vs FP16 (typical for 4-bit quantization)
## Credits
- **Original model**: [Z.AI / GLM Team](https://huggingface.co/zai-org/GLM-4.7-Flash)
- **Quantization method**: GPTQ with actorder
- **Quantization toolkit**: [Metal Marlin](https://github.com/RESMP-DEV/metal-marlin)
## Citation
If you use this model, please cite the original GLM-4.5 paper:
```bibtex
@misc{glm2025glm45,
title={GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models},
author={GLM Team and Aohan Zeng and Xin Lv and others},
year={2025},
eprint={2508.06471},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.06471},
}
```
## License
This quantized model inherits the **MIT License** from the original GLM-4.7-Flash model.
|