Kearm's picture
Update README.md
3715db7 verified
---
license: mit
language:
- en
- zh
base_model: zai-org/GLM-4.7-Flash
pipeline_tag: text-generation
tags:
- quantized
- Mixture of Experts
- 4-bit
- GPTQ
- MMFP4
- glm
- metal-marlin
- moe
library_name: transformers
arxiv: "2508.06471"
---
# GLM-4.7-Flash-Marlin-MMFP4
![](https://raw.githubusercontent.com/zai-org/GLM-4.5/refs/heads/main/resources/logo.svg)
**MMFP4-quantized GLM-4.7-Flash** β€” a 30B-A3B MoE model compressed to **4 bits per weight** using GPTQ with actorder and Metal Marlin's E2M1 FP4 format.
| Metric | Value |
|--------|-------|
| **Effective bits** | 4.0 bpw |
| **Compression** | 4Γ— vs FP16 |
| **Model size** | ~16 GB (vs ~60 GB FP16) |
| **Parameters** | 29.3B |
| **Format** | HuggingFace sharded safetensors |
## Model Description
This is a quantized version of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash), the strongest model in the 30B class that balances performance and efficiency.
GLM-4.7-Flash features:
- **30B-A3B MoE architecture** (64 experts + shared expert, 2-4 active per token)
- **Multi-head Latent Attention (MLA)** for 8Γ— KV cache compression
- **State-of-the-art reasoning** (91.6% on AIME 2025, 59.2% on SWE-bench Verified)
- **Bilingual** (English + Chinese)
## Quantization Details
Quantized using **MR-GPTQ** (Metal Marlin GPTQ) with CUDA acceleration:
### Method
- **Format**: MMFP4 (E2M1 FP4) β€” Metal Marlin's native FP4 format
- **Quantization**: GPTQ with actorder (activation-order column permutation)
- **Hessian calibration**: Pre-computed Hessians for attention layers
- **Expert quantization**: Identity Hessian with actorder (no calibration data for MoE experts)
- **Group size**: 128
- **Hardware**: NVIDIA RTX 3090 Ti (CUDA-accelerated Cholesky factorization)
### Quantization Statistics
| Component | Bit Width | Notes |
|-----------|-----------|-------|
| Embeddings | FP16 | Full precision |
| LM Head | FP16 | Full precision |
| Attention (q/k/v/o) | 4-bit | GPTQ with Hessians |
| MoE Experts (64Γ—) | 4-bit | GPTQ with actorder |
| Layer Norms | FP16 | Full precision |
| Router Weights | FP16 | Full precision |
- **Total tensors**: 19,066
- **Shards**: 48 safetensors files
- **Quantization time**: ~20 minutes (RTX 3090 Ti)
## Files
```
GLM-4.7-Flash-Marlin-MMFP4/
β”œβ”€β”€ model-00001-of-00048.safetensors # Layer 0 (embeddings)
β”œβ”€β”€ model-00002-of-00048.safetensors # Layer 1
β”œβ”€β”€ ...
β”œβ”€β”€ model-00048-of-00048.safetensors # Layer 47 + lm_head
β”œβ”€β”€ model.safetensors.index.json # Weight map
β”œβ”€β”€ config.json # Model config
β”œβ”€β”€ generation_config.json
β”œβ”€β”€ tokenizer.json # Tokenizer
└── tokenizer_config.json
```
## Usage
### With Metal Marlin (Apple Silicon)
```python
from metal_marlin import MarlinForCausalLM
from transformers import AutoTokenizer
model = MarlinForCausalLM.from_pretrained(
"RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4",
device="mps"
)
tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7-Flash")
prompt = "<|user|>\nExplain quantum computing in simple terms.\n<|assistant|>\n"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("mps")
output = model.generate(input_ids, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
### Tensor Format
Each quantized weight tensor has corresponding scale factors:
- `{name}.weight`: Packed FP4 weights (uint8)
- `{name}.scales`: FP16 per-group scales (group_size=128)
## Hardware Requirements
| Device | Memory | Notes |
|--------|--------|-------|
| Apple M4 Max | 36 GB+ | Via Metal Marlin |
| Apple M2 Ultra | 36 GB+ | Via Metal Marlin |
## Benchmarks
### Original Model Performance (from Z.AI)
| Benchmark | GLM-4.7-Flash | Qwen3-30B-A3B | GPT-OSS-20B |
|-----------|---------------|---------------|-------------|
| AIME 2025 | **91.6** | 85.0 | 91.7 |
| GPQA | **75.2** | 73.4 | 71.5 |
| SWE-bench Verified | **59.2** | 22.0 | 34.0 |
| τ²-Bench | **79.5** | 49.0 | 47.7 |
| BrowseComp | **42.8** | 2.29 | 28.3 |
### Quantized Model Notes
- GPTQ with actorder minimizes quality loss vs RTN
- Expected degradation: ~1-2% on benchmarks vs FP16
- E2M1 FP4 format optimized for Metal Performance Shaders
## Comparison with Trellis Quant
| Model | Format | Size | Bits | Method |
|-------|--------|------|------|--------|
| [GLM-4.7-Flash-Trellis-MM](https://huggingface.co/RESMP-DEV/GLM-4.7-Flash-Trellis-MM) | Trellis | 14 GB | 3.78 bpw | EXL3-style mixed precision |
| **This model** | MMFP4 | 16 GB | 4.0 bpw | GPTQ + actorder |
Choose **Trellis** for smaller size, **MMFP4** for simpler tensor format and potentially better compatibility.
## Limitations
- **Metal Marlin required** for optimal inference on Apple Silicon
- **No speculative decoding** yet
- **Quality loss**: ~1-2% on benchmarks vs FP16 (typical for 4-bit quantization)
## Credits
- **Original model**: [Z.AI / GLM Team](https://huggingface.co/zai-org/GLM-4.7-Flash)
- **Quantization method**: GPTQ with actorder
- **Quantization toolkit**: [Metal Marlin](https://github.com/RESMP-DEV/metal-marlin)
## Citation
If you use this model, please cite the original GLM-4.5 paper:
```bibtex
@misc{glm2025glm45,
title={GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models},
author={GLM Team and Aohan Zeng and Xin Lv and others},
year={2025},
eprint={2508.06471},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.06471},
}
```
## License
This quantized model inherits the **MIT License** from the original GLM-4.7-Flash model.