|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
base_model: zai-org/GLM-4.7-Flash |
|
|
pipeline_tag: text-generation |
|
|
tags: |
|
|
- quantized |
|
|
- Mixture of Experts |
|
|
- 4-bit |
|
|
- GPTQ |
|
|
- MMFP4 |
|
|
- glm |
|
|
- metal-marlin |
|
|
- moe |
|
|
library_name: transformers |
|
|
arxiv: "2508.06471" |
|
|
--- |
|
|
|
|
|
# GLM-4.7-Flash-Marlin-MMFP4 |
|
|
|
|
|
 |
|
|
|
|
|
**MMFP4-quantized GLM-4.7-Flash** β a 30B-A3B MoE model compressed to **4 bits per weight** using GPTQ with actorder and Metal Marlin's E2M1 FP4 format. |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| **Effective bits** | 4.0 bpw | |
|
|
| **Compression** | 4Γ vs FP16 | |
|
|
| **Model size** | ~16 GB (vs ~60 GB FP16) | |
|
|
| **Parameters** | 29.3B | |
|
|
| **Format** | HuggingFace sharded safetensors | |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This is a quantized version of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash), the strongest model in the 30B class that balances performance and efficiency. |
|
|
|
|
|
GLM-4.7-Flash features: |
|
|
|
|
|
- **30B-A3B MoE architecture** (64 experts + shared expert, 2-4 active per token) |
|
|
- **Multi-head Latent Attention (MLA)** for 8Γ KV cache compression |
|
|
- **State-of-the-art reasoning** (91.6% on AIME 2025, 59.2% on SWE-bench Verified) |
|
|
- **Bilingual** (English + Chinese) |
|
|
|
|
|
## Quantization Details |
|
|
|
|
|
Quantized using **MR-GPTQ** (Metal Marlin GPTQ) with CUDA acceleration: |
|
|
|
|
|
### Method |
|
|
|
|
|
- **Format**: MMFP4 (E2M1 FP4) β Metal Marlin's native FP4 format |
|
|
- **Quantization**: GPTQ with actorder (activation-order column permutation) |
|
|
- **Hessian calibration**: Pre-computed Hessians for attention layers |
|
|
- **Expert quantization**: Identity Hessian with actorder (no calibration data for MoE experts) |
|
|
- **Group size**: 128 |
|
|
- **Hardware**: NVIDIA RTX 3090 Ti (CUDA-accelerated Cholesky factorization) |
|
|
|
|
|
### Quantization Statistics |
|
|
|
|
|
| Component | Bit Width | Notes | |
|
|
|-----------|-----------|-------| |
|
|
| Embeddings | FP16 | Full precision | |
|
|
| LM Head | FP16 | Full precision | |
|
|
| Attention (q/k/v/o) | 4-bit | GPTQ with Hessians | |
|
|
| MoE Experts (64Γ) | 4-bit | GPTQ with actorder | |
|
|
| Layer Norms | FP16 | Full precision | |
|
|
| Router Weights | FP16 | Full precision | |
|
|
|
|
|
- **Total tensors**: 19,066 |
|
|
- **Shards**: 48 safetensors files |
|
|
- **Quantization time**: ~20 minutes (RTX 3090 Ti) |
|
|
|
|
|
## Files |
|
|
|
|
|
``` |
|
|
GLM-4.7-Flash-Marlin-MMFP4/ |
|
|
βββ model-00001-of-00048.safetensors # Layer 0 (embeddings) |
|
|
βββ model-00002-of-00048.safetensors # Layer 1 |
|
|
βββ ... |
|
|
βββ model-00048-of-00048.safetensors # Layer 47 + lm_head |
|
|
βββ model.safetensors.index.json # Weight map |
|
|
βββ config.json # Model config |
|
|
βββ generation_config.json |
|
|
βββ tokenizer.json # Tokenizer |
|
|
βββ tokenizer_config.json |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
### With Metal Marlin (Apple Silicon) |
|
|
|
|
|
```python |
|
|
from metal_marlin import MarlinForCausalLM |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
model = MarlinForCausalLM.from_pretrained( |
|
|
"RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4", |
|
|
device="mps" |
|
|
) |
|
|
tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7-Flash") |
|
|
|
|
|
prompt = "<|user|>\nExplain quantum computing in simple terms.\n<|assistant|>\n" |
|
|
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("mps") |
|
|
output = model.generate(input_ids, max_new_tokens=256, temperature=0.7) |
|
|
print(tokenizer.decode(output[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
### Tensor Format |
|
|
|
|
|
Each quantized weight tensor has corresponding scale factors: |
|
|
|
|
|
- `{name}.weight`: Packed FP4 weights (uint8) |
|
|
- `{name}.scales`: FP16 per-group scales (group_size=128) |
|
|
|
|
|
## Hardware Requirements |
|
|
|
|
|
| Device | Memory | Notes | |
|
|
|--------|--------|-------| |
|
|
| Apple M4 Max | 36 GB+ | Via Metal Marlin | |
|
|
| Apple M2 Ultra | 36 GB+ | Via Metal Marlin | |
|
|
|
|
|
## Benchmarks |
|
|
|
|
|
### Original Model Performance (from Z.AI) |
|
|
|
|
|
| Benchmark | GLM-4.7-Flash | Qwen3-30B-A3B | GPT-OSS-20B | |
|
|
|-----------|---------------|---------------|-------------| |
|
|
| AIME 2025 | **91.6** | 85.0 | 91.7 | |
|
|
| GPQA | **75.2** | 73.4 | 71.5 | |
|
|
| SWE-bench Verified | **59.2** | 22.0 | 34.0 | |
|
|
| ΟΒ²-Bench | **79.5** | 49.0 | 47.7 | |
|
|
| BrowseComp | **42.8** | 2.29 | 28.3 | |
|
|
|
|
|
### Quantized Model Notes |
|
|
|
|
|
- GPTQ with actorder minimizes quality loss vs RTN |
|
|
- Expected degradation: ~1-2% on benchmarks vs FP16 |
|
|
- E2M1 FP4 format optimized for Metal Performance Shaders |
|
|
|
|
|
## Comparison with Trellis Quant |
|
|
|
|
|
| Model | Format | Size | Bits | Method | |
|
|
|-------|--------|------|------|--------| |
|
|
| [GLM-4.7-Flash-Trellis-MM](https://huggingface.co/RESMP-DEV/GLM-4.7-Flash-Trellis-MM) | Trellis | 14 GB | 3.78 bpw | EXL3-style mixed precision | |
|
|
| **This model** | MMFP4 | 16 GB | 4.0 bpw | GPTQ + actorder | |
|
|
|
|
|
Choose **Trellis** for smaller size, **MMFP4** for simpler tensor format and potentially better compatibility. |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Metal Marlin required** for optimal inference on Apple Silicon |
|
|
- **No speculative decoding** yet |
|
|
- **Quality loss**: ~1-2% on benchmarks vs FP16 (typical for 4-bit quantization) |
|
|
|
|
|
## Credits |
|
|
|
|
|
- **Original model**: [Z.AI / GLM Team](https://huggingface.co/zai-org/GLM-4.7-Flash) |
|
|
- **Quantization method**: GPTQ with actorder |
|
|
- **Quantization toolkit**: [Metal Marlin](https://github.com/RESMP-DEV/metal-marlin) |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite the original GLM-4.5 paper: |
|
|
|
|
|
```bibtex |
|
|
@misc{glm2025glm45, |
|
|
title={GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models}, |
|
|
author={GLM Team and Aohan Zeng and Xin Lv and others}, |
|
|
year={2025}, |
|
|
eprint={2508.06471}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL}, |
|
|
url={https://arxiv.org/abs/2508.06471}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This quantized model inherits the **MIT License** from the original GLM-4.7-Flash model. |
|
|
|