File size: 5,670 Bytes

---
license: mit
language:
  - en
  - zh
base_model: zai-org/GLM-4.7-Flash
pipeline_tag: text-generation
tags:
  - quantized
  - Mixture of Experts
  - 4-bit
  - GPTQ
  - MMFP4
  - glm
  - metal-marlin
  - moe
library_name: transformers
arxiv: "2508.06471"
---

# GLM-4.7-Flash-Marlin-MMFP4

![](https://raw.githubusercontent.com/zai-org/GLM-4.5/refs/heads/main/resources/logo.svg)

**MMFP4-quantized GLM-4.7-Flash** — a 30B-A3B MoE model compressed to **4 bits per weight** using GPTQ with actorder and Metal Marlin's E2M1 FP4 format.

| Metric | Value |
|--------|-------|
| **Effective bits** | 4.0 bpw |
| **Compression** | 4× vs FP16 |
| **Model size** | ~16 GB (vs ~60 GB FP16) |
| **Parameters** | 29.3B |
| **Format** | HuggingFace sharded safetensors |

## Model Description

This is a quantized version of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash), the strongest model in the 30B class that balances performance and efficiency.

GLM-4.7-Flash features:

- **30B-A3B MoE architecture** (64 experts + shared expert, 2-4 active per token)
- **Multi-head Latent Attention (MLA)** for 8× KV cache compression
- **State-of-the-art reasoning** (91.6% on AIME 2025, 59.2% on SWE-bench Verified)
- **Bilingual** (English + Chinese)

## Quantization Details

Quantized using **MR-GPTQ** (Metal Marlin GPTQ) with CUDA acceleration:

### Method

- **Format**: MMFP4 (E2M1 FP4) — Metal Marlin's native FP4 format
- **Quantization**: GPTQ with actorder (activation-order column permutation)
- **Hessian calibration**: Pre-computed Hessians for attention layers
- **Expert quantization**: Identity Hessian with actorder (no calibration data for MoE experts)
- **Group size**: 128
- **Hardware**: NVIDIA RTX 3090 Ti (CUDA-accelerated Cholesky factorization)

### Quantization Statistics

| Component | Bit Width | Notes |
|-----------|-----------|-------|
| Embeddings | FP16 | Full precision |
| LM Head | FP16 | Full precision |
| Attention (q/k/v/o) | 4-bit | GPTQ with Hessians |
| MoE Experts (64×) | 4-bit | GPTQ with actorder |
| Layer Norms | FP16 | Full precision |
| Router Weights | FP16 | Full precision |

- **Total tensors**: 19,066
- **Shards**: 48 safetensors files
- **Quantization time**: ~20 minutes (RTX 3090 Ti)

## Files

```
GLM-4.7-Flash-Marlin-MMFP4/
├── model-00001-of-00048.safetensors   # Layer 0 (embeddings)
├── model-00002-of-00048.safetensors   # Layer 1
├── ...
├── model-00048-of-00048.safetensors   # Layer 47 + lm_head
├── model.safetensors.index.json       # Weight map
├── config.json                        # Model config
├── generation_config.json
├── tokenizer.json                     # Tokenizer
└── tokenizer_config.json
```

## Usage

### With Metal Marlin (Apple Silicon)

```python
from metal_marlin import MarlinForCausalLM
from transformers import AutoTokenizer

model = MarlinForCausalLM.from_pretrained(
    "RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4",
    device="mps"
)
tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7-Flash")

prompt = "<|user|>\nExplain quantum computing in simple terms.\n<|assistant|>\n"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("mps")
output = model.generate(input_ids, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```

### Tensor Format

Each quantized weight tensor has corresponding scale factors:

- `{name}.weight`: Packed FP4 weights (uint8)
- `{name}.scales`: FP16 per-group scales (group_size=128)

## Hardware Requirements

| Device | Memory | Notes |
|--------|--------|-------|
| Apple M4 Max | 36 GB+ | Via Metal Marlin |
| Apple M2 Ultra | 36 GB+ | Via Metal Marlin |

## Benchmarks

### Original Model Performance (from Z.AI)

| Benchmark | GLM-4.7-Flash | Qwen3-30B-A3B | GPT-OSS-20B |
|-----------|---------------|---------------|-------------|
| AIME 2025 | **91.6** | 85.0 | 91.7 |
| GPQA | **75.2** | 73.4 | 71.5 |
| SWE-bench Verified | **59.2** | 22.0 | 34.0 |
| τ²-Bench | **79.5** | 49.0 | 47.7 |
| BrowseComp | **42.8** | 2.29 | 28.3 |

### Quantized Model Notes

- GPTQ with actorder minimizes quality loss vs RTN
- Expected degradation: ~1-2% on benchmarks vs FP16
- E2M1 FP4 format optimized for Metal Performance Shaders

## Comparison with Trellis Quant

| Model | Format | Size | Bits | Method |
|-------|--------|------|------|--------|
| [GLM-4.7-Flash-Trellis-MM](https://huggingface.co/RESMP-DEV/GLM-4.7-Flash-Trellis-MM) | Trellis | 14 GB | 3.78 bpw | EXL3-style mixed precision |
| **This model** | MMFP4 | 16 GB | 4.0 bpw | GPTQ + actorder |

Choose **Trellis** for smaller size, **MMFP4** for simpler tensor format and potentially better compatibility.

## Limitations

- **Metal Marlin required** for optimal inference on Apple Silicon
- **No speculative decoding** yet
- **Quality loss**: ~1-2% on benchmarks vs FP16 (typical for 4-bit quantization)

## Credits

- **Original model**: [Z.AI / GLM Team](https://huggingface.co/zai-org/GLM-4.7-Flash)
- **Quantization method**: GPTQ with actorder
- **Quantization toolkit**: [Metal Marlin](https://github.com/RESMP-DEV/metal-marlin)

## Citation

If you use this model, please cite the original GLM-4.5 paper:

```bibtex
@misc{glm2025glm45,
      title={GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models}, 
      author={GLM Team and Aohan Zeng and Xin Lv and others},
      year={2025},
      eprint={2508.06471},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.06471}, 
}
```

## License

This quantized model inherits the **MIT License** from the original GLM-4.7-Flash model.