Update README.md

3715db7 verified 8 days ago

5.67 kB

	---
	license: mit
	language:
	- en
	- zh
	base_model: zai-org/GLM-4.7-Flash
	pipeline_tag: text-generation
	tags:
	- quantized
	- Mixture of Experts
	- 4-bit
	- GPTQ
	- MMFP4
	- glm
	- metal-marlin
	- moe
	library_name: transformers
	arxiv: "2508.06471"
	---

	# GLM-4.7-Flash-Marlin-MMFP4

	![](https://raw.githubusercontent.com/zai-org/GLM-4.5/refs/heads/main/resources/logo.svg)

	MMFP4-quantized GLM-4.7-Flash — a 30B-A3B MoE model compressed to 4 bits per weight using GPTQ with actorder and Metal Marlin's E2M1 FP4 format.

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Effective bits \| 4.0 bpw \|
	\| Compression \| 4× vs FP16 \|
	\| Model size \| ~16 GB (vs ~60 GB FP16) \|
	\| Parameters \| 29.3B \|
	\| Format \| HuggingFace sharded safetensors \|

	## Model Description

	This is a quantized version of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash), the strongest model in the 30B class that balances performance and efficiency.

	GLM-4.7-Flash features:

	- 30B-A3B MoE architecture (64 experts + shared expert, 2-4 active per token)
	- Multi-head Latent Attention (MLA) for 8× KV cache compression
	- State-of-the-art reasoning (91.6% on AIME 2025, 59.2% on SWE-bench Verified)
	- Bilingual (English + Chinese)

	## Quantization Details

	Quantized using MR-GPTQ (Metal Marlin GPTQ) with CUDA acceleration:

	### Method

	- Format: MMFP4 (E2M1 FP4) — Metal Marlin's native FP4 format
	- Quantization: GPTQ with actorder (activation-order column permutation)
	- Hessian calibration: Pre-computed Hessians for attention layers
	- Expert quantization: Identity Hessian with actorder (no calibration data for MoE experts)
	- Group size: 128
	- Hardware: NVIDIA RTX 3090 Ti (CUDA-accelerated Cholesky factorization)

	### Quantization Statistics

	\| Component \| Bit Width \| Notes \|
	\|-----------\|-----------\|-------\|
	\| Embeddings \| FP16 \| Full precision \|
	\| LM Head \| FP16 \| Full precision \|
	\| Attention (q/k/v/o) \| 4-bit \| GPTQ with Hessians \|
	\| MoE Experts (64×) \| 4-bit \| GPTQ with actorder \|
	\| Layer Norms \| FP16 \| Full precision \|
	\| Router Weights \| FP16 \| Full precision \|

	- Total tensors: 19,066
	- Shards: 48 safetensors files
	- Quantization time: ~20 minutes (RTX 3090 Ti)

	## Files

	```
	GLM-4.7-Flash-Marlin-MMFP4/
	├── model-00001-of-00048.safetensors # Layer 0 (embeddings)
	├── model-00002-of-00048.safetensors # Layer 1
	├── ...
	├── model-00048-of-00048.safetensors # Layer 47 + lm_head
	├── model.safetensors.index.json # Weight map
	├── config.json # Model config
	├── generation_config.json
	├── tokenizer.json # Tokenizer
	└── tokenizer_config.json
	```

	## Usage

	### With Metal Marlin (Apple Silicon)

	```python
	from metal_marlin import MarlinForCausalLM
	from transformers import AutoTokenizer

	model = MarlinForCausalLM.from_pretrained(
	"RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4",
	device="mps"
	)
	tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7-Flash")

	prompt = "<\|user\|>\nExplain quantum computing in simple terms.\n<\|assistant\|>\n"
	input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("mps")
	output = model.generate(input_ids, max_new_tokens=256, temperature=0.7)
	print(tokenizer.decode(output[0], skip_special_tokens=True))
	```

	### Tensor Format

	Each quantized weight tensor has corresponding scale factors:

	- `{name}.weight`: Packed FP4 weights (uint8)
	- `{name}.scales`: FP16 per-group scales (group_size=128)

	## Hardware Requirements

	\| Device \| Memory \| Notes \|
	\|--------\|--------\|-------\|
	\| Apple M4 Max \| 36 GB+ \| Via Metal Marlin \|
	\| Apple M2 Ultra \| 36 GB+ \| Via Metal Marlin \|

	## Benchmarks

	### Original Model Performance (from Z.AI)

	\| Benchmark \| GLM-4.7-Flash \| Qwen3-30B-A3B \| GPT-OSS-20B \|
	\|-----------\|---------------\|---------------\|-------------\|
	\| AIME 2025 \| 91.6 \| 85.0 \| 91.7 \|
	\| GPQA \| 75.2 \| 73.4 \| 71.5 \|
	\| SWE-bench Verified \| 59.2 \| 22.0 \| 34.0 \|
	\| τ²-Bench \| 79.5 \| 49.0 \| 47.7 \|
	\| BrowseComp \| 42.8 \| 2.29 \| 28.3 \|

	### Quantized Model Notes

	- GPTQ with actorder minimizes quality loss vs RTN
	- Expected degradation: ~1-2% on benchmarks vs FP16
	- E2M1 FP4 format optimized for Metal Performance Shaders

	## Comparison with Trellis Quant

	\| Model \| Format \| Size \| Bits \| Method \|
	\|-------\|--------\|------\|------\|--------\|
	\| [GLM-4.7-Flash-Trellis-MM](https://huggingface.co/RESMP-DEV/GLM-4.7-Flash-Trellis-MM) \| Trellis \| 14 GB \| 3.78 bpw \| EXL3-style mixed precision \|
	\| This model \| MMFP4 \| 16 GB \| 4.0 bpw \| GPTQ + actorder \|

	Choose Trellis for smaller size, MMFP4 for simpler tensor format and potentially better compatibility.

	## Limitations

	- Metal Marlin required for optimal inference on Apple Silicon
	- No speculative decoding yet
	- Quality loss: ~1-2% on benchmarks vs FP16 (typical for 4-bit quantization)

	## Credits

	- Original model: [Z.AI / GLM Team](https://huggingface.co/zai-org/GLM-4.7-Flash)
	- Quantization method: GPTQ with actorder
	- Quantization toolkit: [Metal Marlin](https://github.com/RESMP-DEV/metal-marlin)

	## Citation

	If you use this model, please cite the original GLM-4.5 paper:

	```bibtex
	@misc{glm2025glm45,
	title={GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models},
	author={GLM Team and Aohan Zeng and Xin Lv and others},
	year={2025},
	eprint={2508.06471},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2508.06471},
	}
	```

	## License

	This quantized model inherits the MIT License from the original GLM-4.7-Flash model.