GLM-4.7-Flash-Marlin-MMFP4 / README.md

Kearm

Update README.md

3715db7 verified 7 days ago

preview code

raw

history blame contribute delete

5.67 kB

metadata

license: mit
language:
  - en
  - zh
base_model: zai-org/GLM-4.7-Flash
pipeline_tag: text-generation
tags:
  - quantized
  - Mixture of Experts
  - 4-bit
  - GPTQ
  - MMFP4
  - glm
  - metal-marlin
  - moe
library_name: transformers
arxiv: '2508.06471'

GLM-4.7-Flash-Marlin-MMFP4

MMFP4-quantized GLM-4.7-Flash — a 30B-A3B MoE model compressed to 4 bits per weight using GPTQ with actorder and Metal Marlin's E2M1 FP4 format.

Metric	Value
Effective bits	4.0 bpw
Compression	4× vs FP16
Model size	~16 GB (vs ~60 GB FP16)
Parameters	29.3B
Format	HuggingFace sharded safetensors

Model Description

This is a quantized version of zai-org/GLM-4.7-Flash, the strongest model in the 30B class that balances performance and efficiency.

GLM-4.7-Flash features:

30B-A3B MoE architecture (64 experts + shared expert, 2-4 active per token)
Multi-head Latent Attention (MLA) for 8× KV cache compression
State-of-the-art reasoning (91.6% on AIME 2025, 59.2% on SWE-bench Verified)
Bilingual (English + Chinese)

Quantization Details

Quantized using MR-GPTQ (Metal Marlin GPTQ) with CUDA acceleration:

Method

Format: MMFP4 (E2M1 FP4) — Metal Marlin's native FP4 format
Quantization: GPTQ with actorder (activation-order column permutation)
Hessian calibration: Pre-computed Hessians for attention layers
Expert quantization: Identity Hessian with actorder (no calibration data for MoE experts)
Group size: 128
Hardware: NVIDIA RTX 3090 Ti (CUDA-accelerated Cholesky factorization)

Quantization Statistics

Component	Bit Width	Notes
Embeddings	FP16	Full precision
LM Head	FP16	Full precision
Attention (q/k/v/o)	4-bit	GPTQ with Hessians
MoE Experts (64×)	4-bit	GPTQ with actorder
Layer Norms	FP16	Full precision
Router Weights	FP16	Full precision

Total tensors: 19,066
Shards: 48 safetensors files
Quantization time: ~20 minutes (RTX 3090 Ti)

Files

GLM-4.7-Flash-Marlin-MMFP4/
├── model-00001-of-00048.safetensors   # Layer 0 (embeddings)
├── model-00002-of-00048.safetensors   # Layer 1
├── ...
├── model-00048-of-00048.safetensors   # Layer 47 + lm_head
├── model.safetensors.index.json       # Weight map
├── config.json                        # Model config
├── generation_config.json
├── tokenizer.json                     # Tokenizer
└── tokenizer_config.json

Usage

With Metal Marlin (Apple Silicon)

from metal_marlin import MarlinForCausalLM
from transformers import AutoTokenizer

model = MarlinForCausalLM.from_pretrained(
    "RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4",
    device="mps"
)
tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7-Flash")

prompt = "<|user|>\nExplain quantum computing in simple terms.\n<|assistant|>\n"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("mps")
output = model.generate(input_ids, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Tensor Format

Each quantized weight tensor has corresponding scale factors:

{name}.weight: Packed FP4 weights (uint8)
{name}.scales: FP16 per-group scales (group_size=128)

Hardware Requirements

Device	Memory	Notes
Apple M4 Max	36 GB+	Via Metal Marlin
Apple M2 Ultra	36 GB+	Via Metal Marlin

Benchmarks

Original Model Performance (from Z.AI)

Benchmark	GLM-4.7-Flash	Qwen3-30B-A3B	GPT-OSS-20B
AIME 2025	91.6	85.0	91.7
GPQA	75.2	73.4	71.5
SWE-bench Verified	59.2	22.0	34.0
τ²-Bench	79.5	49.0	47.7
BrowseComp	42.8	2.29	28.3

Quantized Model Notes

GPTQ with actorder minimizes quality loss vs RTN
Expected degradation: ~1-2% on benchmarks vs FP16
E2M1 FP4 format optimized for Metal Performance Shaders

Comparison with Trellis Quant

Model	Format	Size	Bits	Method
GLM-4.7-Flash-Trellis-MM	Trellis	14 GB	3.78 bpw	EXL3-style mixed precision
This model	MMFP4	16 GB	4.0 bpw	GPTQ + actorder

Choose Trellis for smaller size, MMFP4 for simpler tensor format and potentially better compatibility.

Limitations

Metal Marlin required for optimal inference on Apple Silicon
No speculative decoding yet
Quality loss: ~1-2% on benchmarks vs FP16 (typical for 4-bit quantization)

Credits

Original model: Z.AI / GLM Team
Quantization method: GPTQ with actorder
Quantization toolkit: Metal Marlin

Citation

If you use this model, please cite the original GLM-4.5 paper:

@misc{glm2025glm45,
      title={GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models}, 
      author={GLM Team and Aohan Zeng and Xin Lv and others},
      year={2025},
      eprint={2508.06471},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.06471}, 
}

License

This quantized model inherits the MIT License from the original GLM-4.7-Flash model.