--- license: mit language: - en - zh base_model: zai-org/GLM-4.7-Flash pipeline_tag: text-generation tags: - quantized - Mixture of Experts - 4-bit - GPTQ - MMFP4 - glm - metal-marlin - moe library_name: transformers arxiv: "2508.06471" --- # GLM-4.7-Flash-Marlin-MMFP4 ![](https://raw.githubusercontent.com/zai-org/GLM-4.5/refs/heads/main/resources/logo.svg) **MMFP4-quantized GLM-4.7-Flash** — a 30B-A3B MoE model compressed to **4 bits per weight** using GPTQ with actorder and Metal Marlin's E2M1 FP4 format. | Metric | Value | |--------|-------| | **Effective bits** | 4.0 bpw | | **Compression** | 4× vs FP16 | | **Model size** | ~16 GB (vs ~60 GB FP16) | | **Parameters** | 29.3B | | **Format** | HuggingFace sharded safetensors | ## Model Description This is a quantized version of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash), the strongest model in the 30B class that balances performance and efficiency. GLM-4.7-Flash features: - **30B-A3B MoE architecture** (64 experts + shared expert, 2-4 active per token) - **Multi-head Latent Attention (MLA)** for 8× KV cache compression - **State-of-the-art reasoning** (91.6% on AIME 2025, 59.2% on SWE-bench Verified) - **Bilingual** (English + Chinese) ## Quantization Details Quantized using **MR-GPTQ** (Metal Marlin GPTQ) with CUDA acceleration: ### Method - **Format**: MMFP4 (E2M1 FP4) — Metal Marlin's native FP4 format - **Quantization**: GPTQ with actorder (activation-order column permutation) - **Hessian calibration**: Pre-computed Hessians for attention layers - **Expert quantization**: Identity Hessian with actorder (no calibration data for MoE experts) - **Group size**: 128 - **Hardware**: NVIDIA RTX 3090 Ti (CUDA-accelerated Cholesky factorization) ### Quantization Statistics | Component | Bit Width | Notes | |-----------|-----------|-------| | Embeddings | FP16 | Full precision | | LM Head | FP16 | Full precision | | Attention (q/k/v/o) | 4-bit | GPTQ with Hessians | | MoE Experts (64×) | 4-bit | GPTQ with actorder | | Layer Norms | FP16 | Full precision | | Router Weights | FP16 | Full precision | - **Total tensors**: 19,066 - **Shards**: 48 safetensors files - **Quantization time**: ~20 minutes (RTX 3090 Ti) ## Files ``` GLM-4.7-Flash-Marlin-MMFP4/ ├── model-00001-of-00048.safetensors # Layer 0 (embeddings) ├── model-00002-of-00048.safetensors # Layer 1 ├── ... ├── model-00048-of-00048.safetensors # Layer 47 + lm_head ├── model.safetensors.index.json # Weight map ├── config.json # Model config ├── generation_config.json ├── tokenizer.json # Tokenizer └── tokenizer_config.json ``` ## Usage ### With Metal Marlin (Apple Silicon) ```python from metal_marlin import MarlinForCausalLM from transformers import AutoTokenizer model = MarlinForCausalLM.from_pretrained( "RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4", device="mps" ) tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7-Flash") prompt = "<|user|>\nExplain quantum computing in simple terms.\n<|assistant|>\n" input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("mps") output = model.generate(input_ids, max_new_tokens=256, temperature=0.7) print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` ### Tensor Format Each quantized weight tensor has corresponding scale factors: - `{name}.weight`: Packed FP4 weights (uint8) - `{name}.scales`: FP16 per-group scales (group_size=128) ## Hardware Requirements | Device | Memory | Notes | |--------|--------|-------| | Apple M4 Max | 36 GB+ | Via Metal Marlin | | Apple M2 Ultra | 36 GB+ | Via Metal Marlin | ## Benchmarks ### Original Model Performance (from Z.AI) | Benchmark | GLM-4.7-Flash | Qwen3-30B-A3B | GPT-OSS-20B | |-----------|---------------|---------------|-------------| | AIME 2025 | **91.6** | 85.0 | 91.7 | | GPQA | **75.2** | 73.4 | 71.5 | | SWE-bench Verified | **59.2** | 22.0 | 34.0 | | τ²-Bench | **79.5** | 49.0 | 47.7 | | BrowseComp | **42.8** | 2.29 | 28.3 | ### Quantized Model Notes - GPTQ with actorder minimizes quality loss vs RTN - Expected degradation: ~1-2% on benchmarks vs FP16 - E2M1 FP4 format optimized for Metal Performance Shaders ## Comparison with Trellis Quant | Model | Format | Size | Bits | Method | |-------|--------|------|------|--------| | [GLM-4.7-Flash-Trellis-MM](https://huggingface.co/RESMP-DEV/GLM-4.7-Flash-Trellis-MM) | Trellis | 14 GB | 3.78 bpw | EXL3-style mixed precision | | **This model** | MMFP4 | 16 GB | 4.0 bpw | GPTQ + actorder | Choose **Trellis** for smaller size, **MMFP4** for simpler tensor format and potentially better compatibility. ## Limitations - **Metal Marlin required** for optimal inference on Apple Silicon - **No speculative decoding** yet - **Quality loss**: ~1-2% on benchmarks vs FP16 (typical for 4-bit quantization) ## Credits - **Original model**: [Z.AI / GLM Team](https://huggingface.co/zai-org/GLM-4.7-Flash) - **Quantization method**: GPTQ with actorder - **Quantization toolkit**: [Metal Marlin](https://github.com/RESMP-DEV/metal-marlin) ## Citation If you use this model, please cite the original GLM-4.5 paper: ```bibtex @misc{glm2025glm45, title={GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models}, author={GLM Team and Aohan Zeng and Xin Lv and others}, year={2025}, eprint={2508.06471}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2508.06471}, } ``` ## License This quantized model inherits the **MIT License** from the original GLM-4.7-Flash model.