Rapid42/GLM-4.7-Flash-MXFP4

GLM-4.7-Flash (~30B) — quantized to MXFP4 for Apple Silicon

Converted and optimized by Rapid42 — engineering tools for fast pipelines.


What This Is

This is GLM-4.7-Flash from ZhipuAI quantized to MXFP4 format using mlx-lm. GLM-4.7 is the latest generation of Zhipu AI's GLM series — strong on both English and Chinese, with a large context window and fast generation.

  • Parameters: ~30B
  • Quantization: MXFP4 (via mlx-lm 0.31.1)
  • Base model: zai-org/GLM-4.7-Flash
  • Framework: Apple MLX
  • Strengths: English + Chinese bilingual, strong coding, long context

Hardware Requirements

Device RAM Experience
M3 Max (128GB) ~18GB ✅ Excellent
M3 Pro (36GB) ~18GB ✅ Comfortable
M2 Ultra (192GB) ~18GB ✅ Excellent
M2 Pro (16GB) ~18GB ⚠️ Very tight — may page
M1/M2 (24GB) ~18GB ✅ Works

Quick Start

pip install mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("Rapid42/GLM-4.7-Flash-MXFP4")

messages = [{"role": "user", "content": "Explain the difference between EXR and OpenEXR multipart files."}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, return_dict=False
)

response = generate(model, tokenizer, prompt=prompt, max_tokens=512, verbose=True)
print(response)

CLI:

mlx_lm.generate \
  --model Rapid42/GLM-4.7-Flash-MXFP4 \
  --prompt "What makes GLM different from Llama-style architectures?" \
  --max-tokens 512

Why GLM-4.7-Flash?

GLM-4.7-Flash is the "fast" variant of the GLM-4.7 series — optimized for inference speed while retaining most of the quality of the full model. Key differentiators:

  • Bilingual English/Chinese — genuine dual-language capability, not fine-tuned bilingualism
  • Flash inference — optimized attention for faster generation
  • Large context — handles long documents and complex multi-turn conversations
  • Alternative architecture — General Language Model (GLM) uses bidirectional attention blanking rather than causal masking

A solid alternative to Qwen-family models, especially for Chinese-language tasks or for those who want architectural diversity in their local model stack.


Why MXFP4?

MXFP4 (Microscaling FP4) preserves more dynamic range than standard int4 through per-block scaling factors, and runs natively fast on Apple Silicon via MLX — no special runtimes needed.


Conversion

python -m mlx_lm.convert \
  --hf-path zai-org/GLM-4.7-Flash \
  --mlx-path Rapid42/GLM-4.7-Flash-MXFP4 \
  --quantize --q-bits 4 --q-group-size 64

Converted using mlx-lm 0.31.1.


About Rapid42

Rapid42 builds fast, precise engineering tools — from VFX pipeline utilities to optimized ML model distributions.

rapid42.com · ExrToPsd · Level Careers

Downloads last month
624
Safetensors
Model size
30B params
Tensor type
U8
·
U32
·
BF16
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Rapid42/GLM-4.7-Flash-MXFP4

Quantized
(78)
this model