--- license: mit base_model: zai-org/GLM-4.7-Flash pipeline_tag: text-generation library_name: core-ai tags: - core-ai - coreml - apple - moe - mla - on-device - metal --- # GLM-4.7-Flash — Core AI (`gather_qmm` kernel, 2.6× faster) Apple **Core AI** (`.aimodel`) conversion of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) (text decoder): MLA attention + a **64-expert top-4 sparse MoE** (+ non-gated shared expert). ~30B total / **~3B active per token** — a strong local coder. Part of the community Core AI model zoo: **https://github.com/john-rocky/coreai-model-zoo** (full card: [`zoo/glm-4.7-flash.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/glm-4.7-flash.md)). ## The `gather_qmm` kernel — 20.3 → 52.4 tok/s (2.6×) Apple's `GatherMM` reads **all 64 experts' weights every token**; a custom `coreai_torch.TorchMetalKernel` reads **only the 4 routed experts** (4/64) → decode runs at active-param bandwidth: **52.4 tok/s, 2.6×** (the biggest relative gain of the zoo's three MoE gather ports — a 16× over-read removed). **Quality is clean and unchanged.** The kernel reads the **`sym8`** scheme = the same symmetric-linear int8 (per-K-block-32) recipe the standard int8 bundle uses, via a **bit-exact** gather: **0 introduced flips / 18 vs fp16**. Pure speed win at the same quality. | bundle | size | decode tok/s | quality | |---|---:|---:|---| | `gpu-pipelined/glm_4_7_flash_decode_sym8_gather/` | 30 GB | **52.4** | clean (0 flips/18 vs fp16) ✅ | Mac-only (30 GB int8). Remaining speed lever = absorbed-MLA (GLM runs full MLA on all 47 layers). ## Run ``` COREAI_CHUNK_THRESHOLD=1 llm-benchmark --model gpu-pipelined/glm_4_7_flash_decode_sym8_gather -p 128 -g 256 -n 3 ``` Convert your own with [`conversion/export_glm47_moe_metal_decode_pipelined.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_glm47_moe_metal_decode_pipelined.py). ## License MIT (upstream GLM license). Conversion + `gather_qmm` kernel: community.