File size: 2,036 Bytes
ebf7393 2f8deff ebf7393 2f8deff ebf7393 2f8deff ebf7393 2f8deff ebf7393 2f8deff ebf7393 2f8deff ebf7393 2f8deff ebf7393 2f8deff ebf7393 2f8deff ebf7393 2f8deff ebf7393 2f8deff | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | ---
license: mit
base_model: zai-org/GLM-4.7-Flash
pipeline_tag: text-generation
library_name: core-ai
tags:
- core-ai
- coreml
- apple
- moe
- mla
- on-device
- metal
---
# GLM-4.7-Flash — Core AI (`gather_qmm` kernel, 2.6× faster)
Apple **Core AI** (`.aimodel`) conversion of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash)
(text decoder): MLA attention + a **64-expert top-4 sparse MoE** (+ non-gated shared expert).
~30B total / **~3B active per token** — a strong local coder.
Part of the community Core AI model zoo: **https://github.com/john-rocky/coreai-model-zoo**
(full card: [`zoo/glm-4.7-flash.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/glm-4.7-flash.md)).
## The `gather_qmm` kernel — 20.3 → 52.4 tok/s (2.6×)
Apple's `GatherMM` reads **all 64 experts' weights every token**; a custom
`coreai_torch.TorchMetalKernel` reads **only the 4 routed experts** (4/64) → decode runs at
active-param bandwidth: **52.4 tok/s, 2.6×** (the biggest relative gain of the zoo's three MoE
gather ports — a 16× over-read removed).
**Quality is clean and unchanged.** The kernel reads the **`sym8`** scheme = the same
symmetric-linear int8 (per-K-block-32) recipe the standard int8 bundle uses, via a **bit-exact**
gather: **0 introduced flips / 18 vs fp16**. Pure speed win at the same quality.
| bundle | size | decode tok/s | quality |
|---|---:|---:|---|
| `gpu-pipelined/glm_4_7_flash_decode_sym8_gather/` | 30 GB | **52.4** | clean (0 flips/18 vs fp16) ✅ |
Mac-only (30 GB int8). Remaining speed lever = absorbed-MLA (GLM runs full MLA on all 47 layers).
## Run
```
COREAI_CHUNK_THRESHOLD=1 llm-benchmark --model gpu-pipelined/glm_4_7_flash_decode_sym8_gather -p 128 -g 256 -n 3
```
Convert your own with [`conversion/export_glm47_moe_metal_decode_pipelined.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_glm47_moe_metal_decode_pipelined.py).
## License
MIT (upstream GLM license). Conversion + `gather_qmm` kernel: community.
|