mlboydaisuke's picture
GLM-4.7-Flash: gather_qmm kernel card (2.6x, clean)
2f8deff verified
|
Raw
History Blame Contribute Delete
2.04 kB
---
license: mit
base_model: zai-org/GLM-4.7-Flash
pipeline_tag: text-generation
library_name: core-ai
tags:
- core-ai
- coreml
- apple
- moe
- mla
- on-device
- metal
---
# GLM-4.7-Flash β€” Core AI (`gather_qmm` kernel, 2.6Γ— faster)
Apple **Core AI** (`.aimodel`) conversion of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash)
(text decoder): MLA attention + a **64-expert top-4 sparse MoE** (+ non-gated shared expert).
~30B total / **~3B active per token** β€” a strong local coder.
Part of the community Core AI model zoo: **https://github.com/john-rocky/coreai-model-zoo**
(full card: [`zoo/glm-4.7-flash.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/glm-4.7-flash.md)).
## The `gather_qmm` kernel β€” 20.3 β†’ 52.4 tok/s (2.6Γ—)
Apple's `GatherMM` reads **all 64 experts' weights every token**; a custom
`coreai_torch.TorchMetalKernel` reads **only the 4 routed experts** (4/64) β†’ decode runs at
active-param bandwidth: **52.4 tok/s, 2.6Γ—** (the biggest relative gain of the zoo's three MoE
gather ports β€” a 16Γ— over-read removed).
**Quality is clean and unchanged.** The kernel reads the **`sym8`** scheme = the same
symmetric-linear int8 (per-K-block-32) recipe the standard int8 bundle uses, via a **bit-exact**
gather: **0 introduced flips / 18 vs fp16**. Pure speed win at the same quality.
| bundle | size | decode tok/s | quality |
|---|---:|---:|---|
| `gpu-pipelined/glm_4_7_flash_decode_sym8_gather/` | 30 GB | **52.4** | clean (0 flips/18 vs fp16) βœ… |
Mac-only (30 GB int8). Remaining speed lever = absorbed-MLA (GLM runs full MLA on all 47 layers).
## Run
```
COREAI_CHUNK_THRESHOLD=1 llm-benchmark --model gpu-pipelined/glm_4_7_flash_decode_sym8_gather -p 128 -g 256 -n 3
```
Convert your own with [`conversion/export_glm47_moe_metal_decode_pipelined.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_glm47_moe_metal_decode_pipelined.py).
## License
MIT (upstream GLM license). Conversion + `gather_qmm` kernel: community.