| --- |
| license: mit |
| base_model: zai-org/GLM-4.7-Flash |
| pipeline_tag: text-generation |
| library_name: core-ai |
| tags: |
| - core-ai |
| - coreml |
| - apple |
| - moe |
| - mla |
| - on-device |
| - metal |
| --- |
| |
| # GLM-4.7-Flash β Core AI (`gather_qmm` kernel, 2.6Γ faster) |
| |
| Apple **Core AI** (`.aimodel`) conversion of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) |
| (text decoder): MLA attention + a **64-expert top-4 sparse MoE** (+ non-gated shared expert). |
| ~30B total / **~3B active per token** β a strong local coder. |
| |
| Part of the community Core AI model zoo: **https://github.com/john-rocky/coreai-model-zoo** |
| (full card: [`zoo/glm-4.7-flash.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/glm-4.7-flash.md)). |
| |
| ## The `gather_qmm` kernel β 20.3 β 52.4 tok/s (2.6Γ) |
|
|
| Apple's `GatherMM` reads **all 64 experts' weights every token**; a custom |
| `coreai_torch.TorchMetalKernel` reads **only the 4 routed experts** (4/64) β decode runs at |
| active-param bandwidth: **52.4 tok/s, 2.6Γ** (the biggest relative gain of the zoo's three MoE |
| gather ports β a 16Γ over-read removed). |
|
|
| **Quality is clean and unchanged.** The kernel reads the **`sym8`** scheme = the same |
| symmetric-linear int8 (per-K-block-32) recipe the standard int8 bundle uses, via a **bit-exact** |
| gather: **0 introduced flips / 18 vs fp16**. Pure speed win at the same quality. |
|
|
| | bundle | size | decode tok/s | quality | |
| |---|---:|---:|---| |
| | `gpu-pipelined/glm_4_7_flash_decode_sym8_gather/` | 30 GB | **52.4** | clean (0 flips/18 vs fp16) β
| |
|
|
| Mac-only (30 GB int8). Remaining speed lever = absorbed-MLA (GLM runs full MLA on all 47 layers). |
|
|
| ## Run |
|
|
| ``` |
| COREAI_CHUNK_THRESHOLD=1 llm-benchmark --model gpu-pipelined/glm_4_7_flash_decode_sym8_gather -p 128 -g 256 -n 3 |
| ``` |
|
|
| Convert your own with [`conversion/export_glm47_moe_metal_decode_pipelined.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_glm47_moe_metal_decode_pipelined.py). |
|
|
| ## License |
|
|
| MIT (upstream GLM license). Conversion + `gather_qmm` kernel: community. |
|
|