File size: 2,036 Bytes
ebf7393
 
 
 
2f8deff
 
 
 
 
 
 
 
 
ebf7393
 
2f8deff
ebf7393
2f8deff
 
 
ebf7393
2f8deff
 
ebf7393
2f8deff
ebf7393
2f8deff
 
 
 
ebf7393
2f8deff
 
 
ebf7393
2f8deff
 
 
ebf7393
2f8deff
ebf7393
 
 
 
2f8deff
 
 
 
ebf7393
 
 
2f8deff
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
---
license: mit
base_model: zai-org/GLM-4.7-Flash
pipeline_tag: text-generation
library_name: core-ai
tags:
- core-ai
- coreml
- apple
- moe
- mla
- on-device
- metal
---

# GLM-4.7-Flash — Core AI (`gather_qmm` kernel, 2.6× faster)

Apple **Core AI** (`.aimodel`) conversion of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash)
(text decoder): MLA attention + a **64-expert top-4 sparse MoE** (+ non-gated shared expert).
~30B total / **~3B active per token** — a strong local coder.

Part of the community Core AI model zoo: **https://github.com/john-rocky/coreai-model-zoo**
(full card: [`zoo/glm-4.7-flash.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/glm-4.7-flash.md)).

## The `gather_qmm` kernel — 20.3 → 52.4 tok/s (2.6×)

Apple's `GatherMM` reads **all 64 experts' weights every token**; a custom
`coreai_torch.TorchMetalKernel` reads **only the 4 routed experts** (4/64) → decode runs at
active-param bandwidth: **52.4 tok/s, 2.6×** (the biggest relative gain of the zoo's three MoE
gather ports — a 16× over-read removed).

**Quality is clean and unchanged.** The kernel reads the **`sym8`** scheme = the same
symmetric-linear int8 (per-K-block-32) recipe the standard int8 bundle uses, via a **bit-exact**
gather: **0 introduced flips / 18 vs fp16**. Pure speed win at the same quality.

| bundle | size | decode tok/s | quality |
|---|---:|---:|---|
| `gpu-pipelined/glm_4_7_flash_decode_sym8_gather/` | 30 GB | **52.4** | clean (0 flips/18 vs fp16) ✅ |

Mac-only (30 GB int8). Remaining speed lever = absorbed-MLA (GLM runs full MLA on all 47 layers).

## Run

```
COREAI_CHUNK_THRESHOLD=1 llm-benchmark --model gpu-pipelined/glm_4_7_flash_decode_sym8_gather -p 128 -g 256 -n 3
```

Convert your own with [`conversion/export_glm47_moe_metal_decode_pipelined.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_glm47_moe_metal_decode_pipelined.py).

## License

MIT (upstream GLM license). Conversion + `gather_qmm` kernel: community.