GLM-4.7-Flash: gather_qmm kernel card (2.6x, clean)

2f8deff verified 16 days ago

2.04 kB

	---
	license: mit
	base_model: zai-org/GLM-4.7-Flash
	pipeline_tag: text-generation
	library_name: core-ai
	tags:
	- core-ai
	- coreml
	- apple
	- moe
	- mla
	- on-device
	- metal
	---

	# GLM-4.7-Flash — Core AI (`gather_qmm` kernel, 2.6× faster)

	Apple Core AI (`.aimodel`) conversion of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash)
	(text decoder): MLA attention + a 64-expert top-4 sparse MoE (+ non-gated shared expert).
	~30B total / ~3B active per token — a strong local coder.

	Part of the community Core AI model zoo: https://github.com/john-rocky/coreai-model-zoo
	(full card: [`zoo/glm-4.7-flash.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/glm-4.7-flash.md)).

	## The `gather_qmm` kernel — 20.3 → 52.4 tok/s (2.6×)

	Apple's `GatherMM` reads all 64 experts' weights every token; a custom
	`coreai_torch.TorchMetalKernel` reads only the 4 routed experts (4/64) → decode runs at
	active-param bandwidth: 52.4 tok/s, 2.6× (the biggest relative gain of the zoo's three MoE
	gather ports — a 16× over-read removed).

	Quality is clean and unchanged. The kernel reads the `sym8` scheme = the same
	symmetric-linear int8 (per-K-block-32) recipe the standard int8 bundle uses, via a bit-exact
	gather: 0 introduced flips / 18 vs fp16. Pure speed win at the same quality.

	\| bundle \| size \| decode tok/s \| quality \|
	\|---\|---:\|---:\|---\|
	\| `gpu-pipelined/glm_4_7_flash_decode_sym8_gather/` \| 30 GB \| 52.4 \| clean (0 flips/18 vs fp16) ✅ \|

	Mac-only (30 GB int8). Remaining speed lever = absorbed-MLA (GLM runs full MLA on all 47 layers).

	## Run

	```
	COREAI_CHUNK_THRESHOLD=1 llm-benchmark --model gpu-pipelined/glm_4_7_flash_decode_sym8_gather -p 128 -g 256 -n 3
	```

	Convert your own with [`conversion/export_glm47_moe_metal_decode_pipelined.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_glm47_moe_metal_decode_pipelined.py).

	## License

	MIT (upstream GLM license). Conversion + `gather_qmm` kernel: community.