GLM-5.2-7layer (layer-trimmed, for training/serving infra testing)

A layer-trimmed copy of zai-org/GLM-5.2, reduced from 78 layers (+1 MTP) to 7 layers (+1 MTP) so every structurally distinct layer type can be exercised on a small number of GPUs during early training/serving infra development (LoRA, parallelism, MTP, etc.).

This is NOT a usable language model — most layers are removed, so generations are gibberish. It exists purely to let infra code load, shard, attach LoRA to, and run a forward/backward pass over every distinct layer of GLM-5.2 at a fraction of the size (~100 GB bf16 vs ~1.45 TB).

Why 7 layers (and why a contiguous prefix)

GLM-5.2 differs from GLM-5.1 in its attention: it uses a mixed DSA pattern. Only some layers own a sparse-attention indexer (indexer_types = "full"); the rest reuse a nearby full layer's top-k index ("shared"), governed by index_topk_freq = 4 / index_skip_topk_offset = 3. Crucially, whether a layer owns an indexer is decided by layer-id arithmetic, so the kept layers must keep their original, contiguous ids — renumbering would misalign the indexer weights with that arithmetic. Layers 0–6 are therefore kept verbatim (identity ids); only the MTP layer (78) is renumbered to 7.

What was kept (verbatim bf16 weights, original ids 0–6)

idx	source	MLP (`mlp_layer_types`)	Attention (`indexer_types`)	Why kept
0,1,2	0,1,2	dense	full (own DSA indexer)	the only dense layers (`first_k_dense_replace=3`) + indexer-owning
3,4,5	3,4,5	sparse (MoE: 256 routed + 1 shared)	shared (reuses a full layer's index — no own indexer weights)	the homogeneous MoE-with-shared-index block
6	6	sparse (MoE)	full (own DSA indexer)	a MoE layer that owns an indexer (the other distinct combo)
7	78	sparse (MoE)	MTP/nextn	the multi-token-prediction layer (`enorm`/`hnorm`/`eh_proj`/`shared_head`)
—	—	—	—	top-level `embed_tokens`, final `norm`, `lm_head`

This covers all four distinct combinations present in GLM-5.2: dense+full, MoE+shared, MoE+full, and MTP.

What was removed

Original layers 7–77 (71 MoE layers) — duplicates of the kept MoE+shared (3–5) and MoE+full (6) types. Nothing else is changed.

What changed in `config.json`

num_hidden_layers: 78 → 7.
Per-layer pattern lists trimmed to the kept layers (first 7 entries), so HF / transformers builds the correct 7-layer model:
- mlp_layer_types → ["dense","dense","dense","sparse","sparse","sparse","sparse"]
- indexer_types → ["full","full","full","shared","shared","shared","full"]
Everything else is identical to the base (first_k_dense_replace=3, index_topk_freq=4, index_skip_topk_offset=3, num_nextn_predict_layers=1, expert counts, all dims), so the kept layers are bit-for-bit the real GLM-5.2 — equivalent to the full model's first 7 layers + its MTP layer.

Coverage checklist (all distinct layer types present ≥ once)

Dense MLP layer (0–2)
MoE layer — routed + shared experts, gate (3–6)
DSA indexer-owning layer ("full": 0,1,2,6)
Index-reusing layer ("shared": 3,4,5)
MTP / nextn layer (7)
embed_tokens / final norm / lm_head

Provenance

Produced by selecting the relevant shards of zai-org/GLM-5.2, copying the kept tensors verbatim (bf16) with original layer ids preserved (MTP renumbered 78→7), trimming the per-layer config lists, and rewriting the safetensors index + num_hidden_layers. Verified to load and run a forward pass (base + a LoRA adapter spanning all kept layers) in SGLang (main).

Downloads last month: 79

Safetensors

Model size

53B params

Tensor type

BF16

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jybsuper/GLM-5.2-7layer

Base model

zai-org/GLM-5.2

Finetuned

(9)

this model