GLM-5.2-7layer (layer-trimmed, for training/serving infra testing)
A layer-trimmed copy of zai-org/GLM-5.2,
reduced from 78 layers (+1 MTP) to 7 layers (+1 MTP) so every structurally
distinct layer type can be exercised on a small number of GPUs during early
training/serving infra development (LoRA, parallelism, MTP, etc.).
This is NOT a usable language model β most layers are removed, so generations are gibberish. It exists purely to let infra code load, shard, attach LoRA to, and run a forward/backward pass over every distinct layer of GLM-5.2 at a fraction of the size (~100 GB bf16 vs ~1.45 TB).
Why 7 layers (and why a contiguous prefix)
GLM-5.2 differs from GLM-5.1 in its attention: it uses a mixed DSA pattern.
Only some layers own a sparse-attention indexer (indexer_types = "full"); the
rest reuse a nearby full layer's top-k index ("shared"), governed by
index_topk_freq = 4 / index_skip_topk_offset = 3. Crucially, whether a layer
owns an indexer is decided by layer-id arithmetic, so the kept layers must
keep their original, contiguous ids β renumbering would misalign the indexer
weights with that arithmetic. Layers 0β6 are therefore kept verbatim (identity
ids); only the MTP layer (78) is renumbered to 7.
What was kept (verbatim bf16 weights, original ids 0β6)
| idx | source | MLP (mlp_layer_types) |
Attention (indexer_types) |
Why kept |
|---|---|---|---|---|
| 0,1,2 | 0,1,2 | dense | full (own DSA indexer) | the only dense layers (first_k_dense_replace=3) + indexer-owning |
| 3,4,5 | 3,4,5 | sparse (MoE: 256 routed + 1 shared) | shared (reuses a full layer's index β no own indexer weights) | the homogeneous MoE-with-shared-index block |
| 6 | 6 | sparse (MoE) | full (own DSA indexer) | a MoE layer that owns an indexer (the other distinct combo) |
| 7 | 78 | sparse (MoE) | MTP/nextn | the multi-token-prediction layer (enorm/hnorm/eh_proj/shared_head) |
| β | β | β | β | top-level embed_tokens, final norm, lm_head |
This covers all four distinct combinations present in GLM-5.2: dense+full, MoE+shared, MoE+full, and MTP.
What was removed
- Original layers 7β77 (71 MoE layers) β duplicates of the kept MoE+shared (3β5) and MoE+full (6) types. Nothing else is changed.
What changed in config.json
num_hidden_layers: 78 β 7.- Per-layer pattern lists trimmed to the kept layers (first 7 entries), so HF /
transformers builds the correct 7-layer model:
mlp_layer_types β ["dense","dense","dense","sparse","sparse","sparse","sparse"]indexer_types β ["full","full","full","shared","shared","shared","full"]
- Everything else is identical to the base (
first_k_dense_replace=3,index_topk_freq=4,index_skip_topk_offset=3,num_nextn_predict_layers=1, expert counts, all dims), so the kept layers are bit-for-bit the real GLM-5.2 β equivalent to the full model's first 7 layers + its MTP layer.
Coverage checklist (all distinct layer types present β₯ once)
- Dense MLP layer (0β2)
- MoE layer β routed + shared experts, gate (3β6)
- DSA indexer-owning layer ("full": 0,1,2,6)
- Index-reusing layer ("shared": 3,4,5)
- MTP / nextn layer (7)
- embed_tokens / final norm / lm_head
Provenance
Produced by selecting the relevant shards of zai-org/GLM-5.2, copying the kept
tensors verbatim (bf16) with original layer ids preserved (MTP renumbered 78β7),
trimming the per-layer config lists, and rewriting the safetensors index +
num_hidden_layers. Verified to load and run a forward pass (base + a LoRA
adapter spanning all kept layers) in SGLang (main).
- Downloads last month
- 79
Model tree for jybsuper/GLM-5.2-7layer
Base model
zai-org/GLM-5.2