GLM-5.2-7layer (layer-trimmed, for training/serving infra testing)

A layer-trimmed copy of zai-org/GLM-5.2, reduced from 78 layers (+1 MTP) to 7 layers (+1 MTP) so every structurally distinct layer type can be exercised on a small number of GPUs during early training/serving infra development (LoRA, parallelism, MTP, etc.).

This is NOT a usable language model β€” most layers are removed, so generations are gibberish. It exists purely to let infra code load, shard, attach LoRA to, and run a forward/backward pass over every distinct layer of GLM-5.2 at a fraction of the size (~100 GB bf16 vs ~1.45 TB).

Why 7 layers (and why a contiguous prefix)

GLM-5.2 differs from GLM-5.1 in its attention: it uses a mixed DSA pattern. Only some layers own a sparse-attention indexer (indexer_types = "full"); the rest reuse a nearby full layer's top-k index ("shared"), governed by index_topk_freq = 4 / index_skip_topk_offset = 3. Crucially, whether a layer owns an indexer is decided by layer-id arithmetic, so the kept layers must keep their original, contiguous ids β€” renumbering would misalign the indexer weights with that arithmetic. Layers 0–6 are therefore kept verbatim (identity ids); only the MTP layer (78) is renumbered to 7.

What was kept (verbatim bf16 weights, original ids 0–6)

idx source MLP (mlp_layer_types) Attention (indexer_types) Why kept
0,1,2 0,1,2 dense full (own DSA indexer) the only dense layers (first_k_dense_replace=3) + indexer-owning
3,4,5 3,4,5 sparse (MoE: 256 routed + 1 shared) shared (reuses a full layer's index β€” no own indexer weights) the homogeneous MoE-with-shared-index block
6 6 sparse (MoE) full (own DSA indexer) a MoE layer that owns an indexer (the other distinct combo)
7 78 sparse (MoE) MTP/nextn the multi-token-prediction layer (enorm/hnorm/eh_proj/shared_head)
β€” β€” β€” β€” top-level embed_tokens, final norm, lm_head

This covers all four distinct combinations present in GLM-5.2: dense+full, MoE+shared, MoE+full, and MTP.

What was removed

  • Original layers 7–77 (71 MoE layers) β€” duplicates of the kept MoE+shared (3–5) and MoE+full (6) types. Nothing else is changed.

What changed in config.json

  • num_hidden_layers: 78 β†’ 7.
  • Per-layer pattern lists trimmed to the kept layers (first 7 entries), so HF / transformers builds the correct 7-layer model:
    • mlp_layer_types β†’ ["dense","dense","dense","sparse","sparse","sparse","sparse"]
    • indexer_types β†’ ["full","full","full","shared","shared","shared","full"]
  • Everything else is identical to the base (first_k_dense_replace=3, index_topk_freq=4, index_skip_topk_offset=3, num_nextn_predict_layers=1, expert counts, all dims), so the kept layers are bit-for-bit the real GLM-5.2 β€” equivalent to the full model's first 7 layers + its MTP layer.

Coverage checklist (all distinct layer types present β‰₯ once)

  • Dense MLP layer (0–2)
  • MoE layer β€” routed + shared experts, gate (3–6)
  • DSA indexer-owning layer ("full": 0,1,2,6)
  • Index-reusing layer ("shared": 3,4,5)
  • MTP / nextn layer (7)
  • embed_tokens / final norm / lm_head

Provenance

Produced by selecting the relevant shards of zai-org/GLM-5.2, copying the kept tensors verbatim (bf16) with original layer ids preserved (MTP renumbered 78β†’7), trimming the per-layer config lists, and rewriting the safetensors index + num_hidden_layers. Verified to load and run a forward pass (base + a LoRA adapter spanning all kept layers) in SGLang (main).

Downloads last month
79
Safetensors
Model size
53B params
Tensor type
BF16
Β·
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jybsuper/GLM-5.2-7layer

Base model

zai-org/GLM-5.2
Finetuned
(9)
this model