GLM-5.1-6layer (layer-trimmed, for training/serving infra testing)
A layer-trimmed copy of zai-org/GLM-5.1,
reduced from 78 layers (+1 MTP) to 6 layers (+1 MTP), so the full set of
distinct layer types can be exercised on a small number of GPUs during early
training/serving infra development (LoRA, parallelism, MTP, etc.).
This is NOT a usable language model β most layers are removed, so generations are gibberish. It exists purely to let infra code load, shard, attach LoRA to, and run a forward/backward pass over every structurally distinct layer of GLM-5.1 at a fraction of the size (~80 GB bf16 vs ~1.45 TB).
What was kept (verbatim bf16 weights from the base model)
| Trimmed index | Source layer | Type | Why kept |
|---|---|---|---|
| 0, 1, 2 | 0, 1, 2 | Dense MLP | first_k_dense_replace=3 β the only dense layers; unique to the start |
| 3, 4, 5 | 3, 4, 5 | MoE (256 routed + 1 shared) | representative of the homogeneous MoE block (orig layers 3β77) |
| 6 | 78 | MTP / nextn | the multi-token-prediction layer (enorm/hnorm/eh_proj/shared_head) |
| β | β | embed_tokens, final norm, lm_head |
top-level, always required |
Every layer carries the same MLA attention + DSA sparse-attention indexer
(q_a_proj/q_b_proj/kv_a_proj_with_mqa/kv_b_proj/o_proj and
indexer.wq_b/indexer.wk/indexer.weights_proj), so attention and the indexer
are covered by any kept layer. The dense vs MoE distinction is the only MLP
difference; MoE layers 3β77 are structurally identical (moe_layer_freq=1), so
three samples fully represent them.
What was removed
- Original layers 6β77 (72 MoE layers) β homogeneous duplicates of 3β5.
- Nothing else: the tokenizer, chat template, generation config, and all non-layer weights are unchanged.
What changed in config.json
Only num_hidden_layers: 78 β 6. Everything else (first_k_dense_replace=3,
num_nextn_predict_layers=1, expert counts, all dims) is identical to the base,
so the per-layer architecture is bit-for-bit the real GLM-5.1. The MTP layer is
renumbered from index 78 to index 6 (= num_hidden_layers), matching how the
nextn layer is addressed.
Coverage checklist (all distinct layer types present β₯ once)
- Dense MLP layer (0β2)
- MoE layer β routed + shared experts, gate (3β5)
- MLA attention + DSA indexer (every layer)
- MTP / nextn layer (6)
- embed_tokens / final norm / lm_head
Provenance
Produced by selecting the relevant shards of zai-org/GLM-5.1, copying the kept
tensors verbatim (bf16), renumbering only the MTP layer, and rewriting the
safetensors index + num_hidden_layers. Verified to load and run a forward pass
in SGLang (main).
- Downloads last month
- 260
Model tree for jybsuper/GLM-5.1-6layer
Base model
zai-org/GLM-5.1