samsja
/

mini-glm-moe

+---
+license: apache-2.0
+tags:
+  - moe
+  - glm
+  - prime-rl
+  - testing
+---
+# Mini GLM-4 MoE (0.5B)
+A small [GLM-4 MoE](https://huggingface.co/THUDM/GLM-4-100B-A10B) model (543M parameters) for testing and development. Uses the same `Glm4MoeForCausalLM` architecture as the full GLM-4-100B-A10B but with reduced dimensions.
+This model is designed for testing MoE training pipelines in [prime-rl](https://github.com/PrimeIntellect-ai/prime-rl) without needing large pretrained checkpoints. It is small enough to run on a single GPU while exercising the same code paths as production models.
+## Architecture
+| Parameter | Value |
+|---|---|
+| Parameters | 543M |
+| Hidden size | 1024 |
+| Layers | 24 |
+| Attention heads | 16 (4 KV heads) |
+| Routed experts | 8 |
+| Experts per token | 4 |
+| Shared experts | 1 |
+| MoE intermediate size | 256 |
+| Dense intermediate size | 2048 |
+| Dense layers (first-k) | 1 |
+| Vocab size | 151,552 |
+| Partial rotary factor | 0.5 |
+| Model type | `glm4_moe` |
+The architecture mirrors [THUDM/GLM-4-100B-A10B](https://huggingface.co/THUDM/GLM-4-100B-A10B): the first layer is a dense MLP, and all subsequent layers use Mixture-of-Experts with a shared expert. Attention uses Grouped Query Attention (GQA) with partial rotary embeddings.
+## How this model was created
+**Step 1: Random initialization.** A `Glm4MoeConfig` was instantiated with the small dimensions above and the HuggingFace `Glm4MoeForCausalLM` model was initialized with random weights. The tokenizer was copied from [THUDM/GLM-4-9B-0414](https://huggingface.co/THUDM/GLM-4-9B-0414).
+**Step 2: Roundtrip verification.** Before training, we verified that the HuggingFace and [prime-rl](https://github.com/PrimeIntellect-ai/prime-rl) custom implementations produce identical outputs on the same weights (max logits diff < 0.01), and that `convert_to_hf` / `convert_to_prime` state dict conversions are lossless.
+**Step 3: SFT warmup.** The model was fine-tuned for 200 steps on [PrimeIntellect/Reverse-Text-SFT](https://huggingface.co/datasets/PrimeIntellect/Reverse-Text-SFT) using prime-rl's custom MoE implementation with the following config:
+```toml
+max_steps = 200
+[model]
+impl = "custom"
+attn = "sdpa"
+[data]
+name = "PrimeIntellect/Reverse-Text-SFT"
+batch_size = 4
+seq_len = 1024
+[optim]
+lr = 1e-4
+```
+Loss went from ~12 (random init) to ~2.5 after 200 steps. The model is not intended to be useful for generation -- the SFT warmup gives it a non-trivial learned distribution so that KL divergence and other RL metrics are meaningful during testing.
+**Step 4: Post-training verification.** After SFT, we re-verified the HF <-> PrimeRL roundtrip on the trained checkpoint to confirm that checkpoint saving (which goes through `convert_to_hf`) produced valid weights.
+## Reproduction
+The scripts used to create this model live in the [prime-rl](https://github.com/PrimeIntellect-ai/prime-rl) repository under `scripts/mini_moe/`:
+```bash
+# Step 1: Create random-init model
+uv run python scripts/mini_moe/create.py --arch glm4_moe --output-dir ./mini-glm-moe
+# Step 2: Verify HF <-> PrimeRL roundtrip
+uv run python scripts/mini_moe/verify.py --arch glm4_moe --model-dir ./mini-glm-moe
+# Step 3: SFT warmup + verify + push
+uv run python scripts/mini_moe/sft_warmup.py --arch glm4_moe --model-dir ./mini-glm-moe --sft-steps 200 --push-to-hub samsja/mini-glm-moe
+```
+To add a new architecture, add a preset to `scripts/mini_moe/presets.py`.
+## Intended use
+- Testing MoE training pipelines (SFT, RL) in prime-rl
+- Validating state dict conversion between HuggingFace and prime-rl formats
+- Integration tests that need a real MoE model but cannot afford large checkpoints
+- Checking RL metrics (KL divergence, reward signals) on a small scale
+This model is **not** intended for inference or any downstream task.