samsja commited on
Commit
97a4455
·
verified ·
1 Parent(s): 37c4f87

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +88 -0
README.md ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - moe
5
+ - glm
6
+ - prime-rl
7
+ - testing
8
+ ---
9
+
10
+ # Mini GLM-4 MoE (0.5B)
11
+
12
+ A small [GLM-4 MoE](https://huggingface.co/THUDM/GLM-4-100B-A10B) model (543M parameters) for testing and development. Uses the same `Glm4MoeForCausalLM` architecture as the full GLM-4-100B-A10B but with reduced dimensions.
13
+
14
+ This model is designed for testing MoE training pipelines in [prime-rl](https://github.com/PrimeIntellect-ai/prime-rl) without needing large pretrained checkpoints. It is small enough to run on a single GPU while exercising the same code paths as production models.
15
+
16
+ ## Architecture
17
+
18
+ | Parameter | Value |
19
+ |---|---|
20
+ | Parameters | 543M |
21
+ | Hidden size | 1024 |
22
+ | Layers | 24 |
23
+ | Attention heads | 16 (4 KV heads) |
24
+ | Routed experts | 8 |
25
+ | Experts per token | 4 |
26
+ | Shared experts | 1 |
27
+ | MoE intermediate size | 256 |
28
+ | Dense intermediate size | 2048 |
29
+ | Dense layers (first-k) | 1 |
30
+ | Vocab size | 151,552 |
31
+ | Partial rotary factor | 0.5 |
32
+ | Model type | `glm4_moe` |
33
+
34
+ The architecture mirrors [THUDM/GLM-4-100B-A10B](https://huggingface.co/THUDM/GLM-4-100B-A10B): the first layer is a dense MLP, and all subsequent layers use Mixture-of-Experts with a shared expert. Attention uses Grouped Query Attention (GQA) with partial rotary embeddings.
35
+
36
+ ## How this model was created
37
+
38
+ **Step 1: Random initialization.** A `Glm4MoeConfig` was instantiated with the small dimensions above and the HuggingFace `Glm4MoeForCausalLM` model was initialized with random weights. The tokenizer was copied from [THUDM/GLM-4-9B-0414](https://huggingface.co/THUDM/GLM-4-9B-0414).
39
+
40
+ **Step 2: Roundtrip verification.** Before training, we verified that the HuggingFace and [prime-rl](https://github.com/PrimeIntellect-ai/prime-rl) custom implementations produce identical outputs on the same weights (max logits diff < 0.01), and that `convert_to_hf` / `convert_to_prime` state dict conversions are lossless.
41
+
42
+ **Step 3: SFT warmup.** The model was fine-tuned for 200 steps on [PrimeIntellect/Reverse-Text-SFT](https://huggingface.co/datasets/PrimeIntellect/Reverse-Text-SFT) using prime-rl's custom MoE implementation with the following config:
43
+
44
+ ```toml
45
+ max_steps = 200
46
+
47
+ [model]
48
+ impl = "custom"
49
+ attn = "sdpa"
50
+
51
+ [data]
52
+ name = "PrimeIntellect/Reverse-Text-SFT"
53
+ batch_size = 4
54
+ seq_len = 1024
55
+
56
+ [optim]
57
+ lr = 1e-4
58
+ ```
59
+
60
+ Loss went from ~12 (random init) to ~2.5 after 200 steps. The model is not intended to be useful for generation -- the SFT warmup gives it a non-trivial learned distribution so that KL divergence and other RL metrics are meaningful during testing.
61
+
62
+ **Step 4: Post-training verification.** After SFT, we re-verified the HF <-> PrimeRL roundtrip on the trained checkpoint to confirm that checkpoint saving (which goes through `convert_to_hf`) produced valid weights.
63
+
64
+ ## Reproduction
65
+
66
+ The scripts used to create this model live in the [prime-rl](https://github.com/PrimeIntellect-ai/prime-rl) repository under `scripts/mini_moe/`:
67
+
68
+ ```bash
69
+ # Step 1: Create random-init model
70
+ uv run python scripts/mini_moe/create.py --arch glm4_moe --output-dir ./mini-glm-moe
71
+
72
+ # Step 2: Verify HF <-> PrimeRL roundtrip
73
+ uv run python scripts/mini_moe/verify.py --arch glm4_moe --model-dir ./mini-glm-moe
74
+
75
+ # Step 3: SFT warmup + verify + push
76
+ uv run python scripts/mini_moe/sft_warmup.py --arch glm4_moe --model-dir ./mini-glm-moe --sft-steps 200 --push-to-hub samsja/mini-glm-moe
77
+ ```
78
+
79
+ To add a new architecture, add a preset to `scripts/mini_moe/presets.py`.
80
+
81
+ ## Intended use
82
+
83
+ - Testing MoE training pipelines (SFT, RL) in prime-rl
84
+ - Validating state dict conversion between HuggingFace and prime-rl formats
85
+ - Integration tests that need a real MoE model but cannot afford large checkpoints
86
+ - Checking RL metrics (KL divergence, reward signals) on a small scale
87
+
88
+ This model is **not** intended for inference or any downstream task.