PakNin commited on
Commit
16059ba
·
verified ·
1 Parent(s): fa6df28

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ logs/loss_compare.png filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: microsoft/Phi-mini-MoE-instruct
3
+ library_name: peft
4
+ pipeline_tag: text-generation
5
+ tags:
6
+ - base_model:adapter:microsoft/Phi-mini-MoE-instruct
7
+ - lora
8
+ - transformers
9
+ ---
10
+
11
+ # Model Card for Model ID
12
+
13
+ <!-- Provide a quick summary of what the model is/does. -->
14
+
15
+
16
+
17
+ ## Model Details
18
+
19
+ ### Model Description
20
+
21
+ <!-- Provide a longer summary of what this model is. -->
22
+
23
+
24
+
25
+ - **Developed by:** [More Information Needed]
26
+ - **Funded by [optional]:** [More Information Needed]
27
+ - **Shared by [optional]:** [More Information Needed]
28
+ - **Model type:** [More Information Needed]
29
+ - **Language(s) (NLP):** [More Information Needed]
30
+ - **License:** [More Information Needed]
31
+ - **Finetuned from model [optional]:** [More Information Needed]
32
+
33
+ ### Model Sources [optional]
34
+
35
+ <!-- Provide the basic links for the model. -->
36
+
37
+ - **Repository:** [More Information Needed]
38
+ - **Paper [optional]:** [More Information Needed]
39
+ - **Demo [optional]:** [More Information Needed]
40
+
41
+ ## Uses
42
+
43
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
44
+
45
+ ### Direct Use
46
+
47
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
48
+
49
+ [More Information Needed]
50
+
51
+ ### Downstream Use [optional]
52
+
53
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
54
+
55
+ [More Information Needed]
56
+
57
+ ### Out-of-Scope Use
58
+
59
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
60
+
61
+ [More Information Needed]
62
+
63
+ ## Bias, Risks, and Limitations
64
+
65
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
66
+
67
+ [More Information Needed]
68
+
69
+ ### Recommendations
70
+
71
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
72
+
73
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
74
+
75
+ ## How to Get Started with the Model
76
+
77
+ Use the code below to get started with the model.
78
+
79
+ [More Information Needed]
80
+
81
+ ## Training Details
82
+
83
+ ### Training Data
84
+
85
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
86
+
87
+ [More Information Needed]
88
+
89
+ ### Training Procedure
90
+
91
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
92
+
93
+ #### Preprocessing [optional]
94
+
95
+ [More Information Needed]
96
+
97
+
98
+ #### Training Hyperparameters
99
+
100
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
101
+
102
+ #### Speeds, Sizes, Times [optional]
103
+
104
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
105
+
106
+ [More Information Needed]
107
+
108
+ ## Evaluation
109
+
110
+ <!-- This section describes the evaluation protocols and provides the results. -->
111
+
112
+ ### Testing Data, Factors & Metrics
113
+
114
+ #### Testing Data
115
+
116
+ <!-- This should link to a Dataset Card if possible. -->
117
+
118
+ [More Information Needed]
119
+
120
+ #### Factors
121
+
122
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
123
+
124
+ [More Information Needed]
125
+
126
+ #### Metrics
127
+
128
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
129
+
130
+ [More Information Needed]
131
+
132
+ ### Results
133
+
134
+ [More Information Needed]
135
+
136
+ #### Summary
137
+
138
+
139
+
140
+ ## Model Examination [optional]
141
+
142
+ <!-- Relevant interpretability work for the model goes here -->
143
+
144
+ [More Information Needed]
145
+
146
+ ## Environmental Impact
147
+
148
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
149
+
150
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
151
+
152
+ - **Hardware Type:** [More Information Needed]
153
+ - **Hours used:** [More Information Needed]
154
+ - **Cloud Provider:** [More Information Needed]
155
+ - **Compute Region:** [More Information Needed]
156
+ - **Carbon Emitted:** [More Information Needed]
157
+
158
+ ## Technical Specifications [optional]
159
+
160
+ ### Model Architecture and Objective
161
+
162
+ [More Information Needed]
163
+
164
+ ### Compute Infrastructure
165
+
166
+ [More Information Needed]
167
+
168
+ #### Hardware
169
+
170
+ [More Information Needed]
171
+
172
+ #### Software
173
+
174
+ [More Information Needed]
175
+
176
+ ## Citation [optional]
177
+
178
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
179
+
180
+ **BibTeX:**
181
+
182
+ [More Information Needed]
183
+
184
+ **APA:**
185
+
186
+ [More Information Needed]
187
+
188
+ ## Glossary [optional]
189
+
190
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
191
+
192
+ [More Information Needed]
193
+
194
+ ## More Information [optional]
195
+
196
+ [More Information Needed]
197
+
198
+ ## Model Card Authors [optional]
199
+
200
+ [More Information Needed]
201
+
202
+ ## Model Card Contact
203
+
204
+ [More Information Needed]
205
+ ### Framework versions
206
+
207
+ - PEFT 0.18.1
adapter_config.json ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alora_invocation_tokens": null,
3
+ "alpha_pattern": {},
4
+ "arrow_config": null,
5
+ "auto_mapping": null,
6
+ "base_model_name_or_path": "microsoft/Phi-mini-MoE-instruct",
7
+ "bias": "none",
8
+ "corda_config": null,
9
+ "ensure_weight_tying": false,
10
+ "eva_config": null,
11
+ "exclude_modules": null,
12
+ "fan_in_fan_out": false,
13
+ "inference_mode": true,
14
+ "init_lora_weights": true,
15
+ "layer_replication": null,
16
+ "layers_pattern": null,
17
+ "layers_to_transform": null,
18
+ "loftq_config": {},
19
+ "lora_alpha": 32,
20
+ "lora_bias": false,
21
+ "lora_dropout": 0.0,
22
+ "megatron_config": null,
23
+ "megatron_core": "megatron.core",
24
+ "modules_to_save": null,
25
+ "peft_type": "LORA",
26
+ "peft_version": "0.18.1",
27
+ "qalora_group_size": 16,
28
+ "r": 16,
29
+ "rank_pattern": {},
30
+ "revision": null,
31
+ "target_modules": [
32
+ "q_proj",
33
+ "w3",
34
+ "o_proj",
35
+ "w2",
36
+ "w1",
37
+ "v_proj",
38
+ "k_proj"
39
+ ],
40
+ "target_parameters": null,
41
+ "task_type": "CAUSAL_LM",
42
+ "trainable_token_indices": null,
43
+ "use_dora": false,
44
+ "use_qalora": false,
45
+ "use_rslora": false
46
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:60d95b10b6e140a9626a7058d5038528f2ff80148dc4569b881db56052046509
3
+ size 40
chat_template.jinja ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {% for message in messages %}{{'<|' + message['role'] + '|>' + '
2
+ ' + message['content'] + '<|end|>
3
+ ' }}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|>
4
+ ' }}{% else %}{{ eos_token }}{% endif %}
logs/aux_loss_compare.png ADDED
logs/aux_loss_curve.png ADDED
logs/loss_compare.png ADDED

Git LFS Details

  • SHA256: 5cea9a270f52b8c49a5e00c04c1caddc3f14767989dc6df7182cf2aefcf99410
  • Pointer size: 131 Bytes
  • Size of remote file: 108 kB
logs/loss_curve.png ADDED
logs/rexmoe_training_0304_033137 copy.log ADDED
@@ -0,0 +1,467 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2026-04-03 03:31:37 - ReXMoE - INFO - ================================================================================
2
+ 2026-04-03 03:31:37 - ReXMoE - INFO - ReXMoE Training Log - 0304_033137
3
+ 2026-04-03 03:31:37 - ReXMoE - INFO - Log file: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/logs/rexmoe_training_0304_033137.log
4
+ 2026-04-03 03:31:37 - ReXMoE - INFO - ================================================================================
5
+ 2026-04-03 03:31:37 - ReXMoE - INFO - ================================================================================
6
+ 2026-04-03 03:31:37 - ReXMoE - INFO - ReXMoE Cross-Layer Expert Reuse Training
7
+ 2026-04-03 03:31:37 - ReXMoE - INFO - ================================================================================
8
+ 2026-04-03 03:31:37 - ReXMoE - INFO - MET enabled: False
9
+ 2026-04-03 03:31:37 - ReXMoE - INFO -
10
+ Configuration:
11
+ Model: microsoft/Phi-mini-MoE-instruct
12
+ Dataset: ../dataset/alpaca_data_cleaned.json
13
+ Dataset mode: IF_2
14
+ Reuse Scale (R): 3
15
+ Prune Ratio (MET): N/A
16
+ Epochs: 1
17
+ Num of samples: 20000
18
+ Batch Size: 4
19
+ Sequence Length: 1024
20
+ Learning Rate: 2e-05
21
+ PSR Enabled: True
22
+ LR Scheduler: True
23
+ Save Path: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3
24
+ Gradient Checkpointing: False
25
+ LoRA Rank: 16 (Full LoRA: True)
26
+ LoRA Alpha: 32
27
+ MET Enabled: False (Mask Ratio: 0.1, Warmup: 0.5)
28
+ Log File: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/logs/rexmoe_training_0304_033137.log
29
+ Aux loss weight: 0.05
30
+
31
+ 2026-04-03 03:31:37 - ReXMoE - INFO - 💻 Using device: cuda)
32
+ 2026-04-03 03:31:37 - ReXMoE - INFO - GPU: NVIDIA RTX A6000, Memory: 47.53 GB
33
+ 2026-04-03 03:31:43 - ReXMoE - INFO - [5/7] Setting up optimizer and dataset...
34
+ 2026-04-03 03:31:43 - ReXMoE - INFO - Using 8-bit AdamW optimizer
35
+ 2026-04-03 03:31:43 - ReXMoE - INFO - LR Scheduler: CosineAnnealingLR (2e-05 → 2.0000000000000003e-06)
36
+ 2026-04-03 03:31:51 - ReXMoE - INFO -
37
+ First batch statistics:
38
+ 2026-04-03 03:31:51 - ReXMoE - INFO - LM Loss: 1.0094
39
+ 2026-04-03 03:31:51 - ReXMoE - INFO - Aux Loss: 0.092773
40
+ 2026-04-03 03:31:51 - ReXMoE - INFO - Total Loss: 1.1022
41
+ 2026-04-03 03:31:51 - ReXMoE - INFO - Current R: 2
42
+ 2026-04-03 03:31:51 - ReXMoE - INFO - Active experts per layer: 32
43
+ 2026-04-03 03:31:51 - ReXMoE - INFO - Gradient norm: 1.0000
44
+ 2026-04-03 03:31:51 - ReXMoE - INFO -
45
+
46
+ 2026-04-03 03:35:09 - ReXMoE - INFO - [50/5000] loss=1.1939 aux=0.062988 R=2
47
+ 2026-04-03 03:38:21 - ReXMoE - INFO - [100/5000] loss=1.1803 aux=0.040039 R=2
48
+ 2026-04-03 03:41:36 - ReXMoE - INFO - [150/5000] loss=1.2968 aux=0.036621 R=2
49
+ 2026-04-03 03:44:50 - ReXMoE - INFO - [200/5000] loss=1.2447 aux=0.028198 R=2
50
+ 2026-04-03 03:48:01 - ReXMoE - INFO - [250/5000] loss=1.1971 aux=0.034180 R=2
51
+ 2026-04-03 03:51:10 - ReXMoE - INFO - [300/5000] loss=2.1766 aux=0.024658 R=2
52
+ 2026-04-03 03:54:19 - ReXMoE - INFO - [350/5000] loss=1.1092 aux=0.017578 R=2
53
+ 2026-04-03 03:57:29 - ReXMoE - INFO - [400/5000] loss=0.9343 aux=0.024414 R=2
54
+ 2026-04-03 04:00:40 - ReXMoE - INFO - [450/5000] loss=1.2180 aux=0.045410 R=2
55
+ 2026-04-03 04:03:47 - ReXMoE - INFO - Warmup completed at step 500. Enabling FULL QLoRA with r = 16 and alpha = 32 on experts and updating optimizer...
56
+ 2026-04-03 04:03:51 - ReXMoE - INFO - Trainable params (routers + LoRA): 144179200 (1.8509%)
57
+ 2026-04-03 04:03:51 - ReXMoE - INFO - Sample trainable params after QLoRA: ['base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight', 'base_model.model.model.layers.0.self_attn.k_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.k_proj.lora_B.default.weight', 'base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight', 'base_model.model.model.layers.0.self_attn.o_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.o_proj.lora_B.default.weight', 'base_model.model.model.layers.0.block_sparse_moe.gate.weight', 'base_model.model.model.layers.0.block_sparse_moe.experts.0.w1.lora_A.default.weight']
58
+ 2026-04-03 04:03:58 - ReXMoE - INFO - [500/5000] loss=1.0733 aux=0.036621 R=2
59
+ 2026-04-03 04:09:15 - ReXMoE - INFO - [550/5000] loss=0.6253 aux=0.014709 R=2
60
+ 2026-04-03 04:14:28 - ReXMoE - INFO - [600/5000] loss=1.5688 aux=0.010986 R=2
61
+ 2026-04-03 04:19:39 - ReXMoE - INFO - [650/5000] loss=0.7864 aux=0.016357 R=2
62
+ 2026-04-03 04:24:52 - ReXMoE - INFO - [700/5000] loss=1.5303 aux=0.010681 R=2
63
+ 2026-04-03 04:30:02 - ReXMoE - INFO - [750/5000] loss=1.0098 aux=0.007812 R=2
64
+ 2026-04-03 04:35:13 - ReXMoE - INFO - [800/5000] loss=1.0523 aux=0.014282 R=2
65
+ 2026-04-03 04:40:24 - ReXMoE - INFO - [850/5000] loss=0.6447 aux=0.009094 R=2
66
+ 2026-04-03 04:45:37 - ReXMoE - INFO - [900/5000] loss=0.7665 aux=0.004822 R=2
67
+ 2026-04-03 04:50:50 - ReXMoE - INFO - [950/5000] loss=0.7762 aux=0.005737 R=2
68
+ 2026-04-03 04:56:03 - ReXMoE - INFO - [1000/5000] loss=1.0254 aux=0.003571 R=2
69
+ 2026-04-03 05:01:16 - ReXMoE - INFO - [1050/5000] loss=1.1320 aux=0.005737 R=2
70
+ 2026-04-03 05:06:28 - ReXMoE - INFO - [1100/5000] loss=0.7519 aux=0.004974 R=2
71
+ 2026-04-03 05:11:40 - ReXMoE - INFO - [1150/5000] loss=0.8246 aux=0.003204 R=2
72
+ 2026-04-03 05:16:55 - ReXMoE - INFO - [1200/5000] loss=1.0041 aux=0.006042 R=2
73
+ 2026-04-03 05:22:09 - ReXMoE - INFO - [1250/5000] loss=0.6804 aux=0.005859 R=2
74
+ 2026-04-03 05:27:21 - ReXMoE - INFO - [1300/5000] loss=0.9695 aux=0.011108 R=2
75
+ 2026-04-03 05:32:33 - ReXMoE - INFO - [1350/5000] loss=1.0448 aux=0.012634 R=2
76
+ 2026-04-03 05:37:45 - ReXMoE - INFO - [1400/5000] loss=0.7468 aux=0.002136 R=2
77
+ 2026-04-03 05:42:58 - ReXMoE - INFO - [1450/5000] loss=1.6307 aux=0.003510 R=2
78
+ 2026-04-03 05:48:10 - ReXMoE - INFO - [1500/5000] loss=1.1833 aux=0.002625 R=2
79
+ 2026-04-03 05:53:21 - ReXMoE - INFO - [1550/5000] loss=0.9216 aux=0.002991 R=2
80
+ 2026-04-03 05:58:33 - ReXMoE - INFO - [1600/5000] loss=0.5969 aux=0.003708 R=2
81
+ 2026-04-03 06:03:46 - ReXMoE - INFO - [1650/5000] loss=0.5240 aux=0.002518 R=2
82
+ 2026-04-03 06:08:58 - ReXMoE - INFO - [1700/5000] loss=0.7681 aux=0.001785 R=2
83
+ 2026-04-03 06:14:09 - ReXMoE - INFO - [1750/5000] loss=1.0812 aux=0.002899 R=2
84
+ 2026-04-03 06:19:21 - ReXMoE - INFO - [1800/5000] loss=0.8171 aux=0.010986 R=2
85
+ 2026-04-03 06:24:34 - ReXMoE - INFO - [1850/5000] loss=0.9029 aux=0.005371 R=2
86
+ 2026-04-03 06:29:46 - ReXMoE - INFO - [1900/5000] loss=1.0440 aux=0.001839 R=2
87
+ 2026-04-03 06:35:00 - ReXMoE - INFO - [1950/5000] loss=1.2026 aux=0.005096 R=2
88
+ 2026-04-03 06:40:13 - ReXMoE - INFO - [2000/5000] loss=0.7174 aux=0.003372 R=2
89
+ 2026-04-03 06:45:25 - ReXMoE - INFO - [2050/5000] loss=1.5737 aux=0.003571 R=2
90
+ 2026-04-03 06:50:37 - ReXMoE - INFO - [2100/5000] loss=0.8508 aux=0.003403 R=2
91
+ 2026-04-03 06:55:51 - ReXMoE - INFO - [2150/5000] loss=0.7965 aux=0.001656 R=2
92
+ 2026-04-03 07:01:02 - ReXMoE - INFO - [2200/5000] loss=1.3079 aux=0.002747 R=2
93
+ 2026-04-03 07:06:14 - ReXMoE - INFO - [2250/5000] loss=0.9750 aux=0.002228 R=2
94
+ 2026-04-03 07:11:28 - ReXMoE - INFO - [2300/5000] loss=0.9549 aux=0.002228 R=2
95
+ 2026-04-03 07:16:40 - ReXMoE - INFO - [2350/5000] loss=1.2216 aux=0.004089 R=2
96
+ 2026-04-03 07:21:53 - ReXMoE - INFO - [2400/5000] loss=0.9801 aux=0.002289 R=2
97
+ 2026-04-03 07:27:07 - ReXMoE - INFO - [2450/5000] loss=1.6587 aux=0.001602 R=2
98
+ 2026-04-03 07:32:23 - ReXMoE - INFO - [2500/5000] loss=1.7420 aux=0.014648 R=3
99
+ 2026-04-03 07:39:14 - ReXMoE - INFO - [2550/5000] loss=1.0498 aux=0.001801 R=3
100
+ 2026-04-03 07:46:08 - ReXMoE - INFO - [2600/5000] loss=0.7848 aux=0.002792 R=3
101
+ 2026-04-03 07:53:01 - ReXMoE - INFO - [2650/5000] loss=0.6119 aux=0.000992 R=3
102
+ 2026-04-03 07:59:55 - ReXMoE - INFO - [2700/5000] loss=1.0871 aux=0.002014 R=3
103
+ 2026-04-03 08:06:48 - ReXMoE - INFO - [2750/5000] loss=1.0422 aux=0.001411 R=3
104
+ 2026-04-03 08:13:45 - ReXMoE - INFO - [2800/5000] loss=1.0147 aux=0.002762 R=3
105
+ 2026-04-03 08:20:37 - ReXMoE - INFO - [2850/5000] loss=0.6756 aux=0.001953 R=3
106
+ 2026-04-03 08:27:28 - ReXMoE - INFO - [2900/5000] loss=0.6243 aux=0.001671 R=3
107
+ 2026-04-03 08:34:22 - ReXMoE - INFO - [2950/5000] loss=0.8838 aux=0.004974 R=3
108
+ 2026-04-03 08:41:13 - ReXMoE - INFO - [3000/5000] loss=0.7627 aux=0.002060 R=3
109
+ 2026-04-03 08:48:05 - ReXMoE - INFO - [3050/5000] loss=0.8120 aux=0.000668 R=3
110
+ 2026-04-03 08:54:56 - ReXMoE - INFO - [3100/5000] loss=0.9701 aux=0.002121 R=3
111
+ 2026-04-03 09:01:47 - ReXMoE - INFO - [3150/5000] loss=0.8151 aux=0.001289 R=3
112
+ 2026-04-03 09:08:39 - ReXMoE - INFO - [3200/5000] loss=0.6943 aux=0.002777 R=3
113
+ 2026-04-03 09:15:30 - ReXMoE - INFO - [3250/5000] loss=0.9401 aux=0.002350 R=3
114
+ 2026-04-03 09:22:20 - ReXMoE - INFO - [3300/5000] loss=0.7034 aux=0.007935 R=3
115
+ 2026-04-03 09:29:11 - ReXMoE - INFO - [3350/5000] loss=1.1980 aux=0.003006 R=3
116
+ 2026-04-03 09:36:04 - ReXMoE - INFO - [3400/5000] loss=0.6413 aux=0.002045 R=3
117
+ 2026-04-03 09:43:01 - ReXMoE - INFO - [3450/5000] loss=1.1729 aux=0.001686 R=3
118
+ 2026-04-03 09:49:52 - ReXMoE - INFO - [3500/5000] loss=1.1667 aux=0.002045 R=3
119
+ 2026-04-03 09:56:42 - ReXMoE - INFO - [3550/5000] loss=0.3543 aux=0.007324 R=3
120
+ 2026-04-03 10:03:29 - ReXMoE - INFO - [3600/5000] loss=1.0002 aux=0.002792 R=3
121
+ 2026-04-03 10:10:20 - ReXMoE - INFO - [3650/5000] loss=0.8748 aux=0.001503 R=3
122
+ 2026-04-03 10:17:12 - ReXMoE - INFO - [3700/5000] loss=0.9026 aux=0.021118 R=3
123
+ 2026-04-03 10:24:05 - ReXMoE - INFO - [3750/5000] loss=0.3710 aux=0.002182 R=3
124
+ 2026-04-03 10:30:57 - ReXMoE - INFO - [3800/5000] loss=1.2199 aux=0.001564 R=3
125
+ 2026-04-03 10:37:48 - ReXMoE - INFO - [3850/5000] loss=0.4812 aux=0.008057 R=3
126
+ 2026-04-03 10:44:38 - ReXMoE - INFO - [3900/5000] loss=0.9683 aux=0.002487 R=3
127
+ 2026-04-03 10:51:31 - ReXMoE - INFO - [3950/5000] loss=0.7649 aux=0.001732 R=3
128
+ 2026-04-03 10:58:23 - ReXMoE - INFO - [4000/5000] loss=0.7234 aux=0.001839 R=3
129
+ 2026-04-03 11:05:13 - ReXMoE - INFO - [4050/5000] loss=0.7793 aux=0.001289 R=3
130
+ 2026-04-03 11:12:02 - ReXMoE - INFO - [4100/5000] loss=1.2237 aux=0.001968 R=3
131
+ 2026-04-03 11:18:51 - ReXMoE - INFO - [4150/5000] loss=1.0040 aux=0.002701 R=3
132
+ 2026-04-03 11:25:38 - ReXMoE - INFO - [4200/5000] loss=0.4700 aux=0.001945 R=3
133
+ 2026-04-03 11:32:25 - ReXMoE - INFO - [4250/5000] loss=0.6833 aux=0.004486 R=3
134
+ 2026-04-03 11:39:11 - ReXMoE - INFO - [4300/5000] loss=0.8191 aux=0.003754 R=3
135
+ 2026-04-03 11:45:56 - ReXMoE - INFO - [4350/5000] loss=0.3914 aux=0.001312 R=3
136
+ 2026-04-03 11:52:41 - ReXMoE - INFO - [4400/5000] loss=0.9623 aux=0.001854 R=3
137
+ 2026-04-03 11:59:28 - ReXMoE - INFO - [4450/5000] loss=0.6550 aux=0.005615 R=3
138
+ 2026-04-03 12:06:15 - ReXMoE - INFO - [4500/5000] loss=0.9616 aux=0.002777 R=3
139
+ 2026-04-03 12:13:01 - ReXMoE - INFO - [4550/5000] loss=0.5557 aux=0.008789 R=3
140
+ 2026-04-03 12:19:46 - ReXMoE - INFO - [4600/5000] loss=0.6275 aux=0.018555 R=3
141
+ 2026-04-03 12:26:33 - ReXMoE - INFO - [4650/5000] loss=1.2395 aux=0.001549 R=3
142
+ 2026-04-03 12:33:18 - ReXMoE - INFO - [4700/5000] loss=0.6769 aux=0.002060 R=3
143
+ 2026-04-03 12:40:05 - ReXMoE - INFO - [4750/5000] loss=1.1499 aux=0.006348 R=3
144
+ 2026-04-03 12:46:50 - ReXMoE - INFO - [4800/5000] loss=0.7449 aux=0.001022 R=3
145
+ 2026-04-03 12:53:34 - ReXMoE - INFO - [4850/5000] loss=0.8246 aux=0.001823 R=3
146
+ 2026-04-03 13:00:22 - ReXMoE - INFO - [4900/5000] loss=0.9550 aux=0.002029 R=3
147
+ 2026-04-03 13:07:10 - ReXMoE - INFO - [4950/5000] loss=1.2535 aux=0.001610 R=3
148
+ 2026-04-03 13:13:50 - ReXMoE - INFO -
149
+ [Step 5000/5000] Running evaluation at eval_steps...
150
+ 2026-04-03 13:13:50 - ReXMoE - INFO -
151
+ Evaluating model with 3 sample prompts...
152
+ 2026-04-03 13:13:52 - ReXMoE - INFO -
153
+ --- Prompt 1/3 ---
154
+ 2026-04-03 13:13:52 - ReXMoE - INFO - Instruction: What is the capital of France?
155
+ 2026-04-03 13:13:52 - ReXMoE - INFO - Input: None
156
+ 2026-04-03 13:13:52 - ReXMoE - INFO - Generated completion (len 9): The capital of France is Paris.
157
+ 2026-04-03 13:14:11 - ReXMoE - INFO -
158
+ --- Prompt 2/3 ---
159
+ 2026-04-03 13:14:11 - ReXMoE - INFO - Instruction: High-pressure systems stop air from rising into the colder regions of the atmosphere where water can condense. What will most likely result if a high-pressure system remains in an area for a long period of time?
160
+ A. fog
161
+ B. rain
162
+ C. drought
163
+ D. tornado
164
+ Answer:
165
+ 2026-04-03 13:14:11 - ReXMoE - INFO - Input: None
166
+ 2026-04-03 13:14:11 - ReXMoE - INFO - Generated completion (len 77): A. fog
167
+
168
+ High-pressure systems often lead to fog formation because they can hold moisture and prevent it from evaporating. This can occur when the high-pressure system remains in an area for a long period of time. Fog forms when moist air cools and condenses into water droplets near the surface of the Earth.
169
+ 2026-04-03 13:14:13 - ReXMoE - INFO -
170
+ --- Prompt 3/3 ---
171
+ 2026-04-03 13:14:13 - ReXMoE - INFO - Instruction: Given the fact: predators eat prey
172
+ Question: Predators eat
173
+ A. lions
174
+ B. humans
175
+ C. bunnies
176
+ D. grass
177
+ Answer:
178
+ 2026-04-03 13:14:13 - ReXMoE - INFO - Input: None
179
+ 2026-04-03 13:14:13 - ReXMoE - INFO - Generated completion (len 7): C. bunnies
180
+ 2026-04-03 13:14:13 - ReXMoE - INFO - Evaluation of all 3 prompts complete.
181
+ 2026-04-03 13:14:13 - ReXMoE - INFO -
182
+ [Step 5000] Analyzing routing patterns at eval_steps...
183
+ 2026-04-03 13:14:37 - ReXMoE - INFO -
184
+ Analyzing ACTUAL routing patterns from 10 batches (15,294 tokens)
185
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Current reuse scale: R=3
186
+ 2026-04-03 13:14:37 - ReXMoE - INFO -
187
+ [IG-MET Pruning Report]:
188
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Global: 0/0 UNIQUE experts pruned (0.0%) | threshold=-1.000000
189
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Cross-Layer Routing Distribution (ACTUAL selections):
190
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Same layer (i): 781,056 ( 29.8%)
191
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Previous layer (i-1): 965,741 ( 36.8%)
192
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Next layer (i+1): 815,206 ( 31.1%)
193
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Distant layers: 59,437 ( 2.3%)
194
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Sample Layer-Specific Routing Patterns:
195
+ 2026-04-03 13:14:37 - ReXMoE - INFO -
196
+ Layer 8:
197
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 7 from layer 9 ( L9): 5,937 times ( 38.8%)
198
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 7 from layer 7 ( L7): 5,895 times ( 38.5%)
199
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 14 from layer 9 ( L9): 5,822 times ( 38.1%)
200
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 14 from layer 7 ( L7): 5,618 times ( 36.7%)
201
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 2 from layer 7 ( L7): 4,200 times ( 27.5%)
202
+ 2026-04-03 13:14:37 - ReXMoE - INFO -
203
+ Layer 16:
204
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 8 from layer 15 ( L15): 8,999 times ( 58.8%)
205
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 8 from layer 17 ( L17): 7,847 times ( 51.3%)
206
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 10 from layer 15 ( L15): 6,002 times ( 39.2%)
207
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 10 from layer 17 ( L17): 5,879 times ( 38.4%)
208
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 15 from layer 15 ( L15): 4,030 times ( 26.4%)
209
+ 2026-04-03 13:14:37 - ReXMoE - INFO -
210
+ Layer 24:
211
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 8 from layer 23 ( L23): 9,213 times ( 60.2%)
212
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 9 from layer 23 ( L23): 7,912 times ( 51.7%)
213
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 8 from layer 25 ( L25): 6,819 times ( 44.6%)
214
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 9 from layer 25 ( L25): 6,403 times ( 41.9%)
215
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 8 from layer 24 (same): 4,350 times ( 28.4%)
216
+ 2026-04-03 13:14:37 - ReXMoE - INFO - ✅ Cross-layer expert reuse detected: 70.2% of routing uses adjacent layers
217
+ 2026-04-03 13:14:37 - ReXMoE - INFO -
218
+ [Step 5000] Saving checkpoint at eval_steps to ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3...
219
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.0.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
220
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.0.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
221
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.1.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
222
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.1.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
223
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.2.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
224
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.2.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
225
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.3.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
226
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.3.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
227
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.4.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
228
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.4.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
229
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.5.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
230
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.5.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
231
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.6.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
232
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.6.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
233
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.7.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
234
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.7.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
235
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.8.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
236
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.8.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
237
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.9.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
238
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.9.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
239
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.10.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
240
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.10.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
241
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.11.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
242
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.11.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
243
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.12.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
244
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.12.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
245
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.13.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
246
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.13.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
247
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.14.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
248
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.14.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
249
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.15.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
250
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.15.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
251
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.16.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
252
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.16.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
253
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.17.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
254
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.17.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
255
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.18.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
256
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.18.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
257
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.19.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
258
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.19.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
259
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.20.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
260
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.20.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
261
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.21.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
262
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.21.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
263
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.22.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
264
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.22.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
265
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.23.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
266
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.23.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
267
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.24.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
268
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.24.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
269
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.25.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
270
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.25.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
271
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.26.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
272
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.26.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
273
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.27.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
274
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.27.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
275
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.28.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
276
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.28.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
277
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.29.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
278
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.29.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
279
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.30.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
280
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.30.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
281
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.31.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
282
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.31.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
283
+ 2026-04-03 13:14:37 - ReXMoE - INFO - ✓ Saved trained router weights: 96 parameters
284
+ 2026-04-03 13:14:37 - ReXMoE - INFO - File: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/rexmoe_routers.pt
285
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Size: 12.03 MB
286
+ 2026-04-03 13:14:37 - ReXMoE - INFO -
287
+ Also saving full model with ReXMoE architecture...
288
+ 2026-04-03 13:14:39 - ReXMoE - INFO -
289
+ Merging LoRA adapters into base weights and saving to: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/merged
290
+ 2026-04-03 13:15:00 - ReXMoE - INFO - ✓ Saved merged full model (base+routers+LoRA) for one-step loading
291
+ 2026-04-03 13:15:00 - ReXMoE - INFO -
292
+ ============================================================
293
+ 2026-04-03 13:15:00 - ReXMoE - INFO - Epoch 1 Summary:
294
+ 2026-04-03 13:15:00 - ReXMoE - INFO - Average LM Loss: 0.9498
295
+ 2026-04-03 13:15:00 - ReXMoE - INFO - Average Aux Loss: 0.008843
296
+ 2026-04-03 13:15:00 - ReXMoE - INFO - Average Total Loss: 0.9586
297
+ 2026-04-03 13:15:00 - ReXMoE - INFO - Final R: 3
298
+ 2026-04-03 13:15:00 - ReXMoE - INFO -
299
+ Evaluating model with 3 sample prompts...
300
+ 2026-04-03 13:15:02 - ReXMoE - INFO -
301
+ --- Prompt 1/3 ---
302
+ 2026-04-03 13:15:02 - ReXMoE - INFO - Instruction: What is the capital of France?
303
+ 2026-04-03 13:15:02 - ReXMoE - INFO - Input: None
304
+ 2026-04-03 13:15:02 - ReXMoE - INFO - Generated completion (len 9): The capital of France is Paris.
305
+ 2026-04-03 13:15:04 - ReXMoE - INFO -
306
+ --- Prompt 2/3 ---
307
+ 2026-04-03 13:15:04 - ReXMoE - INFO - Instruction: High-pressure systems stop air from rising into the colder regions of the atmosphere where water can condense. What will most likely result if a high-pressure system remains in an area for a long period of time?
308
+ A. fog
309
+ B. rain
310
+ C. drought
311
+ D. tornado
312
+ Answer:
313
+ 2026-04-03 13:15:04 - ReXMoE - INFO - Input: None
314
+ 2026-04-03 13:15:04 - ReXMoE - INFO - Generated completion (len 5): A. fog
315
+ 2026-04-03 13:15:05 - ReXMoE - INFO -
316
+ --- Prompt 3/3 ---
317
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Instruction: Given the fact: predators eat prey
318
+ Question: Predators eat
319
+ A. lions
320
+ B. humans
321
+ C. bunnies
322
+ D. grass
323
+ Answer:
324
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Input: None
325
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Generated completion (len 7): C. bunnies
326
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Evaluation of all 3 prompts complete.
327
+ 2026-04-03 13:15:05 - ReXMoE - INFO - New best epoch 1 with avg LM loss 0.9498 — saving checkpoint to ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3
328
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.0.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
329
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.0.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
330
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.1.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
331
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.1.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
332
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.2.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
333
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.2.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
334
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.3.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
335
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.3.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
336
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.4.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
337
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.4.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
338
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.5.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
339
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.5.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
340
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.6.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
341
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.6.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
342
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.7.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
343
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.7.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
344
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.8.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
345
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.8.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
346
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.9.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
347
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.9.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
348
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.10.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
349
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.10.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
350
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.11.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
351
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.11.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
352
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.12.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
353
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.12.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
354
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.13.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
355
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.13.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
356
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.14.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
357
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.14.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
358
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.15.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
359
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.15.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
360
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.16.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
361
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.16.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
362
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.17.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
363
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.17.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
364
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.18.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
365
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.18.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
366
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.19.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
367
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.19.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
368
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.20.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
369
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.20.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
370
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.21.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
371
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.21.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
372
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.22.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
373
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.22.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
374
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.23.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
375
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.23.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
376
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.24.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
377
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.24.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
378
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.25.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
379
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.25.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
380
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.26.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
381
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.26.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
382
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.27.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
383
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.27.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
384
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.28.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
385
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.28.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
386
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.29.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
387
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.29.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
388
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.30.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
389
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.30.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
390
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.31.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
391
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.31.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
392
+ 2026-04-03 13:15:06 - ReXMoE - INFO - ✓ Saved trained router weights: 96 parameters
393
+ 2026-04-03 13:15:06 - ReXMoE - INFO - File: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/rexmoe_routers.pt
394
+ 2026-04-03 13:15:06 - ReXMoE - INFO - Size: 12.03 MB
395
+ 2026-04-03 13:15:06 - ReXMoE - INFO -
396
+ Also saving full model with ReXMoE architecture...
397
+ 2026-04-03 13:15:06 - ReXMoE - INFO -
398
+ Merging LoRA adapters into base weights and saving to: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/merged
399
+ 2026-04-03 13:15:44 - ReXMoE - INFO - ✓ Saved merged full model (base+routers+LoRA) for one-step loading
400
+ 2026-04-03 13:15:44 - ReXMoE - INFO -
401
+ 📊 Convergence Metrics:
402
+ 2026-04-03 13:15:44 - ReXMoE - INFO - Convergence Metrics:
403
+ 2026-04-03 13:15:44 - ReXMoE - INFO - Avg Router Grad Norm: 0.084278
404
+ 2026-04-03 13:15:44 - ReXMoE - INFO - Current Learning Rate: 2.00e-05
405
+ 2026-04-03 13:15:44 - ReXMoE - INFO - ℹ️ Collecting convergence data (need 5 epochs minimum)...
406
+ 2026-04-03 13:15:44 - ReXMoE - INFO - Routing Pattern Analysis (Epoch 1):
407
+ 2026-04-03 13:15:59 - ReXMoE - INFO -
408
+ Analyzing ACTUAL routing patterns from 10 batches (17,341 tokens)
409
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Current reuse scale: R=3
410
+ 2026-04-03 13:15:59 - ReXMoE - INFO -
411
+ [IG-MET Pruning Report]:
412
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Global: 0/0 UNIQUE experts pruned (0.0%) | threshold=-1.000000
413
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Cross-Layer Routing Distribution (ACTUAL selections):
414
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Same layer (i): 869,591 ( 33.2%)
415
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Previous layer (i-1): 896,913 ( 34.2%)
416
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Next layer (i+1): 797,210 ( 30.4%)
417
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Distant layers: 57,726 ( 2.2%)
418
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Sample Layer-Specific Routing Patterns:
419
+ 2026-04-03 13:15:59 - ReXMoE - INFO -
420
+ Layer 8:
421
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 7 from layer 9 ( L9): 6,917 times ( 39.9%)
422
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 14 from layer 9 ( L9): 6,553 times ( 37.8%)
423
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 7 from layer 7 ( L7): 6,305 times ( 36.4%)
424
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 14 from layer 7 ( L7): 5,503 times ( 31.7%)
425
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 2 from layer 7 ( L7): 3,988 times ( 23.0%)
426
+ 2026-04-03 13:15:59 - ReXMoE - INFO -
427
+ Layer 16:
428
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 8 from layer 15 ( L15): 8,873 times ( 51.2%)
429
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 8 from layer 17 ( L17): 8,226 times ( 47.4%)
430
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 10 from layer 15 ( L15): 5,752 times ( 33.2%)
431
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 10 from layer 17 ( L17): 4,996 times ( 28.8%)
432
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 9 from layer 16 (same): 3,718 times ( 21.4%)
433
+ 2026-04-03 13:15:59 - ReXMoE - INFO -
434
+ Layer 24:
435
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 8 from layer 23 ( L23): 9,676 times ( 55.8%)
436
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 8 from layer 25 ( L25): 7,087 times ( 40.9%)
437
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 9 from layer 23 ( L23): 6,982 times ( 40.3%)
438
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 9 from layer 25 ( L25): 4,908 times ( 28.3%)
439
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 4 from layer 24 (same): 3,902 times ( 22.5%)
440
+ 2026-04-03 13:15:59 - ReXMoE - INFO - ✅ Cross-layer expert reuse detected: 66.8% of routing uses adjacent layers
441
+ 2026-04-03 13:15:59 - ReXMoE - INFO - LR stepped to: 2.00e-05
442
+ 2026-04-03 13:15:59 - ReXMoE - INFO - ================================================================================
443
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Training Convergence Summary
444
+ 2026-04-03 13:15:59 - ReXMoE - INFO - ================================================================================
445
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Router Gradient Norms Over Epochs:
446
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Epoch 1: 0.084278
447
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Auxiliary Loss Over Epochs:
448
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Epoch 1: 0.008843
449
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Convergence Status: Insufficient data (< 5 epochs)
450
+ 2026-04-03 13:15:59 - ReXMoE - INFO -
451
+ Saving trained router weights only...
452
+ 2026-04-03 13:15:59 - ReXMoE - INFO - ✓ Saved trained router weights: 96 parameters
453
+ 2026-04-03 13:15:59 - ReXMoE - INFO - File: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/rexmoe_routers.pt
454
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Size: 12.03 MB
455
+ 2026-04-03 13:15:59 - ReXMoE - INFO -
456
+ Also saving full model with ReXMoE architecture...
457
+ 2026-04-03 13:16:00 - ReXMoE - INFO -
458
+ Merging LoRA adapters into base weights and saving to: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/merged
459
+ 2026-04-03 13:16:32 - ReXMoE - INFO - ✓ Saved merged full model (base+routers+LoRA) for one-step loading
460
+ 2026-04-03 13:16:32 - ReXMoE - INFO - ================================================================================
461
+ 2026-04-03 13:16:32 - ReXMoE - INFO - ✓ Training complete. Two checkpoint formats saved:
462
+ 2026-04-03 13:16:32 - ReXMoE - INFO - 1. Router weights only: rexmoe_routers.pt (portable)
463
+ 2026-04-03 13:16:32 - ReXMoE - INFO - 2. Full model: pytorch_model.bin (requires rexmoe_architecture.py)
464
+ 2026-04-03 13:16:32 - ReXMoE - INFO -
465
+ Checkpoint directory: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3
466
+ 2026-04-03 13:16:32 - ReXMoE - INFO - Full model size: 0.00 GB
467
+ 2026-04-03 13:16:32 - ReXMoE - INFO - ================================================================================
logs/rexmoe_training_0304_033137 copy_aux_corrected.log ADDED
@@ -0,0 +1,467 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2026-04-03 03:31:37 - ReXMoE - INFO - ================================================================================
2
+ 2026-04-03 03:31:37 - ReXMoE - INFO - ReXMoE Training Log - 0304_033137
3
+ 2026-04-03 03:31:37 - ReXMoE - INFO - Log file: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/logs/rexmoe_training_0304_033137.log
4
+ 2026-04-03 03:31:37 - ReXMoE - INFO - ================================================================================
5
+ 2026-04-03 03:31:37 - ReXMoE - INFO - ================================================================================
6
+ 2026-04-03 03:31:37 - ReXMoE - INFO - ReXMoE Cross-Layer Expert Reuse Training
7
+ 2026-04-03 03:31:37 - ReXMoE - INFO - ================================================================================
8
+ 2026-04-03 03:31:37 - ReXMoE - INFO - MET enabled: False
9
+ 2026-04-03 03:31:37 - ReXMoE - INFO -
10
+ Configuration:
11
+ Model: microsoft/Phi-mini-MoE-instruct
12
+ Dataset: ../dataset/alpaca_data_cleaned.json
13
+ Dataset mode: IF_2
14
+ Reuse Scale (R): 3
15
+ Prune Ratio (MET): N/A
16
+ Epochs: 1
17
+ Num of samples: 20000
18
+ Batch Size: 4
19
+ Sequence Length: 1024
20
+ Learning Rate: 2e-05
21
+ PSR Enabled: True
22
+ LR Scheduler: True
23
+ Save Path: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3
24
+ Gradient Checkpointing: False
25
+ LoRA Rank: 16 (Full LoRA: True)
26
+ LoRA Alpha: 32
27
+ MET Enabled: False (Mask Ratio: 0.1, Warmup: 0.5)
28
+ Log File: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/logs/rexmoe_training_0304_033137.log
29
+ Aux loss weight: 0.05
30
+
31
+ 2026-04-03 03:31:37 - ReXMoE - INFO - 💻 Using device: cuda)
32
+ 2026-04-03 03:31:37 - ReXMoE - INFO - GPU: NVIDIA RTX A6000, Memory: 47.53 GB
33
+ 2026-04-03 03:31:43 - ReXMoE - INFO - [5/7] Setting up optimizer and dataset...
34
+ 2026-04-03 03:31:43 - ReXMoE - INFO - Using 8-bit AdamW optimizer
35
+ 2026-04-03 03:31:43 - ReXMoE - INFO - LR Scheduler: CosineAnnealingLR (2e-05 → 2.0000000000000003e-06)
36
+ 2026-04-03 03:31:51 - ReXMoE - INFO -
37
+ First batch statistics:
38
+ 2026-04-03 03:31:51 - ReXMoE - INFO - LM Loss: 1.0094
39
+ 2026-04-03 03:31:51 - ReXMoE - INFO - Aux Loss: 0.092773
40
+ 2026-04-03 03:31:51 - ReXMoE - INFO - Total Loss: 1.1022
41
+ 2026-04-03 03:31:51 - ReXMoE - INFO - Current R: 2
42
+ 2026-04-03 03:31:51 - ReXMoE - INFO - Active experts per layer: 32
43
+ 2026-04-03 03:31:51 - ReXMoE - INFO - Gradient norm: 1.0000
44
+ 2026-04-03 03:31:51 - ReXMoE - INFO -
45
+
46
+ 2026-04-03 03:35:09 - ReXMoE - INFO - [50/5000] loss=1.1939 aux=0.025195 R=2
47
+ 2026-04-03 03:38:21 - ReXMoE - INFO - [100/5000] loss=1.1803 aux=0.016016 R=2
48
+ 2026-04-03 03:41:36 - ReXMoE - INFO - [150/5000] loss=1.2968 aux=0.014648 R=2
49
+ 2026-04-03 03:44:50 - ReXMoE - INFO - [200/5000] loss=1.2447 aux=0.011279 R=2
50
+ 2026-04-03 03:48:01 - ReXMoE - INFO - [250/5000] loss=1.1971 aux=0.013672 R=2
51
+ 2026-04-03 03:51:10 - ReXMoE - INFO - [300/5000] loss=2.1766 aux=0.009863 R=2
52
+ 2026-04-03 03:54:19 - ReXMoE - INFO - [350/5000] loss=1.1092 aux=0.007031 R=2
53
+ 2026-04-03 03:57:29 - ReXMoE - INFO - [400/5000] loss=0.9343 aux=0.009766 R=2
54
+ 2026-04-03 04:00:40 - ReXMoE - INFO - [450/5000] loss=1.2180 aux=0.018164 R=2
55
+ 2026-04-03 04:03:47 - ReXMoE - INFO - Warmup completed at step 500. Enabling FULL QLoRA with r = 16 and alpha = 32 on experts and updating optimizer...
56
+ 2026-04-03 04:03:51 - ReXMoE - INFO - Trainable params (routers + LoRA): 144179200 (1.8509%)
57
+ 2026-04-03 04:03:51 - ReXMoE - INFO - Sample trainable params after QLoRA: ['base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight', 'base_model.model.model.layers.0.self_attn.k_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.k_proj.lora_B.default.weight', 'base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight', 'base_model.model.model.layers.0.self_attn.o_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.o_proj.lora_B.default.weight', 'base_model.model.model.layers.0.block_sparse_moe.gate.weight', 'base_model.model.model.layers.0.block_sparse_moe.experts.0.w1.lora_A.default.weight']
58
+ 2026-04-03 04:03:58 - ReXMoE - INFO - [500/5000] loss=1.0733 aux=0.014648 R=2
59
+ 2026-04-03 04:09:15 - ReXMoE - INFO - [550/5000] loss=0.6253 aux=0.005884 R=2
60
+ 2026-04-03 04:14:28 - ReXMoE - INFO - [600/5000] loss=1.5688 aux=0.004394 R=2
61
+ 2026-04-03 04:19:39 - ReXMoE - INFO - [650/5000] loss=0.7864 aux=0.006543 R=2
62
+ 2026-04-03 04:24:52 - ReXMoE - INFO - [700/5000] loss=1.5303 aux=0.004272 R=2
63
+ 2026-04-03 04:30:02 - ReXMoE - INFO - [750/5000] loss=1.0098 aux=0.003125 R=2
64
+ 2026-04-03 04:35:13 - ReXMoE - INFO - [800/5000] loss=1.0523 aux=0.005713 R=2
65
+ 2026-04-03 04:40:24 - ReXMoE - INFO - [850/5000] loss=0.6447 aux=0.003638 R=2
66
+ 2026-04-03 04:45:37 - ReXMoE - INFO - [900/5000] loss=0.7665 aux=0.001929 R=2
67
+ 2026-04-03 04:50:50 - ReXMoE - INFO - [950/5000] loss=0.7762 aux=0.002295 R=2
68
+ 2026-04-03 04:56:03 - ReXMoE - INFO - [1000/5000] loss=1.0254 aux=0.001428 R=2
69
+ 2026-04-03 05:01:16 - ReXMoE - INFO - [1050/5000] loss=1.1320 aux=0.002295 R=2
70
+ 2026-04-03 05:06:28 - ReXMoE - INFO - [1100/5000] loss=0.7519 aux=0.001990 R=2
71
+ 2026-04-03 05:11:40 - ReXMoE - INFO - [1150/5000] loss=0.8246 aux=0.001282 R=2
72
+ 2026-04-03 05:16:55 - ReXMoE - INFO - [1200/5000] loss=1.0041 aux=0.002417 R=2
73
+ 2026-04-03 05:22:09 - ReXMoE - INFO - [1250/5000] loss=0.6804 aux=0.002344 R=2
74
+ 2026-04-03 05:27:21 - ReXMoE - INFO - [1300/5000] loss=0.9695 aux=0.001443 R=2
75
+ 2026-04-03 05:32:33 - ReXMoE - INFO - [1350/5000] loss=1.0448 aux=0.001054 R=2
76
+ 2026-04-03 05:37:45 - ReXMoE - INFO - [1400/5000] loss=0.7468 aux=0.000854 R=2
77
+ 2026-04-03 05:42:58 - ReXMoE - INFO - [1450/5000] loss=1.6307 aux=0.001404 R=2
78
+ 2026-04-03 05:48:10 - ReXMoE - INFO - [1500/5000] loss=1.1833 aux=0.001050 R=2
79
+ 2026-04-03 05:53:21 - ReXMoE - INFO - [1550/5000] loss=0.9216 aux=0.001196 R=2
80
+ 2026-04-03 05:58:33 - ReXMoE - INFO - [1600/5000] loss=0.5969 aux=0.001483 R=2
81
+ 2026-04-03 06:03:46 - ReXMoE - INFO - [1650/5000] loss=0.5240 aux=0.001007 R=2
82
+ 2026-04-03 06:08:58 - ReXMoE - INFO - [1700/5000] loss=0.7681 aux=0.000714 R=2
83
+ 2026-04-03 06:14:09 - ReXMoE - INFO - [1750/5000] loss=1.0812 aux=0.001160 R=2
84
+ 2026-04-03 06:19:21 - ReXMoE - INFO - [1800/5000] loss=0.8171 aux=0.002394 R=2
85
+ 2026-04-03 06:24:34 - ReXMoE - INFO - [1850/5000] loss=0.9029 aux=0.002148 R=2
86
+ 2026-04-03 06:29:46 - ReXMoE - INFO - [1900/5000] loss=1.0440 aux=0.000736 R=2
87
+ 2026-04-03 06:35:00 - ReXMoE - INFO - [1950/5000] loss=1.2026 aux=0.002038 R=2
88
+ 2026-04-03 06:40:13 - ReXMoE - INFO - [2000/5000] loss=0.7174 aux=0.001349 R=2
89
+ 2026-04-03 06:45:25 - ReXMoE - INFO - [2050/5000] loss=1.5737 aux=0.001428 R=2
90
+ 2026-04-03 06:50:37 - ReXMoE - INFO - [2100/5000] loss=0.8508 aux=0.001361 R=2
91
+ 2026-04-03 06:55:51 - ReXMoE - INFO - [2150/5000] loss=0.7965 aux=0.000662 R=2
92
+ 2026-04-03 07:01:02 - ReXMoE - INFO - [2200/5000] loss=1.3079 aux=0.001099 R=2
93
+ 2026-04-03 07:06:14 - ReXMoE - INFO - [2250/5000] loss=0.9750 aux=0.000891 R=2
94
+ 2026-04-03 07:11:28 - ReXMoE - INFO - [2300/5000] loss=0.9549 aux=0.000891 R=2
95
+ 2026-04-03 07:16:40 - ReXMoE - INFO - [2350/5000] loss=1.2216 aux=0.001636 R=2
96
+ 2026-04-03 07:21:53 - ReXMoE - INFO - [2400/5000] loss=0.9801 aux=0.000916 R=2
97
+ 2026-04-03 07:27:07 - ReXMoE - INFO - [2450/5000] loss=1.6587 aux=0.000641 R=2
98
+ 2026-04-03 07:32:23 - ReXMoE - INFO - [2500/5000] loss=1.7420 aux=0.003859 R=3
99
+ 2026-04-03 07:39:14 - ReXMoE - INFO - [2550/5000] loss=1.0498 aux=0.000720 R=3
100
+ 2026-04-03 07:46:08 - ReXMoE - INFO - [2600/5000] loss=0.7848 aux=0.001117 R=3
101
+ 2026-04-03 07:53:01 - ReXMoE - INFO - [2650/5000] loss=0.6119 aux=0.000397 R=3
102
+ 2026-04-03 07:59:55 - ReXMoE - INFO - [2700/5000] loss=1.0871 aux=0.000806 R=3
103
+ 2026-04-03 08:06:48 - ReXMoE - INFO - [2750/5000] loss=1.0422 aux=0.000564 R=3
104
+ 2026-04-03 08:13:45 - ReXMoE - INFO - [2800/5000] loss=1.0147 aux=0.001105 R=3
105
+ 2026-04-03 08:20:37 - ReXMoE - INFO - [2850/5000] loss=0.6756 aux=0.000781 R=3
106
+ 2026-04-03 08:27:28 - ReXMoE - INFO - [2900/5000] loss=0.6243 aux=0.000668 R=3
107
+ 2026-04-03 08:34:22 - ReXMoE - INFO - [2950/5000] loss=0.8838 aux=0.000990 R=3
108
+ 2026-04-03 08:41:13 - ReXMoE - INFO - [3000/5000] loss=0.7627 aux=0.000824 R=3
109
+ 2026-04-03 08:48:05 - ReXMoE - INFO - [3050/5000] loss=0.8120 aux=0.000267 R=3
110
+ 2026-04-03 08:54:56 - ReXMoE - INFO - [3100/5000] loss=0.9701 aux=0.000848 R=3
111
+ 2026-04-03 09:01:47 - ReXMoE - INFO - [3150/5000] loss=0.8151 aux=0.000516 R=3
112
+ 2026-04-03 09:08:39 - ReXMoE - INFO - [3200/5000] loss=0.6943 aux=0.001111 R=3
113
+ 2026-04-03 09:15:30 - ReXMoE - INFO - [3250/5000] loss=0.9401 aux=0.000940 R=3
114
+ 2026-04-03 09:22:20 - ReXMoE - INFO - [3300/5000] loss=0.7034 aux=0.001174 R=3
115
+ 2026-04-03 09:29:11 - ReXMoE - INFO - [3350/5000] loss=1.1980 aux=0.001202 R=3
116
+ 2026-04-03 09:36:04 - ReXMoE - INFO - [3400/5000] loss=0.6413 aux=0.000818 R=3
117
+ 2026-04-03 09:43:01 - ReXMoE - INFO - [3450/5000] loss=1.1729 aux=0.000674 R=3
118
+ 2026-04-03 09:49:52 - ReXMoE - INFO - [3500/5000] loss=1.1667 aux=0.000818 R=3
119
+ 2026-04-03 09:56:42 - ReXMoE - INFO - [3550/5000] loss=0.3543 aux=0.002930 R=3
120
+ 2026-04-03 10:03:29 - ReXMoE - INFO - [3600/5000] loss=1.0002 aux=0.001117 R=3
121
+ 2026-04-03 10:10:20 - ReXMoE - INFO - [3650/5000] loss=0.8748 aux=0.000601 R=3
122
+ 2026-04-03 10:17:12 - ReXMoE - INFO - [3700/5000] loss=0.9026 aux=0.002447 R=3
123
+ 2026-04-03 10:24:05 - ReXMoE - INFO - [3750/5000] loss=0.3710 aux=0.000873 R=3
124
+ 2026-04-03 10:30:57 - ReXMoE - INFO - [3800/5000] loss=1.2199 aux=0.000626 R=3
125
+ 2026-04-03 10:37:48 - ReXMoE - INFO - [3850/5000] loss=0.4812 aux=0.001223 R=3
126
+ 2026-04-03 10:44:38 - ReXMoE - INFO - [3900/5000] loss=0.9683 aux=0.000995 R=3
127
+ 2026-04-03 10:51:31 - ReXMoE - INFO - [3950/5000] loss=0.7649 aux=0.000693 R=3
128
+ 2026-04-03 10:58:23 - ReXMoE - INFO - [4000/5000] loss=0.7234 aux=0.000736 R=3
129
+ 2026-04-03 11:05:13 - ReXMoE - INFO - [4050/5000] loss=0.7793 aux=0.000516 R=3
130
+ 2026-04-03 11:12:02 - ReXMoE - INFO - [4100/5000] loss=1.2237 aux=0.000787 R=3
131
+ 2026-04-03 11:18:51 - ReXMoE - INFO - [4150/5000] loss=1.0040 aux=0.001080 R=3
132
+ 2026-04-03 11:25:38 - ReXMoE - INFO - [4200/5000] loss=0.4700 aux=0.000778 R=3
133
+ 2026-04-03 11:32:25 - ReXMoE - INFO - [4250/5000] loss=0.6833 aux=0.001794 R=3
134
+ 2026-04-03 11:39:11 - ReXMoE - INFO - [4300/5000] loss=0.8191 aux=0.001502 R=3
135
+ 2026-04-03 11:45:56 - ReXMoE - INFO - [4350/5000] loss=0.3914 aux=0.000525 R=3
136
+ 2026-04-03 11:52:41 - ReXMoE - INFO - [4400/5000] loss=0.9623 aux=0.000742 R=3
137
+ 2026-04-03 11:59:28 - ReXMoE - INFO - [4450/5000] loss=0.6550 aux=0.002246 R=3
138
+ 2026-04-03 12:06:15 - ReXMoE - INFO - [4500/5000] loss=0.9616 aux=0.001111 R=3
139
+ 2026-04-03 12:13:01 - ReXMoE - INFO - [4550/5000] loss=0.5557 aux=0.003516 R=3
140
+ 2026-04-03 12:19:46 - ReXMoE - INFO - [4600/5000] loss=0.6275 aux=0.002422 R=3
141
+ 2026-04-03 12:26:33 - ReXMoE - INFO - [4650/5000] loss=1.2395 aux=0.000620 R=3
142
+ 2026-04-03 12:33:18 - ReXMoE - INFO - [4700/5000] loss=0.6769 aux=0.000824 R=3
143
+ 2026-04-03 12:40:05 - ReXMoE - INFO - [4750/5000] loss=1.1499 aux=0.002539 R=3
144
+ 2026-04-03 12:46:50 - ReXMoE - INFO - [4800/5000] loss=0.7449 aux=0.000409 R=3
145
+ 2026-04-03 12:53:34 - ReXMoE - INFO - [4850/5000] loss=0.8246 aux=0.000729 R=3
146
+ 2026-04-03 13:00:22 - ReXMoE - INFO - [4900/5000] loss=0.9550 aux=0.000812 R=3
147
+ 2026-04-03 13:07:10 - ReXMoE - INFO - [4950/5000] loss=1.2535 aux=0.000644 R=3
148
+ 2026-04-03 13:13:50 - ReXMoE - INFO -
149
+ [Step 5000/5000] Running evaluation at eval_steps...
150
+ 2026-04-03 13:13:50 - ReXMoE - INFO -
151
+ Evaluating model with 3 sample prompts...
152
+ 2026-04-03 13:13:52 - ReXMoE - INFO -
153
+ --- Prompt 1/3 ---
154
+ 2026-04-03 13:13:52 - ReXMoE - INFO - Instruction: What is the capital of France?
155
+ 2026-04-03 13:13:52 - ReXMoE - INFO - Input: None
156
+ 2026-04-03 13:13:52 - ReXMoE - INFO - Generated completion (len 9): The capital of France is Paris.
157
+ 2026-04-03 13:14:11 - ReXMoE - INFO -
158
+ --- Prompt 2/3 ---
159
+ 2026-04-03 13:14:11 - ReXMoE - INFO - Instruction: High-pressure systems stop air from rising into the colder regions of the atmosphere where water can condense. What will most likely result if a high-pressure system remains in an area for a long period of time?
160
+ A. fog
161
+ B. rain
162
+ C. drought
163
+ D. tornado
164
+ Answer:
165
+ 2026-04-03 13:14:11 - ReXMoE - INFO - Input: None
166
+ 2026-04-03 13:14:11 - ReXMoE - INFO - Generated completion (len 77): A. fog
167
+
168
+ High-pressure systems often lead to fog formation because they can hold moisture and prevent it from evaporating. This can occur when the high-pressure system remains in an area for a long period of time. Fog forms when moist air cools and condenses into water droplets near the surface of the Earth.
169
+ 2026-04-03 13:14:13 - ReXMoE - INFO -
170
+ --- Prompt 3/3 ---
171
+ 2026-04-03 13:14:13 - ReXMoE - INFO - Instruction: Given the fact: predators eat prey
172
+ Question: Predators eat
173
+ A. lions
174
+ B. humans
175
+ C. bunnies
176
+ D. grass
177
+ Answer:
178
+ 2026-04-03 13:14:13 - ReXMoE - INFO - Input: None
179
+ 2026-04-03 13:14:13 - ReXMoE - INFO - Generated completion (len 7): C. bunnies
180
+ 2026-04-03 13:14:13 - ReXMoE - INFO - Evaluation of all 3 prompts complete.
181
+ 2026-04-03 13:14:13 - ReXMoE - INFO -
182
+ [Step 5000] Analyzing routing patterns at eval_steps...
183
+ 2026-04-03 13:14:37 - ReXMoE - INFO -
184
+ Analyzing ACTUAL routing patterns from 10 batches (15,294 tokens)
185
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Current reuse scale: R=3
186
+ 2026-04-03 13:14:37 - ReXMoE - INFO -
187
+ [IG-MET Pruning Report]:
188
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Global: 0/0 UNIQUE experts pruned (0.0%) | threshold=-1.000000
189
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Cross-Layer Routing Distribution (ACTUAL selections):
190
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Same layer (i): 781,056 ( 29.8%)
191
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Previous layer (i-1): 965,741 ( 36.8%)
192
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Next layer (i+1): 815,206 ( 31.1%)
193
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Distant layers: 59,437 ( 2.3%)
194
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Sample Layer-Specific Routing Patterns:
195
+ 2026-04-03 13:14:37 - ReXMoE - INFO -
196
+ Layer 8:
197
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 7 from layer 9 ( L9): 5,937 times ( 38.8%)
198
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 7 from layer 7 ( L7): 5,895 times ( 38.5%)
199
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 14 from layer 9 ( L9): 5,822 times ( 38.1%)
200
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 14 from layer 7 ( L7): 5,618 times ( 36.7%)
201
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 2 from layer 7 ( L7): 4,200 times ( 27.5%)
202
+ 2026-04-03 13:14:37 - ReXMoE - INFO -
203
+ Layer 16:
204
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 8 from layer 15 ( L15): 8,999 times ( 58.8%)
205
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 8 from layer 17 ( L17): 7,847 times ( 51.3%)
206
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 10 from layer 15 ( L15): 6,002 times ( 39.2%)
207
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 10 from layer 17 ( L17): 5,879 times ( 38.4%)
208
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 15 from layer 15 ( L15): 4,030 times ( 26.4%)
209
+ 2026-04-03 13:14:37 - ReXMoE - INFO -
210
+ Layer 24:
211
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 8 from layer 23 ( L23): 9,213 times ( 60.2%)
212
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 9 from layer 23 ( L23): 7,912 times ( 51.7%)
213
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 8 from layer 25 ( L25): 6,819 times ( 44.6%)
214
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 9 from layer 25 ( L25): 6,403 times ( 41.9%)
215
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 8 from layer 24 (same): 4,350 times ( 28.4%)
216
+ 2026-04-03 13:14:37 - ReXMoE - INFO - ✅ Cross-layer expert reuse detected: 70.2% of routing uses adjacent layers
217
+ 2026-04-03 13:14:37 - ReXMoE - INFO -
218
+ [Step 5000] Saving checkpoint at eval_steps to ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3...
219
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.0.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
220
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.0.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
221
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.1.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
222
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.1.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
223
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.2.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
224
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.2.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
225
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.3.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
226
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.3.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
227
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.4.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
228
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.4.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
229
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.5.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
230
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.5.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
231
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.6.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
232
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.6.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
233
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.7.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
234
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.7.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
235
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.8.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
236
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.8.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
237
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.9.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
238
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.9.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
239
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.10.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
240
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.10.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
241
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.11.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
242
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.11.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
243
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.12.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
244
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.12.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
245
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.13.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
246
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.13.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
247
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.14.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
248
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.14.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
249
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.15.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
250
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.15.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
251
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.16.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
252
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.16.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
253
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.17.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
254
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.17.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
255
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.18.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
256
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.18.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
257
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.19.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
258
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.19.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
259
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.20.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
260
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.20.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
261
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.21.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
262
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.21.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
263
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.22.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
264
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.22.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
265
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.23.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
266
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.23.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
267
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.24.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
268
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.24.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
269
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.25.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
270
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.25.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
271
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.26.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
272
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.26.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
273
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.27.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
274
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.27.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
275
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.28.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
276
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.28.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
277
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.29.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
278
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.29.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
279
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.30.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
280
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.30.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
281
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.31.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
282
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.31.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
283
+ 2026-04-03 13:14:37 - ReXMoE - INFO - ✓ Saved trained router weights: 96 parameters
284
+ 2026-04-03 13:14:37 - ReXMoE - INFO - File: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/rexmoe_routers.pt
285
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Size: 12.03 MB
286
+ 2026-04-03 13:14:37 - ReXMoE - INFO -
287
+ Also saving full model with ReXMoE architecture...
288
+ 2026-04-03 13:14:39 - ReXMoE - INFO -
289
+ Merging LoRA adapters into base weights and saving to: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/merged
290
+ 2026-04-03 13:15:00 - ReXMoE - INFO - ✓ Saved merged full model (base+routers+LoRA) for one-step loading
291
+ 2026-04-03 13:15:00 - ReXMoE - INFO -
292
+ ============================================================
293
+ 2026-04-03 13:15:00 - ReXMoE - INFO - Epoch 1 Summary:
294
+ 2026-04-03 13:15:00 - ReXMoE - INFO - Average LM Loss: 0.9498
295
+ 2026-04-03 13:15:00 - ReXMoE - INFO - Average Aux Loss: 0.008843
296
+ 2026-04-03 13:15:00 - ReXMoE - INFO - Average Total Loss: 0.9586
297
+ 2026-04-03 13:15:00 - ReXMoE - INFO - Final R: 3
298
+ 2026-04-03 13:15:00 - ReXMoE - INFO -
299
+ Evaluating model with 3 sample prompts...
300
+ 2026-04-03 13:15:02 - ReXMoE - INFO -
301
+ --- Prompt 1/3 ---
302
+ 2026-04-03 13:15:02 - ReXMoE - INFO - Instruction: What is the capital of France?
303
+ 2026-04-03 13:15:02 - ReXMoE - INFO - Input: None
304
+ 2026-04-03 13:15:02 - ReXMoE - INFO - Generated completion (len 9): The capital of France is Paris.
305
+ 2026-04-03 13:15:04 - ReXMoE - INFO -
306
+ --- Prompt 2/3 ---
307
+ 2026-04-03 13:15:04 - ReXMoE - INFO - Instruction: High-pressure systems stop air from rising into the colder regions of the atmosphere where water can condense. What will most likely result if a high-pressure system remains in an area for a long period of time?
308
+ A. fog
309
+ B. rain
310
+ C. drought
311
+ D. tornado
312
+ Answer:
313
+ 2026-04-03 13:15:04 - ReXMoE - INFO - Input: None
314
+ 2026-04-03 13:15:04 - ReXMoE - INFO - Generated completion (len 5): A. fog
315
+ 2026-04-03 13:15:05 - ReXMoE - INFO -
316
+ --- Prompt 3/3 ---
317
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Instruction: Given the fact: predators eat prey
318
+ Question: Predators eat
319
+ A. lions
320
+ B. humans
321
+ C. bunnies
322
+ D. grass
323
+ Answer:
324
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Input: None
325
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Generated completion (len 7): C. bunnies
326
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Evaluation of all 3 prompts complete.
327
+ 2026-04-03 13:15:05 - ReXMoE - INFO - New best epoch 1 with avg LM loss 0.9498 — saving checkpoint to ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3
328
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.0.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
329
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.0.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
330
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.1.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
331
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.1.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
332
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.2.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
333
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.2.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
334
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.3.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
335
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.3.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
336
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.4.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
337
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.4.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
338
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.5.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
339
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.5.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
340
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.6.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
341
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.6.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
342
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.7.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
343
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.7.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
344
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.8.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
345
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.8.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
346
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.9.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
347
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.9.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
348
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.10.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
349
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.10.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
350
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.11.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
351
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.11.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
352
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.12.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
353
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.12.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
354
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.13.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
355
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.13.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
356
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.14.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
357
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.14.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
358
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.15.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
359
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.15.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
360
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.16.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
361
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.16.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
362
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.17.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
363
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.17.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
364
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.18.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
365
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.18.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
366
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.19.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
367
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.19.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
368
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.20.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
369
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.20.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
370
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.21.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
371
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.21.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
372
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.22.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
373
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.22.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
374
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.23.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
375
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.23.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
376
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.24.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
377
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.24.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
378
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.25.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
379
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.25.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
380
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.26.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
381
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.26.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
382
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.27.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
383
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.27.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
384
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.28.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
385
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.28.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
386
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.29.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
387
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.29.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
388
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.30.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
389
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.30.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
390
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.31.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
391
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.31.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
392
+ 2026-04-03 13:15:06 - ReXMoE - INFO - ✓ Saved trained router weights: 96 parameters
393
+ 2026-04-03 13:15:06 - ReXMoE - INFO - File: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/rexmoe_routers.pt
394
+ 2026-04-03 13:15:06 - ReXMoE - INFO - Size: 12.03 MB
395
+ 2026-04-03 13:15:06 - ReXMoE - INFO -
396
+ Also saving full model with ReXMoE architecture...
397
+ 2026-04-03 13:15:06 - ReXMoE - INFO -
398
+ Merging LoRA adapters into base weights and saving to: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/merged
399
+ 2026-04-03 13:15:44 - ReXMoE - INFO - ✓ Saved merged full model (base+routers+LoRA) for one-step loading
400
+ 2026-04-03 13:15:44 - ReXMoE - INFO -
401
+ 📊 Convergence Metrics:
402
+ 2026-04-03 13:15:44 - ReXMoE - INFO - Convergence Metrics:
403
+ 2026-04-03 13:15:44 - ReXMoE - INFO - Avg Router Grad Norm: 0.084278
404
+ 2026-04-03 13:15:44 - ReXMoE - INFO - Current Learning Rate: 2.00e-05
405
+ 2026-04-03 13:15:44 - ReXMoE - INFO - ℹ️ Collecting convergence data (need 5 epochs minimum)...
406
+ 2026-04-03 13:15:44 - ReXMoE - INFO - Routing Pattern Analysis (Epoch 1):
407
+ 2026-04-03 13:15:59 - ReXMoE - INFO -
408
+ Analyzing ACTUAL routing patterns from 10 batches (17,341 tokens)
409
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Current reuse scale: R=3
410
+ 2026-04-03 13:15:59 - ReXMoE - INFO -
411
+ [IG-MET Pruning Report]:
412
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Global: 0/0 UNIQUE experts pruned (0.0%) | threshold=-1.000000
413
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Cross-Layer Routing Distribution (ACTUAL selections):
414
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Same layer (i): 869,591 ( 33.2%)
415
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Previous layer (i-1): 896,913 ( 34.2%)
416
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Next layer (i+1): 797,210 ( 30.4%)
417
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Distant layers: 57,726 ( 2.2%)
418
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Sample Layer-Specific Routing Patterns:
419
+ 2026-04-03 13:15:59 - ReXMoE - INFO -
420
+ Layer 8:
421
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 7 from layer 9 ( L9): 6,917 times ( 39.9%)
422
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 14 from layer 9 ( L9): 6,553 times ( 37.8%)
423
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 7 from layer 7 ( L7): 6,305 times ( 36.4%)
424
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 14 from layer 7 ( L7): 5,503 times ( 31.7%)
425
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 2 from layer 7 ( L7): 3,988 times ( 23.0%)
426
+ 2026-04-03 13:15:59 - ReXMoE - INFO -
427
+ Layer 16:
428
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 8 from layer 15 ( L15): 8,873 times ( 51.2%)
429
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 8 from layer 17 ( L17): 8,226 times ( 47.4%)
430
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 10 from layer 15 ( L15): 5,752 times ( 33.2%)
431
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 10 from layer 17 ( L17): 4,996 times ( 28.8%)
432
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 9 from layer 16 (same): 3,718 times ( 21.4%)
433
+ 2026-04-03 13:15:59 - ReXMoE - INFO -
434
+ Layer 24:
435
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 8 from layer 23 ( L23): 9,676 times ( 55.8%)
436
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 8 from layer 25 ( L25): 7,087 times ( 40.9%)
437
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 9 from layer 23 ( L23): 6,982 times ( 40.3%)
438
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 9 from layer 25 ( L25): 4,908 times ( 28.3%)
439
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 4 from layer 24 (same): 3,902 times ( 22.5%)
440
+ 2026-04-03 13:15:59 - ReXMoE - INFO - ✅ Cross-layer expert reuse detected: 66.8% of routing uses adjacent layers
441
+ 2026-04-03 13:15:59 - ReXMoE - INFO - LR stepped to: 2.00e-05
442
+ 2026-04-03 13:15:59 - ReXMoE - INFO - ================================================================================
443
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Training Convergence Summary
444
+ 2026-04-03 13:15:59 - ReXMoE - INFO - ================================================================================
445
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Router Gradient Norms Over Epochs:
446
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Epoch 1: 0.084278
447
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Auxiliary Loss Over Epochs:
448
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Epoch 1: 0.008843
449
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Convergence Status: Insufficient data (< 5 epochs)
450
+ 2026-04-03 13:15:59 - ReXMoE - INFO -
451
+ Saving trained router weights only...
452
+ 2026-04-03 13:15:59 - ReXMoE - INFO - ✓ Saved trained router weights: 96 parameters
453
+ 2026-04-03 13:15:59 - ReXMoE - INFO - File: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/rexmoe_routers.pt
454
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Size: 12.03 MB
455
+ 2026-04-03 13:15:59 - ReXMoE - INFO -
456
+ Also saving full model with ReXMoE architecture...
457
+ 2026-04-03 13:16:00 - ReXMoE - INFO -
458
+ Merging LoRA adapters into base weights and saving to: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/merged
459
+ 2026-04-03 13:16:32 - ReXMoE - INFO - ✓ Saved merged full model (base+routers+LoRA) for one-step loading
460
+ 2026-04-03 13:16:32 - ReXMoE - INFO - ================================================================================
461
+ 2026-04-03 13:16:32 - ReXMoE - INFO - ✓ Training complete. Two checkpoint formats saved:
462
+ 2026-04-03 13:16:32 - ReXMoE - INFO - 1. Router weights only: rexmoe_routers.pt (portable)
463
+ 2026-04-03 13:16:32 - ReXMoE - INFO - 2. Full model: pytorch_model.bin (requires rexmoe_architecture.py)
464
+ 2026-04-03 13:16:32 - ReXMoE - INFO -
465
+ Checkpoint directory: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3
466
+ 2026-04-03 13:16:32 - ReXMoE - INFO - Full model size: 0.00 GB
467
+ 2026-04-03 13:16:32 - ReXMoE - INFO - ================================================================================
logs/rexmoe_training_0304_033137.log ADDED
@@ -0,0 +1,467 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2026-04-03 03:31:37 - ReXMoE - INFO - ================================================================================
2
+ 2026-04-03 03:31:37 - ReXMoE - INFO - ReXMoE Training Log - 0304_033137
3
+ 2026-04-03 03:31:37 - ReXMoE - INFO - Log file: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/logs/rexmoe_training_0304_033137.log
4
+ 2026-04-03 03:31:37 - ReXMoE - INFO - ================================================================================
5
+ 2026-04-03 03:31:37 - ReXMoE - INFO - ================================================================================
6
+ 2026-04-03 03:31:37 - ReXMoE - INFO - ReXMoE Cross-Layer Expert Reuse Training
7
+ 2026-04-03 03:31:37 - ReXMoE - INFO - ================================================================================
8
+ 2026-04-03 03:31:37 - ReXMoE - INFO - MET enabled: False
9
+ 2026-04-03 03:31:37 - ReXMoE - INFO -
10
+ Configuration:
11
+ Model: microsoft/Phi-mini-MoE-instruct
12
+ Dataset: ../dataset/alpaca_data_cleaned.json
13
+ Dataset mode: IF_2
14
+ Reuse Scale (R): 3
15
+ Prune Ratio (MET): N/A
16
+ Epochs: 1
17
+ Num of samples: 20000
18
+ Batch Size: 4
19
+ Sequence Length: 1024
20
+ Learning Rate: 2e-05
21
+ PSR Enabled: True
22
+ LR Scheduler: True
23
+ Save Path: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3
24
+ Gradient Checkpointing: False
25
+ LoRA Rank: 16 (Full LoRA: True)
26
+ LoRA Alpha: 32
27
+ MET Enabled: False (Mask Ratio: 0.1, Warmup: 0.5)
28
+ Log File: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/logs/rexmoe_training_0304_033137.log
29
+ Aux loss weight: 0.05
30
+
31
+ 2026-04-03 03:31:37 - ReXMoE - INFO - 💻 Using device: cuda)
32
+ 2026-04-03 03:31:37 - ReXMoE - INFO - GPU: NVIDIA RTX A6000, Memory: 47.53 GB
33
+ 2026-04-03 03:31:43 - ReXMoE - INFO - [5/7] Setting up optimizer and dataset...
34
+ 2026-04-03 03:31:43 - ReXMoE - INFO - Using 8-bit AdamW optimizer
35
+ 2026-04-03 03:31:43 - ReXMoE - INFO - LR Scheduler: CosineAnnealingLR (2e-05 → 2.0000000000000003e-06)
36
+ 2026-04-03 03:31:51 - ReXMoE - INFO -
37
+ First batch statistics:
38
+ 2026-04-03 03:31:51 - ReXMoE - INFO - LM Loss: 1.0094
39
+ 2026-04-03 03:31:51 - ReXMoE - INFO - Aux Loss: 0.092773
40
+ 2026-04-03 03:31:51 - ReXMoE - INFO - Total Loss: 1.1022
41
+ 2026-04-03 03:31:51 - ReXMoE - INFO - Current R: 2
42
+ 2026-04-03 03:31:51 - ReXMoE - INFO - Active experts per layer: 32
43
+ 2026-04-03 03:31:51 - ReXMoE - INFO - Gradient norm: 1.0000
44
+ 2026-04-03 03:31:51 - ReXMoE - INFO -
45
+
46
+ 2026-04-03 03:35:09 - ReXMoE - INFO - [50/5000] loss=1.1939 aux=0.062988 R=2
47
+ 2026-04-03 03:38:21 - ReXMoE - INFO - [100/5000] loss=1.1803 aux=0.040039 R=2
48
+ 2026-04-03 03:41:36 - ReXMoE - INFO - [150/5000] loss=1.2968 aux=0.036621 R=2
49
+ 2026-04-03 03:44:50 - ReXMoE - INFO - [200/5000] loss=1.2447 aux=0.028198 R=2
50
+ 2026-04-03 03:48:01 - ReXMoE - INFO - [250/5000] loss=1.1971 aux=0.034180 R=2
51
+ 2026-04-03 03:51:10 - ReXMoE - INFO - [300/5000] loss=2.1766 aux=0.024658 R=2
52
+ 2026-04-03 03:54:19 - ReXMoE - INFO - [350/5000] loss=1.1092 aux=0.017578 R=2
53
+ 2026-04-03 03:57:29 - ReXMoE - INFO - [400/5000] loss=0.9343 aux=0.024414 R=2
54
+ 2026-04-03 04:00:40 - ReXMoE - INFO - [450/5000] loss=1.2180 aux=0.045410 R=2
55
+ 2026-04-03 04:03:47 - ReXMoE - INFO - Warmup completed at step 500. Enabling FULL QLoRA with r = 16 and alpha = 32 on experts and updating optimizer...
56
+ 2026-04-03 04:03:51 - ReXMoE - INFO - Trainable params (routers + LoRA): 144179200 (1.8509%)
57
+ 2026-04-03 04:03:51 - ReXMoE - INFO - Sample trainable params after QLoRA: ['base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight', 'base_model.model.model.layers.0.self_attn.k_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.k_proj.lora_B.default.weight', 'base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight', 'base_model.model.model.layers.0.self_attn.o_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.o_proj.lora_B.default.weight', 'base_model.model.model.layers.0.block_sparse_moe.gate.weight', 'base_model.model.model.layers.0.block_sparse_moe.experts.0.w1.lora_A.default.weight']
58
+ 2026-04-03 04:03:58 - ReXMoE - INFO - [500/5000] loss=1.0733 aux=0.036621 R=2
59
+ 2026-04-03 04:09:15 - ReXMoE - INFO - [550/5000] loss=0.6253 aux=0.014709 R=2
60
+ 2026-04-03 04:14:28 - ReXMoE - INFO - [600/5000] loss=1.5688 aux=0.010986 R=2
61
+ 2026-04-03 04:19:39 - ReXMoE - INFO - [650/5000] loss=0.7864 aux=0.016357 R=2
62
+ 2026-04-03 04:24:52 - ReXMoE - INFO - [700/5000] loss=1.5303 aux=0.010681 R=2
63
+ 2026-04-03 04:30:02 - ReXMoE - INFO - [750/5000] loss=1.0098 aux=0.007812 R=2
64
+ 2026-04-03 04:35:13 - ReXMoE - INFO - [800/5000] loss=1.0523 aux=0.014282 R=2
65
+ 2026-04-03 04:40:24 - ReXMoE - INFO - [850/5000] loss=0.6447 aux=0.009094 R=2
66
+ 2026-04-03 04:45:37 - ReXMoE - INFO - [900/5000] loss=0.7665 aux=0.004822 R=2
67
+ 2026-04-03 04:50:50 - ReXMoE - INFO - [950/5000] loss=0.7762 aux=0.005737 R=2
68
+ 2026-04-03 04:56:03 - ReXMoE - INFO - [1000/5000] loss=1.0254 aux=0.003571 R=2
69
+ 2026-04-03 05:01:16 - ReXMoE - INFO - [1050/5000] loss=1.1320 aux=0.005737 R=2
70
+ 2026-04-03 05:06:28 - ReXMoE - INFO - [1100/5000] loss=0.7519 aux=0.004974 R=2
71
+ 2026-04-03 05:11:40 - ReXMoE - INFO - [1150/5000] loss=0.8246 aux=0.003204 R=2
72
+ 2026-04-03 05:16:55 - ReXMoE - INFO - [1200/5000] loss=1.0041 aux=0.006042 R=2
73
+ 2026-04-03 05:22:09 - ReXMoE - INFO - [1250/5000] loss=0.6804 aux=0.005859 R=2
74
+ 2026-04-03 05:27:21 - ReXMoE - INFO - [1300/5000] loss=0.9695 aux=0.011108 R=2
75
+ 2026-04-03 05:32:33 - ReXMoE - INFO - [1350/5000] loss=1.0448 aux=0.012634 R=2
76
+ 2026-04-03 05:37:45 - ReXMoE - INFO - [1400/5000] loss=0.7468 aux=0.002136 R=2
77
+ 2026-04-03 05:42:58 - ReXMoE - INFO - [1450/5000] loss=1.6307 aux=0.003510 R=2
78
+ 2026-04-03 05:48:10 - ReXMoE - INFO - [1500/5000] loss=1.1833 aux=0.002625 R=2
79
+ 2026-04-03 05:53:21 - ReXMoE - INFO - [1550/5000] loss=0.9216 aux=0.002991 R=2
80
+ 2026-04-03 05:58:33 - ReXMoE - INFO - [1600/5000] loss=0.5969 aux=0.003708 R=2
81
+ 2026-04-03 06:03:46 - ReXMoE - INFO - [1650/5000] loss=0.5240 aux=0.002518 R=2
82
+ 2026-04-03 06:08:58 - ReXMoE - INFO - [1700/5000] loss=0.7681 aux=0.001785 R=2
83
+ 2026-04-03 06:14:09 - ReXMoE - INFO - [1750/5000] loss=1.0812 aux=0.002899 R=2
84
+ 2026-04-03 06:19:21 - ReXMoE - INFO - [1800/5000] loss=0.8171 aux=0.010986 R=2
85
+ 2026-04-03 06:24:34 - ReXMoE - INFO - [1850/5000] loss=0.9029 aux=0.005371 R=2
86
+ 2026-04-03 06:29:46 - ReXMoE - INFO - [1900/5000] loss=1.0440 aux=0.001839 R=2
87
+ 2026-04-03 06:35:00 - ReXMoE - INFO - [1950/5000] loss=1.2026 aux=0.005096 R=2
88
+ 2026-04-03 06:40:13 - ReXMoE - INFO - [2000/5000] loss=0.7174 aux=0.003372 R=2
89
+ 2026-04-03 06:45:25 - ReXMoE - INFO - [2050/5000] loss=1.5737 aux=0.003571 R=2
90
+ 2026-04-03 06:50:37 - ReXMoE - INFO - [2100/5000] loss=0.8508 aux=0.003403 R=2
91
+ 2026-04-03 06:55:51 - ReXMoE - INFO - [2150/5000] loss=0.7965 aux=0.001656 R=2
92
+ 2026-04-03 07:01:02 - ReXMoE - INFO - [2200/5000] loss=1.3079 aux=0.002747 R=2
93
+ 2026-04-03 07:06:14 - ReXMoE - INFO - [2250/5000] loss=0.9750 aux=0.002228 R=2
94
+ 2026-04-03 07:11:28 - ReXMoE - INFO - [2300/5000] loss=0.9549 aux=0.002228 R=2
95
+ 2026-04-03 07:16:40 - ReXMoE - INFO - [2350/5000] loss=1.2216 aux=0.004089 R=2
96
+ 2026-04-03 07:21:53 - ReXMoE - INFO - [2400/5000] loss=0.9801 aux=0.002289 R=2
97
+ 2026-04-03 07:27:07 - ReXMoE - INFO - [2450/5000] loss=1.6587 aux=0.001602 R=2
98
+ 2026-04-03 07:32:23 - ReXMoE - INFO - [2500/5000] loss=1.7420 aux=0.014648 R=3
99
+ 2026-04-03 07:39:14 - ReXMoE - INFO - [2550/5000] loss=1.0498 aux=0.001801 R=3
100
+ 2026-04-03 07:46:08 - ReXMoE - INFO - [2600/5000] loss=0.7848 aux=0.002792 R=3
101
+ 2026-04-03 07:53:01 - ReXMoE - INFO - [2650/5000] loss=0.6119 aux=0.000992 R=3
102
+ 2026-04-03 07:59:55 - ReXMoE - INFO - [2700/5000] loss=1.0871 aux=0.002014 R=3
103
+ 2026-04-03 08:06:48 - ReXMoE - INFO - [2750/5000] loss=1.0422 aux=0.001411 R=3
104
+ 2026-04-03 08:13:45 - ReXMoE - INFO - [2800/5000] loss=1.0147 aux=0.002762 R=3
105
+ 2026-04-03 08:20:37 - ReXMoE - INFO - [2850/5000] loss=0.6756 aux=0.001953 R=3
106
+ 2026-04-03 08:27:28 - ReXMoE - INFO - [2900/5000] loss=0.6243 aux=0.001671 R=3
107
+ 2026-04-03 08:34:22 - ReXMoE - INFO - [2950/5000] loss=0.8838 aux=0.004974 R=3
108
+ 2026-04-03 08:41:13 - ReXMoE - INFO - [3000/5000] loss=0.7627 aux=0.002060 R=3
109
+ 2026-04-03 08:48:05 - ReXMoE - INFO - [3050/5000] loss=0.8120 aux=0.000668 R=3
110
+ 2026-04-03 08:54:56 - ReXMoE - INFO - [3100/5000] loss=0.9701 aux=0.002121 R=3
111
+ 2026-04-03 09:01:47 - ReXMoE - INFO - [3150/5000] loss=0.8151 aux=0.001289 R=3
112
+ 2026-04-03 09:08:39 - ReXMoE - INFO - [3200/5000] loss=0.6943 aux=0.002777 R=3
113
+ 2026-04-03 09:15:30 - ReXMoE - INFO - [3250/5000] loss=0.9401 aux=0.002350 R=3
114
+ 2026-04-03 09:22:20 - ReXMoE - INFO - [3300/5000] loss=0.7034 aux=0.007935 R=3
115
+ 2026-04-03 09:29:11 - ReXMoE - INFO - [3350/5000] loss=1.1980 aux=0.003006 R=3
116
+ 2026-04-03 09:36:04 - ReXMoE - INFO - [3400/5000] loss=0.6413 aux=0.002045 R=3
117
+ 2026-04-03 09:43:01 - ReXMoE - INFO - [3450/5000] loss=1.1729 aux=0.001686 R=3
118
+ 2026-04-03 09:49:52 - ReXMoE - INFO - [3500/5000] loss=1.1667 aux=0.002045 R=3
119
+ 2026-04-03 09:56:42 - ReXMoE - INFO - [3550/5000] loss=0.3543 aux=0.007324 R=3
120
+ 2026-04-03 10:03:29 - ReXMoE - INFO - [3600/5000] loss=1.0002 aux=0.002792 R=3
121
+ 2026-04-03 10:10:20 - ReXMoE - INFO - [3650/5000] loss=0.8748 aux=0.001503 R=3
122
+ 2026-04-03 10:17:12 - ReXMoE - INFO - [3700/5000] loss=0.9026 aux=0.021118 R=3
123
+ 2026-04-03 10:24:05 - ReXMoE - INFO - [3750/5000] loss=0.3710 aux=0.002182 R=3
124
+ 2026-04-03 10:30:57 - ReXMoE - INFO - [3800/5000] loss=1.2199 aux=0.001564 R=3
125
+ 2026-04-03 10:37:48 - ReXMoE - INFO - [3850/5000] loss=0.4812 aux=0.008057 R=3
126
+ 2026-04-03 10:44:38 - ReXMoE - INFO - [3900/5000] loss=0.9683 aux=0.002487 R=3
127
+ 2026-04-03 10:51:31 - ReXMoE - INFO - [3950/5000] loss=0.7649 aux=0.001732 R=3
128
+ 2026-04-03 10:58:23 - ReXMoE - INFO - [4000/5000] loss=0.7234 aux=0.001839 R=3
129
+ 2026-04-03 11:05:13 - ReXMoE - INFO - [4050/5000] loss=0.7793 aux=0.001289 R=3
130
+ 2026-04-03 11:12:02 - ReXMoE - INFO - [4100/5000] loss=1.2237 aux=0.001968 R=3
131
+ 2026-04-03 11:18:51 - ReXMoE - INFO - [4150/5000] loss=1.0040 aux=0.002701 R=3
132
+ 2026-04-03 11:25:38 - ReXMoE - INFO - [4200/5000] loss=0.4700 aux=0.001945 R=3
133
+ 2026-04-03 11:32:25 - ReXMoE - INFO - [4250/5000] loss=0.6833 aux=0.004486 R=3
134
+ 2026-04-03 11:39:11 - ReXMoE - INFO - [4300/5000] loss=0.8191 aux=0.003754 R=3
135
+ 2026-04-03 11:45:56 - ReXMoE - INFO - [4350/5000] loss=0.3914 aux=0.001312 R=3
136
+ 2026-04-03 11:52:41 - ReXMoE - INFO - [4400/5000] loss=0.9623 aux=0.001854 R=3
137
+ 2026-04-03 11:59:28 - ReXMoE - INFO - [4450/5000] loss=0.6550 aux=0.005615 R=3
138
+ 2026-04-03 12:06:15 - ReXMoE - INFO - [4500/5000] loss=0.9616 aux=0.002777 R=3
139
+ 2026-04-03 12:13:01 - ReXMoE - INFO - [4550/5000] loss=0.5557 aux=0.008789 R=3
140
+ 2026-04-03 12:19:46 - ReXMoE - INFO - [4600/5000] loss=0.6275 aux=0.018555 R=3
141
+ 2026-04-03 12:26:33 - ReXMoE - INFO - [4650/5000] loss=1.2395 aux=0.001549 R=3
142
+ 2026-04-03 12:33:18 - ReXMoE - INFO - [4700/5000] loss=0.6769 aux=0.002060 R=3
143
+ 2026-04-03 12:40:05 - ReXMoE - INFO - [4750/5000] loss=1.1499 aux=0.006348 R=3
144
+ 2026-04-03 12:46:50 - ReXMoE - INFO - [4800/5000] loss=0.7449 aux=0.001022 R=3
145
+ 2026-04-03 12:53:34 - ReXMoE - INFO - [4850/5000] loss=0.8246 aux=0.001823 R=3
146
+ 2026-04-03 13:00:22 - ReXMoE - INFO - [4900/5000] loss=0.9550 aux=0.002029 R=3
147
+ 2026-04-03 13:07:10 - ReXMoE - INFO - [4950/5000] loss=1.2535 aux=0.001610 R=3
148
+ 2026-04-03 13:13:50 - ReXMoE - INFO -
149
+ [Step 5000/5000] Running evaluation at eval_steps...
150
+ 2026-04-03 13:13:50 - ReXMoE - INFO -
151
+ Evaluating model with 3 sample prompts...
152
+ 2026-04-03 13:13:52 - ReXMoE - INFO -
153
+ --- Prompt 1/3 ---
154
+ 2026-04-03 13:13:52 - ReXMoE - INFO - Instruction: What is the capital of France?
155
+ 2026-04-03 13:13:52 - ReXMoE - INFO - Input: None
156
+ 2026-04-03 13:13:52 - ReXMoE - INFO - Generated completion (len 9): The capital of France is Paris.
157
+ 2026-04-03 13:14:11 - ReXMoE - INFO -
158
+ --- Prompt 2/3 ---
159
+ 2026-04-03 13:14:11 - ReXMoE - INFO - Instruction: High-pressure systems stop air from rising into the colder regions of the atmosphere where water can condense. What will most likely result if a high-pressure system remains in an area for a long period of time?
160
+ A. fog
161
+ B. rain
162
+ C. drought
163
+ D. tornado
164
+ Answer:
165
+ 2026-04-03 13:14:11 - ReXMoE - INFO - Input: None
166
+ 2026-04-03 13:14:11 - ReXMoE - INFO - Generated completion (len 77): A. fog
167
+
168
+ High-pressure systems often lead to fog formation because they can hold moisture and prevent it from evaporating. This can occur when the high-pressure system remains in an area for a long period of time. Fog forms when moist air cools and condenses into water droplets near the surface of the Earth.
169
+ 2026-04-03 13:14:13 - ReXMoE - INFO -
170
+ --- Prompt 3/3 ---
171
+ 2026-04-03 13:14:13 - ReXMoE - INFO - Instruction: Given the fact: predators eat prey
172
+ Question: Predators eat
173
+ A. lions
174
+ B. humans
175
+ C. bunnies
176
+ D. grass
177
+ Answer:
178
+ 2026-04-03 13:14:13 - ReXMoE - INFO - Input: None
179
+ 2026-04-03 13:14:13 - ReXMoE - INFO - Generated completion (len 7): C. bunnies
180
+ 2026-04-03 13:14:13 - ReXMoE - INFO - Evaluation of all 3 prompts complete.
181
+ 2026-04-03 13:14:13 - ReXMoE - INFO -
182
+ [Step 5000] Analyzing routing patterns at eval_steps...
183
+ 2026-04-03 13:14:37 - ReXMoE - INFO -
184
+ Analyzing ACTUAL routing patterns from 10 batches (15,294 tokens)
185
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Current reuse scale: R=3
186
+ 2026-04-03 13:14:37 - ReXMoE - INFO -
187
+ [IG-MET Pruning Report]:
188
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Global: 0/0 UNIQUE experts pruned (0.0%) | threshold=-1.000000
189
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Cross-Layer Routing Distribution (ACTUAL selections):
190
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Same layer (i): 781,056 ( 29.8%)
191
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Previous layer (i-1): 965,741 ( 36.8%)
192
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Next layer (i+1): 815,206 ( 31.1%)
193
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Distant layers: 59,437 ( 2.3%)
194
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Sample Layer-Specific Routing Patterns:
195
+ 2026-04-03 13:14:37 - ReXMoE - INFO -
196
+ Layer 8:
197
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 7 from layer 9 ( L9): 5,937 times ( 38.8%)
198
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 7 from layer 7 ( L7): 5,895 times ( 38.5%)
199
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 14 from layer 9 ( L9): 5,822 times ( 38.1%)
200
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 14 from layer 7 ( L7): 5,618 times ( 36.7%)
201
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 2 from layer 7 ( L7): 4,200 times ( 27.5%)
202
+ 2026-04-03 13:14:37 - ReXMoE - INFO -
203
+ Layer 16:
204
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 8 from layer 15 ( L15): 8,999 times ( 58.8%)
205
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 8 from layer 17 ( L17): 7,847 times ( 51.3%)
206
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 10 from layer 15 ( L15): 6,002 times ( 39.2%)
207
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 10 from layer 17 ( L17): 5,879 times ( 38.4%)
208
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 15 from layer 15 ( L15): 4,030 times ( 26.4%)
209
+ 2026-04-03 13:14:37 - ReXMoE - INFO -
210
+ Layer 24:
211
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 8 from layer 23 ( L23): 9,213 times ( 60.2%)
212
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 9 from layer 23 ( L23): 7,912 times ( 51.7%)
213
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 8 from layer 25 ( L25): 6,819 times ( 44.6%)
214
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 9 from layer 25 ( L25): 6,403 times ( 41.9%)
215
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Expert 8 from layer 24 (same): 4,350 times ( 28.4%)
216
+ 2026-04-03 13:14:37 - ReXMoE - INFO - ✅ Cross-layer expert reuse detected: 70.2% of routing uses adjacent layers
217
+ 2026-04-03 13:14:37 - ReXMoE - INFO -
218
+ [Step 5000] Saving checkpoint at eval_steps to ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3...
219
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.0.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
220
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.0.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
221
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.1.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
222
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.1.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
223
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.2.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
224
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.2.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
225
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.3.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
226
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.3.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
227
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.4.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
228
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.4.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
229
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.5.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
230
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.5.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
231
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.6.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
232
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.6.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
233
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.7.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
234
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.7.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
235
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.8.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
236
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.8.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
237
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.9.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
238
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.9.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
239
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.10.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
240
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.10.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
241
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.11.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
242
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.11.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
243
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.12.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
244
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.12.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
245
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.13.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
246
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.13.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
247
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.14.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
248
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.14.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
249
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.15.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
250
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.15.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
251
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.16.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
252
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.16.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
253
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.17.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
254
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.17.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
255
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.18.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
256
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.18.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
257
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.19.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
258
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.19.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
259
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.20.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
260
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.20.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
261
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.21.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
262
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.21.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
263
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.22.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
264
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.22.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
265
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.23.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
266
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.23.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
267
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.24.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
268
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.24.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
269
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.25.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
270
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.25.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
271
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.26.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
272
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.26.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
273
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.27.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
274
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.27.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
275
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.28.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
276
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.28.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
277
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.29.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
278
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.29.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
279
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.30.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
280
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.30.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
281
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.31.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
282
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.31.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
283
+ 2026-04-03 13:14:37 - ReXMoE - INFO - ✓ Saved trained router weights: 96 parameters
284
+ 2026-04-03 13:14:37 - ReXMoE - INFO - File: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/rexmoe_routers.pt
285
+ 2026-04-03 13:14:37 - ReXMoE - INFO - Size: 12.03 MB
286
+ 2026-04-03 13:14:37 - ReXMoE - INFO -
287
+ Also saving full model with ReXMoE architecture...
288
+ 2026-04-03 13:14:39 - ReXMoE - INFO -
289
+ Merging LoRA adapters into base weights and saving to: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/merged
290
+ 2026-04-03 13:15:00 - ReXMoE - INFO - ✓ Saved merged full model (base+routers+LoRA) for one-step loading
291
+ 2026-04-03 13:15:00 - ReXMoE - INFO -
292
+ ============================================================
293
+ 2026-04-03 13:15:00 - ReXMoE - INFO - Epoch 1 Summary:
294
+ 2026-04-03 13:15:00 - ReXMoE - INFO - Average LM Loss: 0.9498
295
+ 2026-04-03 13:15:00 - ReXMoE - INFO - Average Aux Loss: 0.008843
296
+ 2026-04-03 13:15:00 - ReXMoE - INFO - Average Total Loss: 0.9586
297
+ 2026-04-03 13:15:00 - ReXMoE - INFO - Final R: 3
298
+ 2026-04-03 13:15:00 - ReXMoE - INFO -
299
+ Evaluating model with 3 sample prompts...
300
+ 2026-04-03 13:15:02 - ReXMoE - INFO -
301
+ --- Prompt 1/3 ---
302
+ 2026-04-03 13:15:02 - ReXMoE - INFO - Instruction: What is the capital of France?
303
+ 2026-04-03 13:15:02 - ReXMoE - INFO - Input: None
304
+ 2026-04-03 13:15:02 - ReXMoE - INFO - Generated completion (len 9): The capital of France is Paris.
305
+ 2026-04-03 13:15:04 - ReXMoE - INFO -
306
+ --- Prompt 2/3 ---
307
+ 2026-04-03 13:15:04 - ReXMoE - INFO - Instruction: High-pressure systems stop air from rising into the colder regions of the atmosphere where water can condense. What will most likely result if a high-pressure system remains in an area for a long period of time?
308
+ A. fog
309
+ B. rain
310
+ C. drought
311
+ D. tornado
312
+ Answer:
313
+ 2026-04-03 13:15:04 - ReXMoE - INFO - Input: None
314
+ 2026-04-03 13:15:04 - ReXMoE - INFO - Generated completion (len 5): A. fog
315
+ 2026-04-03 13:15:05 - ReXMoE - INFO -
316
+ --- Prompt 3/3 ---
317
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Instruction: Given the fact: predators eat prey
318
+ Question: Predators eat
319
+ A. lions
320
+ B. humans
321
+ C. bunnies
322
+ D. grass
323
+ Answer:
324
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Input: None
325
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Generated completion (len 7): C. bunnies
326
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Evaluation of all 3 prompts complete.
327
+ 2026-04-03 13:15:05 - ReXMoE - INFO - New best epoch 1 with avg LM loss 0.9498 — saving checkpoint to ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3
328
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.0.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
329
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.0.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
330
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.1.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
331
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.1.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
332
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.2.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
333
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.2.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
334
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.3.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
335
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.3.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
336
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.4.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
337
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.4.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
338
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.5.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
339
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.5.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
340
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.6.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
341
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.6.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
342
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.7.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
343
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.7.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
344
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.8.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
345
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.8.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
346
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.9.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
347
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.9.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
348
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.10.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
349
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.10.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
350
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.11.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
351
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.11.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
352
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.12.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
353
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.12.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
354
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.13.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
355
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.13.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
356
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.14.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
357
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.14.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
358
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.15.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
359
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.15.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
360
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.16.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
361
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.16.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
362
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.17.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
363
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.17.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
364
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.18.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
365
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.18.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
366
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.19.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
367
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.19.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
368
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.20.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
369
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.20.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
370
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.21.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
371
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.21.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
372
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.22.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
373
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.22.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
374
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.23.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
375
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.23.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
376
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.24.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
377
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.24.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
378
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.25.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
379
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.25.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
380
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.26.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
381
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.26.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
382
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.27.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
383
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.27.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
384
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.28.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
385
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.28.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
386
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.29.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
387
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.29.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
388
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.30.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
389
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.30.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
390
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.31.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
391
+ 2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.31.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
392
+ 2026-04-03 13:15:06 - ReXMoE - INFO - ✓ Saved trained router weights: 96 parameters
393
+ 2026-04-03 13:15:06 - ReXMoE - INFO - File: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/rexmoe_routers.pt
394
+ 2026-04-03 13:15:06 - ReXMoE - INFO - Size: 12.03 MB
395
+ 2026-04-03 13:15:06 - ReXMoE - INFO -
396
+ Also saving full model with ReXMoE architecture...
397
+ 2026-04-03 13:15:06 - ReXMoE - INFO -
398
+ Merging LoRA adapters into base weights and saving to: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/merged
399
+ 2026-04-03 13:15:44 - ReXMoE - INFO - ✓ Saved merged full model (base+routers+LoRA) for one-step loading
400
+ 2026-04-03 13:15:44 - ReXMoE - INFO -
401
+ 📊 Convergence Metrics:
402
+ 2026-04-03 13:15:44 - ReXMoE - INFO - Convergence Metrics:
403
+ 2026-04-03 13:15:44 - ReXMoE - INFO - Avg Router Grad Norm: 0.084278
404
+ 2026-04-03 13:15:44 - ReXMoE - INFO - Current Learning Rate: 2.00e-05
405
+ 2026-04-03 13:15:44 - ReXMoE - INFO - ℹ️ Collecting convergence data (need 5 epochs minimum)...
406
+ 2026-04-03 13:15:44 - ReXMoE - INFO - Routing Pattern Analysis (Epoch 1):
407
+ 2026-04-03 13:15:59 - ReXMoE - INFO -
408
+ Analyzing ACTUAL routing patterns from 10 batches (17,341 tokens)
409
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Current reuse scale: R=3
410
+ 2026-04-03 13:15:59 - ReXMoE - INFO -
411
+ [IG-MET Pruning Report]:
412
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Global: 0/0 UNIQUE experts pruned (0.0%) | threshold=-1.000000
413
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Cross-Layer Routing Distribution (ACTUAL selections):
414
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Same layer (i): 869,591 ( 33.2%)
415
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Previous layer (i-1): 896,913 ( 34.2%)
416
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Next layer (i+1): 797,210 ( 30.4%)
417
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Distant layers: 57,726 ( 2.2%)
418
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Sample Layer-Specific Routing Patterns:
419
+ 2026-04-03 13:15:59 - ReXMoE - INFO -
420
+ Layer 8:
421
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 7 from layer 9 ( L9): 6,917 times ( 39.9%)
422
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 14 from layer 9 ( L9): 6,553 times ( 37.8%)
423
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 7 from layer 7 ( L7): 6,305 times ( 36.4%)
424
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 14 from layer 7 ( L7): 5,503 times ( 31.7%)
425
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 2 from layer 7 ( L7): 3,988 times ( 23.0%)
426
+ 2026-04-03 13:15:59 - ReXMoE - INFO -
427
+ Layer 16:
428
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 8 from layer 15 ( L15): 8,873 times ( 51.2%)
429
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 8 from layer 17 ( L17): 8,226 times ( 47.4%)
430
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 10 from layer 15 ( L15): 5,752 times ( 33.2%)
431
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 10 from layer 17 ( L17): 4,996 times ( 28.8%)
432
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 9 from layer 16 (same): 3,718 times ( 21.4%)
433
+ 2026-04-03 13:15:59 - ReXMoE - INFO -
434
+ Layer 24:
435
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 8 from layer 23 ( L23): 9,676 times ( 55.8%)
436
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 8 from layer 25 ( L25): 7,087 times ( 40.9%)
437
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 9 from layer 23 ( L23): 6,982 times ( 40.3%)
438
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 9 from layer 25 ( L25): 4,908 times ( 28.3%)
439
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Expert 4 from layer 24 (same): 3,902 times ( 22.5%)
440
+ 2026-04-03 13:15:59 - ReXMoE - INFO - ✅ Cross-layer expert reuse detected: 66.8% of routing uses adjacent layers
441
+ 2026-04-03 13:15:59 - ReXMoE - INFO - LR stepped to: 2.00e-05
442
+ 2026-04-03 13:15:59 - ReXMoE - INFO - ================================================================================
443
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Training Convergence Summary
444
+ 2026-04-03 13:15:59 - ReXMoE - INFO - ================================================================================
445
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Router Gradient Norms Over Epochs:
446
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Epoch 1: 0.084278
447
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Auxiliary Loss Over Epochs:
448
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Epoch 1: 0.008843
449
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Convergence Status: Insufficient data (< 5 epochs)
450
+ 2026-04-03 13:15:59 - ReXMoE - INFO -
451
+ Saving trained router weights only...
452
+ 2026-04-03 13:15:59 - ReXMoE - INFO - ✓ Saved trained router weights: 96 parameters
453
+ 2026-04-03 13:15:59 - ReXMoE - INFO - File: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/rexmoe_routers.pt
454
+ 2026-04-03 13:15:59 - ReXMoE - INFO - Size: 12.03 MB
455
+ 2026-04-03 13:15:59 - ReXMoE - INFO -
456
+ Also saving full model with ReXMoE architecture...
457
+ 2026-04-03 13:16:00 - ReXMoE - INFO -
458
+ Merging LoRA adapters into base weights and saving to: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/merged
459
+ 2026-04-03 13:16:32 - ReXMoE - INFO - ✓ Saved merged full model (base+routers+LoRA) for one-step loading
460
+ 2026-04-03 13:16:32 - ReXMoE - INFO - ================================================================================
461
+ 2026-04-03 13:16:32 - ReXMoE - INFO - ✓ Training complete. Two checkpoint formats saved:
462
+ 2026-04-03 13:16:32 - ReXMoE - INFO - 1. Router weights only: rexmoe_routers.pt (portable)
463
+ 2026-04-03 13:16:32 - ReXMoE - INFO - 2. Full model: pytorch_model.bin (requires rexmoe_architecture.py)
464
+ 2026-04-03 13:16:32 - ReXMoE - INFO -
465
+ Checkpoint directory: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3
466
+ 2026-04-03 13:16:32 - ReXMoE - INFO - Full model size: 0.00 GB
467
+ 2026-04-03 13:16:32 - ReXMoE - INFO - ================================================================================
merged/chat_template.jinja ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {% for message in messages %}{{'<|' + message['role'] + '|>' + '
2
+ ' + message['content'] + '<|end|>
3
+ ' }}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|>
4
+ ' }}{% else %}{{ eos_token }}{% endif %}
merged/config.json ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "PhimoeForCausalLM"
4
+ ],
5
+ "attention_bias": true,
6
+ "attention_dropout": 0.0,
7
+ "auto_map": {
8
+ "AutoConfig": "configuration_slimmoe.PhiMoEConfig",
9
+ "AutoModelForCausalLM": "modeling_slimmoe.PhiMoEForCausalLM"
10
+ },
11
+ "bos_token_id": 1,
12
+ "dtype": "bfloat16",
13
+ "eos_token_id": 32000,
14
+ "expert_dropout": 0.0,
15
+ "head_dim": 128,
16
+ "hidden_act": "silu",
17
+ "hidden_dropout": 0.0,
18
+ "hidden_size": 4096,
19
+ "initializer_range": 0.02,
20
+ "input_jitter_noise": 0.01,
21
+ "intermediate_size": 960,
22
+ "lm_head_bias": true,
23
+ "max_position_embeddings": 4096,
24
+ "model_type": "phimoe",
25
+ "num_attention_heads": 32,
26
+ "num_experts_per_tok": 2,
27
+ "num_hidden_layers": 32,
28
+ "num_key_value_heads": 8,
29
+ "num_local_experts": 16,
30
+ "output_router_logits": false,
31
+ "rms_norm_eps": 1e-05,
32
+ "rope_scaling": null,
33
+ "rope_theta": 10000.0,
34
+ "router_aux_loss_coef": 0.0,
35
+ "router_jitter_noise": 0.01,
36
+ "sliding_window": 2047,
37
+ "tie_word_embeddings": false,
38
+ "transformers_version": "4.57.3",
39
+ "use_cache": true,
40
+ "vocab_size": 32064
41
+ }
merged/generation_config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": [
5
+ 32000,
6
+ 32001,
7
+ 32007
8
+ ],
9
+ "pad_token_id": 32000,
10
+ "transformers_version": "4.57.3"
11
+ }
merged/model-00001-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6484d5015f8ea3efdd33cefc1936368eddd1c2dcbf11e56748ef7479d2d8438d
3
+ size 4996706662
merged/model-00002-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:327e8d8adf238d2ec2790faccbfc32e82a4c00171648c52ef221ccc458558323
3
+ size 4997911740
merged/model-00003-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e1e6c4947cb7311b368c5d85243655b096db10bb6daf7432c3527ab912c79986
3
+ size 4999325054
merged/model-00004-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:097bbe013b37437b336e70844d6910a88d9956f3b292a8310f795d21946e11b4
3
+ size 309969096
merged/model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
merged/special_tokens_map.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|endoftext|>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "<|endoftext|>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "unk_token": {
24
+ "content": "<unk>",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ }
30
+ }
merged/tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
merged/tokenizer_config.json ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_eos_token": false,
4
+ "add_prefix_space": null,
5
+ "added_tokens_decoder": {
6
+ "0": {
7
+ "content": "<unk>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false,
12
+ "special": true
13
+ },
14
+ "1": {
15
+ "content": "<s>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false,
20
+ "special": true
21
+ },
22
+ "2": {
23
+ "content": "</s>",
24
+ "lstrip": false,
25
+ "normalized": false,
26
+ "rstrip": true,
27
+ "single_word": false,
28
+ "special": false
29
+ },
30
+ "32000": {
31
+ "content": "<|endoftext|>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false,
36
+ "special": true
37
+ },
38
+ "32001": {
39
+ "content": "<|assistant|>",
40
+ "lstrip": false,
41
+ "normalized": false,
42
+ "rstrip": true,
43
+ "single_word": false,
44
+ "special": true
45
+ },
46
+ "32002": {
47
+ "content": "<|placeholder1|>",
48
+ "lstrip": false,
49
+ "normalized": false,
50
+ "rstrip": true,
51
+ "single_word": false,
52
+ "special": true
53
+ },
54
+ "32003": {
55
+ "content": "<|placeholder2|>",
56
+ "lstrip": false,
57
+ "normalized": false,
58
+ "rstrip": true,
59
+ "single_word": false,
60
+ "special": true
61
+ },
62
+ "32004": {
63
+ "content": "<|placeholder3|>",
64
+ "lstrip": false,
65
+ "normalized": false,
66
+ "rstrip": true,
67
+ "single_word": false,
68
+ "special": true
69
+ },
70
+ "32005": {
71
+ "content": "<|placeholder4|>",
72
+ "lstrip": false,
73
+ "normalized": false,
74
+ "rstrip": true,
75
+ "single_word": false,
76
+ "special": true
77
+ },
78
+ "32006": {
79
+ "content": "<|system|>",
80
+ "lstrip": false,
81
+ "normalized": false,
82
+ "rstrip": true,
83
+ "single_word": false,
84
+ "special": true
85
+ },
86
+ "32007": {
87
+ "content": "<|end|>",
88
+ "lstrip": false,
89
+ "normalized": false,
90
+ "rstrip": true,
91
+ "single_word": false,
92
+ "special": true
93
+ },
94
+ "32008": {
95
+ "content": "<|placeholder5|>",
96
+ "lstrip": false,
97
+ "normalized": false,
98
+ "rstrip": true,
99
+ "single_word": false,
100
+ "special": true
101
+ },
102
+ "32009": {
103
+ "content": "<|placeholder6|>",
104
+ "lstrip": false,
105
+ "normalized": false,
106
+ "rstrip": true,
107
+ "single_word": false,
108
+ "special": true
109
+ },
110
+ "32010": {
111
+ "content": "<|user|>",
112
+ "lstrip": false,
113
+ "normalized": false,
114
+ "rstrip": true,
115
+ "single_word": false,
116
+ "special": true
117
+ }
118
+ },
119
+ "bos_token": "<s>",
120
+ "clean_up_tokenization_spaces": false,
121
+ "eos_token": "<|endoftext|>",
122
+ "extra_special_tokens": {},
123
+ "legacy": false,
124
+ "model_max_length": 4096,
125
+ "pad_token": "<|endoftext|>",
126
+ "padding_side": "left",
127
+ "sp_model_kwargs": {},
128
+ "tokenizer_class": "LlamaTokenizerFast",
129
+ "unk_token": "<unk>",
130
+ "use_default_system_prompt": false
131
+ }
rexmoe_architecture.py ADDED
The diff for this file is too large to render. See raw diff
 
rexmoe_routers.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b5d3ffa393ddbb18257baf74cb112aaa8f83f8291906d7763a9a43ea53b0cd98
3
+ size 12618290
special_tokens_map.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|endoftext|>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "<|endoftext|>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "unk_token": {
24
+ "content": "<unk>",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ }
30
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_eos_token": false,
4
+ "add_prefix_space": null,
5
+ "added_tokens_decoder": {
6
+ "0": {
7
+ "content": "<unk>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false,
12
+ "special": true
13
+ },
14
+ "1": {
15
+ "content": "<s>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false,
20
+ "special": true
21
+ },
22
+ "2": {
23
+ "content": "</s>",
24
+ "lstrip": false,
25
+ "normalized": false,
26
+ "rstrip": true,
27
+ "single_word": false,
28
+ "special": false
29
+ },
30
+ "32000": {
31
+ "content": "<|endoftext|>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false,
36
+ "special": true
37
+ },
38
+ "32001": {
39
+ "content": "<|assistant|>",
40
+ "lstrip": false,
41
+ "normalized": false,
42
+ "rstrip": true,
43
+ "single_word": false,
44
+ "special": true
45
+ },
46
+ "32002": {
47
+ "content": "<|placeholder1|>",
48
+ "lstrip": false,
49
+ "normalized": false,
50
+ "rstrip": true,
51
+ "single_word": false,
52
+ "special": true
53
+ },
54
+ "32003": {
55
+ "content": "<|placeholder2|>",
56
+ "lstrip": false,
57
+ "normalized": false,
58
+ "rstrip": true,
59
+ "single_word": false,
60
+ "special": true
61
+ },
62
+ "32004": {
63
+ "content": "<|placeholder3|>",
64
+ "lstrip": false,
65
+ "normalized": false,
66
+ "rstrip": true,
67
+ "single_word": false,
68
+ "special": true
69
+ },
70
+ "32005": {
71
+ "content": "<|placeholder4|>",
72
+ "lstrip": false,
73
+ "normalized": false,
74
+ "rstrip": true,
75
+ "single_word": false,
76
+ "special": true
77
+ },
78
+ "32006": {
79
+ "content": "<|system|>",
80
+ "lstrip": false,
81
+ "normalized": false,
82
+ "rstrip": true,
83
+ "single_word": false,
84
+ "special": true
85
+ },
86
+ "32007": {
87
+ "content": "<|end|>",
88
+ "lstrip": false,
89
+ "normalized": false,
90
+ "rstrip": true,
91
+ "single_word": false,
92
+ "special": true
93
+ },
94
+ "32008": {
95
+ "content": "<|placeholder5|>",
96
+ "lstrip": false,
97
+ "normalized": false,
98
+ "rstrip": true,
99
+ "single_word": false,
100
+ "special": true
101
+ },
102
+ "32009": {
103
+ "content": "<|placeholder6|>",
104
+ "lstrip": false,
105
+ "normalized": false,
106
+ "rstrip": true,
107
+ "single_word": false,
108
+ "special": true
109
+ },
110
+ "32010": {
111
+ "content": "<|user|>",
112
+ "lstrip": false,
113
+ "normalized": false,
114
+ "rstrip": true,
115
+ "single_word": false,
116
+ "special": true
117
+ }
118
+ },
119
+ "bos_token": "<s>",
120
+ "clean_up_tokenization_spaces": false,
121
+ "eos_token": "<|endoftext|>",
122
+ "extra_special_tokens": {},
123
+ "legacy": false,
124
+ "model_max_length": 4096,
125
+ "pad_token": "<|endoftext|>",
126
+ "padding_side": "left",
127
+ "sp_model_kwargs": {},
128
+ "tokenizer_class": "LlamaTokenizerFast",
129
+ "unk_token": "<unk>",
130
+ "use_default_system_prompt": false
131
+ }