Instructions to use PakNin/Reuse-Trained-R3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use PakNin/Reuse-Trained-R3 with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-mini-MoE-instruct") model = PeftModel.from_pretrained(base_model, "PakNin/Reuse-Trained-R3") - Transformers
How to use PakNin/Reuse-Trained-R3 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="PakNin/Reuse-Trained-R3") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("PakNin/Reuse-Trained-R3", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use PakNin/Reuse-Trained-R3 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "PakNin/Reuse-Trained-R3" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "PakNin/Reuse-Trained-R3", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/PakNin/Reuse-Trained-R3
- SGLang
How to use PakNin/Reuse-Trained-R3 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "PakNin/Reuse-Trained-R3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "PakNin/Reuse-Trained-R3", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "PakNin/Reuse-Trained-R3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "PakNin/Reuse-Trained-R3", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use PakNin/Reuse-Trained-R3 with Docker Model Runner:
docker model run hf.co/PakNin/Reuse-Trained-R3
Upload folder using huggingface_hub
Browse files- .gitattributes +1 -0
- README.md +207 -0
- adapter_config.json +46 -0
- adapter_model.safetensors +3 -0
- chat_template.jinja +4 -0
- logs/aux_loss_compare.png +0 -0
- logs/aux_loss_curve.png +0 -0
- logs/loss_compare.png +3 -0
- logs/loss_curve.png +0 -0
- logs/rexmoe_training_0304_033137 copy.log +467 -0
- logs/rexmoe_training_0304_033137 copy_aux_corrected.log +467 -0
- logs/rexmoe_training_0304_033137.log +467 -0
- merged/chat_template.jinja +4 -0
- merged/config.json +41 -0
- merged/generation_config.json +11 -0
- merged/model-00001-of-00004.safetensors +3 -0
- merged/model-00002-of-00004.safetensors +3 -0
- merged/model-00003-of-00004.safetensors +3 -0
- merged/model-00004-of-00004.safetensors +3 -0
- merged/model.safetensors.index.json +0 -0
- merged/special_tokens_map.json +30 -0
- merged/tokenizer.json +0 -0
- merged/tokenizer_config.json +131 -0
- rexmoe_architecture.py +0 -0
- rexmoe_routers.pt +3 -0
- special_tokens_map.json +30 -0
- tokenizer.json +0 -0
- tokenizer_config.json +131 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
logs/loss_compare.png filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
|
@@ -0,0 +1,207 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
base_model: microsoft/Phi-mini-MoE-instruct
|
| 3 |
+
library_name: peft
|
| 4 |
+
pipeline_tag: text-generation
|
| 5 |
+
tags:
|
| 6 |
+
- base_model:adapter:microsoft/Phi-mini-MoE-instruct
|
| 7 |
+
- lora
|
| 8 |
+
- transformers
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
# Model Card for Model ID
|
| 12 |
+
|
| 13 |
+
<!-- Provide a quick summary of what the model is/does. -->
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
## Model Details
|
| 18 |
+
|
| 19 |
+
### Model Description
|
| 20 |
+
|
| 21 |
+
<!-- Provide a longer summary of what this model is. -->
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
- **Developed by:** [More Information Needed]
|
| 26 |
+
- **Funded by [optional]:** [More Information Needed]
|
| 27 |
+
- **Shared by [optional]:** [More Information Needed]
|
| 28 |
+
- **Model type:** [More Information Needed]
|
| 29 |
+
- **Language(s) (NLP):** [More Information Needed]
|
| 30 |
+
- **License:** [More Information Needed]
|
| 31 |
+
- **Finetuned from model [optional]:** [More Information Needed]
|
| 32 |
+
|
| 33 |
+
### Model Sources [optional]
|
| 34 |
+
|
| 35 |
+
<!-- Provide the basic links for the model. -->
|
| 36 |
+
|
| 37 |
+
- **Repository:** [More Information Needed]
|
| 38 |
+
- **Paper [optional]:** [More Information Needed]
|
| 39 |
+
- **Demo [optional]:** [More Information Needed]
|
| 40 |
+
|
| 41 |
+
## Uses
|
| 42 |
+
|
| 43 |
+
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
| 44 |
+
|
| 45 |
+
### Direct Use
|
| 46 |
+
|
| 47 |
+
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
|
| 48 |
+
|
| 49 |
+
[More Information Needed]
|
| 50 |
+
|
| 51 |
+
### Downstream Use [optional]
|
| 52 |
+
|
| 53 |
+
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
|
| 54 |
+
|
| 55 |
+
[More Information Needed]
|
| 56 |
+
|
| 57 |
+
### Out-of-Scope Use
|
| 58 |
+
|
| 59 |
+
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
|
| 60 |
+
|
| 61 |
+
[More Information Needed]
|
| 62 |
+
|
| 63 |
+
## Bias, Risks, and Limitations
|
| 64 |
+
|
| 65 |
+
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
| 66 |
+
|
| 67 |
+
[More Information Needed]
|
| 68 |
+
|
| 69 |
+
### Recommendations
|
| 70 |
+
|
| 71 |
+
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
|
| 72 |
+
|
| 73 |
+
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
|
| 74 |
+
|
| 75 |
+
## How to Get Started with the Model
|
| 76 |
+
|
| 77 |
+
Use the code below to get started with the model.
|
| 78 |
+
|
| 79 |
+
[More Information Needed]
|
| 80 |
+
|
| 81 |
+
## Training Details
|
| 82 |
+
|
| 83 |
+
### Training Data
|
| 84 |
+
|
| 85 |
+
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
| 86 |
+
|
| 87 |
+
[More Information Needed]
|
| 88 |
+
|
| 89 |
+
### Training Procedure
|
| 90 |
+
|
| 91 |
+
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
|
| 92 |
+
|
| 93 |
+
#### Preprocessing [optional]
|
| 94 |
+
|
| 95 |
+
[More Information Needed]
|
| 96 |
+
|
| 97 |
+
|
| 98 |
+
#### Training Hyperparameters
|
| 99 |
+
|
| 100 |
+
- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
|
| 101 |
+
|
| 102 |
+
#### Speeds, Sizes, Times [optional]
|
| 103 |
+
|
| 104 |
+
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
|
| 105 |
+
|
| 106 |
+
[More Information Needed]
|
| 107 |
+
|
| 108 |
+
## Evaluation
|
| 109 |
+
|
| 110 |
+
<!-- This section describes the evaluation protocols and provides the results. -->
|
| 111 |
+
|
| 112 |
+
### Testing Data, Factors & Metrics
|
| 113 |
+
|
| 114 |
+
#### Testing Data
|
| 115 |
+
|
| 116 |
+
<!-- This should link to a Dataset Card if possible. -->
|
| 117 |
+
|
| 118 |
+
[More Information Needed]
|
| 119 |
+
|
| 120 |
+
#### Factors
|
| 121 |
+
|
| 122 |
+
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
|
| 123 |
+
|
| 124 |
+
[More Information Needed]
|
| 125 |
+
|
| 126 |
+
#### Metrics
|
| 127 |
+
|
| 128 |
+
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
|
| 129 |
+
|
| 130 |
+
[More Information Needed]
|
| 131 |
+
|
| 132 |
+
### Results
|
| 133 |
+
|
| 134 |
+
[More Information Needed]
|
| 135 |
+
|
| 136 |
+
#### Summary
|
| 137 |
+
|
| 138 |
+
|
| 139 |
+
|
| 140 |
+
## Model Examination [optional]
|
| 141 |
+
|
| 142 |
+
<!-- Relevant interpretability work for the model goes here -->
|
| 143 |
+
|
| 144 |
+
[More Information Needed]
|
| 145 |
+
|
| 146 |
+
## Environmental Impact
|
| 147 |
+
|
| 148 |
+
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
|
| 149 |
+
|
| 150 |
+
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
| 151 |
+
|
| 152 |
+
- **Hardware Type:** [More Information Needed]
|
| 153 |
+
- **Hours used:** [More Information Needed]
|
| 154 |
+
- **Cloud Provider:** [More Information Needed]
|
| 155 |
+
- **Compute Region:** [More Information Needed]
|
| 156 |
+
- **Carbon Emitted:** [More Information Needed]
|
| 157 |
+
|
| 158 |
+
## Technical Specifications [optional]
|
| 159 |
+
|
| 160 |
+
### Model Architecture and Objective
|
| 161 |
+
|
| 162 |
+
[More Information Needed]
|
| 163 |
+
|
| 164 |
+
### Compute Infrastructure
|
| 165 |
+
|
| 166 |
+
[More Information Needed]
|
| 167 |
+
|
| 168 |
+
#### Hardware
|
| 169 |
+
|
| 170 |
+
[More Information Needed]
|
| 171 |
+
|
| 172 |
+
#### Software
|
| 173 |
+
|
| 174 |
+
[More Information Needed]
|
| 175 |
+
|
| 176 |
+
## Citation [optional]
|
| 177 |
+
|
| 178 |
+
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
| 179 |
+
|
| 180 |
+
**BibTeX:**
|
| 181 |
+
|
| 182 |
+
[More Information Needed]
|
| 183 |
+
|
| 184 |
+
**APA:**
|
| 185 |
+
|
| 186 |
+
[More Information Needed]
|
| 187 |
+
|
| 188 |
+
## Glossary [optional]
|
| 189 |
+
|
| 190 |
+
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
|
| 191 |
+
|
| 192 |
+
[More Information Needed]
|
| 193 |
+
|
| 194 |
+
## More Information [optional]
|
| 195 |
+
|
| 196 |
+
[More Information Needed]
|
| 197 |
+
|
| 198 |
+
## Model Card Authors [optional]
|
| 199 |
+
|
| 200 |
+
[More Information Needed]
|
| 201 |
+
|
| 202 |
+
## Model Card Contact
|
| 203 |
+
|
| 204 |
+
[More Information Needed]
|
| 205 |
+
### Framework versions
|
| 206 |
+
|
| 207 |
+
- PEFT 0.18.1
|
adapter_config.json
ADDED
|
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"alora_invocation_tokens": null,
|
| 3 |
+
"alpha_pattern": {},
|
| 4 |
+
"arrow_config": null,
|
| 5 |
+
"auto_mapping": null,
|
| 6 |
+
"base_model_name_or_path": "microsoft/Phi-mini-MoE-instruct",
|
| 7 |
+
"bias": "none",
|
| 8 |
+
"corda_config": null,
|
| 9 |
+
"ensure_weight_tying": false,
|
| 10 |
+
"eva_config": null,
|
| 11 |
+
"exclude_modules": null,
|
| 12 |
+
"fan_in_fan_out": false,
|
| 13 |
+
"inference_mode": true,
|
| 14 |
+
"init_lora_weights": true,
|
| 15 |
+
"layer_replication": null,
|
| 16 |
+
"layers_pattern": null,
|
| 17 |
+
"layers_to_transform": null,
|
| 18 |
+
"loftq_config": {},
|
| 19 |
+
"lora_alpha": 32,
|
| 20 |
+
"lora_bias": false,
|
| 21 |
+
"lora_dropout": 0.0,
|
| 22 |
+
"megatron_config": null,
|
| 23 |
+
"megatron_core": "megatron.core",
|
| 24 |
+
"modules_to_save": null,
|
| 25 |
+
"peft_type": "LORA",
|
| 26 |
+
"peft_version": "0.18.1",
|
| 27 |
+
"qalora_group_size": 16,
|
| 28 |
+
"r": 16,
|
| 29 |
+
"rank_pattern": {},
|
| 30 |
+
"revision": null,
|
| 31 |
+
"target_modules": [
|
| 32 |
+
"q_proj",
|
| 33 |
+
"w3",
|
| 34 |
+
"o_proj",
|
| 35 |
+
"w2",
|
| 36 |
+
"w1",
|
| 37 |
+
"v_proj",
|
| 38 |
+
"k_proj"
|
| 39 |
+
],
|
| 40 |
+
"target_parameters": null,
|
| 41 |
+
"task_type": "CAUSAL_LM",
|
| 42 |
+
"trainable_token_indices": null,
|
| 43 |
+
"use_dora": false,
|
| 44 |
+
"use_qalora": false,
|
| 45 |
+
"use_rslora": false
|
| 46 |
+
}
|
adapter_model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:60d95b10b6e140a9626a7058d5038528f2ff80148dc4569b881db56052046509
|
| 3 |
+
size 40
|
chat_template.jinja
ADDED
|
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{% for message in messages %}{{'<|' + message['role'] + '|>' + '
|
| 2 |
+
' + message['content'] + '<|end|>
|
| 3 |
+
' }}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|>
|
| 4 |
+
' }}{% else %}{{ eos_token }}{% endif %}
|
logs/aux_loss_compare.png
ADDED
|
logs/aux_loss_curve.png
ADDED
|
logs/loss_compare.png
ADDED
|
Git LFS Details
|
logs/loss_curve.png
ADDED
|
logs/rexmoe_training_0304_033137 copy.log
ADDED
|
@@ -0,0 +1,467 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
2026-04-03 03:31:37 - ReXMoE - INFO - ================================================================================
|
| 2 |
+
2026-04-03 03:31:37 - ReXMoE - INFO - ReXMoE Training Log - 0304_033137
|
| 3 |
+
2026-04-03 03:31:37 - ReXMoE - INFO - Log file: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/logs/rexmoe_training_0304_033137.log
|
| 4 |
+
2026-04-03 03:31:37 - ReXMoE - INFO - ================================================================================
|
| 5 |
+
2026-04-03 03:31:37 - ReXMoE - INFO - ================================================================================
|
| 6 |
+
2026-04-03 03:31:37 - ReXMoE - INFO - ReXMoE Cross-Layer Expert Reuse Training
|
| 7 |
+
2026-04-03 03:31:37 - ReXMoE - INFO - ================================================================================
|
| 8 |
+
2026-04-03 03:31:37 - ReXMoE - INFO - MET enabled: False
|
| 9 |
+
2026-04-03 03:31:37 - ReXMoE - INFO -
|
| 10 |
+
Configuration:
|
| 11 |
+
Model: microsoft/Phi-mini-MoE-instruct
|
| 12 |
+
Dataset: ../dataset/alpaca_data_cleaned.json
|
| 13 |
+
Dataset mode: IF_2
|
| 14 |
+
Reuse Scale (R): 3
|
| 15 |
+
Prune Ratio (MET): N/A
|
| 16 |
+
Epochs: 1
|
| 17 |
+
Num of samples: 20000
|
| 18 |
+
Batch Size: 4
|
| 19 |
+
Sequence Length: 1024
|
| 20 |
+
Learning Rate: 2e-05
|
| 21 |
+
PSR Enabled: True
|
| 22 |
+
LR Scheduler: True
|
| 23 |
+
Save Path: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3
|
| 24 |
+
Gradient Checkpointing: False
|
| 25 |
+
LoRA Rank: 16 (Full LoRA: True)
|
| 26 |
+
LoRA Alpha: 32
|
| 27 |
+
MET Enabled: False (Mask Ratio: 0.1, Warmup: 0.5)
|
| 28 |
+
Log File: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/logs/rexmoe_training_0304_033137.log
|
| 29 |
+
Aux loss weight: 0.05
|
| 30 |
+
|
| 31 |
+
2026-04-03 03:31:37 - ReXMoE - INFO - 💻 Using device: cuda)
|
| 32 |
+
2026-04-03 03:31:37 - ReXMoE - INFO - GPU: NVIDIA RTX A6000, Memory: 47.53 GB
|
| 33 |
+
2026-04-03 03:31:43 - ReXMoE - INFO - [5/7] Setting up optimizer and dataset...
|
| 34 |
+
2026-04-03 03:31:43 - ReXMoE - INFO - Using 8-bit AdamW optimizer
|
| 35 |
+
2026-04-03 03:31:43 - ReXMoE - INFO - LR Scheduler: CosineAnnealingLR (2e-05 → 2.0000000000000003e-06)
|
| 36 |
+
2026-04-03 03:31:51 - ReXMoE - INFO -
|
| 37 |
+
First batch statistics:
|
| 38 |
+
2026-04-03 03:31:51 - ReXMoE - INFO - LM Loss: 1.0094
|
| 39 |
+
2026-04-03 03:31:51 - ReXMoE - INFO - Aux Loss: 0.092773
|
| 40 |
+
2026-04-03 03:31:51 - ReXMoE - INFO - Total Loss: 1.1022
|
| 41 |
+
2026-04-03 03:31:51 - ReXMoE - INFO - Current R: 2
|
| 42 |
+
2026-04-03 03:31:51 - ReXMoE - INFO - Active experts per layer: 32
|
| 43 |
+
2026-04-03 03:31:51 - ReXMoE - INFO - Gradient norm: 1.0000
|
| 44 |
+
2026-04-03 03:31:51 - ReXMoE - INFO -
|
| 45 |
+
|
| 46 |
+
2026-04-03 03:35:09 - ReXMoE - INFO - [50/5000] loss=1.1939 aux=0.062988 R=2
|
| 47 |
+
2026-04-03 03:38:21 - ReXMoE - INFO - [100/5000] loss=1.1803 aux=0.040039 R=2
|
| 48 |
+
2026-04-03 03:41:36 - ReXMoE - INFO - [150/5000] loss=1.2968 aux=0.036621 R=2
|
| 49 |
+
2026-04-03 03:44:50 - ReXMoE - INFO - [200/5000] loss=1.2447 aux=0.028198 R=2
|
| 50 |
+
2026-04-03 03:48:01 - ReXMoE - INFO - [250/5000] loss=1.1971 aux=0.034180 R=2
|
| 51 |
+
2026-04-03 03:51:10 - ReXMoE - INFO - [300/5000] loss=2.1766 aux=0.024658 R=2
|
| 52 |
+
2026-04-03 03:54:19 - ReXMoE - INFO - [350/5000] loss=1.1092 aux=0.017578 R=2
|
| 53 |
+
2026-04-03 03:57:29 - ReXMoE - INFO - [400/5000] loss=0.9343 aux=0.024414 R=2
|
| 54 |
+
2026-04-03 04:00:40 - ReXMoE - INFO - [450/5000] loss=1.2180 aux=0.045410 R=2
|
| 55 |
+
2026-04-03 04:03:47 - ReXMoE - INFO - Warmup completed at step 500. Enabling FULL QLoRA with r = 16 and alpha = 32 on experts and updating optimizer...
|
| 56 |
+
2026-04-03 04:03:51 - ReXMoE - INFO - Trainable params (routers + LoRA): 144179200 (1.8509%)
|
| 57 |
+
2026-04-03 04:03:51 - ReXMoE - INFO - Sample trainable params after QLoRA: ['base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight', 'base_model.model.model.layers.0.self_attn.k_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.k_proj.lora_B.default.weight', 'base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight', 'base_model.model.model.layers.0.self_attn.o_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.o_proj.lora_B.default.weight', 'base_model.model.model.layers.0.block_sparse_moe.gate.weight', 'base_model.model.model.layers.0.block_sparse_moe.experts.0.w1.lora_A.default.weight']
|
| 58 |
+
2026-04-03 04:03:58 - ReXMoE - INFO - [500/5000] loss=1.0733 aux=0.036621 R=2
|
| 59 |
+
2026-04-03 04:09:15 - ReXMoE - INFO - [550/5000] loss=0.6253 aux=0.014709 R=2
|
| 60 |
+
2026-04-03 04:14:28 - ReXMoE - INFO - [600/5000] loss=1.5688 aux=0.010986 R=2
|
| 61 |
+
2026-04-03 04:19:39 - ReXMoE - INFO - [650/5000] loss=0.7864 aux=0.016357 R=2
|
| 62 |
+
2026-04-03 04:24:52 - ReXMoE - INFO - [700/5000] loss=1.5303 aux=0.010681 R=2
|
| 63 |
+
2026-04-03 04:30:02 - ReXMoE - INFO - [750/5000] loss=1.0098 aux=0.007812 R=2
|
| 64 |
+
2026-04-03 04:35:13 - ReXMoE - INFO - [800/5000] loss=1.0523 aux=0.014282 R=2
|
| 65 |
+
2026-04-03 04:40:24 - ReXMoE - INFO - [850/5000] loss=0.6447 aux=0.009094 R=2
|
| 66 |
+
2026-04-03 04:45:37 - ReXMoE - INFO - [900/5000] loss=0.7665 aux=0.004822 R=2
|
| 67 |
+
2026-04-03 04:50:50 - ReXMoE - INFO - [950/5000] loss=0.7762 aux=0.005737 R=2
|
| 68 |
+
2026-04-03 04:56:03 - ReXMoE - INFO - [1000/5000] loss=1.0254 aux=0.003571 R=2
|
| 69 |
+
2026-04-03 05:01:16 - ReXMoE - INFO - [1050/5000] loss=1.1320 aux=0.005737 R=2
|
| 70 |
+
2026-04-03 05:06:28 - ReXMoE - INFO - [1100/5000] loss=0.7519 aux=0.004974 R=2
|
| 71 |
+
2026-04-03 05:11:40 - ReXMoE - INFO - [1150/5000] loss=0.8246 aux=0.003204 R=2
|
| 72 |
+
2026-04-03 05:16:55 - ReXMoE - INFO - [1200/5000] loss=1.0041 aux=0.006042 R=2
|
| 73 |
+
2026-04-03 05:22:09 - ReXMoE - INFO - [1250/5000] loss=0.6804 aux=0.005859 R=2
|
| 74 |
+
2026-04-03 05:27:21 - ReXMoE - INFO - [1300/5000] loss=0.9695 aux=0.011108 R=2
|
| 75 |
+
2026-04-03 05:32:33 - ReXMoE - INFO - [1350/5000] loss=1.0448 aux=0.012634 R=2
|
| 76 |
+
2026-04-03 05:37:45 - ReXMoE - INFO - [1400/5000] loss=0.7468 aux=0.002136 R=2
|
| 77 |
+
2026-04-03 05:42:58 - ReXMoE - INFO - [1450/5000] loss=1.6307 aux=0.003510 R=2
|
| 78 |
+
2026-04-03 05:48:10 - ReXMoE - INFO - [1500/5000] loss=1.1833 aux=0.002625 R=2
|
| 79 |
+
2026-04-03 05:53:21 - ReXMoE - INFO - [1550/5000] loss=0.9216 aux=0.002991 R=2
|
| 80 |
+
2026-04-03 05:58:33 - ReXMoE - INFO - [1600/5000] loss=0.5969 aux=0.003708 R=2
|
| 81 |
+
2026-04-03 06:03:46 - ReXMoE - INFO - [1650/5000] loss=0.5240 aux=0.002518 R=2
|
| 82 |
+
2026-04-03 06:08:58 - ReXMoE - INFO - [1700/5000] loss=0.7681 aux=0.001785 R=2
|
| 83 |
+
2026-04-03 06:14:09 - ReXMoE - INFO - [1750/5000] loss=1.0812 aux=0.002899 R=2
|
| 84 |
+
2026-04-03 06:19:21 - ReXMoE - INFO - [1800/5000] loss=0.8171 aux=0.010986 R=2
|
| 85 |
+
2026-04-03 06:24:34 - ReXMoE - INFO - [1850/5000] loss=0.9029 aux=0.005371 R=2
|
| 86 |
+
2026-04-03 06:29:46 - ReXMoE - INFO - [1900/5000] loss=1.0440 aux=0.001839 R=2
|
| 87 |
+
2026-04-03 06:35:00 - ReXMoE - INFO - [1950/5000] loss=1.2026 aux=0.005096 R=2
|
| 88 |
+
2026-04-03 06:40:13 - ReXMoE - INFO - [2000/5000] loss=0.7174 aux=0.003372 R=2
|
| 89 |
+
2026-04-03 06:45:25 - ReXMoE - INFO - [2050/5000] loss=1.5737 aux=0.003571 R=2
|
| 90 |
+
2026-04-03 06:50:37 - ReXMoE - INFO - [2100/5000] loss=0.8508 aux=0.003403 R=2
|
| 91 |
+
2026-04-03 06:55:51 - ReXMoE - INFO - [2150/5000] loss=0.7965 aux=0.001656 R=2
|
| 92 |
+
2026-04-03 07:01:02 - ReXMoE - INFO - [2200/5000] loss=1.3079 aux=0.002747 R=2
|
| 93 |
+
2026-04-03 07:06:14 - ReXMoE - INFO - [2250/5000] loss=0.9750 aux=0.002228 R=2
|
| 94 |
+
2026-04-03 07:11:28 - ReXMoE - INFO - [2300/5000] loss=0.9549 aux=0.002228 R=2
|
| 95 |
+
2026-04-03 07:16:40 - ReXMoE - INFO - [2350/5000] loss=1.2216 aux=0.004089 R=2
|
| 96 |
+
2026-04-03 07:21:53 - ReXMoE - INFO - [2400/5000] loss=0.9801 aux=0.002289 R=2
|
| 97 |
+
2026-04-03 07:27:07 - ReXMoE - INFO - [2450/5000] loss=1.6587 aux=0.001602 R=2
|
| 98 |
+
2026-04-03 07:32:23 - ReXMoE - INFO - [2500/5000] loss=1.7420 aux=0.014648 R=3
|
| 99 |
+
2026-04-03 07:39:14 - ReXMoE - INFO - [2550/5000] loss=1.0498 aux=0.001801 R=3
|
| 100 |
+
2026-04-03 07:46:08 - ReXMoE - INFO - [2600/5000] loss=0.7848 aux=0.002792 R=3
|
| 101 |
+
2026-04-03 07:53:01 - ReXMoE - INFO - [2650/5000] loss=0.6119 aux=0.000992 R=3
|
| 102 |
+
2026-04-03 07:59:55 - ReXMoE - INFO - [2700/5000] loss=1.0871 aux=0.002014 R=3
|
| 103 |
+
2026-04-03 08:06:48 - ReXMoE - INFO - [2750/5000] loss=1.0422 aux=0.001411 R=3
|
| 104 |
+
2026-04-03 08:13:45 - ReXMoE - INFO - [2800/5000] loss=1.0147 aux=0.002762 R=3
|
| 105 |
+
2026-04-03 08:20:37 - ReXMoE - INFO - [2850/5000] loss=0.6756 aux=0.001953 R=3
|
| 106 |
+
2026-04-03 08:27:28 - ReXMoE - INFO - [2900/5000] loss=0.6243 aux=0.001671 R=3
|
| 107 |
+
2026-04-03 08:34:22 - ReXMoE - INFO - [2950/5000] loss=0.8838 aux=0.004974 R=3
|
| 108 |
+
2026-04-03 08:41:13 - ReXMoE - INFO - [3000/5000] loss=0.7627 aux=0.002060 R=3
|
| 109 |
+
2026-04-03 08:48:05 - ReXMoE - INFO - [3050/5000] loss=0.8120 aux=0.000668 R=3
|
| 110 |
+
2026-04-03 08:54:56 - ReXMoE - INFO - [3100/5000] loss=0.9701 aux=0.002121 R=3
|
| 111 |
+
2026-04-03 09:01:47 - ReXMoE - INFO - [3150/5000] loss=0.8151 aux=0.001289 R=3
|
| 112 |
+
2026-04-03 09:08:39 - ReXMoE - INFO - [3200/5000] loss=0.6943 aux=0.002777 R=3
|
| 113 |
+
2026-04-03 09:15:30 - ReXMoE - INFO - [3250/5000] loss=0.9401 aux=0.002350 R=3
|
| 114 |
+
2026-04-03 09:22:20 - ReXMoE - INFO - [3300/5000] loss=0.7034 aux=0.007935 R=3
|
| 115 |
+
2026-04-03 09:29:11 - ReXMoE - INFO - [3350/5000] loss=1.1980 aux=0.003006 R=3
|
| 116 |
+
2026-04-03 09:36:04 - ReXMoE - INFO - [3400/5000] loss=0.6413 aux=0.002045 R=3
|
| 117 |
+
2026-04-03 09:43:01 - ReXMoE - INFO - [3450/5000] loss=1.1729 aux=0.001686 R=3
|
| 118 |
+
2026-04-03 09:49:52 - ReXMoE - INFO - [3500/5000] loss=1.1667 aux=0.002045 R=3
|
| 119 |
+
2026-04-03 09:56:42 - ReXMoE - INFO - [3550/5000] loss=0.3543 aux=0.007324 R=3
|
| 120 |
+
2026-04-03 10:03:29 - ReXMoE - INFO - [3600/5000] loss=1.0002 aux=0.002792 R=3
|
| 121 |
+
2026-04-03 10:10:20 - ReXMoE - INFO - [3650/5000] loss=0.8748 aux=0.001503 R=3
|
| 122 |
+
2026-04-03 10:17:12 - ReXMoE - INFO - [3700/5000] loss=0.9026 aux=0.021118 R=3
|
| 123 |
+
2026-04-03 10:24:05 - ReXMoE - INFO - [3750/5000] loss=0.3710 aux=0.002182 R=3
|
| 124 |
+
2026-04-03 10:30:57 - ReXMoE - INFO - [3800/5000] loss=1.2199 aux=0.001564 R=3
|
| 125 |
+
2026-04-03 10:37:48 - ReXMoE - INFO - [3850/5000] loss=0.4812 aux=0.008057 R=3
|
| 126 |
+
2026-04-03 10:44:38 - ReXMoE - INFO - [3900/5000] loss=0.9683 aux=0.002487 R=3
|
| 127 |
+
2026-04-03 10:51:31 - ReXMoE - INFO - [3950/5000] loss=0.7649 aux=0.001732 R=3
|
| 128 |
+
2026-04-03 10:58:23 - ReXMoE - INFO - [4000/5000] loss=0.7234 aux=0.001839 R=3
|
| 129 |
+
2026-04-03 11:05:13 - ReXMoE - INFO - [4050/5000] loss=0.7793 aux=0.001289 R=3
|
| 130 |
+
2026-04-03 11:12:02 - ReXMoE - INFO - [4100/5000] loss=1.2237 aux=0.001968 R=3
|
| 131 |
+
2026-04-03 11:18:51 - ReXMoE - INFO - [4150/5000] loss=1.0040 aux=0.002701 R=3
|
| 132 |
+
2026-04-03 11:25:38 - ReXMoE - INFO - [4200/5000] loss=0.4700 aux=0.001945 R=3
|
| 133 |
+
2026-04-03 11:32:25 - ReXMoE - INFO - [4250/5000] loss=0.6833 aux=0.004486 R=3
|
| 134 |
+
2026-04-03 11:39:11 - ReXMoE - INFO - [4300/5000] loss=0.8191 aux=0.003754 R=3
|
| 135 |
+
2026-04-03 11:45:56 - ReXMoE - INFO - [4350/5000] loss=0.3914 aux=0.001312 R=3
|
| 136 |
+
2026-04-03 11:52:41 - ReXMoE - INFO - [4400/5000] loss=0.9623 aux=0.001854 R=3
|
| 137 |
+
2026-04-03 11:59:28 - ReXMoE - INFO - [4450/5000] loss=0.6550 aux=0.005615 R=3
|
| 138 |
+
2026-04-03 12:06:15 - ReXMoE - INFO - [4500/5000] loss=0.9616 aux=0.002777 R=3
|
| 139 |
+
2026-04-03 12:13:01 - ReXMoE - INFO - [4550/5000] loss=0.5557 aux=0.008789 R=3
|
| 140 |
+
2026-04-03 12:19:46 - ReXMoE - INFO - [4600/5000] loss=0.6275 aux=0.018555 R=3
|
| 141 |
+
2026-04-03 12:26:33 - ReXMoE - INFO - [4650/5000] loss=1.2395 aux=0.001549 R=3
|
| 142 |
+
2026-04-03 12:33:18 - ReXMoE - INFO - [4700/5000] loss=0.6769 aux=0.002060 R=3
|
| 143 |
+
2026-04-03 12:40:05 - ReXMoE - INFO - [4750/5000] loss=1.1499 aux=0.006348 R=3
|
| 144 |
+
2026-04-03 12:46:50 - ReXMoE - INFO - [4800/5000] loss=0.7449 aux=0.001022 R=3
|
| 145 |
+
2026-04-03 12:53:34 - ReXMoE - INFO - [4850/5000] loss=0.8246 aux=0.001823 R=3
|
| 146 |
+
2026-04-03 13:00:22 - ReXMoE - INFO - [4900/5000] loss=0.9550 aux=0.002029 R=3
|
| 147 |
+
2026-04-03 13:07:10 - ReXMoE - INFO - [4950/5000] loss=1.2535 aux=0.001610 R=3
|
| 148 |
+
2026-04-03 13:13:50 - ReXMoE - INFO -
|
| 149 |
+
[Step 5000/5000] Running evaluation at eval_steps...
|
| 150 |
+
2026-04-03 13:13:50 - ReXMoE - INFO -
|
| 151 |
+
Evaluating model with 3 sample prompts...
|
| 152 |
+
2026-04-03 13:13:52 - ReXMoE - INFO -
|
| 153 |
+
--- Prompt 1/3 ---
|
| 154 |
+
2026-04-03 13:13:52 - ReXMoE - INFO - Instruction: What is the capital of France?
|
| 155 |
+
2026-04-03 13:13:52 - ReXMoE - INFO - Input: None
|
| 156 |
+
2026-04-03 13:13:52 - ReXMoE - INFO - Generated completion (len 9): The capital of France is Paris.
|
| 157 |
+
2026-04-03 13:14:11 - ReXMoE - INFO -
|
| 158 |
+
--- Prompt 2/3 ---
|
| 159 |
+
2026-04-03 13:14:11 - ReXMoE - INFO - Instruction: High-pressure systems stop air from rising into the colder regions of the atmosphere where water can condense. What will most likely result if a high-pressure system remains in an area for a long period of time?
|
| 160 |
+
A. fog
|
| 161 |
+
B. rain
|
| 162 |
+
C. drought
|
| 163 |
+
D. tornado
|
| 164 |
+
Answer:
|
| 165 |
+
2026-04-03 13:14:11 - ReXMoE - INFO - Input: None
|
| 166 |
+
2026-04-03 13:14:11 - ReXMoE - INFO - Generated completion (len 77): A. fog
|
| 167 |
+
|
| 168 |
+
High-pressure systems often lead to fog formation because they can hold moisture and prevent it from evaporating. This can occur when the high-pressure system remains in an area for a long period of time. Fog forms when moist air cools and condenses into water droplets near the surface of the Earth.
|
| 169 |
+
2026-04-03 13:14:13 - ReXMoE - INFO -
|
| 170 |
+
--- Prompt 3/3 ---
|
| 171 |
+
2026-04-03 13:14:13 - ReXMoE - INFO - Instruction: Given the fact: predators eat prey
|
| 172 |
+
Question: Predators eat
|
| 173 |
+
A. lions
|
| 174 |
+
B. humans
|
| 175 |
+
C. bunnies
|
| 176 |
+
D. grass
|
| 177 |
+
Answer:
|
| 178 |
+
2026-04-03 13:14:13 - ReXMoE - INFO - Input: None
|
| 179 |
+
2026-04-03 13:14:13 - ReXMoE - INFO - Generated completion (len 7): C. bunnies
|
| 180 |
+
2026-04-03 13:14:13 - ReXMoE - INFO - Evaluation of all 3 prompts complete.
|
| 181 |
+
2026-04-03 13:14:13 - ReXMoE - INFO -
|
| 182 |
+
[Step 5000] Analyzing routing patterns at eval_steps...
|
| 183 |
+
2026-04-03 13:14:37 - ReXMoE - INFO -
|
| 184 |
+
Analyzing ACTUAL routing patterns from 10 batches (15,294 tokens)
|
| 185 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Current reuse scale: R=3
|
| 186 |
+
2026-04-03 13:14:37 - ReXMoE - INFO -
|
| 187 |
+
[IG-MET Pruning Report]:
|
| 188 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Global: 0/0 UNIQUE experts pruned (0.0%) | threshold=-1.000000
|
| 189 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Cross-Layer Routing Distribution (ACTUAL selections):
|
| 190 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Same layer (i): 781,056 ( 29.8%)
|
| 191 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Previous layer (i-1): 965,741 ( 36.8%)
|
| 192 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Next layer (i+1): 815,206 ( 31.1%)
|
| 193 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Distant layers: 59,437 ( 2.3%)
|
| 194 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Sample Layer-Specific Routing Patterns:
|
| 195 |
+
2026-04-03 13:14:37 - ReXMoE - INFO -
|
| 196 |
+
Layer 8:
|
| 197 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 7 from layer 9 ( L9): 5,937 times ( 38.8%)
|
| 198 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 7 from layer 7 ( L7): 5,895 times ( 38.5%)
|
| 199 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 14 from layer 9 ( L9): 5,822 times ( 38.1%)
|
| 200 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 14 from layer 7 ( L7): 5,618 times ( 36.7%)
|
| 201 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 2 from layer 7 ( L7): 4,200 times ( 27.5%)
|
| 202 |
+
2026-04-03 13:14:37 - ReXMoE - INFO -
|
| 203 |
+
Layer 16:
|
| 204 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 8 from layer 15 ( L15): 8,999 times ( 58.8%)
|
| 205 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 8 from layer 17 ( L17): 7,847 times ( 51.3%)
|
| 206 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 10 from layer 15 ( L15): 6,002 times ( 39.2%)
|
| 207 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 10 from layer 17 ( L17): 5,879 times ( 38.4%)
|
| 208 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 15 from layer 15 ( L15): 4,030 times ( 26.4%)
|
| 209 |
+
2026-04-03 13:14:37 - ReXMoE - INFO -
|
| 210 |
+
Layer 24:
|
| 211 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 8 from layer 23 ( L23): 9,213 times ( 60.2%)
|
| 212 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 9 from layer 23 ( L23): 7,912 times ( 51.7%)
|
| 213 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 8 from layer 25 ( L25): 6,819 times ( 44.6%)
|
| 214 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 9 from layer 25 ( L25): 6,403 times ( 41.9%)
|
| 215 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 8 from layer 24 (same): 4,350 times ( 28.4%)
|
| 216 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - ✅ Cross-layer expert reuse detected: 70.2% of routing uses adjacent layers
|
| 217 |
+
2026-04-03 13:14:37 - ReXMoE - INFO -
|
| 218 |
+
[Step 5000] Saving checkpoint at eval_steps to ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3...
|
| 219 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.0.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 220 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.0.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 221 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.1.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 222 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.1.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 223 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.2.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 224 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.2.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 225 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.3.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 226 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.3.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 227 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.4.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 228 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.4.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 229 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.5.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 230 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.5.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 231 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.6.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 232 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.6.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 233 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.7.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 234 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.7.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 235 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.8.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 236 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.8.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 237 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.9.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 238 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.9.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 239 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.10.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 240 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.10.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 241 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.11.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 242 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.11.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 243 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.12.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 244 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.12.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 245 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.13.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 246 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.13.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 247 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.14.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 248 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.14.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 249 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.15.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 250 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.15.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 251 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.16.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 252 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.16.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 253 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.17.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 254 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.17.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 255 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.18.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 256 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.18.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 257 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.19.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 258 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.19.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 259 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.20.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 260 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.20.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 261 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.21.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 262 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.21.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 263 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.22.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 264 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.22.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 265 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.23.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 266 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.23.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 267 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.24.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 268 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.24.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 269 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.25.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 270 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.25.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 271 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.26.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 272 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.26.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 273 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.27.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 274 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.27.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 275 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.28.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 276 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.28.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 277 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.29.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 278 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.29.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 279 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.30.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 280 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.30.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 281 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.31.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 282 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.31.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 283 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - ✓ Saved trained router weights: 96 parameters
|
| 284 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - File: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/rexmoe_routers.pt
|
| 285 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Size: 12.03 MB
|
| 286 |
+
2026-04-03 13:14:37 - ReXMoE - INFO -
|
| 287 |
+
Also saving full model with ReXMoE architecture...
|
| 288 |
+
2026-04-03 13:14:39 - ReXMoE - INFO -
|
| 289 |
+
Merging LoRA adapters into base weights and saving to: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/merged
|
| 290 |
+
2026-04-03 13:15:00 - ReXMoE - INFO - ✓ Saved merged full model (base+routers+LoRA) for one-step loading
|
| 291 |
+
2026-04-03 13:15:00 - ReXMoE - INFO -
|
| 292 |
+
============================================================
|
| 293 |
+
2026-04-03 13:15:00 - ReXMoE - INFO - Epoch 1 Summary:
|
| 294 |
+
2026-04-03 13:15:00 - ReXMoE - INFO - Average LM Loss: 0.9498
|
| 295 |
+
2026-04-03 13:15:00 - ReXMoE - INFO - Average Aux Loss: 0.008843
|
| 296 |
+
2026-04-03 13:15:00 - ReXMoE - INFO - Average Total Loss: 0.9586
|
| 297 |
+
2026-04-03 13:15:00 - ReXMoE - INFO - Final R: 3
|
| 298 |
+
2026-04-03 13:15:00 - ReXMoE - INFO -
|
| 299 |
+
Evaluating model with 3 sample prompts...
|
| 300 |
+
2026-04-03 13:15:02 - ReXMoE - INFO -
|
| 301 |
+
--- Prompt 1/3 ---
|
| 302 |
+
2026-04-03 13:15:02 - ReXMoE - INFO - Instruction: What is the capital of France?
|
| 303 |
+
2026-04-03 13:15:02 - ReXMoE - INFO - Input: None
|
| 304 |
+
2026-04-03 13:15:02 - ReXMoE - INFO - Generated completion (len 9): The capital of France is Paris.
|
| 305 |
+
2026-04-03 13:15:04 - ReXMoE - INFO -
|
| 306 |
+
--- Prompt 2/3 ---
|
| 307 |
+
2026-04-03 13:15:04 - ReXMoE - INFO - Instruction: High-pressure systems stop air from rising into the colder regions of the atmosphere where water can condense. What will most likely result if a high-pressure system remains in an area for a long period of time?
|
| 308 |
+
A. fog
|
| 309 |
+
B. rain
|
| 310 |
+
C. drought
|
| 311 |
+
D. tornado
|
| 312 |
+
Answer:
|
| 313 |
+
2026-04-03 13:15:04 - ReXMoE - INFO - Input: None
|
| 314 |
+
2026-04-03 13:15:04 - ReXMoE - INFO - Generated completion (len 5): A. fog
|
| 315 |
+
2026-04-03 13:15:05 - ReXMoE - INFO -
|
| 316 |
+
--- Prompt 3/3 ---
|
| 317 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Instruction: Given the fact: predators eat prey
|
| 318 |
+
Question: Predators eat
|
| 319 |
+
A. lions
|
| 320 |
+
B. humans
|
| 321 |
+
C. bunnies
|
| 322 |
+
D. grass
|
| 323 |
+
Answer:
|
| 324 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Input: None
|
| 325 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Generated completion (len 7): C. bunnies
|
| 326 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Evaluation of all 3 prompts complete.
|
| 327 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - New best epoch 1 with avg LM loss 0.9498 — saving checkpoint to ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3
|
| 328 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.0.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 329 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.0.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 330 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.1.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 331 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.1.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 332 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.2.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 333 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.2.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 334 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.3.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 335 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.3.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 336 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.4.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 337 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.4.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 338 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.5.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 339 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.5.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 340 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.6.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 341 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.6.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 342 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.7.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 343 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.7.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 344 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.8.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 345 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.8.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 346 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.9.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 347 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.9.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 348 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.10.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 349 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.10.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 350 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.11.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 351 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.11.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 352 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.12.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 353 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.12.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 354 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.13.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 355 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.13.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 356 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.14.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 357 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.14.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 358 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.15.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 359 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.15.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 360 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.16.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 361 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.16.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 362 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.17.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 363 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.17.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 364 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.18.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 365 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.18.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 366 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.19.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 367 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.19.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 368 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.20.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 369 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.20.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 370 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.21.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 371 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.21.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 372 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.22.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 373 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.22.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 374 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.23.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 375 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.23.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 376 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.24.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 377 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.24.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 378 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.25.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 379 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.25.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 380 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.26.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 381 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.26.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 382 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.27.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 383 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.27.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 384 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.28.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 385 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.28.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 386 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.29.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 387 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.29.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 388 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.30.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 389 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.30.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 390 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.31.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 391 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.31.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 392 |
+
2026-04-03 13:15:06 - ReXMoE - INFO - ✓ Saved trained router weights: 96 parameters
|
| 393 |
+
2026-04-03 13:15:06 - ReXMoE - INFO - File: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/rexmoe_routers.pt
|
| 394 |
+
2026-04-03 13:15:06 - ReXMoE - INFO - Size: 12.03 MB
|
| 395 |
+
2026-04-03 13:15:06 - ReXMoE - INFO -
|
| 396 |
+
Also saving full model with ReXMoE architecture...
|
| 397 |
+
2026-04-03 13:15:06 - ReXMoE - INFO -
|
| 398 |
+
Merging LoRA adapters into base weights and saving to: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/merged
|
| 399 |
+
2026-04-03 13:15:44 - ReXMoE - INFO - ✓ Saved merged full model (base+routers+LoRA) for one-step loading
|
| 400 |
+
2026-04-03 13:15:44 - ReXMoE - INFO -
|
| 401 |
+
📊 Convergence Metrics:
|
| 402 |
+
2026-04-03 13:15:44 - ReXMoE - INFO - Convergence Metrics:
|
| 403 |
+
2026-04-03 13:15:44 - ReXMoE - INFO - Avg Router Grad Norm: 0.084278
|
| 404 |
+
2026-04-03 13:15:44 - ReXMoE - INFO - Current Learning Rate: 2.00e-05
|
| 405 |
+
2026-04-03 13:15:44 - ReXMoE - INFO - ℹ️ Collecting convergence data (need 5 epochs minimum)...
|
| 406 |
+
2026-04-03 13:15:44 - ReXMoE - INFO - Routing Pattern Analysis (Epoch 1):
|
| 407 |
+
2026-04-03 13:15:59 - ReXMoE - INFO -
|
| 408 |
+
Analyzing ACTUAL routing patterns from 10 batches (17,341 tokens)
|
| 409 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Current reuse scale: R=3
|
| 410 |
+
2026-04-03 13:15:59 - ReXMoE - INFO -
|
| 411 |
+
[IG-MET Pruning Report]:
|
| 412 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Global: 0/0 UNIQUE experts pruned (0.0%) | threshold=-1.000000
|
| 413 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Cross-Layer Routing Distribution (ACTUAL selections):
|
| 414 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Same layer (i): 869,591 ( 33.2%)
|
| 415 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Previous layer (i-1): 896,913 ( 34.2%)
|
| 416 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Next layer (i+1): 797,210 ( 30.4%)
|
| 417 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Distant layers: 57,726 ( 2.2%)
|
| 418 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Sample Layer-Specific Routing Patterns:
|
| 419 |
+
2026-04-03 13:15:59 - ReXMoE - INFO -
|
| 420 |
+
Layer 8:
|
| 421 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 7 from layer 9 ( L9): 6,917 times ( 39.9%)
|
| 422 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 14 from layer 9 ( L9): 6,553 times ( 37.8%)
|
| 423 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 7 from layer 7 ( L7): 6,305 times ( 36.4%)
|
| 424 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 14 from layer 7 ( L7): 5,503 times ( 31.7%)
|
| 425 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 2 from layer 7 ( L7): 3,988 times ( 23.0%)
|
| 426 |
+
2026-04-03 13:15:59 - ReXMoE - INFO -
|
| 427 |
+
Layer 16:
|
| 428 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 8 from layer 15 ( L15): 8,873 times ( 51.2%)
|
| 429 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 8 from layer 17 ( L17): 8,226 times ( 47.4%)
|
| 430 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 10 from layer 15 ( L15): 5,752 times ( 33.2%)
|
| 431 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 10 from layer 17 ( L17): 4,996 times ( 28.8%)
|
| 432 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 9 from layer 16 (same): 3,718 times ( 21.4%)
|
| 433 |
+
2026-04-03 13:15:59 - ReXMoE - INFO -
|
| 434 |
+
Layer 24:
|
| 435 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 8 from layer 23 ( L23): 9,676 times ( 55.8%)
|
| 436 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 8 from layer 25 ( L25): 7,087 times ( 40.9%)
|
| 437 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 9 from layer 23 ( L23): 6,982 times ( 40.3%)
|
| 438 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 9 from layer 25 ( L25): 4,908 times ( 28.3%)
|
| 439 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 4 from layer 24 (same): 3,902 times ( 22.5%)
|
| 440 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - ✅ Cross-layer expert reuse detected: 66.8% of routing uses adjacent layers
|
| 441 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - LR stepped to: 2.00e-05
|
| 442 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - ================================================================================
|
| 443 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Training Convergence Summary
|
| 444 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - ================================================================================
|
| 445 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Router Gradient Norms Over Epochs:
|
| 446 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Epoch 1: 0.084278
|
| 447 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Auxiliary Loss Over Epochs:
|
| 448 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Epoch 1: 0.008843
|
| 449 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Convergence Status: Insufficient data (< 5 epochs)
|
| 450 |
+
2026-04-03 13:15:59 - ReXMoE - INFO -
|
| 451 |
+
Saving trained router weights only...
|
| 452 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - ✓ Saved trained router weights: 96 parameters
|
| 453 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - File: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/rexmoe_routers.pt
|
| 454 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Size: 12.03 MB
|
| 455 |
+
2026-04-03 13:15:59 - ReXMoE - INFO -
|
| 456 |
+
Also saving full model with ReXMoE architecture...
|
| 457 |
+
2026-04-03 13:16:00 - ReXMoE - INFO -
|
| 458 |
+
Merging LoRA adapters into base weights and saving to: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/merged
|
| 459 |
+
2026-04-03 13:16:32 - ReXMoE - INFO - ✓ Saved merged full model (base+routers+LoRA) for one-step loading
|
| 460 |
+
2026-04-03 13:16:32 - ReXMoE - INFO - ================================================================================
|
| 461 |
+
2026-04-03 13:16:32 - ReXMoE - INFO - ✓ Training complete. Two checkpoint formats saved:
|
| 462 |
+
2026-04-03 13:16:32 - ReXMoE - INFO - 1. Router weights only: rexmoe_routers.pt (portable)
|
| 463 |
+
2026-04-03 13:16:32 - ReXMoE - INFO - 2. Full model: pytorch_model.bin (requires rexmoe_architecture.py)
|
| 464 |
+
2026-04-03 13:16:32 - ReXMoE - INFO -
|
| 465 |
+
Checkpoint directory: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3
|
| 466 |
+
2026-04-03 13:16:32 - ReXMoE - INFO - Full model size: 0.00 GB
|
| 467 |
+
2026-04-03 13:16:32 - ReXMoE - INFO - ================================================================================
|
logs/rexmoe_training_0304_033137 copy_aux_corrected.log
ADDED
|
@@ -0,0 +1,467 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
2026-04-03 03:31:37 - ReXMoE - INFO - ================================================================================
|
| 2 |
+
2026-04-03 03:31:37 - ReXMoE - INFO - ReXMoE Training Log - 0304_033137
|
| 3 |
+
2026-04-03 03:31:37 - ReXMoE - INFO - Log file: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/logs/rexmoe_training_0304_033137.log
|
| 4 |
+
2026-04-03 03:31:37 - ReXMoE - INFO - ================================================================================
|
| 5 |
+
2026-04-03 03:31:37 - ReXMoE - INFO - ================================================================================
|
| 6 |
+
2026-04-03 03:31:37 - ReXMoE - INFO - ReXMoE Cross-Layer Expert Reuse Training
|
| 7 |
+
2026-04-03 03:31:37 - ReXMoE - INFO - ================================================================================
|
| 8 |
+
2026-04-03 03:31:37 - ReXMoE - INFO - MET enabled: False
|
| 9 |
+
2026-04-03 03:31:37 - ReXMoE - INFO -
|
| 10 |
+
Configuration:
|
| 11 |
+
Model: microsoft/Phi-mini-MoE-instruct
|
| 12 |
+
Dataset: ../dataset/alpaca_data_cleaned.json
|
| 13 |
+
Dataset mode: IF_2
|
| 14 |
+
Reuse Scale (R): 3
|
| 15 |
+
Prune Ratio (MET): N/A
|
| 16 |
+
Epochs: 1
|
| 17 |
+
Num of samples: 20000
|
| 18 |
+
Batch Size: 4
|
| 19 |
+
Sequence Length: 1024
|
| 20 |
+
Learning Rate: 2e-05
|
| 21 |
+
PSR Enabled: True
|
| 22 |
+
LR Scheduler: True
|
| 23 |
+
Save Path: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3
|
| 24 |
+
Gradient Checkpointing: False
|
| 25 |
+
LoRA Rank: 16 (Full LoRA: True)
|
| 26 |
+
LoRA Alpha: 32
|
| 27 |
+
MET Enabled: False (Mask Ratio: 0.1, Warmup: 0.5)
|
| 28 |
+
Log File: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/logs/rexmoe_training_0304_033137.log
|
| 29 |
+
Aux loss weight: 0.05
|
| 30 |
+
|
| 31 |
+
2026-04-03 03:31:37 - ReXMoE - INFO - 💻 Using device: cuda)
|
| 32 |
+
2026-04-03 03:31:37 - ReXMoE - INFO - GPU: NVIDIA RTX A6000, Memory: 47.53 GB
|
| 33 |
+
2026-04-03 03:31:43 - ReXMoE - INFO - [5/7] Setting up optimizer and dataset...
|
| 34 |
+
2026-04-03 03:31:43 - ReXMoE - INFO - Using 8-bit AdamW optimizer
|
| 35 |
+
2026-04-03 03:31:43 - ReXMoE - INFO - LR Scheduler: CosineAnnealingLR (2e-05 → 2.0000000000000003e-06)
|
| 36 |
+
2026-04-03 03:31:51 - ReXMoE - INFO -
|
| 37 |
+
First batch statistics:
|
| 38 |
+
2026-04-03 03:31:51 - ReXMoE - INFO - LM Loss: 1.0094
|
| 39 |
+
2026-04-03 03:31:51 - ReXMoE - INFO - Aux Loss: 0.092773
|
| 40 |
+
2026-04-03 03:31:51 - ReXMoE - INFO - Total Loss: 1.1022
|
| 41 |
+
2026-04-03 03:31:51 - ReXMoE - INFO - Current R: 2
|
| 42 |
+
2026-04-03 03:31:51 - ReXMoE - INFO - Active experts per layer: 32
|
| 43 |
+
2026-04-03 03:31:51 - ReXMoE - INFO - Gradient norm: 1.0000
|
| 44 |
+
2026-04-03 03:31:51 - ReXMoE - INFO -
|
| 45 |
+
|
| 46 |
+
2026-04-03 03:35:09 - ReXMoE - INFO - [50/5000] loss=1.1939 aux=0.025195 R=2
|
| 47 |
+
2026-04-03 03:38:21 - ReXMoE - INFO - [100/5000] loss=1.1803 aux=0.016016 R=2
|
| 48 |
+
2026-04-03 03:41:36 - ReXMoE - INFO - [150/5000] loss=1.2968 aux=0.014648 R=2
|
| 49 |
+
2026-04-03 03:44:50 - ReXMoE - INFO - [200/5000] loss=1.2447 aux=0.011279 R=2
|
| 50 |
+
2026-04-03 03:48:01 - ReXMoE - INFO - [250/5000] loss=1.1971 aux=0.013672 R=2
|
| 51 |
+
2026-04-03 03:51:10 - ReXMoE - INFO - [300/5000] loss=2.1766 aux=0.009863 R=2
|
| 52 |
+
2026-04-03 03:54:19 - ReXMoE - INFO - [350/5000] loss=1.1092 aux=0.007031 R=2
|
| 53 |
+
2026-04-03 03:57:29 - ReXMoE - INFO - [400/5000] loss=0.9343 aux=0.009766 R=2
|
| 54 |
+
2026-04-03 04:00:40 - ReXMoE - INFO - [450/5000] loss=1.2180 aux=0.018164 R=2
|
| 55 |
+
2026-04-03 04:03:47 - ReXMoE - INFO - Warmup completed at step 500. Enabling FULL QLoRA with r = 16 and alpha = 32 on experts and updating optimizer...
|
| 56 |
+
2026-04-03 04:03:51 - ReXMoE - INFO - Trainable params (routers + LoRA): 144179200 (1.8509%)
|
| 57 |
+
2026-04-03 04:03:51 - ReXMoE - INFO - Sample trainable params after QLoRA: ['base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight', 'base_model.model.model.layers.0.self_attn.k_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.k_proj.lora_B.default.weight', 'base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight', 'base_model.model.model.layers.0.self_attn.o_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.o_proj.lora_B.default.weight', 'base_model.model.model.layers.0.block_sparse_moe.gate.weight', 'base_model.model.model.layers.0.block_sparse_moe.experts.0.w1.lora_A.default.weight']
|
| 58 |
+
2026-04-03 04:03:58 - ReXMoE - INFO - [500/5000] loss=1.0733 aux=0.014648 R=2
|
| 59 |
+
2026-04-03 04:09:15 - ReXMoE - INFO - [550/5000] loss=0.6253 aux=0.005884 R=2
|
| 60 |
+
2026-04-03 04:14:28 - ReXMoE - INFO - [600/5000] loss=1.5688 aux=0.004394 R=2
|
| 61 |
+
2026-04-03 04:19:39 - ReXMoE - INFO - [650/5000] loss=0.7864 aux=0.006543 R=2
|
| 62 |
+
2026-04-03 04:24:52 - ReXMoE - INFO - [700/5000] loss=1.5303 aux=0.004272 R=2
|
| 63 |
+
2026-04-03 04:30:02 - ReXMoE - INFO - [750/5000] loss=1.0098 aux=0.003125 R=2
|
| 64 |
+
2026-04-03 04:35:13 - ReXMoE - INFO - [800/5000] loss=1.0523 aux=0.005713 R=2
|
| 65 |
+
2026-04-03 04:40:24 - ReXMoE - INFO - [850/5000] loss=0.6447 aux=0.003638 R=2
|
| 66 |
+
2026-04-03 04:45:37 - ReXMoE - INFO - [900/5000] loss=0.7665 aux=0.001929 R=2
|
| 67 |
+
2026-04-03 04:50:50 - ReXMoE - INFO - [950/5000] loss=0.7762 aux=0.002295 R=2
|
| 68 |
+
2026-04-03 04:56:03 - ReXMoE - INFO - [1000/5000] loss=1.0254 aux=0.001428 R=2
|
| 69 |
+
2026-04-03 05:01:16 - ReXMoE - INFO - [1050/5000] loss=1.1320 aux=0.002295 R=2
|
| 70 |
+
2026-04-03 05:06:28 - ReXMoE - INFO - [1100/5000] loss=0.7519 aux=0.001990 R=2
|
| 71 |
+
2026-04-03 05:11:40 - ReXMoE - INFO - [1150/5000] loss=0.8246 aux=0.001282 R=2
|
| 72 |
+
2026-04-03 05:16:55 - ReXMoE - INFO - [1200/5000] loss=1.0041 aux=0.002417 R=2
|
| 73 |
+
2026-04-03 05:22:09 - ReXMoE - INFO - [1250/5000] loss=0.6804 aux=0.002344 R=2
|
| 74 |
+
2026-04-03 05:27:21 - ReXMoE - INFO - [1300/5000] loss=0.9695 aux=0.001443 R=2
|
| 75 |
+
2026-04-03 05:32:33 - ReXMoE - INFO - [1350/5000] loss=1.0448 aux=0.001054 R=2
|
| 76 |
+
2026-04-03 05:37:45 - ReXMoE - INFO - [1400/5000] loss=0.7468 aux=0.000854 R=2
|
| 77 |
+
2026-04-03 05:42:58 - ReXMoE - INFO - [1450/5000] loss=1.6307 aux=0.001404 R=2
|
| 78 |
+
2026-04-03 05:48:10 - ReXMoE - INFO - [1500/5000] loss=1.1833 aux=0.001050 R=2
|
| 79 |
+
2026-04-03 05:53:21 - ReXMoE - INFO - [1550/5000] loss=0.9216 aux=0.001196 R=2
|
| 80 |
+
2026-04-03 05:58:33 - ReXMoE - INFO - [1600/5000] loss=0.5969 aux=0.001483 R=2
|
| 81 |
+
2026-04-03 06:03:46 - ReXMoE - INFO - [1650/5000] loss=0.5240 aux=0.001007 R=2
|
| 82 |
+
2026-04-03 06:08:58 - ReXMoE - INFO - [1700/5000] loss=0.7681 aux=0.000714 R=2
|
| 83 |
+
2026-04-03 06:14:09 - ReXMoE - INFO - [1750/5000] loss=1.0812 aux=0.001160 R=2
|
| 84 |
+
2026-04-03 06:19:21 - ReXMoE - INFO - [1800/5000] loss=0.8171 aux=0.002394 R=2
|
| 85 |
+
2026-04-03 06:24:34 - ReXMoE - INFO - [1850/5000] loss=0.9029 aux=0.002148 R=2
|
| 86 |
+
2026-04-03 06:29:46 - ReXMoE - INFO - [1900/5000] loss=1.0440 aux=0.000736 R=2
|
| 87 |
+
2026-04-03 06:35:00 - ReXMoE - INFO - [1950/5000] loss=1.2026 aux=0.002038 R=2
|
| 88 |
+
2026-04-03 06:40:13 - ReXMoE - INFO - [2000/5000] loss=0.7174 aux=0.001349 R=2
|
| 89 |
+
2026-04-03 06:45:25 - ReXMoE - INFO - [2050/5000] loss=1.5737 aux=0.001428 R=2
|
| 90 |
+
2026-04-03 06:50:37 - ReXMoE - INFO - [2100/5000] loss=0.8508 aux=0.001361 R=2
|
| 91 |
+
2026-04-03 06:55:51 - ReXMoE - INFO - [2150/5000] loss=0.7965 aux=0.000662 R=2
|
| 92 |
+
2026-04-03 07:01:02 - ReXMoE - INFO - [2200/5000] loss=1.3079 aux=0.001099 R=2
|
| 93 |
+
2026-04-03 07:06:14 - ReXMoE - INFO - [2250/5000] loss=0.9750 aux=0.000891 R=2
|
| 94 |
+
2026-04-03 07:11:28 - ReXMoE - INFO - [2300/5000] loss=0.9549 aux=0.000891 R=2
|
| 95 |
+
2026-04-03 07:16:40 - ReXMoE - INFO - [2350/5000] loss=1.2216 aux=0.001636 R=2
|
| 96 |
+
2026-04-03 07:21:53 - ReXMoE - INFO - [2400/5000] loss=0.9801 aux=0.000916 R=2
|
| 97 |
+
2026-04-03 07:27:07 - ReXMoE - INFO - [2450/5000] loss=1.6587 aux=0.000641 R=2
|
| 98 |
+
2026-04-03 07:32:23 - ReXMoE - INFO - [2500/5000] loss=1.7420 aux=0.003859 R=3
|
| 99 |
+
2026-04-03 07:39:14 - ReXMoE - INFO - [2550/5000] loss=1.0498 aux=0.000720 R=3
|
| 100 |
+
2026-04-03 07:46:08 - ReXMoE - INFO - [2600/5000] loss=0.7848 aux=0.001117 R=3
|
| 101 |
+
2026-04-03 07:53:01 - ReXMoE - INFO - [2650/5000] loss=0.6119 aux=0.000397 R=3
|
| 102 |
+
2026-04-03 07:59:55 - ReXMoE - INFO - [2700/5000] loss=1.0871 aux=0.000806 R=3
|
| 103 |
+
2026-04-03 08:06:48 - ReXMoE - INFO - [2750/5000] loss=1.0422 aux=0.000564 R=3
|
| 104 |
+
2026-04-03 08:13:45 - ReXMoE - INFO - [2800/5000] loss=1.0147 aux=0.001105 R=3
|
| 105 |
+
2026-04-03 08:20:37 - ReXMoE - INFO - [2850/5000] loss=0.6756 aux=0.000781 R=3
|
| 106 |
+
2026-04-03 08:27:28 - ReXMoE - INFO - [2900/5000] loss=0.6243 aux=0.000668 R=3
|
| 107 |
+
2026-04-03 08:34:22 - ReXMoE - INFO - [2950/5000] loss=0.8838 aux=0.000990 R=3
|
| 108 |
+
2026-04-03 08:41:13 - ReXMoE - INFO - [3000/5000] loss=0.7627 aux=0.000824 R=3
|
| 109 |
+
2026-04-03 08:48:05 - ReXMoE - INFO - [3050/5000] loss=0.8120 aux=0.000267 R=3
|
| 110 |
+
2026-04-03 08:54:56 - ReXMoE - INFO - [3100/5000] loss=0.9701 aux=0.000848 R=3
|
| 111 |
+
2026-04-03 09:01:47 - ReXMoE - INFO - [3150/5000] loss=0.8151 aux=0.000516 R=3
|
| 112 |
+
2026-04-03 09:08:39 - ReXMoE - INFO - [3200/5000] loss=0.6943 aux=0.001111 R=3
|
| 113 |
+
2026-04-03 09:15:30 - ReXMoE - INFO - [3250/5000] loss=0.9401 aux=0.000940 R=3
|
| 114 |
+
2026-04-03 09:22:20 - ReXMoE - INFO - [3300/5000] loss=0.7034 aux=0.001174 R=3
|
| 115 |
+
2026-04-03 09:29:11 - ReXMoE - INFO - [3350/5000] loss=1.1980 aux=0.001202 R=3
|
| 116 |
+
2026-04-03 09:36:04 - ReXMoE - INFO - [3400/5000] loss=0.6413 aux=0.000818 R=3
|
| 117 |
+
2026-04-03 09:43:01 - ReXMoE - INFO - [3450/5000] loss=1.1729 aux=0.000674 R=3
|
| 118 |
+
2026-04-03 09:49:52 - ReXMoE - INFO - [3500/5000] loss=1.1667 aux=0.000818 R=3
|
| 119 |
+
2026-04-03 09:56:42 - ReXMoE - INFO - [3550/5000] loss=0.3543 aux=0.002930 R=3
|
| 120 |
+
2026-04-03 10:03:29 - ReXMoE - INFO - [3600/5000] loss=1.0002 aux=0.001117 R=3
|
| 121 |
+
2026-04-03 10:10:20 - ReXMoE - INFO - [3650/5000] loss=0.8748 aux=0.000601 R=3
|
| 122 |
+
2026-04-03 10:17:12 - ReXMoE - INFO - [3700/5000] loss=0.9026 aux=0.002447 R=3
|
| 123 |
+
2026-04-03 10:24:05 - ReXMoE - INFO - [3750/5000] loss=0.3710 aux=0.000873 R=3
|
| 124 |
+
2026-04-03 10:30:57 - ReXMoE - INFO - [3800/5000] loss=1.2199 aux=0.000626 R=3
|
| 125 |
+
2026-04-03 10:37:48 - ReXMoE - INFO - [3850/5000] loss=0.4812 aux=0.001223 R=3
|
| 126 |
+
2026-04-03 10:44:38 - ReXMoE - INFO - [3900/5000] loss=0.9683 aux=0.000995 R=3
|
| 127 |
+
2026-04-03 10:51:31 - ReXMoE - INFO - [3950/5000] loss=0.7649 aux=0.000693 R=3
|
| 128 |
+
2026-04-03 10:58:23 - ReXMoE - INFO - [4000/5000] loss=0.7234 aux=0.000736 R=3
|
| 129 |
+
2026-04-03 11:05:13 - ReXMoE - INFO - [4050/5000] loss=0.7793 aux=0.000516 R=3
|
| 130 |
+
2026-04-03 11:12:02 - ReXMoE - INFO - [4100/5000] loss=1.2237 aux=0.000787 R=3
|
| 131 |
+
2026-04-03 11:18:51 - ReXMoE - INFO - [4150/5000] loss=1.0040 aux=0.001080 R=3
|
| 132 |
+
2026-04-03 11:25:38 - ReXMoE - INFO - [4200/5000] loss=0.4700 aux=0.000778 R=3
|
| 133 |
+
2026-04-03 11:32:25 - ReXMoE - INFO - [4250/5000] loss=0.6833 aux=0.001794 R=3
|
| 134 |
+
2026-04-03 11:39:11 - ReXMoE - INFO - [4300/5000] loss=0.8191 aux=0.001502 R=3
|
| 135 |
+
2026-04-03 11:45:56 - ReXMoE - INFO - [4350/5000] loss=0.3914 aux=0.000525 R=3
|
| 136 |
+
2026-04-03 11:52:41 - ReXMoE - INFO - [4400/5000] loss=0.9623 aux=0.000742 R=3
|
| 137 |
+
2026-04-03 11:59:28 - ReXMoE - INFO - [4450/5000] loss=0.6550 aux=0.002246 R=3
|
| 138 |
+
2026-04-03 12:06:15 - ReXMoE - INFO - [4500/5000] loss=0.9616 aux=0.001111 R=3
|
| 139 |
+
2026-04-03 12:13:01 - ReXMoE - INFO - [4550/5000] loss=0.5557 aux=0.003516 R=3
|
| 140 |
+
2026-04-03 12:19:46 - ReXMoE - INFO - [4600/5000] loss=0.6275 aux=0.002422 R=3
|
| 141 |
+
2026-04-03 12:26:33 - ReXMoE - INFO - [4650/5000] loss=1.2395 aux=0.000620 R=3
|
| 142 |
+
2026-04-03 12:33:18 - ReXMoE - INFO - [4700/5000] loss=0.6769 aux=0.000824 R=3
|
| 143 |
+
2026-04-03 12:40:05 - ReXMoE - INFO - [4750/5000] loss=1.1499 aux=0.002539 R=3
|
| 144 |
+
2026-04-03 12:46:50 - ReXMoE - INFO - [4800/5000] loss=0.7449 aux=0.000409 R=3
|
| 145 |
+
2026-04-03 12:53:34 - ReXMoE - INFO - [4850/5000] loss=0.8246 aux=0.000729 R=3
|
| 146 |
+
2026-04-03 13:00:22 - ReXMoE - INFO - [4900/5000] loss=0.9550 aux=0.000812 R=3
|
| 147 |
+
2026-04-03 13:07:10 - ReXMoE - INFO - [4950/5000] loss=1.2535 aux=0.000644 R=3
|
| 148 |
+
2026-04-03 13:13:50 - ReXMoE - INFO -
|
| 149 |
+
[Step 5000/5000] Running evaluation at eval_steps...
|
| 150 |
+
2026-04-03 13:13:50 - ReXMoE - INFO -
|
| 151 |
+
Evaluating model with 3 sample prompts...
|
| 152 |
+
2026-04-03 13:13:52 - ReXMoE - INFO -
|
| 153 |
+
--- Prompt 1/3 ---
|
| 154 |
+
2026-04-03 13:13:52 - ReXMoE - INFO - Instruction: What is the capital of France?
|
| 155 |
+
2026-04-03 13:13:52 - ReXMoE - INFO - Input: None
|
| 156 |
+
2026-04-03 13:13:52 - ReXMoE - INFO - Generated completion (len 9): The capital of France is Paris.
|
| 157 |
+
2026-04-03 13:14:11 - ReXMoE - INFO -
|
| 158 |
+
--- Prompt 2/3 ---
|
| 159 |
+
2026-04-03 13:14:11 - ReXMoE - INFO - Instruction: High-pressure systems stop air from rising into the colder regions of the atmosphere where water can condense. What will most likely result if a high-pressure system remains in an area for a long period of time?
|
| 160 |
+
A. fog
|
| 161 |
+
B. rain
|
| 162 |
+
C. drought
|
| 163 |
+
D. tornado
|
| 164 |
+
Answer:
|
| 165 |
+
2026-04-03 13:14:11 - ReXMoE - INFO - Input: None
|
| 166 |
+
2026-04-03 13:14:11 - ReXMoE - INFO - Generated completion (len 77): A. fog
|
| 167 |
+
|
| 168 |
+
High-pressure systems often lead to fog formation because they can hold moisture and prevent it from evaporating. This can occur when the high-pressure system remains in an area for a long period of time. Fog forms when moist air cools and condenses into water droplets near the surface of the Earth.
|
| 169 |
+
2026-04-03 13:14:13 - ReXMoE - INFO -
|
| 170 |
+
--- Prompt 3/3 ---
|
| 171 |
+
2026-04-03 13:14:13 - ReXMoE - INFO - Instruction: Given the fact: predators eat prey
|
| 172 |
+
Question: Predators eat
|
| 173 |
+
A. lions
|
| 174 |
+
B. humans
|
| 175 |
+
C. bunnies
|
| 176 |
+
D. grass
|
| 177 |
+
Answer:
|
| 178 |
+
2026-04-03 13:14:13 - ReXMoE - INFO - Input: None
|
| 179 |
+
2026-04-03 13:14:13 - ReXMoE - INFO - Generated completion (len 7): C. bunnies
|
| 180 |
+
2026-04-03 13:14:13 - ReXMoE - INFO - Evaluation of all 3 prompts complete.
|
| 181 |
+
2026-04-03 13:14:13 - ReXMoE - INFO -
|
| 182 |
+
[Step 5000] Analyzing routing patterns at eval_steps...
|
| 183 |
+
2026-04-03 13:14:37 - ReXMoE - INFO -
|
| 184 |
+
Analyzing ACTUAL routing patterns from 10 batches (15,294 tokens)
|
| 185 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Current reuse scale: R=3
|
| 186 |
+
2026-04-03 13:14:37 - ReXMoE - INFO -
|
| 187 |
+
[IG-MET Pruning Report]:
|
| 188 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Global: 0/0 UNIQUE experts pruned (0.0%) | threshold=-1.000000
|
| 189 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Cross-Layer Routing Distribution (ACTUAL selections):
|
| 190 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Same layer (i): 781,056 ( 29.8%)
|
| 191 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Previous layer (i-1): 965,741 ( 36.8%)
|
| 192 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Next layer (i+1): 815,206 ( 31.1%)
|
| 193 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Distant layers: 59,437 ( 2.3%)
|
| 194 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Sample Layer-Specific Routing Patterns:
|
| 195 |
+
2026-04-03 13:14:37 - ReXMoE - INFO -
|
| 196 |
+
Layer 8:
|
| 197 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 7 from layer 9 ( L9): 5,937 times ( 38.8%)
|
| 198 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 7 from layer 7 ( L7): 5,895 times ( 38.5%)
|
| 199 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 14 from layer 9 ( L9): 5,822 times ( 38.1%)
|
| 200 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 14 from layer 7 ( L7): 5,618 times ( 36.7%)
|
| 201 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 2 from layer 7 ( L7): 4,200 times ( 27.5%)
|
| 202 |
+
2026-04-03 13:14:37 - ReXMoE - INFO -
|
| 203 |
+
Layer 16:
|
| 204 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 8 from layer 15 ( L15): 8,999 times ( 58.8%)
|
| 205 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 8 from layer 17 ( L17): 7,847 times ( 51.3%)
|
| 206 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 10 from layer 15 ( L15): 6,002 times ( 39.2%)
|
| 207 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 10 from layer 17 ( L17): 5,879 times ( 38.4%)
|
| 208 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 15 from layer 15 ( L15): 4,030 times ( 26.4%)
|
| 209 |
+
2026-04-03 13:14:37 - ReXMoE - INFO -
|
| 210 |
+
Layer 24:
|
| 211 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 8 from layer 23 ( L23): 9,213 times ( 60.2%)
|
| 212 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 9 from layer 23 ( L23): 7,912 times ( 51.7%)
|
| 213 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 8 from layer 25 ( L25): 6,819 times ( 44.6%)
|
| 214 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 9 from layer 25 ( L25): 6,403 times ( 41.9%)
|
| 215 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 8 from layer 24 (same): 4,350 times ( 28.4%)
|
| 216 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - ✅ Cross-layer expert reuse detected: 70.2% of routing uses adjacent layers
|
| 217 |
+
2026-04-03 13:14:37 - ReXMoE - INFO -
|
| 218 |
+
[Step 5000] Saving checkpoint at eval_steps to ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3...
|
| 219 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.0.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 220 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.0.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 221 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.1.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 222 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.1.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 223 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.2.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 224 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.2.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 225 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.3.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 226 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.3.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 227 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.4.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 228 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.4.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 229 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.5.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 230 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.5.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 231 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.6.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 232 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.6.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 233 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.7.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 234 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.7.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 235 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.8.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 236 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.8.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 237 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.9.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 238 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.9.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 239 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.10.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 240 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.10.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 241 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.11.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 242 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.11.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 243 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.12.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 244 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.12.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 245 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.13.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 246 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.13.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 247 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.14.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 248 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.14.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 249 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.15.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 250 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.15.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 251 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.16.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 252 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.16.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 253 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.17.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 254 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.17.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 255 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.18.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 256 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.18.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 257 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.19.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 258 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.19.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 259 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.20.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 260 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.20.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 261 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.21.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 262 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.21.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 263 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.22.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 264 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.22.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 265 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.23.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 266 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.23.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 267 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.24.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 268 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.24.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 269 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.25.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 270 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.25.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 271 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.26.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 272 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.26.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 273 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.27.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 274 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.27.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 275 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.28.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 276 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.28.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 277 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.29.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 278 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.29.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 279 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.30.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 280 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.30.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 281 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.31.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 282 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.31.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 283 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - ✓ Saved trained router weights: 96 parameters
|
| 284 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - File: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/rexmoe_routers.pt
|
| 285 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Size: 12.03 MB
|
| 286 |
+
2026-04-03 13:14:37 - ReXMoE - INFO -
|
| 287 |
+
Also saving full model with ReXMoE architecture...
|
| 288 |
+
2026-04-03 13:14:39 - ReXMoE - INFO -
|
| 289 |
+
Merging LoRA adapters into base weights and saving to: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/merged
|
| 290 |
+
2026-04-03 13:15:00 - ReXMoE - INFO - ✓ Saved merged full model (base+routers+LoRA) for one-step loading
|
| 291 |
+
2026-04-03 13:15:00 - ReXMoE - INFO -
|
| 292 |
+
============================================================
|
| 293 |
+
2026-04-03 13:15:00 - ReXMoE - INFO - Epoch 1 Summary:
|
| 294 |
+
2026-04-03 13:15:00 - ReXMoE - INFO - Average LM Loss: 0.9498
|
| 295 |
+
2026-04-03 13:15:00 - ReXMoE - INFO - Average Aux Loss: 0.008843
|
| 296 |
+
2026-04-03 13:15:00 - ReXMoE - INFO - Average Total Loss: 0.9586
|
| 297 |
+
2026-04-03 13:15:00 - ReXMoE - INFO - Final R: 3
|
| 298 |
+
2026-04-03 13:15:00 - ReXMoE - INFO -
|
| 299 |
+
Evaluating model with 3 sample prompts...
|
| 300 |
+
2026-04-03 13:15:02 - ReXMoE - INFO -
|
| 301 |
+
--- Prompt 1/3 ---
|
| 302 |
+
2026-04-03 13:15:02 - ReXMoE - INFO - Instruction: What is the capital of France?
|
| 303 |
+
2026-04-03 13:15:02 - ReXMoE - INFO - Input: None
|
| 304 |
+
2026-04-03 13:15:02 - ReXMoE - INFO - Generated completion (len 9): The capital of France is Paris.
|
| 305 |
+
2026-04-03 13:15:04 - ReXMoE - INFO -
|
| 306 |
+
--- Prompt 2/3 ---
|
| 307 |
+
2026-04-03 13:15:04 - ReXMoE - INFO - Instruction: High-pressure systems stop air from rising into the colder regions of the atmosphere where water can condense. What will most likely result if a high-pressure system remains in an area for a long period of time?
|
| 308 |
+
A. fog
|
| 309 |
+
B. rain
|
| 310 |
+
C. drought
|
| 311 |
+
D. tornado
|
| 312 |
+
Answer:
|
| 313 |
+
2026-04-03 13:15:04 - ReXMoE - INFO - Input: None
|
| 314 |
+
2026-04-03 13:15:04 - ReXMoE - INFO - Generated completion (len 5): A. fog
|
| 315 |
+
2026-04-03 13:15:05 - ReXMoE - INFO -
|
| 316 |
+
--- Prompt 3/3 ---
|
| 317 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Instruction: Given the fact: predators eat prey
|
| 318 |
+
Question: Predators eat
|
| 319 |
+
A. lions
|
| 320 |
+
B. humans
|
| 321 |
+
C. bunnies
|
| 322 |
+
D. grass
|
| 323 |
+
Answer:
|
| 324 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Input: None
|
| 325 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Generated completion (len 7): C. bunnies
|
| 326 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Evaluation of all 3 prompts complete.
|
| 327 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - New best epoch 1 with avg LM loss 0.9498 — saving checkpoint to ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3
|
| 328 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.0.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 329 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.0.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 330 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.1.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 331 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.1.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 332 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.2.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 333 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.2.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 334 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.3.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 335 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.3.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 336 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.4.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 337 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.4.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 338 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.5.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 339 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.5.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 340 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.6.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 341 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.6.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 342 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.7.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 343 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.7.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 344 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.8.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 345 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.8.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 346 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.9.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 347 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.9.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 348 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.10.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 349 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.10.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 350 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.11.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 351 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.11.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 352 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.12.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 353 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.12.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 354 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.13.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 355 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.13.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 356 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.14.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 357 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.14.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 358 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.15.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 359 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.15.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 360 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.16.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 361 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.16.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 362 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.17.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 363 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.17.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 364 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.18.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 365 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.18.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 366 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.19.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 367 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.19.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 368 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.20.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 369 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.20.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 370 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.21.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 371 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.21.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 372 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.22.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 373 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.22.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 374 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.23.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 375 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.23.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 376 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.24.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 377 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.24.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 378 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.25.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 379 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.25.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 380 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.26.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 381 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.26.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 382 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.27.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 383 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.27.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 384 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.28.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 385 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.28.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 386 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.29.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 387 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.29.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 388 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.30.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 389 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.30.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 390 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.31.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 391 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.31.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 392 |
+
2026-04-03 13:15:06 - ReXMoE - INFO - ✓ Saved trained router weights: 96 parameters
|
| 393 |
+
2026-04-03 13:15:06 - ReXMoE - INFO - File: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/rexmoe_routers.pt
|
| 394 |
+
2026-04-03 13:15:06 - ReXMoE - INFO - Size: 12.03 MB
|
| 395 |
+
2026-04-03 13:15:06 - ReXMoE - INFO -
|
| 396 |
+
Also saving full model with ReXMoE architecture...
|
| 397 |
+
2026-04-03 13:15:06 - ReXMoE - INFO -
|
| 398 |
+
Merging LoRA adapters into base weights and saving to: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/merged
|
| 399 |
+
2026-04-03 13:15:44 - ReXMoE - INFO - ✓ Saved merged full model (base+routers+LoRA) for one-step loading
|
| 400 |
+
2026-04-03 13:15:44 - ReXMoE - INFO -
|
| 401 |
+
📊 Convergence Metrics:
|
| 402 |
+
2026-04-03 13:15:44 - ReXMoE - INFO - Convergence Metrics:
|
| 403 |
+
2026-04-03 13:15:44 - ReXMoE - INFO - Avg Router Grad Norm: 0.084278
|
| 404 |
+
2026-04-03 13:15:44 - ReXMoE - INFO - Current Learning Rate: 2.00e-05
|
| 405 |
+
2026-04-03 13:15:44 - ReXMoE - INFO - ℹ️ Collecting convergence data (need 5 epochs minimum)...
|
| 406 |
+
2026-04-03 13:15:44 - ReXMoE - INFO - Routing Pattern Analysis (Epoch 1):
|
| 407 |
+
2026-04-03 13:15:59 - ReXMoE - INFO -
|
| 408 |
+
Analyzing ACTUAL routing patterns from 10 batches (17,341 tokens)
|
| 409 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Current reuse scale: R=3
|
| 410 |
+
2026-04-03 13:15:59 - ReXMoE - INFO -
|
| 411 |
+
[IG-MET Pruning Report]:
|
| 412 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Global: 0/0 UNIQUE experts pruned (0.0%) | threshold=-1.000000
|
| 413 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Cross-Layer Routing Distribution (ACTUAL selections):
|
| 414 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Same layer (i): 869,591 ( 33.2%)
|
| 415 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Previous layer (i-1): 896,913 ( 34.2%)
|
| 416 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Next layer (i+1): 797,210 ( 30.4%)
|
| 417 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Distant layers: 57,726 ( 2.2%)
|
| 418 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Sample Layer-Specific Routing Patterns:
|
| 419 |
+
2026-04-03 13:15:59 - ReXMoE - INFO -
|
| 420 |
+
Layer 8:
|
| 421 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 7 from layer 9 ( L9): 6,917 times ( 39.9%)
|
| 422 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 14 from layer 9 ( L9): 6,553 times ( 37.8%)
|
| 423 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 7 from layer 7 ( L7): 6,305 times ( 36.4%)
|
| 424 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 14 from layer 7 ( L7): 5,503 times ( 31.7%)
|
| 425 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 2 from layer 7 ( L7): 3,988 times ( 23.0%)
|
| 426 |
+
2026-04-03 13:15:59 - ReXMoE - INFO -
|
| 427 |
+
Layer 16:
|
| 428 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 8 from layer 15 ( L15): 8,873 times ( 51.2%)
|
| 429 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 8 from layer 17 ( L17): 8,226 times ( 47.4%)
|
| 430 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 10 from layer 15 ( L15): 5,752 times ( 33.2%)
|
| 431 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 10 from layer 17 ( L17): 4,996 times ( 28.8%)
|
| 432 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 9 from layer 16 (same): 3,718 times ( 21.4%)
|
| 433 |
+
2026-04-03 13:15:59 - ReXMoE - INFO -
|
| 434 |
+
Layer 24:
|
| 435 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 8 from layer 23 ( L23): 9,676 times ( 55.8%)
|
| 436 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 8 from layer 25 ( L25): 7,087 times ( 40.9%)
|
| 437 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 9 from layer 23 ( L23): 6,982 times ( 40.3%)
|
| 438 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 9 from layer 25 ( L25): 4,908 times ( 28.3%)
|
| 439 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 4 from layer 24 (same): 3,902 times ( 22.5%)
|
| 440 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - ✅ Cross-layer expert reuse detected: 66.8% of routing uses adjacent layers
|
| 441 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - LR stepped to: 2.00e-05
|
| 442 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - ================================================================================
|
| 443 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Training Convergence Summary
|
| 444 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - ================================================================================
|
| 445 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Router Gradient Norms Over Epochs:
|
| 446 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Epoch 1: 0.084278
|
| 447 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Auxiliary Loss Over Epochs:
|
| 448 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Epoch 1: 0.008843
|
| 449 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Convergence Status: Insufficient data (< 5 epochs)
|
| 450 |
+
2026-04-03 13:15:59 - ReXMoE - INFO -
|
| 451 |
+
Saving trained router weights only...
|
| 452 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - ✓ Saved trained router weights: 96 parameters
|
| 453 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - File: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/rexmoe_routers.pt
|
| 454 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Size: 12.03 MB
|
| 455 |
+
2026-04-03 13:15:59 - ReXMoE - INFO -
|
| 456 |
+
Also saving full model with ReXMoE architecture...
|
| 457 |
+
2026-04-03 13:16:00 - ReXMoE - INFO -
|
| 458 |
+
Merging LoRA adapters into base weights and saving to: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/merged
|
| 459 |
+
2026-04-03 13:16:32 - ReXMoE - INFO - ✓ Saved merged full model (base+routers+LoRA) for one-step loading
|
| 460 |
+
2026-04-03 13:16:32 - ReXMoE - INFO - ================================================================================
|
| 461 |
+
2026-04-03 13:16:32 - ReXMoE - INFO - ✓ Training complete. Two checkpoint formats saved:
|
| 462 |
+
2026-04-03 13:16:32 - ReXMoE - INFO - 1. Router weights only: rexmoe_routers.pt (portable)
|
| 463 |
+
2026-04-03 13:16:32 - ReXMoE - INFO - 2. Full model: pytorch_model.bin (requires rexmoe_architecture.py)
|
| 464 |
+
2026-04-03 13:16:32 - ReXMoE - INFO -
|
| 465 |
+
Checkpoint directory: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3
|
| 466 |
+
2026-04-03 13:16:32 - ReXMoE - INFO - Full model size: 0.00 GB
|
| 467 |
+
2026-04-03 13:16:32 - ReXMoE - INFO - ================================================================================
|
logs/rexmoe_training_0304_033137.log
ADDED
|
@@ -0,0 +1,467 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
2026-04-03 03:31:37 - ReXMoE - INFO - ================================================================================
|
| 2 |
+
2026-04-03 03:31:37 - ReXMoE - INFO - ReXMoE Training Log - 0304_033137
|
| 3 |
+
2026-04-03 03:31:37 - ReXMoE - INFO - Log file: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/logs/rexmoe_training_0304_033137.log
|
| 4 |
+
2026-04-03 03:31:37 - ReXMoE - INFO - ================================================================================
|
| 5 |
+
2026-04-03 03:31:37 - ReXMoE - INFO - ================================================================================
|
| 6 |
+
2026-04-03 03:31:37 - ReXMoE - INFO - ReXMoE Cross-Layer Expert Reuse Training
|
| 7 |
+
2026-04-03 03:31:37 - ReXMoE - INFO - ================================================================================
|
| 8 |
+
2026-04-03 03:31:37 - ReXMoE - INFO - MET enabled: False
|
| 9 |
+
2026-04-03 03:31:37 - ReXMoE - INFO -
|
| 10 |
+
Configuration:
|
| 11 |
+
Model: microsoft/Phi-mini-MoE-instruct
|
| 12 |
+
Dataset: ../dataset/alpaca_data_cleaned.json
|
| 13 |
+
Dataset mode: IF_2
|
| 14 |
+
Reuse Scale (R): 3
|
| 15 |
+
Prune Ratio (MET): N/A
|
| 16 |
+
Epochs: 1
|
| 17 |
+
Num of samples: 20000
|
| 18 |
+
Batch Size: 4
|
| 19 |
+
Sequence Length: 1024
|
| 20 |
+
Learning Rate: 2e-05
|
| 21 |
+
PSR Enabled: True
|
| 22 |
+
LR Scheduler: True
|
| 23 |
+
Save Path: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3
|
| 24 |
+
Gradient Checkpointing: False
|
| 25 |
+
LoRA Rank: 16 (Full LoRA: True)
|
| 26 |
+
LoRA Alpha: 32
|
| 27 |
+
MET Enabled: False (Mask Ratio: 0.1, Warmup: 0.5)
|
| 28 |
+
Log File: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/logs/rexmoe_training_0304_033137.log
|
| 29 |
+
Aux loss weight: 0.05
|
| 30 |
+
|
| 31 |
+
2026-04-03 03:31:37 - ReXMoE - INFO - 💻 Using device: cuda)
|
| 32 |
+
2026-04-03 03:31:37 - ReXMoE - INFO - GPU: NVIDIA RTX A6000, Memory: 47.53 GB
|
| 33 |
+
2026-04-03 03:31:43 - ReXMoE - INFO - [5/7] Setting up optimizer and dataset...
|
| 34 |
+
2026-04-03 03:31:43 - ReXMoE - INFO - Using 8-bit AdamW optimizer
|
| 35 |
+
2026-04-03 03:31:43 - ReXMoE - INFO - LR Scheduler: CosineAnnealingLR (2e-05 → 2.0000000000000003e-06)
|
| 36 |
+
2026-04-03 03:31:51 - ReXMoE - INFO -
|
| 37 |
+
First batch statistics:
|
| 38 |
+
2026-04-03 03:31:51 - ReXMoE - INFO - LM Loss: 1.0094
|
| 39 |
+
2026-04-03 03:31:51 - ReXMoE - INFO - Aux Loss: 0.092773
|
| 40 |
+
2026-04-03 03:31:51 - ReXMoE - INFO - Total Loss: 1.1022
|
| 41 |
+
2026-04-03 03:31:51 - ReXMoE - INFO - Current R: 2
|
| 42 |
+
2026-04-03 03:31:51 - ReXMoE - INFO - Active experts per layer: 32
|
| 43 |
+
2026-04-03 03:31:51 - ReXMoE - INFO - Gradient norm: 1.0000
|
| 44 |
+
2026-04-03 03:31:51 - ReXMoE - INFO -
|
| 45 |
+
|
| 46 |
+
2026-04-03 03:35:09 - ReXMoE - INFO - [50/5000] loss=1.1939 aux=0.062988 R=2
|
| 47 |
+
2026-04-03 03:38:21 - ReXMoE - INFO - [100/5000] loss=1.1803 aux=0.040039 R=2
|
| 48 |
+
2026-04-03 03:41:36 - ReXMoE - INFO - [150/5000] loss=1.2968 aux=0.036621 R=2
|
| 49 |
+
2026-04-03 03:44:50 - ReXMoE - INFO - [200/5000] loss=1.2447 aux=0.028198 R=2
|
| 50 |
+
2026-04-03 03:48:01 - ReXMoE - INFO - [250/5000] loss=1.1971 aux=0.034180 R=2
|
| 51 |
+
2026-04-03 03:51:10 - ReXMoE - INFO - [300/5000] loss=2.1766 aux=0.024658 R=2
|
| 52 |
+
2026-04-03 03:54:19 - ReXMoE - INFO - [350/5000] loss=1.1092 aux=0.017578 R=2
|
| 53 |
+
2026-04-03 03:57:29 - ReXMoE - INFO - [400/5000] loss=0.9343 aux=0.024414 R=2
|
| 54 |
+
2026-04-03 04:00:40 - ReXMoE - INFO - [450/5000] loss=1.2180 aux=0.045410 R=2
|
| 55 |
+
2026-04-03 04:03:47 - ReXMoE - INFO - Warmup completed at step 500. Enabling FULL QLoRA with r = 16 and alpha = 32 on experts and updating optimizer...
|
| 56 |
+
2026-04-03 04:03:51 - ReXMoE - INFO - Trainable params (routers + LoRA): 144179200 (1.8509%)
|
| 57 |
+
2026-04-03 04:03:51 - ReXMoE - INFO - Sample trainable params after QLoRA: ['base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight', 'base_model.model.model.layers.0.self_attn.k_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.k_proj.lora_B.default.weight', 'base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight', 'base_model.model.model.layers.0.self_attn.o_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.o_proj.lora_B.default.weight', 'base_model.model.model.layers.0.block_sparse_moe.gate.weight', 'base_model.model.model.layers.0.block_sparse_moe.experts.0.w1.lora_A.default.weight']
|
| 58 |
+
2026-04-03 04:03:58 - ReXMoE - INFO - [500/5000] loss=1.0733 aux=0.036621 R=2
|
| 59 |
+
2026-04-03 04:09:15 - ReXMoE - INFO - [550/5000] loss=0.6253 aux=0.014709 R=2
|
| 60 |
+
2026-04-03 04:14:28 - ReXMoE - INFO - [600/5000] loss=1.5688 aux=0.010986 R=2
|
| 61 |
+
2026-04-03 04:19:39 - ReXMoE - INFO - [650/5000] loss=0.7864 aux=0.016357 R=2
|
| 62 |
+
2026-04-03 04:24:52 - ReXMoE - INFO - [700/5000] loss=1.5303 aux=0.010681 R=2
|
| 63 |
+
2026-04-03 04:30:02 - ReXMoE - INFO - [750/5000] loss=1.0098 aux=0.007812 R=2
|
| 64 |
+
2026-04-03 04:35:13 - ReXMoE - INFO - [800/5000] loss=1.0523 aux=0.014282 R=2
|
| 65 |
+
2026-04-03 04:40:24 - ReXMoE - INFO - [850/5000] loss=0.6447 aux=0.009094 R=2
|
| 66 |
+
2026-04-03 04:45:37 - ReXMoE - INFO - [900/5000] loss=0.7665 aux=0.004822 R=2
|
| 67 |
+
2026-04-03 04:50:50 - ReXMoE - INFO - [950/5000] loss=0.7762 aux=0.005737 R=2
|
| 68 |
+
2026-04-03 04:56:03 - ReXMoE - INFO - [1000/5000] loss=1.0254 aux=0.003571 R=2
|
| 69 |
+
2026-04-03 05:01:16 - ReXMoE - INFO - [1050/5000] loss=1.1320 aux=0.005737 R=2
|
| 70 |
+
2026-04-03 05:06:28 - ReXMoE - INFO - [1100/5000] loss=0.7519 aux=0.004974 R=2
|
| 71 |
+
2026-04-03 05:11:40 - ReXMoE - INFO - [1150/5000] loss=0.8246 aux=0.003204 R=2
|
| 72 |
+
2026-04-03 05:16:55 - ReXMoE - INFO - [1200/5000] loss=1.0041 aux=0.006042 R=2
|
| 73 |
+
2026-04-03 05:22:09 - ReXMoE - INFO - [1250/5000] loss=0.6804 aux=0.005859 R=2
|
| 74 |
+
2026-04-03 05:27:21 - ReXMoE - INFO - [1300/5000] loss=0.9695 aux=0.011108 R=2
|
| 75 |
+
2026-04-03 05:32:33 - ReXMoE - INFO - [1350/5000] loss=1.0448 aux=0.012634 R=2
|
| 76 |
+
2026-04-03 05:37:45 - ReXMoE - INFO - [1400/5000] loss=0.7468 aux=0.002136 R=2
|
| 77 |
+
2026-04-03 05:42:58 - ReXMoE - INFO - [1450/5000] loss=1.6307 aux=0.003510 R=2
|
| 78 |
+
2026-04-03 05:48:10 - ReXMoE - INFO - [1500/5000] loss=1.1833 aux=0.002625 R=2
|
| 79 |
+
2026-04-03 05:53:21 - ReXMoE - INFO - [1550/5000] loss=0.9216 aux=0.002991 R=2
|
| 80 |
+
2026-04-03 05:58:33 - ReXMoE - INFO - [1600/5000] loss=0.5969 aux=0.003708 R=2
|
| 81 |
+
2026-04-03 06:03:46 - ReXMoE - INFO - [1650/5000] loss=0.5240 aux=0.002518 R=2
|
| 82 |
+
2026-04-03 06:08:58 - ReXMoE - INFO - [1700/5000] loss=0.7681 aux=0.001785 R=2
|
| 83 |
+
2026-04-03 06:14:09 - ReXMoE - INFO - [1750/5000] loss=1.0812 aux=0.002899 R=2
|
| 84 |
+
2026-04-03 06:19:21 - ReXMoE - INFO - [1800/5000] loss=0.8171 aux=0.010986 R=2
|
| 85 |
+
2026-04-03 06:24:34 - ReXMoE - INFO - [1850/5000] loss=0.9029 aux=0.005371 R=2
|
| 86 |
+
2026-04-03 06:29:46 - ReXMoE - INFO - [1900/5000] loss=1.0440 aux=0.001839 R=2
|
| 87 |
+
2026-04-03 06:35:00 - ReXMoE - INFO - [1950/5000] loss=1.2026 aux=0.005096 R=2
|
| 88 |
+
2026-04-03 06:40:13 - ReXMoE - INFO - [2000/5000] loss=0.7174 aux=0.003372 R=2
|
| 89 |
+
2026-04-03 06:45:25 - ReXMoE - INFO - [2050/5000] loss=1.5737 aux=0.003571 R=2
|
| 90 |
+
2026-04-03 06:50:37 - ReXMoE - INFO - [2100/5000] loss=0.8508 aux=0.003403 R=2
|
| 91 |
+
2026-04-03 06:55:51 - ReXMoE - INFO - [2150/5000] loss=0.7965 aux=0.001656 R=2
|
| 92 |
+
2026-04-03 07:01:02 - ReXMoE - INFO - [2200/5000] loss=1.3079 aux=0.002747 R=2
|
| 93 |
+
2026-04-03 07:06:14 - ReXMoE - INFO - [2250/5000] loss=0.9750 aux=0.002228 R=2
|
| 94 |
+
2026-04-03 07:11:28 - ReXMoE - INFO - [2300/5000] loss=0.9549 aux=0.002228 R=2
|
| 95 |
+
2026-04-03 07:16:40 - ReXMoE - INFO - [2350/5000] loss=1.2216 aux=0.004089 R=2
|
| 96 |
+
2026-04-03 07:21:53 - ReXMoE - INFO - [2400/5000] loss=0.9801 aux=0.002289 R=2
|
| 97 |
+
2026-04-03 07:27:07 - ReXMoE - INFO - [2450/5000] loss=1.6587 aux=0.001602 R=2
|
| 98 |
+
2026-04-03 07:32:23 - ReXMoE - INFO - [2500/5000] loss=1.7420 aux=0.014648 R=3
|
| 99 |
+
2026-04-03 07:39:14 - ReXMoE - INFO - [2550/5000] loss=1.0498 aux=0.001801 R=3
|
| 100 |
+
2026-04-03 07:46:08 - ReXMoE - INFO - [2600/5000] loss=0.7848 aux=0.002792 R=3
|
| 101 |
+
2026-04-03 07:53:01 - ReXMoE - INFO - [2650/5000] loss=0.6119 aux=0.000992 R=3
|
| 102 |
+
2026-04-03 07:59:55 - ReXMoE - INFO - [2700/5000] loss=1.0871 aux=0.002014 R=3
|
| 103 |
+
2026-04-03 08:06:48 - ReXMoE - INFO - [2750/5000] loss=1.0422 aux=0.001411 R=3
|
| 104 |
+
2026-04-03 08:13:45 - ReXMoE - INFO - [2800/5000] loss=1.0147 aux=0.002762 R=3
|
| 105 |
+
2026-04-03 08:20:37 - ReXMoE - INFO - [2850/5000] loss=0.6756 aux=0.001953 R=3
|
| 106 |
+
2026-04-03 08:27:28 - ReXMoE - INFO - [2900/5000] loss=0.6243 aux=0.001671 R=3
|
| 107 |
+
2026-04-03 08:34:22 - ReXMoE - INFO - [2950/5000] loss=0.8838 aux=0.004974 R=3
|
| 108 |
+
2026-04-03 08:41:13 - ReXMoE - INFO - [3000/5000] loss=0.7627 aux=0.002060 R=3
|
| 109 |
+
2026-04-03 08:48:05 - ReXMoE - INFO - [3050/5000] loss=0.8120 aux=0.000668 R=3
|
| 110 |
+
2026-04-03 08:54:56 - ReXMoE - INFO - [3100/5000] loss=0.9701 aux=0.002121 R=3
|
| 111 |
+
2026-04-03 09:01:47 - ReXMoE - INFO - [3150/5000] loss=0.8151 aux=0.001289 R=3
|
| 112 |
+
2026-04-03 09:08:39 - ReXMoE - INFO - [3200/5000] loss=0.6943 aux=0.002777 R=3
|
| 113 |
+
2026-04-03 09:15:30 - ReXMoE - INFO - [3250/5000] loss=0.9401 aux=0.002350 R=3
|
| 114 |
+
2026-04-03 09:22:20 - ReXMoE - INFO - [3300/5000] loss=0.7034 aux=0.007935 R=3
|
| 115 |
+
2026-04-03 09:29:11 - ReXMoE - INFO - [3350/5000] loss=1.1980 aux=0.003006 R=3
|
| 116 |
+
2026-04-03 09:36:04 - ReXMoE - INFO - [3400/5000] loss=0.6413 aux=0.002045 R=3
|
| 117 |
+
2026-04-03 09:43:01 - ReXMoE - INFO - [3450/5000] loss=1.1729 aux=0.001686 R=3
|
| 118 |
+
2026-04-03 09:49:52 - ReXMoE - INFO - [3500/5000] loss=1.1667 aux=0.002045 R=3
|
| 119 |
+
2026-04-03 09:56:42 - ReXMoE - INFO - [3550/5000] loss=0.3543 aux=0.007324 R=3
|
| 120 |
+
2026-04-03 10:03:29 - ReXMoE - INFO - [3600/5000] loss=1.0002 aux=0.002792 R=3
|
| 121 |
+
2026-04-03 10:10:20 - ReXMoE - INFO - [3650/5000] loss=0.8748 aux=0.001503 R=3
|
| 122 |
+
2026-04-03 10:17:12 - ReXMoE - INFO - [3700/5000] loss=0.9026 aux=0.021118 R=3
|
| 123 |
+
2026-04-03 10:24:05 - ReXMoE - INFO - [3750/5000] loss=0.3710 aux=0.002182 R=3
|
| 124 |
+
2026-04-03 10:30:57 - ReXMoE - INFO - [3800/5000] loss=1.2199 aux=0.001564 R=3
|
| 125 |
+
2026-04-03 10:37:48 - ReXMoE - INFO - [3850/5000] loss=0.4812 aux=0.008057 R=3
|
| 126 |
+
2026-04-03 10:44:38 - ReXMoE - INFO - [3900/5000] loss=0.9683 aux=0.002487 R=3
|
| 127 |
+
2026-04-03 10:51:31 - ReXMoE - INFO - [3950/5000] loss=0.7649 aux=0.001732 R=3
|
| 128 |
+
2026-04-03 10:58:23 - ReXMoE - INFO - [4000/5000] loss=0.7234 aux=0.001839 R=3
|
| 129 |
+
2026-04-03 11:05:13 - ReXMoE - INFO - [4050/5000] loss=0.7793 aux=0.001289 R=3
|
| 130 |
+
2026-04-03 11:12:02 - ReXMoE - INFO - [4100/5000] loss=1.2237 aux=0.001968 R=3
|
| 131 |
+
2026-04-03 11:18:51 - ReXMoE - INFO - [4150/5000] loss=1.0040 aux=0.002701 R=3
|
| 132 |
+
2026-04-03 11:25:38 - ReXMoE - INFO - [4200/5000] loss=0.4700 aux=0.001945 R=3
|
| 133 |
+
2026-04-03 11:32:25 - ReXMoE - INFO - [4250/5000] loss=0.6833 aux=0.004486 R=3
|
| 134 |
+
2026-04-03 11:39:11 - ReXMoE - INFO - [4300/5000] loss=0.8191 aux=0.003754 R=3
|
| 135 |
+
2026-04-03 11:45:56 - ReXMoE - INFO - [4350/5000] loss=0.3914 aux=0.001312 R=3
|
| 136 |
+
2026-04-03 11:52:41 - ReXMoE - INFO - [4400/5000] loss=0.9623 aux=0.001854 R=3
|
| 137 |
+
2026-04-03 11:59:28 - ReXMoE - INFO - [4450/5000] loss=0.6550 aux=0.005615 R=3
|
| 138 |
+
2026-04-03 12:06:15 - ReXMoE - INFO - [4500/5000] loss=0.9616 aux=0.002777 R=3
|
| 139 |
+
2026-04-03 12:13:01 - ReXMoE - INFO - [4550/5000] loss=0.5557 aux=0.008789 R=3
|
| 140 |
+
2026-04-03 12:19:46 - ReXMoE - INFO - [4600/5000] loss=0.6275 aux=0.018555 R=3
|
| 141 |
+
2026-04-03 12:26:33 - ReXMoE - INFO - [4650/5000] loss=1.2395 aux=0.001549 R=3
|
| 142 |
+
2026-04-03 12:33:18 - ReXMoE - INFO - [4700/5000] loss=0.6769 aux=0.002060 R=3
|
| 143 |
+
2026-04-03 12:40:05 - ReXMoE - INFO - [4750/5000] loss=1.1499 aux=0.006348 R=3
|
| 144 |
+
2026-04-03 12:46:50 - ReXMoE - INFO - [4800/5000] loss=0.7449 aux=0.001022 R=3
|
| 145 |
+
2026-04-03 12:53:34 - ReXMoE - INFO - [4850/5000] loss=0.8246 aux=0.001823 R=3
|
| 146 |
+
2026-04-03 13:00:22 - ReXMoE - INFO - [4900/5000] loss=0.9550 aux=0.002029 R=3
|
| 147 |
+
2026-04-03 13:07:10 - ReXMoE - INFO - [4950/5000] loss=1.2535 aux=0.001610 R=3
|
| 148 |
+
2026-04-03 13:13:50 - ReXMoE - INFO -
|
| 149 |
+
[Step 5000/5000] Running evaluation at eval_steps...
|
| 150 |
+
2026-04-03 13:13:50 - ReXMoE - INFO -
|
| 151 |
+
Evaluating model with 3 sample prompts...
|
| 152 |
+
2026-04-03 13:13:52 - ReXMoE - INFO -
|
| 153 |
+
--- Prompt 1/3 ---
|
| 154 |
+
2026-04-03 13:13:52 - ReXMoE - INFO - Instruction: What is the capital of France?
|
| 155 |
+
2026-04-03 13:13:52 - ReXMoE - INFO - Input: None
|
| 156 |
+
2026-04-03 13:13:52 - ReXMoE - INFO - Generated completion (len 9): The capital of France is Paris.
|
| 157 |
+
2026-04-03 13:14:11 - ReXMoE - INFO -
|
| 158 |
+
--- Prompt 2/3 ---
|
| 159 |
+
2026-04-03 13:14:11 - ReXMoE - INFO - Instruction: High-pressure systems stop air from rising into the colder regions of the atmosphere where water can condense. What will most likely result if a high-pressure system remains in an area for a long period of time?
|
| 160 |
+
A. fog
|
| 161 |
+
B. rain
|
| 162 |
+
C. drought
|
| 163 |
+
D. tornado
|
| 164 |
+
Answer:
|
| 165 |
+
2026-04-03 13:14:11 - ReXMoE - INFO - Input: None
|
| 166 |
+
2026-04-03 13:14:11 - ReXMoE - INFO - Generated completion (len 77): A. fog
|
| 167 |
+
|
| 168 |
+
High-pressure systems often lead to fog formation because they can hold moisture and prevent it from evaporating. This can occur when the high-pressure system remains in an area for a long period of time. Fog forms when moist air cools and condenses into water droplets near the surface of the Earth.
|
| 169 |
+
2026-04-03 13:14:13 - ReXMoE - INFO -
|
| 170 |
+
--- Prompt 3/3 ---
|
| 171 |
+
2026-04-03 13:14:13 - ReXMoE - INFO - Instruction: Given the fact: predators eat prey
|
| 172 |
+
Question: Predators eat
|
| 173 |
+
A. lions
|
| 174 |
+
B. humans
|
| 175 |
+
C. bunnies
|
| 176 |
+
D. grass
|
| 177 |
+
Answer:
|
| 178 |
+
2026-04-03 13:14:13 - ReXMoE - INFO - Input: None
|
| 179 |
+
2026-04-03 13:14:13 - ReXMoE - INFO - Generated completion (len 7): C. bunnies
|
| 180 |
+
2026-04-03 13:14:13 - ReXMoE - INFO - Evaluation of all 3 prompts complete.
|
| 181 |
+
2026-04-03 13:14:13 - ReXMoE - INFO -
|
| 182 |
+
[Step 5000] Analyzing routing patterns at eval_steps...
|
| 183 |
+
2026-04-03 13:14:37 - ReXMoE - INFO -
|
| 184 |
+
Analyzing ACTUAL routing patterns from 10 batches (15,294 tokens)
|
| 185 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Current reuse scale: R=3
|
| 186 |
+
2026-04-03 13:14:37 - ReXMoE - INFO -
|
| 187 |
+
[IG-MET Pruning Report]:
|
| 188 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Global: 0/0 UNIQUE experts pruned (0.0%) | threshold=-1.000000
|
| 189 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Cross-Layer Routing Distribution (ACTUAL selections):
|
| 190 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Same layer (i): 781,056 ( 29.8%)
|
| 191 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Previous layer (i-1): 965,741 ( 36.8%)
|
| 192 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Next layer (i+1): 815,206 ( 31.1%)
|
| 193 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Distant layers: 59,437 ( 2.3%)
|
| 194 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Sample Layer-Specific Routing Patterns:
|
| 195 |
+
2026-04-03 13:14:37 - ReXMoE - INFO -
|
| 196 |
+
Layer 8:
|
| 197 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 7 from layer 9 ( L9): 5,937 times ( 38.8%)
|
| 198 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 7 from layer 7 ( L7): 5,895 times ( 38.5%)
|
| 199 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 14 from layer 9 ( L9): 5,822 times ( 38.1%)
|
| 200 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 14 from layer 7 ( L7): 5,618 times ( 36.7%)
|
| 201 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 2 from layer 7 ( L7): 4,200 times ( 27.5%)
|
| 202 |
+
2026-04-03 13:14:37 - ReXMoE - INFO -
|
| 203 |
+
Layer 16:
|
| 204 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 8 from layer 15 ( L15): 8,999 times ( 58.8%)
|
| 205 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 8 from layer 17 ( L17): 7,847 times ( 51.3%)
|
| 206 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 10 from layer 15 ( L15): 6,002 times ( 39.2%)
|
| 207 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 10 from layer 17 ( L17): 5,879 times ( 38.4%)
|
| 208 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 15 from layer 15 ( L15): 4,030 times ( 26.4%)
|
| 209 |
+
2026-04-03 13:14:37 - ReXMoE - INFO -
|
| 210 |
+
Layer 24:
|
| 211 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 8 from layer 23 ( L23): 9,213 times ( 60.2%)
|
| 212 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 9 from layer 23 ( L23): 7,912 times ( 51.7%)
|
| 213 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 8 from layer 25 ( L25): 6,819 times ( 44.6%)
|
| 214 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 9 from layer 25 ( L25): 6,403 times ( 41.9%)
|
| 215 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Expert 8 from layer 24 (same): 4,350 times ( 28.4%)
|
| 216 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - ✅ Cross-layer expert reuse detected: 70.2% of routing uses adjacent layers
|
| 217 |
+
2026-04-03 13:14:37 - ReXMoE - INFO -
|
| 218 |
+
[Step 5000] Saving checkpoint at eval_steps to ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3...
|
| 219 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.0.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 220 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.0.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 221 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.1.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 222 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.1.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 223 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.2.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 224 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.2.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 225 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.3.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 226 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.3.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 227 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.4.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 228 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.4.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 229 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.5.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 230 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.5.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 231 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.6.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 232 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.6.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 233 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.7.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 234 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.7.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 235 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.8.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 236 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.8.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 237 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.9.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 238 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.9.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 239 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.10.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 240 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.10.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 241 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.11.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 242 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.11.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 243 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.12.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 244 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.12.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 245 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.13.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 246 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.13.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 247 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.14.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 248 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.14.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 249 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.15.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 250 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.15.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 251 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.16.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 252 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.16.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 253 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.17.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 254 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.17.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 255 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.18.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 256 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.18.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 257 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.19.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 258 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.19.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 259 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.20.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 260 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.20.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 261 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.21.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 262 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.21.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 263 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.22.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 264 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.22.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 265 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.23.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 266 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.23.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 267 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.24.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 268 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.24.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 269 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.25.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 270 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.25.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 271 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.26.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 272 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.26.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 273 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.27.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 274 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.27.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 275 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.28.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 276 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.28.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 277 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.29.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 278 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.29.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 279 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.30.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 280 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.30.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 281 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.31.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 282 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.31.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 283 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - ✓ Saved trained router weights: 96 parameters
|
| 284 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - File: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/rexmoe_routers.pt
|
| 285 |
+
2026-04-03 13:14:37 - ReXMoE - INFO - Size: 12.03 MB
|
| 286 |
+
2026-04-03 13:14:37 - ReXMoE - INFO -
|
| 287 |
+
Also saving full model with ReXMoE architecture...
|
| 288 |
+
2026-04-03 13:14:39 - ReXMoE - INFO -
|
| 289 |
+
Merging LoRA adapters into base weights and saving to: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/merged
|
| 290 |
+
2026-04-03 13:15:00 - ReXMoE - INFO - ✓ Saved merged full model (base+routers+LoRA) for one-step loading
|
| 291 |
+
2026-04-03 13:15:00 - ReXMoE - INFO -
|
| 292 |
+
============================================================
|
| 293 |
+
2026-04-03 13:15:00 - ReXMoE - INFO - Epoch 1 Summary:
|
| 294 |
+
2026-04-03 13:15:00 - ReXMoE - INFO - Average LM Loss: 0.9498
|
| 295 |
+
2026-04-03 13:15:00 - ReXMoE - INFO - Average Aux Loss: 0.008843
|
| 296 |
+
2026-04-03 13:15:00 - ReXMoE - INFO - Average Total Loss: 0.9586
|
| 297 |
+
2026-04-03 13:15:00 - ReXMoE - INFO - Final R: 3
|
| 298 |
+
2026-04-03 13:15:00 - ReXMoE - INFO -
|
| 299 |
+
Evaluating model with 3 sample prompts...
|
| 300 |
+
2026-04-03 13:15:02 - ReXMoE - INFO -
|
| 301 |
+
--- Prompt 1/3 ---
|
| 302 |
+
2026-04-03 13:15:02 - ReXMoE - INFO - Instruction: What is the capital of France?
|
| 303 |
+
2026-04-03 13:15:02 - ReXMoE - INFO - Input: None
|
| 304 |
+
2026-04-03 13:15:02 - ReXMoE - INFO - Generated completion (len 9): The capital of France is Paris.
|
| 305 |
+
2026-04-03 13:15:04 - ReXMoE - INFO -
|
| 306 |
+
--- Prompt 2/3 ---
|
| 307 |
+
2026-04-03 13:15:04 - ReXMoE - INFO - Instruction: High-pressure systems stop air from rising into the colder regions of the atmosphere where water can condense. What will most likely result if a high-pressure system remains in an area for a long period of time?
|
| 308 |
+
A. fog
|
| 309 |
+
B. rain
|
| 310 |
+
C. drought
|
| 311 |
+
D. tornado
|
| 312 |
+
Answer:
|
| 313 |
+
2026-04-03 13:15:04 - ReXMoE - INFO - Input: None
|
| 314 |
+
2026-04-03 13:15:04 - ReXMoE - INFO - Generated completion (len 5): A. fog
|
| 315 |
+
2026-04-03 13:15:05 - ReXMoE - INFO -
|
| 316 |
+
--- Prompt 3/3 ---
|
| 317 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Instruction: Given the fact: predators eat prey
|
| 318 |
+
Question: Predators eat
|
| 319 |
+
A. lions
|
| 320 |
+
B. humans
|
| 321 |
+
C. bunnies
|
| 322 |
+
D. grass
|
| 323 |
+
Answer:
|
| 324 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Input: None
|
| 325 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Generated completion (len 7): C. bunnies
|
| 326 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Evaluation of all 3 prompts complete.
|
| 327 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - New best epoch 1 with avg LM loss 0.9498 — saving checkpoint to ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3
|
| 328 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.0.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 329 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.0.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 330 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.1.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 331 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.1.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 332 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.2.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 333 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.2.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 334 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.3.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 335 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.3.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 336 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.4.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 337 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.4.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 338 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.5.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 339 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.5.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 340 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.6.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 341 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.6.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 342 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.7.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 343 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.7.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 344 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.8.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 345 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.8.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 346 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.9.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 347 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.9.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 348 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.10.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 349 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.10.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 350 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.11.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 351 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.11.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 352 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.12.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 353 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.12.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 354 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.13.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 355 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.13.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 356 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.14.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 357 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.14.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 358 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.15.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 359 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.15.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 360 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.16.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 361 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.16.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 362 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.17.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 363 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.17.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 364 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.18.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 365 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.18.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 366 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.19.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 367 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.19.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 368 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.20.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 369 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.20.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 370 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.21.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 371 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.21.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 372 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.22.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 373 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.22.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 374 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.23.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 375 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.23.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 376 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.24.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 377 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.24.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 378 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.25.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 379 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.25.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 380 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.26.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 381 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.26.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 382 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.27.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 383 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.27.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 384 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.28.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 385 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.28.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 386 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.29.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 387 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.29.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 388 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.30.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 389 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.30.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 390 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.31.block_sparse_moe.router.ema_utilization with shape torch.Size([48]) for pruning evaluation
|
| 391 |
+
2026-04-03 13:15:05 - ReXMoE - INFO - Saving buffer base_model.model.model.layers.31.block_sparse_moe.router.mask_threshold with shape torch.Size([]) for pruning evaluation
|
| 392 |
+
2026-04-03 13:15:06 - ReXMoE - INFO - ✓ Saved trained router weights: 96 parameters
|
| 393 |
+
2026-04-03 13:15:06 - ReXMoE - INFO - File: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/rexmoe_routers.pt
|
| 394 |
+
2026-04-03 13:15:06 - ReXMoE - INFO - Size: 12.03 MB
|
| 395 |
+
2026-04-03 13:15:06 - ReXMoE - INFO -
|
| 396 |
+
Also saving full model with ReXMoE architecture...
|
| 397 |
+
2026-04-03 13:15:06 - ReXMoE - INFO -
|
| 398 |
+
Merging LoRA adapters into base weights and saving to: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/merged
|
| 399 |
+
2026-04-03 13:15:44 - ReXMoE - INFO - ✓ Saved merged full model (base+routers+LoRA) for one-step loading
|
| 400 |
+
2026-04-03 13:15:44 - ReXMoE - INFO -
|
| 401 |
+
📊 Convergence Metrics:
|
| 402 |
+
2026-04-03 13:15:44 - ReXMoE - INFO - Convergence Metrics:
|
| 403 |
+
2026-04-03 13:15:44 - ReXMoE - INFO - Avg Router Grad Norm: 0.084278
|
| 404 |
+
2026-04-03 13:15:44 - ReXMoE - INFO - Current Learning Rate: 2.00e-05
|
| 405 |
+
2026-04-03 13:15:44 - ReXMoE - INFO - ℹ️ Collecting convergence data (need 5 epochs minimum)...
|
| 406 |
+
2026-04-03 13:15:44 - ReXMoE - INFO - Routing Pattern Analysis (Epoch 1):
|
| 407 |
+
2026-04-03 13:15:59 - ReXMoE - INFO -
|
| 408 |
+
Analyzing ACTUAL routing patterns from 10 batches (17,341 tokens)
|
| 409 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Current reuse scale: R=3
|
| 410 |
+
2026-04-03 13:15:59 - ReXMoE - INFO -
|
| 411 |
+
[IG-MET Pruning Report]:
|
| 412 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Global: 0/0 UNIQUE experts pruned (0.0%) | threshold=-1.000000
|
| 413 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Cross-Layer Routing Distribution (ACTUAL selections):
|
| 414 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Same layer (i): 869,591 ( 33.2%)
|
| 415 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Previous layer (i-1): 896,913 ( 34.2%)
|
| 416 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Next layer (i+1): 797,210 ( 30.4%)
|
| 417 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Distant layers: 57,726 ( 2.2%)
|
| 418 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Sample Layer-Specific Routing Patterns:
|
| 419 |
+
2026-04-03 13:15:59 - ReXMoE - INFO -
|
| 420 |
+
Layer 8:
|
| 421 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 7 from layer 9 ( L9): 6,917 times ( 39.9%)
|
| 422 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 14 from layer 9 ( L9): 6,553 times ( 37.8%)
|
| 423 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 7 from layer 7 ( L7): 6,305 times ( 36.4%)
|
| 424 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 14 from layer 7 ( L7): 5,503 times ( 31.7%)
|
| 425 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 2 from layer 7 ( L7): 3,988 times ( 23.0%)
|
| 426 |
+
2026-04-03 13:15:59 - ReXMoE - INFO -
|
| 427 |
+
Layer 16:
|
| 428 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 8 from layer 15 ( L15): 8,873 times ( 51.2%)
|
| 429 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 8 from layer 17 ( L17): 8,226 times ( 47.4%)
|
| 430 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 10 from layer 15 ( L15): 5,752 times ( 33.2%)
|
| 431 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 10 from layer 17 ( L17): 4,996 times ( 28.8%)
|
| 432 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 9 from layer 16 (same): 3,718 times ( 21.4%)
|
| 433 |
+
2026-04-03 13:15:59 - ReXMoE - INFO -
|
| 434 |
+
Layer 24:
|
| 435 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 8 from layer 23 ( L23): 9,676 times ( 55.8%)
|
| 436 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 8 from layer 25 ( L25): 7,087 times ( 40.9%)
|
| 437 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 9 from layer 23 ( L23): 6,982 times ( 40.3%)
|
| 438 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 9 from layer 25 ( L25): 4,908 times ( 28.3%)
|
| 439 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Expert 4 from layer 24 (same): 3,902 times ( 22.5%)
|
| 440 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - ✅ Cross-layer expert reuse detected: 66.8% of routing uses adjacent layers
|
| 441 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - LR stepped to: 2.00e-05
|
| 442 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - ================================================================================
|
| 443 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Training Convergence Summary
|
| 444 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - ================================================================================
|
| 445 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Router Gradient Norms Over Epochs:
|
| 446 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Epoch 1: 0.084278
|
| 447 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Auxiliary Loss Over Epochs:
|
| 448 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Epoch 1: 0.008843
|
| 449 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Convergence Status: Insufficient data (< 5 epochs)
|
| 450 |
+
2026-04-03 13:15:59 - ReXMoE - INFO -
|
| 451 |
+
Saving trained router weights only...
|
| 452 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - ✓ Saved trained router weights: 96 parameters
|
| 453 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - File: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/rexmoe_routers.pt
|
| 454 |
+
2026-04-03 13:15:59 - ReXMoE - INFO - Size: 12.03 MB
|
| 455 |
+
2026-04-03 13:15:59 - ReXMoE - INFO -
|
| 456 |
+
Also saving full model with ReXMoE architecture...
|
| 457 |
+
2026-04-03 13:16:00 - ReXMoE - INFO -
|
| 458 |
+
Merging LoRA adapters into base weights and saving to: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3/merged
|
| 459 |
+
2026-04-03 13:16:32 - ReXMoE - INFO - ✓ Saved merged full model (base+routers+LoRA) for one-step loading
|
| 460 |
+
2026-04-03 13:16:32 - ReXMoE - INFO - ================================================================================
|
| 461 |
+
2026-04-03 13:16:32 - ReXMoE - INFO - ✓ Training complete. Two checkpoint formats saved:
|
| 462 |
+
2026-04-03 13:16:32 - ReXMoE - INFO - 1. Router weights only: rexmoe_routers.pt (portable)
|
| 463 |
+
2026-04-03 13:16:32 - ReXMoE - INFO - 2. Full model: pytorch_model.bin (requires rexmoe_architecture.py)
|
| 464 |
+
2026-04-03 13:16:32 - ReXMoE - INFO -
|
| 465 |
+
Checkpoint directory: ./0304_033137_10_rexmoe_natural_phi_mini_moe_R3
|
| 466 |
+
2026-04-03 13:16:32 - ReXMoE - INFO - Full model size: 0.00 GB
|
| 467 |
+
2026-04-03 13:16:32 - ReXMoE - INFO - ================================================================================
|
merged/chat_template.jinja
ADDED
|
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{% for message in messages %}{{'<|' + message['role'] + '|>' + '
|
| 2 |
+
' + message['content'] + '<|end|>
|
| 3 |
+
' }}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|>
|
| 4 |
+
' }}{% else %}{{ eos_token }}{% endif %}
|
merged/config.json
ADDED
|
@@ -0,0 +1,41 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"architectures": [
|
| 3 |
+
"PhimoeForCausalLM"
|
| 4 |
+
],
|
| 5 |
+
"attention_bias": true,
|
| 6 |
+
"attention_dropout": 0.0,
|
| 7 |
+
"auto_map": {
|
| 8 |
+
"AutoConfig": "configuration_slimmoe.PhiMoEConfig",
|
| 9 |
+
"AutoModelForCausalLM": "modeling_slimmoe.PhiMoEForCausalLM"
|
| 10 |
+
},
|
| 11 |
+
"bos_token_id": 1,
|
| 12 |
+
"dtype": "bfloat16",
|
| 13 |
+
"eos_token_id": 32000,
|
| 14 |
+
"expert_dropout": 0.0,
|
| 15 |
+
"head_dim": 128,
|
| 16 |
+
"hidden_act": "silu",
|
| 17 |
+
"hidden_dropout": 0.0,
|
| 18 |
+
"hidden_size": 4096,
|
| 19 |
+
"initializer_range": 0.02,
|
| 20 |
+
"input_jitter_noise": 0.01,
|
| 21 |
+
"intermediate_size": 960,
|
| 22 |
+
"lm_head_bias": true,
|
| 23 |
+
"max_position_embeddings": 4096,
|
| 24 |
+
"model_type": "phimoe",
|
| 25 |
+
"num_attention_heads": 32,
|
| 26 |
+
"num_experts_per_tok": 2,
|
| 27 |
+
"num_hidden_layers": 32,
|
| 28 |
+
"num_key_value_heads": 8,
|
| 29 |
+
"num_local_experts": 16,
|
| 30 |
+
"output_router_logits": false,
|
| 31 |
+
"rms_norm_eps": 1e-05,
|
| 32 |
+
"rope_scaling": null,
|
| 33 |
+
"rope_theta": 10000.0,
|
| 34 |
+
"router_aux_loss_coef": 0.0,
|
| 35 |
+
"router_jitter_noise": 0.01,
|
| 36 |
+
"sliding_window": 2047,
|
| 37 |
+
"tie_word_embeddings": false,
|
| 38 |
+
"transformers_version": "4.57.3",
|
| 39 |
+
"use_cache": true,
|
| 40 |
+
"vocab_size": 32064
|
| 41 |
+
}
|
merged/generation_config.json
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"_from_model_config": true,
|
| 3 |
+
"bos_token_id": 1,
|
| 4 |
+
"eos_token_id": [
|
| 5 |
+
32000,
|
| 6 |
+
32001,
|
| 7 |
+
32007
|
| 8 |
+
],
|
| 9 |
+
"pad_token_id": 32000,
|
| 10 |
+
"transformers_version": "4.57.3"
|
| 11 |
+
}
|
merged/model-00001-of-00004.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:6484d5015f8ea3efdd33cefc1936368eddd1c2dcbf11e56748ef7479d2d8438d
|
| 3 |
+
size 4996706662
|
merged/model-00002-of-00004.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:327e8d8adf238d2ec2790faccbfc32e82a4c00171648c52ef221ccc458558323
|
| 3 |
+
size 4997911740
|
merged/model-00003-of-00004.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e1e6c4947cb7311b368c5d85243655b096db10bb6daf7432c3527ab912c79986
|
| 3 |
+
size 4999325054
|
merged/model-00004-of-00004.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:097bbe013b37437b336e70844d6910a88d9956f3b292a8310f795d21946e11b4
|
| 3 |
+
size 309969096
|
merged/model.safetensors.index.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
merged/special_tokens_map.json
ADDED
|
@@ -0,0 +1,30 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"bos_token": {
|
| 3 |
+
"content": "<s>",
|
| 4 |
+
"lstrip": false,
|
| 5 |
+
"normalized": false,
|
| 6 |
+
"rstrip": false,
|
| 7 |
+
"single_word": false
|
| 8 |
+
},
|
| 9 |
+
"eos_token": {
|
| 10 |
+
"content": "<|endoftext|>",
|
| 11 |
+
"lstrip": false,
|
| 12 |
+
"normalized": false,
|
| 13 |
+
"rstrip": false,
|
| 14 |
+
"single_word": false
|
| 15 |
+
},
|
| 16 |
+
"pad_token": {
|
| 17 |
+
"content": "<|endoftext|>",
|
| 18 |
+
"lstrip": false,
|
| 19 |
+
"normalized": false,
|
| 20 |
+
"rstrip": false,
|
| 21 |
+
"single_word": false
|
| 22 |
+
},
|
| 23 |
+
"unk_token": {
|
| 24 |
+
"content": "<unk>",
|
| 25 |
+
"lstrip": false,
|
| 26 |
+
"normalized": false,
|
| 27 |
+
"rstrip": false,
|
| 28 |
+
"single_word": false
|
| 29 |
+
}
|
| 30 |
+
}
|
merged/tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
merged/tokenizer_config.json
ADDED
|
@@ -0,0 +1,131 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"add_bos_token": false,
|
| 3 |
+
"add_eos_token": false,
|
| 4 |
+
"add_prefix_space": null,
|
| 5 |
+
"added_tokens_decoder": {
|
| 6 |
+
"0": {
|
| 7 |
+
"content": "<unk>",
|
| 8 |
+
"lstrip": false,
|
| 9 |
+
"normalized": false,
|
| 10 |
+
"rstrip": false,
|
| 11 |
+
"single_word": false,
|
| 12 |
+
"special": true
|
| 13 |
+
},
|
| 14 |
+
"1": {
|
| 15 |
+
"content": "<s>",
|
| 16 |
+
"lstrip": false,
|
| 17 |
+
"normalized": false,
|
| 18 |
+
"rstrip": false,
|
| 19 |
+
"single_word": false,
|
| 20 |
+
"special": true
|
| 21 |
+
},
|
| 22 |
+
"2": {
|
| 23 |
+
"content": "</s>",
|
| 24 |
+
"lstrip": false,
|
| 25 |
+
"normalized": false,
|
| 26 |
+
"rstrip": true,
|
| 27 |
+
"single_word": false,
|
| 28 |
+
"special": false
|
| 29 |
+
},
|
| 30 |
+
"32000": {
|
| 31 |
+
"content": "<|endoftext|>",
|
| 32 |
+
"lstrip": false,
|
| 33 |
+
"normalized": false,
|
| 34 |
+
"rstrip": false,
|
| 35 |
+
"single_word": false,
|
| 36 |
+
"special": true
|
| 37 |
+
},
|
| 38 |
+
"32001": {
|
| 39 |
+
"content": "<|assistant|>",
|
| 40 |
+
"lstrip": false,
|
| 41 |
+
"normalized": false,
|
| 42 |
+
"rstrip": true,
|
| 43 |
+
"single_word": false,
|
| 44 |
+
"special": true
|
| 45 |
+
},
|
| 46 |
+
"32002": {
|
| 47 |
+
"content": "<|placeholder1|>",
|
| 48 |
+
"lstrip": false,
|
| 49 |
+
"normalized": false,
|
| 50 |
+
"rstrip": true,
|
| 51 |
+
"single_word": false,
|
| 52 |
+
"special": true
|
| 53 |
+
},
|
| 54 |
+
"32003": {
|
| 55 |
+
"content": "<|placeholder2|>",
|
| 56 |
+
"lstrip": false,
|
| 57 |
+
"normalized": false,
|
| 58 |
+
"rstrip": true,
|
| 59 |
+
"single_word": false,
|
| 60 |
+
"special": true
|
| 61 |
+
},
|
| 62 |
+
"32004": {
|
| 63 |
+
"content": "<|placeholder3|>",
|
| 64 |
+
"lstrip": false,
|
| 65 |
+
"normalized": false,
|
| 66 |
+
"rstrip": true,
|
| 67 |
+
"single_word": false,
|
| 68 |
+
"special": true
|
| 69 |
+
},
|
| 70 |
+
"32005": {
|
| 71 |
+
"content": "<|placeholder4|>",
|
| 72 |
+
"lstrip": false,
|
| 73 |
+
"normalized": false,
|
| 74 |
+
"rstrip": true,
|
| 75 |
+
"single_word": false,
|
| 76 |
+
"special": true
|
| 77 |
+
},
|
| 78 |
+
"32006": {
|
| 79 |
+
"content": "<|system|>",
|
| 80 |
+
"lstrip": false,
|
| 81 |
+
"normalized": false,
|
| 82 |
+
"rstrip": true,
|
| 83 |
+
"single_word": false,
|
| 84 |
+
"special": true
|
| 85 |
+
},
|
| 86 |
+
"32007": {
|
| 87 |
+
"content": "<|end|>",
|
| 88 |
+
"lstrip": false,
|
| 89 |
+
"normalized": false,
|
| 90 |
+
"rstrip": true,
|
| 91 |
+
"single_word": false,
|
| 92 |
+
"special": true
|
| 93 |
+
},
|
| 94 |
+
"32008": {
|
| 95 |
+
"content": "<|placeholder5|>",
|
| 96 |
+
"lstrip": false,
|
| 97 |
+
"normalized": false,
|
| 98 |
+
"rstrip": true,
|
| 99 |
+
"single_word": false,
|
| 100 |
+
"special": true
|
| 101 |
+
},
|
| 102 |
+
"32009": {
|
| 103 |
+
"content": "<|placeholder6|>",
|
| 104 |
+
"lstrip": false,
|
| 105 |
+
"normalized": false,
|
| 106 |
+
"rstrip": true,
|
| 107 |
+
"single_word": false,
|
| 108 |
+
"special": true
|
| 109 |
+
},
|
| 110 |
+
"32010": {
|
| 111 |
+
"content": "<|user|>",
|
| 112 |
+
"lstrip": false,
|
| 113 |
+
"normalized": false,
|
| 114 |
+
"rstrip": true,
|
| 115 |
+
"single_word": false,
|
| 116 |
+
"special": true
|
| 117 |
+
}
|
| 118 |
+
},
|
| 119 |
+
"bos_token": "<s>",
|
| 120 |
+
"clean_up_tokenization_spaces": false,
|
| 121 |
+
"eos_token": "<|endoftext|>",
|
| 122 |
+
"extra_special_tokens": {},
|
| 123 |
+
"legacy": false,
|
| 124 |
+
"model_max_length": 4096,
|
| 125 |
+
"pad_token": "<|endoftext|>",
|
| 126 |
+
"padding_side": "left",
|
| 127 |
+
"sp_model_kwargs": {},
|
| 128 |
+
"tokenizer_class": "LlamaTokenizerFast",
|
| 129 |
+
"unk_token": "<unk>",
|
| 130 |
+
"use_default_system_prompt": false
|
| 131 |
+
}
|
rexmoe_architecture.py
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
rexmoe_routers.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b5d3ffa393ddbb18257baf74cb112aaa8f83f8291906d7763a9a43ea53b0cd98
|
| 3 |
+
size 12618290
|
special_tokens_map.json
ADDED
|
@@ -0,0 +1,30 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"bos_token": {
|
| 3 |
+
"content": "<s>",
|
| 4 |
+
"lstrip": false,
|
| 5 |
+
"normalized": false,
|
| 6 |
+
"rstrip": false,
|
| 7 |
+
"single_word": false
|
| 8 |
+
},
|
| 9 |
+
"eos_token": {
|
| 10 |
+
"content": "<|endoftext|>",
|
| 11 |
+
"lstrip": false,
|
| 12 |
+
"normalized": false,
|
| 13 |
+
"rstrip": false,
|
| 14 |
+
"single_word": false
|
| 15 |
+
},
|
| 16 |
+
"pad_token": {
|
| 17 |
+
"content": "<|endoftext|>",
|
| 18 |
+
"lstrip": false,
|
| 19 |
+
"normalized": false,
|
| 20 |
+
"rstrip": false,
|
| 21 |
+
"single_word": false
|
| 22 |
+
},
|
| 23 |
+
"unk_token": {
|
| 24 |
+
"content": "<unk>",
|
| 25 |
+
"lstrip": false,
|
| 26 |
+
"normalized": false,
|
| 27 |
+
"rstrip": false,
|
| 28 |
+
"single_word": false
|
| 29 |
+
}
|
| 30 |
+
}
|
tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,131 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"add_bos_token": false,
|
| 3 |
+
"add_eos_token": false,
|
| 4 |
+
"add_prefix_space": null,
|
| 5 |
+
"added_tokens_decoder": {
|
| 6 |
+
"0": {
|
| 7 |
+
"content": "<unk>",
|
| 8 |
+
"lstrip": false,
|
| 9 |
+
"normalized": false,
|
| 10 |
+
"rstrip": false,
|
| 11 |
+
"single_word": false,
|
| 12 |
+
"special": true
|
| 13 |
+
},
|
| 14 |
+
"1": {
|
| 15 |
+
"content": "<s>",
|
| 16 |
+
"lstrip": false,
|
| 17 |
+
"normalized": false,
|
| 18 |
+
"rstrip": false,
|
| 19 |
+
"single_word": false,
|
| 20 |
+
"special": true
|
| 21 |
+
},
|
| 22 |
+
"2": {
|
| 23 |
+
"content": "</s>",
|
| 24 |
+
"lstrip": false,
|
| 25 |
+
"normalized": false,
|
| 26 |
+
"rstrip": true,
|
| 27 |
+
"single_word": false,
|
| 28 |
+
"special": false
|
| 29 |
+
},
|
| 30 |
+
"32000": {
|
| 31 |
+
"content": "<|endoftext|>",
|
| 32 |
+
"lstrip": false,
|
| 33 |
+
"normalized": false,
|
| 34 |
+
"rstrip": false,
|
| 35 |
+
"single_word": false,
|
| 36 |
+
"special": true
|
| 37 |
+
},
|
| 38 |
+
"32001": {
|
| 39 |
+
"content": "<|assistant|>",
|
| 40 |
+
"lstrip": false,
|
| 41 |
+
"normalized": false,
|
| 42 |
+
"rstrip": true,
|
| 43 |
+
"single_word": false,
|
| 44 |
+
"special": true
|
| 45 |
+
},
|
| 46 |
+
"32002": {
|
| 47 |
+
"content": "<|placeholder1|>",
|
| 48 |
+
"lstrip": false,
|
| 49 |
+
"normalized": false,
|
| 50 |
+
"rstrip": true,
|
| 51 |
+
"single_word": false,
|
| 52 |
+
"special": true
|
| 53 |
+
},
|
| 54 |
+
"32003": {
|
| 55 |
+
"content": "<|placeholder2|>",
|
| 56 |
+
"lstrip": false,
|
| 57 |
+
"normalized": false,
|
| 58 |
+
"rstrip": true,
|
| 59 |
+
"single_word": false,
|
| 60 |
+
"special": true
|
| 61 |
+
},
|
| 62 |
+
"32004": {
|
| 63 |
+
"content": "<|placeholder3|>",
|
| 64 |
+
"lstrip": false,
|
| 65 |
+
"normalized": false,
|
| 66 |
+
"rstrip": true,
|
| 67 |
+
"single_word": false,
|
| 68 |
+
"special": true
|
| 69 |
+
},
|
| 70 |
+
"32005": {
|
| 71 |
+
"content": "<|placeholder4|>",
|
| 72 |
+
"lstrip": false,
|
| 73 |
+
"normalized": false,
|
| 74 |
+
"rstrip": true,
|
| 75 |
+
"single_word": false,
|
| 76 |
+
"special": true
|
| 77 |
+
},
|
| 78 |
+
"32006": {
|
| 79 |
+
"content": "<|system|>",
|
| 80 |
+
"lstrip": false,
|
| 81 |
+
"normalized": false,
|
| 82 |
+
"rstrip": true,
|
| 83 |
+
"single_word": false,
|
| 84 |
+
"special": true
|
| 85 |
+
},
|
| 86 |
+
"32007": {
|
| 87 |
+
"content": "<|end|>",
|
| 88 |
+
"lstrip": false,
|
| 89 |
+
"normalized": false,
|
| 90 |
+
"rstrip": true,
|
| 91 |
+
"single_word": false,
|
| 92 |
+
"special": true
|
| 93 |
+
},
|
| 94 |
+
"32008": {
|
| 95 |
+
"content": "<|placeholder5|>",
|
| 96 |
+
"lstrip": false,
|
| 97 |
+
"normalized": false,
|
| 98 |
+
"rstrip": true,
|
| 99 |
+
"single_word": false,
|
| 100 |
+
"special": true
|
| 101 |
+
},
|
| 102 |
+
"32009": {
|
| 103 |
+
"content": "<|placeholder6|>",
|
| 104 |
+
"lstrip": false,
|
| 105 |
+
"normalized": false,
|
| 106 |
+
"rstrip": true,
|
| 107 |
+
"single_word": false,
|
| 108 |
+
"special": true
|
| 109 |
+
},
|
| 110 |
+
"32010": {
|
| 111 |
+
"content": "<|user|>",
|
| 112 |
+
"lstrip": false,
|
| 113 |
+
"normalized": false,
|
| 114 |
+
"rstrip": true,
|
| 115 |
+
"single_word": false,
|
| 116 |
+
"special": true
|
| 117 |
+
}
|
| 118 |
+
},
|
| 119 |
+
"bos_token": "<s>",
|
| 120 |
+
"clean_up_tokenization_spaces": false,
|
| 121 |
+
"eos_token": "<|endoftext|>",
|
| 122 |
+
"extra_special_tokens": {},
|
| 123 |
+
"legacy": false,
|
| 124 |
+
"model_max_length": 4096,
|
| 125 |
+
"pad_token": "<|endoftext|>",
|
| 126 |
+
"padding_side": "left",
|
| 127 |
+
"sp_model_kwargs": {},
|
| 128 |
+
"tokenizer_class": "LlamaTokenizerFast",
|
| 129 |
+
"unk_token": "<unk>",
|
| 130 |
+
"use_default_system_prompt": false
|
| 131 |
+
}
|