Update README.md
Browse files
README.md
CHANGED
|
@@ -42,7 +42,7 @@ YModel2 is the most powerful Large Language Model (LLM) trained by SnifferCaptai
|
|
| 42 |
- The model inherits the self-distillation structure from YModel1.1, employing an inter-layer cosine similarity loss that encourages the model to compress knowledge into shallower layers.
|
| 43 |
- Parallel linear layers have been fused. Experiments with second-order optimizers indicate that this fusion does not degrade the model's final performance compared to the non-fused counterpart.
|
| 44 |
- The model was trained using the powerful SiMuon optimizer, an improved version of Muon ( https://kellerjordan.github.io/posts/muon ) as detailed in ( https://www.arxiv.org/abs/2507.11005 ). The SiMuon implementation used in this model differs from the original Muon in the following ways:
|
| 45 |
-
1. A `sign` operation is applied before the L2 norm in the NS
|
| 46 |
2. The number of NS iterations is reduced from 5 to 2.
|
| 47 |
- The model's tokenizer and embedding layers are initialized with pre-trained weights from MiniMind2 ( https://github.com/jingyaogong/minimind ).
|
| 48 |
- The model was trained on 0.4 billion tokens during the pre-training phase and 2 billion tokens during the fine-tuning phase.
|
|
|
|
| 42 |
- The model inherits the self-distillation structure from YModel1.1, employing an inter-layer cosine similarity loss that encourages the model to compress knowledge into shallower layers.
|
| 43 |
- Parallel linear layers have been fused. Experiments with second-order optimizers indicate that this fusion does not degrade the model's final performance compared to the non-fused counterpart.
|
| 44 |
- The model was trained using the powerful SiMuon optimizer, an improved version of Muon ( https://kellerjordan.github.io/posts/muon ) as detailed in ( https://www.arxiv.org/abs/2507.11005 ). The SiMuon implementation used in this model differs from the original Muon in the following ways:
|
| 45 |
+
1. A `sign` operation is applied before the L2 norm in the NS iteration.
|
| 46 |
2. The number of NS iterations is reduced from 5 to 2.
|
| 47 |
- The model's tokenizer and embedding layers are initialized with pre-trained weights from MiniMind2 ( https://github.com/jingyaogong/minimind ).
|
| 48 |
- The model was trained on 0.4 billion tokens during the pre-training phase and 2 billion tokens during the fine-tuning phase.
|