SnifferCaptain commited on
Commit
19d9458
·
verified ·
1 Parent(s): f1dcdc2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -42,7 +42,7 @@ YModel2 is the most powerful Large Language Model (LLM) trained by SnifferCaptai
42
  - The model inherits the self-distillation structure from YModel1.1, employing an inter-layer cosine similarity loss that encourages the model to compress knowledge into shallower layers.
43
  - Parallel linear layers have been fused. Experiments with second-order optimizers indicate that this fusion does not degrade the model's final performance compared to the non-fused counterpart.
44
  - The model was trained using the powerful SiMuon optimizer, an improved version of Muon ( https://kellerjordan.github.io/posts/muon ) as detailed in ( https://www.arxiv.org/abs/2507.11005 ). The SiMuon implementation used in this model differs from the original Muon in the following ways:
45
- 1. A `sign` operation is applied before the L2 norm in the NS (Neumann Series) operation.
46
  2. The number of NS iterations is reduced from 5 to 2.
47
  - The model's tokenizer and embedding layers are initialized with pre-trained weights from MiniMind2 ( https://github.com/jingyaogong/minimind ).
48
  - The model was trained on 0.4 billion tokens during the pre-training phase and 2 billion tokens during the fine-tuning phase.
 
42
  - The model inherits the self-distillation structure from YModel1.1, employing an inter-layer cosine similarity loss that encourages the model to compress knowledge into shallower layers.
43
  - Parallel linear layers have been fused. Experiments with second-order optimizers indicate that this fusion does not degrade the model's final performance compared to the non-fused counterpart.
44
  - The model was trained using the powerful SiMuon optimizer, an improved version of Muon ( https://kellerjordan.github.io/posts/muon ) as detailed in ( https://www.arxiv.org/abs/2507.11005 ). The SiMuon implementation used in this model differs from the original Muon in the following ways:
45
+ 1. A `sign` operation is applied before the L2 norm in the NS iteration.
46
  2. The number of NS iterations is reduced from 5 to 2.
47
  - The model's tokenizer and embedding layers are initialized with pre-trained weights from MiniMind2 ( https://github.com/jingyaogong/minimind ).
48
  - The model was trained on 0.4 billion tokens during the pre-training phase and 2 billion tokens during the fine-tuning phase.