SnifferCaptain
/

YModel2-s0

Text Generation

Model card Files Files and versions

SnifferCaptain commited on Nov 23, 2025

Commit

19d9458

·

verified ·

1 Parent(s): f1dcdc2

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -42,7 +42,7 @@ YModel2 is the most powerful Large Language Model (LLM) trained by SnifferCaptai
 - The model inherits the self-distillation structure from YModel1.1, employing an inter-layer cosine similarity loss that encourages the model to compress knowledge into shallower layers.
 - Parallel linear layers have been fused. Experiments with second-order optimizers indicate that this fusion does not degrade the model's final performance compared to the non-fused counterpart.
 - The model was trained using the powerful SiMuon optimizer, an improved version of Muon ( https://kellerjordan.github.io/posts/muon ) as detailed in ( https://www.arxiv.org/abs/2507.11005 ). The SiMuon implementation used in this model differs from the original Muon in the following ways:
-    1. A `sign` operation is applied before the L2 norm in the NS (Neumann Series) operation.
     2. The number of NS iterations is reduced from 5 to 2.
 - The model's tokenizer and embedding layers are initialized with pre-trained weights from MiniMind2 ( https://github.com/jingyaogong/minimind ).
 - The model was trained on 0.4 billion tokens during the pre-training phase and 2 billion tokens during the fine-tuning phase.

 - The model inherits the self-distillation structure from YModel1.1, employing an inter-layer cosine similarity loss that encourages the model to compress knowledge into shallower layers.
 - Parallel linear layers have been fused. Experiments with second-order optimizers indicate that this fusion does not degrade the model's final performance compared to the non-fused counterpart.
 - The model was trained using the powerful SiMuon optimizer, an improved version of Muon ( https://kellerjordan.github.io/posts/muon ) as detailed in ( https://www.arxiv.org/abs/2507.11005 ). The SiMuon implementation used in this model differs from the original Muon in the following ways:
+    1. A `sign` operation is applied before the L2 norm in the NS iteration.
     2. The number of NS iterations is reduced from 5 to 2.
 - The model's tokenizer and embedding layers are initialized with pre-trained weights from MiniMind2 ( https://github.com/jingyaogong/minimind ).
 - The model was trained on 0.4 billion tokens during the pre-training phase and 2 billion tokens during the fine-tuning phase.