Update README.md
Browse files
README.md
CHANGED
|
@@ -17,13 +17,13 @@ pipeline_tag: text-generation
|
|
| 17 |
## 模型描述
|
| 18 |
YModel2是SnifferCaptain训练的到目前为止(11/23/2025)最强大的大预言模型。其推理速度、数学能力、代码能力以及常识回答相比YModel1.x版本均有长足的进步。
|
| 19 |
## 模型细节
|
| 20 |
-
- 模型借鉴了MFA(https://arxiv.org/abs/2412.19255)的优化思路,将PEGA(Position Embedding Gate Attention)升级到了PEGA2版本,在性能持平甚至超越PEGA的同时,带来了接近3x的速度提升。
|
| 21 |
- 模型在FFN部分采用了GeGLU。
|
| 22 |
## 训练细节
|
| 23 |
- 模型继承了YModel1.1的自蒸馏结构,在层间设置余弦相似度损失,使得模型倾向于将知识压缩到浅层。
|
| 24 |
- 模型将并行的线性层融合,经二阶优化器实验显示模型最终效果不会与不融合的情况更差。
|
| 25 |
-
- 模型采用最强大的SiMuon优化器训练。SiMuon是从Muon(https://kellerjordan.github.io/posts/muon)改进得到的(https://www.arxiv.org/abs/2507.11005)。这个模型使用的SiMuon与原始的Muon以下区别:1:在执行NS操作的l2 norm前,执行sign操作。2:将NS迭代从5步减少到2步。
|
| 26 |
-
- 模型的tokenlizer与词嵌入层使用的是预训练权重,来自MiniMind2(https://github.com/jingyaogong/minimind)
|
| 27 |
- 模型在预训练阶段训练了0.4B token,在微调阶段训练了2B token。
|
| 28 |
- 模型在512长度上进行预训练,学习率为1e-3(SiMuon,使用0.2*sqrt(max(fan in, fan out))进行学习率缩放)与1e-4(AdamW)。采用bf16 amp加速。
|
| 29 |
- 模型在512/1e-5、1024/3e-6、2048/1e-6、2048/5e-7(长度/学习率)上进行sft微调。采用bf16 amp加速。
|
|
@@ -34,17 +34,17 @@ YModel2 is the most powerful Large Language Model (LLM) trained by SnifferCaptai
|
|
| 34 |
|
| 35 |
## Model Details
|
| 36 |
|
| 37 |
-
- The model incorporates optimization ideas from MFA (https://arxiv.org/abs/2412.19255), upgrading PEGA (Position Embedding Gate Attention) to PEGA2. This new version achieves performance on par with or even surpassing the original PEGA, while delivering nearly a 3x speedup.
|
| 38 |
- The model utilizes GeGLU in its Feed-Forward Network (FFN) blocks.
|
| 39 |
|
| 40 |
## Training Details
|
| 41 |
|
| 42 |
- The model inherits the self-distillation structure from YModel1.1, employing an inter-layer cosine similarity loss that encourages the model to compress knowledge into shallower layers.
|
| 43 |
- Parallel linear layers have been fused. Experiments with second-order optimizers indicate that this fusion does not degrade the model's final performance compared to the non-fused counterpart.
|
| 44 |
-
- The model was trained using the powerful SiMuon optimizer, an improved version of Muon (https://kellerjordan.github.io/posts/muon) as detailed in (https://www.arxiv.org/abs/2507.11005). The SiMuon implementation used in this model differs from the original Muon in the following ways:
|
| 45 |
1. A `sign` operation is applied before the L2 norm in the NS (Neumann Series) operation.
|
| 46 |
2. The number of NS iterations is reduced from 5 to 2.
|
| 47 |
-
- The model's tokenizer and embedding layers are initialized with pre-trained weights from MiniMind2 (https://github.com/jingyaogong/minimind).
|
| 48 |
- The model was trained on 0.4 billion tokens during the pre-training phase and 2 billion tokens during the fine-tuning phase.
|
| 49 |
- Pre-training was conducted at a sequence length of 512 with a learning rate of 1e-3 (for SiMuon, scaled using `0.2*sqrt(max(fan_in, fan_out))`) and 1e-4 (for AdamW). The process was accelerated using bf16 AMP.
|
| 50 |
- For Supervised Fine-Tuning (SFT), the model was trained at the following sequence length and learning rate combinations: 512/1e-5, 1024/3e-6, 2048/1e-6, and 2048/5e-7 (Length/LR). This stage was also accelerated with bf16 AMP.
|
|
|
|
| 17 |
## 模型描述
|
| 18 |
YModel2是SnifferCaptain训练的到目前为止(11/23/2025)最强大的大预言模型。其推理速度、数学能力、代码能力以及常识回答相比YModel1.x版本均有长足的进步。
|
| 19 |
## 模型细节
|
| 20 |
+
- 模型借鉴了MFA( https://arxiv.org/abs/2412.19255 )的优化思路,将PEGA(Position Embedding Gate Attention)升级到了PEGA2版本,在性能持平甚至超越PEGA的同时,带来了接近3x的速度提升。
|
| 21 |
- 模型在FFN部分采用了GeGLU。
|
| 22 |
## 训练细节
|
| 23 |
- 模型继承了YModel1.1的自蒸馏结构,在层间设置余弦相似度损失,使得模型倾向于将知识压缩到浅层。
|
| 24 |
- 模型将并行的线性层融合,经二阶优化器实验显示模型最终效果不会与不融合的情况更差。
|
| 25 |
+
- 模型采用最强大的SiMuon优化器训练。SiMuon是从Muon( https://kellerjordan.github.io/posts/muon )改进得到的( https://www.arxiv.org/abs/2507.11005 )。这个模型使用的SiMuon与原始的Muon以下区别:1:在执行NS操作的l2 norm前,执行sign操作。2:将NS迭代从5步减少到2步。
|
| 26 |
+
- 模型的tokenlizer与词嵌入层使用的是预训练权重,来自MiniMind2( https://github.com/jingyaogong/minimind )
|
| 27 |
- 模型在预训练阶段训练了0.4B token,在微调阶段训练了2B token。
|
| 28 |
- 模型在512长度上进行预训练,学习率为1e-3(SiMuon,使用0.2*sqrt(max(fan in, fan out))进行学习率缩放)与1e-4(AdamW)。采用bf16 amp加速。
|
| 29 |
- 模型在512/1e-5、1024/3e-6、2048/1e-6、2048/5e-7(长度/学习率)上进行sft微调。采用bf16 amp加速。
|
|
|
|
| 34 |
|
| 35 |
## Model Details
|
| 36 |
|
| 37 |
+
- The model incorporates optimization ideas from MFA ( https://arxiv.org/abs/2412.19255 ), upgrading PEGA (Position Embedding Gate Attention) to PEGA2. This new version achieves performance on par with or even surpassing the original PEGA, while delivering nearly a 3x speedup.
|
| 38 |
- The model utilizes GeGLU in its Feed-Forward Network (FFN) blocks.
|
| 39 |
|
| 40 |
## Training Details
|
| 41 |
|
| 42 |
- The model inherits the self-distillation structure from YModel1.1, employing an inter-layer cosine similarity loss that encourages the model to compress knowledge into shallower layers.
|
| 43 |
- Parallel linear layers have been fused. Experiments with second-order optimizers indicate that this fusion does not degrade the model's final performance compared to the non-fused counterpart.
|
| 44 |
+
- The model was trained using the powerful SiMuon optimizer, an improved version of Muon ( https://kellerjordan.github.io/posts/muon ) as detailed in ( https://www.arxiv.org/abs/2507.11005 ). The SiMuon implementation used in this model differs from the original Muon in the following ways:
|
| 45 |
1. A `sign` operation is applied before the L2 norm in the NS (Neumann Series) operation.
|
| 46 |
2. The number of NS iterations is reduced from 5 to 2.
|
| 47 |
+
- The model's tokenizer and embedding layers are initialized with pre-trained weights from MiniMind2 ( https://github.com/jingyaogong/minimind ).
|
| 48 |
- The model was trained on 0.4 billion tokens during the pre-training phase and 2 billion tokens during the fine-tuning phase.
|
| 49 |
- Pre-training was conducted at a sequence length of 512 with a learning rate of 1e-3 (for SiMuon, scaled using `0.2*sqrt(max(fan_in, fan_out))`) and 1e-4 (for AdamW). The process was accelerated using bf16 AMP.
|
| 50 |
- For Supervised Fine-Tuning (SFT), the model was trained at the following sequence length and learning rate combinations: 512/1e-5, 1024/3e-6, 2048/1e-6, and 2048/5e-7 (Length/LR). This stage was also accelerated with bf16 AMP.
|