SnifferCaptain
/

YModel2-s0

 base_model:
 - SnifferCaptain/YModel2-s0
 pipeline_tag: text-generation
+---
+## 模型描述
+YModel2是SnifferCaptain训练的到目前为止（11/23/2025）最强大的大预言模型。其推理速度、数学能力、代码能力以及常识回答相比YModel1.x版本均有长足的进步。
+## 模型细节
+- 模型借鉴了MFA（https://arxiv.org/abs/2412.19255）的优化思路，将PEGA（Position Embedding Gate Attention）升级到了PEGA2版本，在性能持平甚至超越PEGA的同时，带来了接近3x的速度提升。
+- 模型在FFN部分采用了GeGLU。
+## 训练细节
+- 模型继承了YModel1.1的自蒸馏结构，在层间设置余弦相似度损失，使得模型倾向于将知识压缩到浅层。
+- 模型将并行的线性层融合，经二阶优化器实验显示模型最终效果不会与不融合的情况更差。
+- 模型采用最强大的SiMuon优化器训练。SiMuon是从Muon（https://kellerjordan.github.io/posts/muon）改进得到的（https://www.arxiv.org/abs/2507.11005）。这个模型使用的SiMuon与原始的Muon以下区别：1：在执行NS操作的l2 norm前，执行sign操作。2：将NS迭代从5步减少到2步。
+- 模型的tokenlizer与词嵌入层使用的是预训练权重，来自MiniMind2（https://github.com/jingyaogong/minimind）
+- 模型在预训练阶段训练了0.4B token，在微调阶段训练了2B token。
+- 模型在512长度上进行预训练，学习率为1e-3（SiMuon，使用0.2*sqrt(max(fan in, fan out))进行学习率缩放）与1e-4（AdamW）。采用bf16 amp加速。
+- 模型在512/1e-5、1024/3e-6、2048/1e-6、2048/5e-7（长度/学习率）上进行sft微调。采用bf16 amp加速。
+## Model Description
+YModel2 is the most powerful Large Language Model (LLM) trained by SnifferCaptain to date (as of 11/23/2025). It demonstrates significant improvements in inference speed, mathematical capabilities, coding abilities, and common sense reasoning compared to the YModel1.x versions.
+## Model Details
+- The model incorporates optimization ideas from MFA (https://arxiv.org/abs/2412.19255), upgrading PEGA (Position Embedding Gate Attention) to PEGA2. This new version achieves performance on par with or even surpassing the original PEGA, while delivering nearly a 3x speedup.
+- The model utilizes GeGLU in its Feed-Forward Network (FFN) blocks.
+## Training Details
+- The model inherits the self-distillation structure from YModel1.1, employing an inter-layer cosine similarity loss that encourages the model to compress knowledge into shallower layers.
+- Parallel linear layers have been fused. Experiments with second-order optimizers indicate that this fusion does not degrade the model's final performance compared to the non-fused counterpart.
+- The model was trained using the powerful SiMuon optimizer, an improved version of Muon (https://kellerjordan.github.io/posts/muon) as detailed in (https://www.arxiv.org/abs/2507.11005). The SiMuon implementation used in this model differs from the original Muon in the following ways:
+    1. A `sign` operation is applied before the L2 norm in the NS (Neumann Series) operation.
+    2. The number of NS iterations is reduced from 5 to 2.
+- The model's tokenizer and embedding layers are initialized with pre-trained weights from MiniMind2 (https://github.com/jingyaogong/minimind).
+- The model was trained on 0.4 billion tokens during the pre-training phase and 2 billion tokens during the fine-tuning phase.
+- Pre-training was conducted at a sequence length of 512 with a learning rate of 1e-3 (for SiMuon, scaled using `0.2*sqrt(max(fan_in, fan_out))`) and 1e-4 (for AdamW). The process was accelerated using bf16 AMP.
+- For Supervised Fine-Tuning (SFT), the model was trained at the following sequence length and learning rate combinations: 512/1e-5, 1024/3e-6, 2048/1e-6, and 2048/5e-7 (Length/LR). This stage was also accelerated with bf16 AMP.
+## 模型性能 Model Performance
+模型没有经过任何跑分。模型在训练集上最终ppl约为3.0。
+以下是模型的问答输出：
+---
+[Q]你知道长江吗？
+[A]是的，我知道长江。长江，位于中国东部沿海地区，是中国第一大河，也是亚洲第一大河，全长约6300公里。它发源于青藏高原的唐古拉山脉，流经中国11个省份，最终注入东海。长江不仅是中国的母亲河，也是亚洲最重要的河流之一，对中国的气候、水资源和生态环境有着深远的影响。
+[Q]世界上最高的山峰是什么？
+[A]珠穆朗玛峰是世界上海拔最高的山峰，位于喜马拉雅山脉，位于尼泊尔和中国边境。它的海拔高度为8,848.86米（29,031.7英尺）。
+[Q]万有引力是谁提出的？
+[A]万有引力是由艾萨克·牛顿在1687年提出的。他在这本书中详细阐述了万有引力定律，并通过实验验证了这一理论的正确性。
+[Q]你知道大熊猫的主要食物是什么吗？
+[A]当然，大熊猫主要以竹子为食。它们每天需要消耗大量的竹子来满足���量需求。除了竹子，大熊猫也会吃一些其他植物，如野生植物、果实和坚果。
+[Q]海水为什么是咸的？
+[A]海水之所以是咸的，主要是因为地球上的水循环过程。海洋是地球上最大的生态系统，它通过蒸发、降水、地表径流和地下水的补给等过程，将大量的溶解在水中的盐分转化为无量的盐分，从而维持着海洋生态系统的平衡。
+1. **蒸发**：地球表面的水体（如海洋）在吸收和释放热量后，会以水蒸气的形式从海洋中蒸发出来。这个过程称为蒸发。
+2. **降水**：随着水体的不断循环，水蒸气在大气中凝结成云，最终以降水的形式返回海洋。降水是海洋生物生长和繁殖的基础。
+3. **地表径流**：在某些情况下，海水中的溶解物质（如溶解的矿物质、有机物等）会通过河流、湖泊、海洋表面等地理环境传递到海洋中，形成海水的“咸水”。
+4. **地下水补给**：在一些深海区域，由于水压的增加，海水中的溶解盐分（如氯化钠、硫酸盐等）会被提取出来，然后通过地下水系统输送到淡水湖泊、河流、海洋中的淡水中，再通过河流或湖泊进行补给。
+综上所述，海水之所以是咸的，是由于地球上的水循环过程，以及地球表面的盐分和溶解物质的积累。这些过程共同作用，使得海洋成为地球上最大的生态系统之一。
+[Q]写一段python的快速排序代码
+[A]快速排序是一种高效的排序算法，采用分治法策略。以下是使用Python实现的快速排序代码：
+```python
+def quick_sort(arr):
+    if len(arr) <= 1:
+        return arr
+    pivot = arr[len(arr) // 2]
+    left = [x for x in arr if x < pivot]
+    middle = [x for x in arr if x == pivot]
+    right = [x for x in arr if x > pivot]
+    return quick_sort(left) + middle + quick_sort(right)
+# 示例
+arr = [3, 6, 8, 10, 1, 2, 1]
+print("原始数组:", arr)
+sorted_arr = quick_sort(arr)
+print("排序后的数组:", sorted_arr)
+```
+这段代码首先检查输入数组的长度，如果长度小于等于1，则直接返回数组（因为长度为1或空的数组已经是排序好的）。然后选择一个基准值（这里选择数组中间的元素），将数组分为三部分：小于基准值的元素、等于基准值的元素和大于基准值的元素。接着对小于和大于基准值的两部分递归地进行快速排序，并将结果与等于基准值的部分合并。
+在示例中，我们创建了一个数组`arr`，并调用`quick_sort`函数对其进行排序。最后，打印排序前后的结果。
+[Q]截至2025年2月9日14时31分，电影《哪吒之魔童闹海》累计票房（含预售）突破78.09亿元，超过《星球大战：天行者崛起》的票房成绩，成为首部进入全球票房榜前40的亚洲电影。这一成就不仅标志着中国电影在国际市场的突破，也展示了中国动画电影的巨大潜力和市场吸引力。从这段文字总结20字的摘要给我。
+[A]《哪吒之魔童闹海》在2025年2月9日14时31分累计票房突破78.09亿元，成为首部进入全球票房榜前40的亚洲电影。
+[Q]Justice has 4 ferns, 6 palms, and 8 succulent plants in her house. If she wants a total of 25 plants in her home, how many more plants does she need?
+[A]Justice has 4 ferns, 6 palms, and 8 succulent plants. So, the total number of plants she has is 4 + 6 + 8 = 18. Since she wants a total of 25 plants, she needs 25 - 18 = 7 more plants. The answer is 7.
+[Q]请写一篇关于环保的社论，要求言之有物、具有说服力。
+[A]好的，以下是我为您撰写的社论：
+随着工业化和城市化的快速发展，环境问题越来越受到人们的关注。我们必须认识到，保护环境是每个人的责任。为了实现可持续发展，我们需要采取行动。
+首先，我们应该减少污染。这包括减少废弃物的产生和处理，以及使用更环保的产品和材料。同时，我们也应该鼓励人们使用公共交通工具、骑自行车或步行等方式出行，以减少汽车尾气的排放。
+其次，我们应该重视水资源的保护。水是生命之源，但它是地球上最宝贵的资源之一。我们应该避免过度开采地下水，并确保我们的水源清洁。此外，我们还应该注意节约用水，例如减少洗澡时间、关闭水龙头等。
+最后，我们应该重视垃圾分类和回收。垃圾分类可以有效地减少垃圾的数量，并将可回收的物品进行回收利用。同时，回收废纸、塑料和金属等材料可以减少对自然资源的消耗。
+综上所述，保护环境是我们每个人的责任。我们应该采取行动，减少污染，保护水资源，并重视垃圾分类和回收。只有这样，我们才能共同创造一个更加美好的未来。
 ---