---
license: mit
datasets:
- inclusionAI/Ling-Coder-SFT
- amd/Instella-GSM8K-synthetic
- Jackrong/Chinese-Qwen3-235B-Thinking-2507-Distill-100k
- tuanha1305/DeepSeek-R1-Distill
- YeungNLP/school_math_0.25M
- jingyaogong/minimind_dataset
language:
- zh
- en
base_model:
- SnifferCaptain/YModel2-s0
pipeline_tag: text-generation
---
## 模型描述
YModel2是SnifferCaptain训练的到目前为止(11/23/2025)最强大的大语言模型。其推理速度、数学能力、代码能力以及常识回答相比YModel1.x版本均有长足的进步。
## 模型细节
- 模型借鉴了MFA( https://arxiv.org/abs/2412.19255 )的优化思路,将PEGA(Position Embedding Gate Attention)升级到了PEGA2版本,在性能持平甚至超越PEGA的同时,带来了接近3x的速度提升。
- 模型在FFN部分采用了GeGLU。
## 训练细节
- 模型继承了YModel1.1的自蒸馏结构,在层间设置余弦相似度损失,使得模型倾向于将知识压缩到浅层。
- 模型将并行的线性层融合,经二阶优化器实验显示模型最终效果不会与不融合的情况更差。
- 模型采用最强大的SiMuon优化器训练。SiMuon是从Muon( https://kellerjordan.github.io/posts/muon )改进得到的( https://www.arxiv.org/abs/2507.11005 )。这个模型使用的SiMuon与原始的Muon以下区别:1:在执行NS操作的l2 norm前,执行sign操作。2:将NS迭代从5步减少到2步。
- 模型的tokenlizer与词嵌入层使用的是预训练权重,来自MiniMind2( https://github.com/jingyaogong/minimind )
- 模型在预训练阶段训练了0.4B token,在微调阶段训练了2B token。
- 模型在512长度上进行预训练,学习率为1e-3(SiMuon,使用0.2*sqrt(max(fan in, fan out))进行学习率缩放)与1e-4(AdamW)。采用bf16 amp加速。
- 模型在512/1e-5、1024/3e-6、2048/1e-6、2048/5e-7(长度/学习率)上进行sft微调。采用bf16 amp加速。
## Model Description
YModel2 is the most powerful Large Language Model (LLM) trained by SnifferCaptain to date (as of 11/23/2025). It demonstrates significant improvements in inference speed, mathematical capabilities, coding abilities, and common sense reasoning compared to the YModel1.x versions.
## Model Details
- The model incorporates optimization ideas from MFA ( https://arxiv.org/abs/2412.19255 ), upgrading PEGA (Position Embedding Gate Attention) to PEGA2. This new version achieves performance on par with or even surpassing the original PEGA, while delivering nearly a 3x speedup.
- The model utilizes GeGLU in its Feed-Forward Network (FFN) blocks.
## Training Details
- The model inherits the self-distillation structure from YModel1.1, employing an inter-layer cosine similarity loss that encourages the model to compress knowledge into shallower layers.
- Parallel linear layers have been fused. Experiments with second-order optimizers indicate that this fusion does not degrade the model's final performance compared to the non-fused counterpart.
- The model was trained using the powerful SiMuon optimizer, an improved version of Muon ( https://kellerjordan.github.io/posts/muon ) as detailed in ( https://www.arxiv.org/abs/2507.11005 ). The SiMuon implementation used in this model differs from the original Muon in the following ways:
1. A `sign` operation is applied before the L2 norm in the NS iteration.
2. The number of NS iterations is reduced from 5 to 2.
- The model's tokenizer and embedding layers are initialized with pre-trained weights from MiniMind2 ( https://github.com/jingyaogong/minimind ).
- The model was trained on 0.4 billion tokens during the pre-training phase and 2 billion tokens during the fine-tuning phase.
- Pre-training was conducted at a sequence length of 512 with a learning rate of 1e-3 (for SiMuon, scaled using `0.2*sqrt(max(fan_in, fan_out))`) and 1e-4 (for AdamW). The process was accelerated using bf16 AMP.
- For Supervised Fine-Tuning (SFT), the model was trained at the following sequence length and learning rate combinations: 512/1e-5, 1024/3e-6, 2048/1e-6, and 2048/5e-7 (Length/LR). This stage was also accelerated with bf16 AMP.
## 模型性能 Model Performance
模型在多个数据集上跑分,仅供娱乐参考:
模型跑分结果如下,使用lm_eval框架:
| Groups |Version|Filter|n-shot| Metric | |Value | |Stderr|
|-----------|------:|------|------|--------|---|-----:|---|-----:|
|ceval-valid| 2|none | 0|acc |↑ |0.2303|± |0.0115|
| cmmlu| 1|none | 5|acc |↑ |0.2515|± | 0.004|
ceval bench result
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|----------------------------------------------------|------:|------|-----:|--------|---|-----:|---|-----:|
|ceval-valid | 2|none | |acc |↑ |0.2303|± |0.0115|
|ceval-valid_accountant | 2|none | 0|acc |↑ |0.2245|± |0.0602|
|ceval-valid_advanced_mathematics | 2|none | 0|acc |↑ |0.3158|± |0.1096|
|ceval-valid_art_studies | 2|none | 0|acc |↑ |0.4545|± |0.0880|
|ceval-valid_basic_medicine | 2|none | 0|acc |↑ |0.0526|± |0.0526|
|ceval-valid_business_administration | 2|none | 0|acc |↑ |0.2424|± |0.0758|
|ceval-valid_chinese_language_and_literature | 2|none | 0|acc |↑ |0.2174|± |0.0879|
|ceval-valid_civil_servant | 2|none | 0|acc |↑ |0.2553|± |0.0643|
|ceval-valid_clinical_medicine | 2|none | 0|acc |↑ |0.2273|± |0.0914|
|ceval-valid_college_chemistry | 2|none | 0|acc |↑ |0.1667|± |0.0777|
|ceval-valid_college_economics | 2|none | 0|acc |↑ |0.2909|± |0.0618|
|ceval-valid_college_physics | 2|none | 0|acc |↑ |0.2105|± |0.0961|
|ceval-valid_college_programming | 2|none | 0|acc |↑ |0.2432|± |0.0715|
|ceval-valid_computer_architecture | 2|none | 0|acc |↑ |0.2857|± |0.1010|
|ceval-valid_computer_network | 2|none | 0|acc |↑ |0.1053|± |0.0723|
|ceval-valid_discrete_mathematics | 2|none | 0|acc |↑ |0.3750|± |0.1250|
|ceval-valid_education_science | 2|none | 0|acc |↑ |0.2414|± |0.0809|
|ceval-valid_electrical_engineer | 2|none | 0|acc |↑ |0.2162|± |0.0686|
|ceval-valid_environmental_impact_assessment_engineer| 2|none | 0|acc |↑ |0.1613|± |0.0672|
|ceval-valid_fire_engineer | 2|none | 0|acc |↑ |0.2581|± |0.0799|
|ceval-valid_high_school_biology | 2|none | 0|acc |↑ |0.3684|± |0.1137|
|ceval-valid_high_school_chemistry | 2|none | 0|acc |↑ |0.2105|± |0.0961|
|ceval-valid_high_school_chinese | 2|none | 0|acc |↑ |0.2105|± |0.0961|
|ceval-valid_high_school_geography | 2|none | 0|acc |↑ |0.2105|± |0.0961|
|ceval-valid_high_school_history | 2|none | 0|acc |↑ |0.3000|± |0.1051|
|ceval-valid_high_school_mathematics | 2|none | 0|acc |↑ |0.2222|± |0.1008|
|ceval-valid_high_school_physics | 2|none | 0|acc |↑ |0.2105|± |0.0961|
|ceval-valid_high_school_politics | 2|none | 0|acc |↑ |0.2105|± |0.0961|
|ceval-valid_ideological_and_moral_cultivation | 2|none | 0|acc |↑ |0.2632|± |0.1038|
|ceval-valid_law | 2|none | 0|acc |↑ |0.2500|± |0.0903|
|ceval-valid_legal_professional | 2|none | 0|acc |↑ |0.0435|± |0.0435|
|ceval-valid_logic | 2|none | 0|acc |↑ |0.1818|± |0.0842|
|ceval-valid_mao_zedong_thought | 2|none | 0|acc |↑ |0.3333|± |0.0983|
|ceval-valid_marxism | 2|none | 0|acc |↑ |0.2632|± |0.1038|
|ceval-valid_metrology_engineer | 2|none | 0|acc |↑ |0.1250|± |0.0690|
|ceval-valid_middle_school_biology | 2|none | 0|acc |↑ |0.1905|± |0.0878|
|ceval-valid_middle_school_chemistry | 2|none | 0|acc |↑ |0.1500|± |0.0819|
|ceval-valid_middle_school_geography | 2|none | 0|acc |↑ |0.0833|± |0.0833|
|ceval-valid_middle_school_history | 2|none | 0|acc |↑ |0.1818|± |0.0842|
|ceval-valid_middle_school_mathematics | 2|none | 0|acc |↑ |0.1579|± |0.0859|
|ceval-valid_middle_school_physics | 2|none | 0|acc |↑ |0.2105|± |0.0961|
|ceval-valid_middle_school_politics | 2|none | 0|acc |↑ |0.2857|± |0.1010|
|ceval-valid_modern_chinese_history | 2|none | 0|acc |↑ |0.1739|± |0.0808|
|ceval-valid_operating_system | 2|none | 0|acc |↑ |0.1579|± |0.0859|
|ceval-valid_physician | 2|none | 0|acc |↑ |0.2653|± |0.0637|
|ceval-valid_plant_protection | 2|none | 0|acc |↑ |0.3182|± |0.1016|
|ceval-valid_probability_and_statistics | 2|none | 0|acc |↑ |0.1111|± |0.0762|
|ceval-valid_professional_tour_guide | 2|none | 0|acc |↑ |0.3448|± |0.0898|
|ceval-valid_sports_science | 2|none | 0|acc |↑ |0.1579|± |0.0859|
|ceval-valid_tax_accountant | 2|none | 0|acc |↑ |0.1633|± |0.0533|
|ceval-valid_teacher_qualification | 2|none | 0|acc |↑ |0.2955|± |0.0696|
|ceval-valid_urban_and_rural_planner | 2|none | 0|acc |↑ |0.2174|± |0.0615|
|ceval-valid_veterinary_medicine | 2|none | 0|acc |↑ |0.2174|± |0.0879|
cmmlu bench result
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|----------------------------------------------|------:|------|-----:|--------|---|-----:|---|-----:|
|cmmlu | 1|none | |acc |↑ |0.2515|± |0.0040|
| - cmmlu_agronomy | 1|none | 5|acc |↑ |0.2544|± |0.0336|
| - cmmlu_anatomy | 1|none | 5|acc |↑ |0.2432|± |0.0354|
| - cmmlu_ancient_chinese | 1|none | 5|acc |↑ |0.2317|± |0.0330|
| - cmmlu_arts | 1|none | 5|acc |↑ |0.2562|± |0.0346|
| - cmmlu_astronomy | 1|none | 5|acc |↑ |0.2485|± |0.0337|
| - cmmlu_business_ethics | 1|none | 5|acc |↑ |0.2344|± |0.0294|
| - cmmlu_chinese_civil_service_exam | 1|none | 5|acc |↑ |0.2562|± |0.0346|
| - cmmlu_chinese_driving_rule | 1|none | 5|acc |↑ |0.2519|± |0.0381|
| - cmmlu_chinese_food_culture | 1|none | 5|acc |↑ |0.2574|± |0.0376|
| - cmmlu_chinese_foreign_policy | 1|none | 5|acc |↑ |0.2710|± |0.0432|
| - cmmlu_chinese_history | 1|none | 5|acc |↑ |0.2508|± |0.0242|
| - cmmlu_chinese_literature | 1|none | 5|acc |↑ |0.2549|± |0.0306|
| - cmmlu_chinese_teacher_qualification | 1|none | 5|acc |↑ |0.2458|± |0.0323|
| - cmmlu_clinical_knowledge | 1|none | 5|acc |↑ |0.2532|± |0.0283|
| - cmmlu_college_actuarial_science | 1|none | 5|acc |↑ |0.3019|± |0.0448|
| - cmmlu_college_education | 1|none | 5|acc |↑ |0.3178|± |0.0452|
| - cmmlu_college_engineering_hydrology | 1|none | 5|acc |↑ |0.3019|± |0.0448|
| - cmmlu_college_law | 1|none | 5|acc |↑ |0.1667|± |0.0360|
| - cmmlu_college_mathematics | 1|none | 5|acc |↑ |0.2381|± |0.0418|
| - cmmlu_college_medical_statistics | 1|none | 5|acc |↑ |0.2547|± |0.0425|
| - cmmlu_college_medicine | 1|none | 5|acc |↑ |0.2234|± |0.0253|
| - cmmlu_computer_science | 1|none | 5|acc |↑ |0.2500|± |0.0304|
| - cmmlu_computer_security | 1|none | 5|acc |↑ |0.2632|± |0.0338|
| - cmmlu_conceptual_physics | 1|none | 5|acc |↑ |0.2721|± |0.0368|
| - cmmlu_construction_project_management | 1|none | 5|acc |↑ |0.2230|± |0.0354|
| - cmmlu_economics | 1|none | 5|acc |↑ |0.2264|± |0.0333|
| - cmmlu_education | 1|none | 5|acc |↑ |0.2454|± |0.0338|
| - cmmlu_electrical_engineering | 1|none | 5|acc |↑ |0.2500|± |0.0331|
| - cmmlu_elementary_chinese | 1|none | 5|acc |↑ |0.2778|± |0.0283|
| - cmmlu_elementary_commonsense | 1|none | 5|acc |↑ |0.2273|± |0.0299|
| - cmmlu_elementary_information_and_technology| 1|none | 5|acc |↑ |0.2731|± |0.0289|
| - cmmlu_elementary_mathematics | 1|none | 5|acc |↑ |0.2304|± |0.0278|
| - cmmlu_ethnology | 1|none | 5|acc |↑ |0.2519|± |0.0375|
| - cmmlu_food_science | 1|none | 5|acc |↑ |0.2448|± |0.0361|
| - cmmlu_genetics | 1|none | 5|acc |↑ |0.2443|± |0.0325|
| - cmmlu_global_facts | 1|none | 5|acc |↑ |0.2617|± |0.0361|
| - cmmlu_high_school_biology | 1|none | 5|acc |↑ |0.2485|± |0.0333|
| - cmmlu_high_school_chemistry | 1|none | 5|acc |↑ |0.2273|± |0.0366|
| - cmmlu_high_school_geography | 1|none | 5|acc |↑ |0.2542|± |0.0403|
| - cmmlu_high_school_mathematics | 1|none | 5|acc |↑ |0.2500|± |0.0339|
| - cmmlu_high_school_physics | 1|none | 5|acc |↑ |0.2545|± |0.0417|
| - cmmlu_high_school_politics | 1|none | 5|acc |↑ |0.2378|± |0.0357|
| - cmmlu_human_sexuality | 1|none | 5|acc |↑ |0.2619|± |0.0393|
| - cmmlu_international_law | 1|none | 5|acc |↑ |0.2432|± |0.0316|
| - cmmlu_journalism | 1|none | 5|acc |↑ |0.2442|± |0.0329|
| - cmmlu_jurisprudence | 1|none | 5|acc |↑ |0.2506|± |0.0214|
| - cmmlu_legal_and_moral_basis | 1|none | 5|acc |↑ |0.2290|± |0.0288|
| - cmmlu_logical | 1|none | 5|acc |↑ |0.2602|± |0.0397|
| - cmmlu_machine_learning | 1|none | 5|acc |↑ |0.2623|± |0.0400|
| - cmmlu_management | 1|none | 5|acc |↑ |0.2667|± |0.0306|
| - cmmlu_marketing | 1|none | 5|acc |↑ |0.2722|± |0.0333|
| - cmmlu_marxist_theory | 1|none | 5|acc |↑ |0.2434|± |0.0313|
| - cmmlu_modern_chinese | 1|none | 5|acc |↑ |0.2500|± |0.0404|
| - cmmlu_nutrition | 1|none | 5|acc |↑ |0.2138|± |0.0342|
| - cmmlu_philosophy | 1|none | 5|acc |↑ |0.2476|± |0.0423|
| - cmmlu_professional_accounting | 1|none | 5|acc |↑ |0.2571|± |0.0331|
| - cmmlu_professional_law | 1|none | 5|acc |↑ |0.2749|± |0.0308|
| - cmmlu_professional_medicine | 1|none | 5|acc |↑ |0.2553|± |0.0225|
| - cmmlu_professional_psychology | 1|none | 5|acc |↑ |0.2500|± |0.0285|
| - cmmlu_public_relations | 1|none | 5|acc |↑ |0.2529|± |0.0330|
| - cmmlu_security_study | 1|none | 5|acc |↑ |0.2370|± |0.0367|
| - cmmlu_sociology | 1|none | 5|acc |↑ |0.2699|± |0.0296|
| - cmmlu_sports_science | 1|none | 5|acc |↑ |0.2485|± |0.0337|
| - cmmlu_traditional_chinese_medicine | 1|none | 5|acc |↑ |0.2595|± |0.0323|
| - cmmlu_virology | 1|none | 5|acc |↑ |0.2485|± |0.0333|
| - cmmlu_world_history | 1|none | 5|acc |↑ |0.2857|± |0.0357|
| - cmmlu_world_religions | 1|none | 5|acc |↑ |0.2625|± |0.0349|
以下是模型的问答输出:
---
[Q]你知道长江吗?
[A]是的,我知道长江。长江,位于中国东部沿海地区,是中国第一大河,也是亚洲第一大河,全长约6300公里。它发源于青藏高原的唐古拉山脉,流经中国11个省份,最终注入东海。长江不仅是中国的母亲河,也是亚洲最重要的河流之一,对中国的气候、水资源和生态环境有着深远的影响。
[Q]世界上最高的山峰是什么?
[A]珠穆朗玛峰是世界上海拔最高的山峰,位于喜马拉雅山脉,位于尼泊尔和中国边境。它的海拔高度为8,848.86米(29,031.7英尺)。
[Q]万有引力是谁提出的?
[A]万有引力是由艾萨克·牛顿在1687年提出的。他在这本书中详细阐述了万有引力定律,并通过实验验证了这一理论的正确性。
[Q]你知道大熊猫的主要食物是什么吗?
[A]当然,大熊猫主要以竹子为食。它们每天需要消耗大量的竹子来满足能量需求。除了竹子,大熊猫也会吃一些其他植物,如野生植物、果实和坚果。
[Q]海水为什么是咸的?
[A]海水之所以是咸的,主要是因为地球上的水循环过程。海洋是地球上最大的生态系统,它通过蒸发、降水、地表径流和地下水的补给等过程,将大量的溶解在水中的盐分转化为无量的盐分,从而维持着海洋生态系统的平衡。
1. **蒸发**:地球表面的水体(如海洋)在吸收和释放热量后,会以水蒸气的形式从海洋中蒸发出来。这个过程称为蒸发。
2. **降水**:随着水体的不断循环,水蒸气在大气中凝结成云,最终以降水的形式返回海洋。降水是海洋生物生长和繁殖的基础。
3. **地表径流**:在某些情况下,海水中的溶解物质(如溶解的矿物质、有机物等)会通过河流、湖泊、海洋表面等地理环境传递到海洋中,形成海水的“咸水”。
4. **地下水补给**:在一些深海区域,由于水压的增加,海水中的溶解盐分(如氯化钠、硫酸盐等)会被提取出来,然后通过地下水系统输送到淡水湖泊、河流、海洋中的淡水中,再通过河流或湖泊进行补给。
综上所述,海水之所以是咸的,是由于地球上的水循环过程,以及地球表面的盐分和溶解物质的积累。这些过程共同作用,使得海洋成为地球上最大的生态系统之一。
[Q]写一段python的快速排序代码
[A]快速排序是一种高效的排序算法,采用分治法策略。以下是使用Python实现的快速排序代码:
```python
def quick_sort(arr):
if len(arr) <= 1:
return arr
pivot = arr[len(arr) // 2]
left = [x for x in arr if x < pivot]
middle = [x for x in arr if x == pivot]
right = [x for x in arr if x > pivot]
return quick_sort(left) + middle + quick_sort(right)
# 示例
arr = [3, 6, 8, 10, 1, 2, 1]
print("原始数组:", arr)
sorted_arr = quick_sort(arr)
print("排序后的数组:", sorted_arr)
```
这段代码首先检查输入数组的长度,如果长度小于等于1,则直接返回数组(因为长度为1或空的数组已经是排序好的)。然后选择一个基准值(这里选择数组中间的元素),将数组分为三部分:小于基准值的元素、等于基准值的元素和大于基准值的元素。接着对小于和大于基准值的两部分递归地进行快速排序,并将结果与等于基准值的部分合并。
在示例中,我们创建了一个数组`arr`,并调用`quick_sort`函数对其进行排序。最后,打印排序前后的结果。
[Q]截至2025年2月9日14时31分,电影《哪吒之魔童闹海》累计票房(含预售)突破78.09亿元,超过《星球大战:天行者崛起》的票房成绩,成为首部进入全球票房榜前40的亚洲电影。这一成就不仅标志着中国电影在国际市场的突破,也展示了中国动画电影的巨大潜力和市场吸引力。从这段文字总结20字的摘要给我。
[A]《哪吒之魔童闹海》在2025年2月9日14时31分累计票房突破78.09亿元,成为首部进入全球票房榜前40的亚洲电影。
[Q]Justice has 4 ferns, 6 palms, and 8 succulent plants in her house. If she wants a total of 25 plants in her home, how many more plants does she need?
[A]Justice has 4 ferns, 6 palms, and 8 succulent plants. So, the total number of plants she has is 4 + 6 + 8 = 18. Since she wants a total of 25 plants, she needs 25 - 18 = 7 more plants. The answer is 7.
[Q]请写一篇关于环保的社论,要求言之有物、具有说服力。
[A]好的,以下是我为您撰写的社论:
随着工业化和城市化的快速发展,环境问题越来越受到人们的关注。我们必须认识到,保护环境是每个人的责任。为了实现可持续发展,我们需要采取行动。
首先,我们应该减少污染。这包括减少废弃物的产生和处理,以及使用更环保的产品和材料。同时,我们也应该鼓励人们使用公共交通工具、骑自行车或步行等方式出行,以减少汽车尾气的排放。
其次,我们应该重视水资源的保护。水是生命之源,但它是地球上最宝贵的资源之一。我们应该避免过度开采地下水,并确保我们的水源清洁。此外,我们还应该注意节约用水,例如减少洗澡时间、关闭水龙头等。
最后,我们应该重视垃圾分类和回收。垃圾分类可以有效地减少垃圾的数量,并将可回收的物品进行回收利用。同时,回收废纸、塑料和金属等材料可以减少对自然资源的消耗。
综上所述,保护环境是我们每个人的责任。我们应该采取行动,减少污染,保护水资源,并重视垃圾分类和回收。只有这样,我们才能共同创造一个更加美好的未来。
---