SnifferCaptain
/

ymodel3-n1

@@ -13,6 +13,59 @@ datasets:
 - SnifferCaptain/z1-1m
 pipeline_tag: text-generation
 ---
 ## 模型描述
 YModel3是SnifferCaptain训练的到目前为止（5/4/2026）最新的大语言模型。模型相比YModel2，支持了如混合思考、可调思考深度的功能，在回答的质量上有一定的进步。

 - SnifferCaptain/z1-1m
 pipeline_tag: text-generation
 ---
+## Model Description
+YModel3 is the latest large language model developed by SnifferCaptain as of May 4, 2026. Compared to YModel2, it introduces features such as mixed reasoning and adjustable reasoning depth, achieving certain improvements in response quality.
+## Model Details
+- The model incorporates optimization ideas from MLA (DeepSeek, https://arxiv.org/pdf/2405.04434) and Gated Attention (Qwen, https://arxiv.org/pdf/2505.06708), replacing YModel2’s PEGA2 with the larger-capacity, better-scaling MLGA module, significantly improving parameter efficiency.
+- The model uses SwiGLU in the FFN part.
+- An SEBlock is added after each RMSNorm layer.
+| Key | Value |
+|:--|:--|
+| Parameters (during training) | 50.149M |
+| Number of layers | 8 |
+| Hidden size | 768 |
+| Vocabulary size | 6400 |
+| FFN activation function | SwiGLU |
+| FFN expansion dimension | 1536 |
+| Normalization layer | RMSNorm + SEBlock |
+| Attention mechanism | MLGA |
+| Number of attention heads | 6 |
+| Attention head dimension | 64 |
+| KV latent dimension | 128 |
+| RoPE embedding dimension | 64 |
+## Training Details
+- The model was trained with the SiMuon optimizer, similar to YModel2, but with the number of NS iterations increased from 2 to 3. During pre-training, the SiMuon parameters used a learning rate 10× the default; in other stages, the learning rate was 66× the default. The AdamW part used 1× the default learning rate. Learning rate scaling used `0.2 * sqrt(max(fan in, fan out))`.
+- The tokenizer and embedding layers are pre-trained weights from MiniMind3-v (https://github.com/jingyaogong/minimind).
+- During pre-training, the model was trained on **5B tokens** with a context length of 512, using a learning rate of 1e-4 with warmup and cosine decay to 1e-5.
+- During full fine-tuning, the model was trained on **8B tokens** with context lengths from 1024 to 4096, using a learning rate of 1e-5 with warmup and cosine decay to 1e-7. The final PPL is 4.01 (with a 6400 BPE vocabulary).
+- Pre-training batch size: 65536 tokens/step. Fine-tuning batch size: 131072 tokens/step.
+## Additional Details
+- Due to the pre-training dataset, the pre-trained model has a high probability of generating multiple-choice questions (similar to those found in benchmark tests) after English input.
+- To adjust reasoning depth, a specific reasoning template must be followed:
+```
+<|im_start|>user
+[user content]<|im_end|>
+<|im_start|>assistant
+<think>juice = 1.14
+[thinking content]</think>
+[reply content]<|im_end|>
+```
+The `juice` value must be a two‑decimal number with spaces before and after the equals sign; otherwise model performance degrades. If the value is too small or too large, the output length may be unexpected due to limited training data coverage. The `juice` value can be computed using:
+$$\max\left(0.0,\ \log_2\left(\frac{\text{token\_count}}{128} + 1\right)\right)$$
+Typical values:
+| juice | token count |
+|-------|--------------|
+| 0.59  | 64           |
+| 1.00  | 128          |
+| 2.00  | 384          |
+| 3.00  | 896          |
 ## 模型描述
 YModel3是SnifferCaptain训练的到目前为止（5/4/2026）最新的大语言模型。模型相比YModel2，支持了如混合思考、可调思考深度的功能，在回答的质量上有一定的进步。