mozihe
/

moxin

+---
+license: mit
+language:
+  - zh
+tags:
+  - moe
+  - chinese
+  - vlm
+  - from-scratch
+  - lora
+pipeline_tag: text-generation
+---
+# MoXin（墨心）
+从零实现的中文大语言模型 & 多模态视觉语言模型。
+- GitHub: [https://github.com/mozihe/moxin](https://github.com/mozihe/moxin)
+## 模型概述
+MoXin 是一个完全从零构建的中文语言模型项目，涵盖 Tokenizer 训练、预训练、多阶段 SFT、LoRA 微调、多模态 VLM 扩展的完整流程。所有组件基于 PyTorch 原生实现，不依赖第三方训练框架。
+## 模型架构
+- **类型**: Decoder-only Transformer + Mixture of Experts (MoE)
+- **总参数量**: ~270M
+- **隐藏维度**: 768
+- **层数**: 2 Dense + 10 MoE = 12 层
+- **注意力**: GQA（8 Q heads / 2 KV heads）
+- **FFN**: SwiGLU，隐藏维度 2048
+- **MoE**: 4 专家，top-2 激活，1 共享专家，负载均衡辅助损失
+- **位置编码**: RoPE（θ=1e6）
+- **归一化**: RMSNorm
+- **词表**: 9600（BPE，自训练）
+- **最大序列长度**: 1024
+### VLM 扩展
+- **视觉编码器**: CLIP ViT-B/16（冻结，~86M）
+- **投影层**: VisionProj（Linear → GELU → Linear，768 → 768）
+- **图像表示**: 196 个 patch token 注入文本序列
+## 权重文件
+| 文件 | 说明 |
+|---|---|
+| `pretrain.pth` | 文本预训练权重 |
+| `sft01.pth` | SFT 第一阶段（max_seq_len=512） |
+| `sft02.pth` | SFT 第二阶段（max_seq_len=1024） |
+| `moxin-lora.pt` | LoRA 微调权重（基于 sft02） |
+| `pretrain_vlm.pth` | VLM 预训练权重 |
+| `sft_vlm.pth` | VLM SFT 权重 |
+## 训练流程
+```
+Tokenizer 训练
+      ↓
+文本预训练 → pretrain.pth
+      ↓
+SFT-1 (seq_len=512) → sft01.pth
+      ↓
+SFT-2 (seq_len=1024) → sft02.pth
+      ↓                       ↓
+LoRA 微调 → moxin-lora.pt   VLM 预训练 → pretrain_vlm.pth
+                                  ↓
+                              VLM SFT → sft_vlm.pth
+```
+## 快速使用
+```python
+import torch
+from transformers import AutoTokenizer
+# 需要先 clone 项目代码
+# git clone https://github.com/mozihe/moxin
+# cd moxin
+from config.moxin_config import MoXinConfig
+from model.moxin_model import MoXinModel
+config = MoXinConfig()
+tokenizer = AutoTokenizer.from_pretrained("tokenizer/moxin_tokenizer")
+model = MoXinModel(config)
+state_dict = torch.load("out/sft02.pth", map_location="cpu")
+model.load_state_dict(state_dict, strict=False)
+model.eval()
+messages = [{"role": "user", "content": "你好，请介绍一下你自己。"}]
+prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+input_ids = torch.tensor(tokenizer(prompt)["input_ids"]).unsqueeze(0)
+output = model.generate(
+    input_ids,
+    eos_token_id=tokenizer.eos_token_id,
+    max_new_tokens=512,
+    temperature=0.85,
+    top_p=0.85,
+)
+print(tokenizer.decode(output[0], skip_special_tokens=True))
+```
+## 评测结果
+### 中文语言能力
+| 指标 | 值 |
+|---|---|
+| Perplexity | 147.36 |
+| Distinct-1 | 0.492 |
+| Distinct-2 | 0.864 |
+| Distinct-3 | 0.943 |
+| Repetition | 0.009 |
+| Empty Rate | 0.0% |
+### C-Eval（Zero-shot）
+| 类别 | 准确率 |
+|---|---|
+| STEM | 24.4% |
+| Social Science | 25.0% |
+| Humanities | 24.8% |
+| Other | 22.0% |
+| **Overall** | **24.2%** |
+### VLM 图文理解
+| 指标 | 值 |
+|---|---|
+| CharOverlap | 0.410 |
+| BLEU-1 | 0.305 |
+| Distinct-2 | 0.714 |
+| Repetition | 0.036 |
+| Empty Rate | 0.0% |
+## 致谢
+- [MiniMind](https://github.com/jingyaogong/minimind) — 项目灵感来源
+- [OpenAI CLIP](https://github.com/openai/CLIP) — 视觉编码器
+- [HuggingFace Transformers](https://github.com/huggingface/transformers) — Tokenizer 与模型基类
+- [C-Eval](https://cevalbenchmark.com/) — 中文评测基准
+## License
+MIT