jingyaogong commited on Mar 31

Commit

21b4e01

0 Parent(s):

Duplicate from jingyaogong/minimind-3v

Browse files

Co-authored-by: jingyaogong <jingyaogong@users.noreply.huggingface.co>

Files changed (20) hide show

.gitattributes +41 -0
README.md +699 -0
README_en.md +716 -0
chat_template.jinja +85 -0
config.json +42 -0
generation_config.json +6 -0
images/VLM-structure-moe.jpg +3 -0
images/VLM-structure.jpg +3 -0
images/llava-structure.png +3 -0
images/logo.png +3 -0
images/minimind-3v.gif +3 -0
images/minimind-v-input.jpg +3 -0
images/pretrain_loss.jpg +0 -0
images/sft_loss.jpg +0 -0
model_minimind.py +279 -0
model_vlm.py +155 -0
pytorch_model.bin +3 -0
special_tokens_map.json +52 -0
tokenizer.json +0 -0
tokenizer_config.json +335 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,41 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+images/llava-structure.png filter=lfs diff=lfs merge=lfs -text
+images/logo.png filter=lfs diff=lfs merge=lfs -text
+images/minimind-3v.gif filter=lfs diff=lfs merge=lfs -text
+images/minimind-v-input.jpg filter=lfs diff=lfs merge=lfs -text
+images/VLM-structure-moe.jpg filter=lfs diff=lfs merge=lfs -text
+images/VLM-structure.jpg filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,699 @@

+---
+license: apache-2.0
+datasets:
+- jingyaogong/minimind-v_dataset
+language:
+- zh
+- en
+pipeline_tag: image-text-to-text
+---
+<div align="center">
+![logo](./images/logo.png)
+</div>
+<div align="center">
+![visitors](https://visitor-badge.laobi.icu/badge?page_id=jingyaogong/minimind-v)
+[![GitHub Repo stars](https://img.shields.io/github/stars/jingyaogong/minimind-v?style=social)](https://github.com/jingyaogong/minimind-v/stargazers)
+[![GitHub Code License](https://img.shields.io/github/license/jingyaogong/minimind-v?v=1)](LICENSE)
+[![GitHub last commit](https://img.shields.io/github/last-commit/jingyaogong/minimind-v)](https://github.com/jingyaogong/minimind-v/commits/master)
+[![GitHub pull request](https://img.shields.io/badge/PRs-welcome-blue)](https://github.com/jingyaogong/minimind-v/pulls)
+[![Collection](https://img.shields.io/badge/🤗-MiniMindV%20%20Collection-blue)](https://huggingface.co/collections/jingyaogong/minimind-v-67000833fb60b3a2e1f3597d)
+</div>
+<div align="center">
+![GitHub Trend](https://trendshift.io/api/badge/repositories/13265)
+</div>
+<div align="center">
+  <h3>"大道至简"</h3>
+</div>
+<div align="center">
+中文 | [English](./README_en.md)
+</div>
+* 此项目旨在从0开始，仅用1.3块钱成本 + 1小时！即可训练出67M参数的超小多模态视觉语言模型**MiniMind-V**。
+* **MiniMind-V**最小版本体积仅为 GPT3 的约 $\frac{1}{2600}$，力求做到个人GPU也可快速推理甚至训练。
+* **MiniMind-V**是[MiniMind](https://github.com/jingyaogong/minimind)纯语言模型的视觉能力额外拓展。
+* 项目同时包含了VLM大模型的极简结构、数据集清洗、预训练(Pretrain)、监督微调(SFT)等全过程代码。
+* 这不仅是一个开源VLM模型的最小实现，也是入门视觉语言模型的简明教程。
+* 希望此项目能为所有人提供一个抛砖引玉的示例，一起感受创造的乐趣！推动更广泛AI社区的进步！
+> 为防止误解，“1小时” 基于NVIDIA 3090硬件设备（单卡）测试`1 epoch`，“1.3块钱” 指GPU服务器租用成本。
+<div align="center">
+![minimind2-v](./images/minimind-3v.gif)
+[🔗🤖在线体验](https://www.modelscope.cn/studios/gongjy/MiniMind-V) | [🔗🎞️视频介绍](https://www.bilibili.com/video/BV1Sh1vYBEzY)
+</div>
+# 📌 项目介绍
+“用乐高拼出一架飞机，远比坐在头等舱里飞行更让人兴奋！”
+构建VLM范式的多模态大模型是否真的如想象中那样复杂？它的代码实现到底如何？
+训练过程究竟难不难？那么现在，探索它们的答案，一起感受创造的乐趣吧！
+> [!TIP]
+> （截至2026-02-15）MiniMind-V 系列已完成了以下型号模型训练，最小仅需67M (0.067B)，即可具备识图和对话的能力！
+| 模型 (大小)                   | 推理占用   | release    |
+|---------------------------|--------|------------|
+| minimind-3v-moe (201M-A67M) | 1.0 GB | 2026.04.01 |
+| minimind-3v (67M)         | 0.5 GB | 2026.04.01 |
+| MiniMind2-V (104M)        | 1.1 GB | 2025.02.20 |
+| MiniMind2-Small-V (26M)   | 0.6 GB | 2025.02.20 |
+| minimind-v-v1-small (27M) | 0.6 GB | 2024.10.04 |
+| minimind-v-v1 (109M)      | 1.1 GB | 2024.10.04 |
+#### 👉 更新日志
+<details close>
+<summary> <b>2026-02-15</b> </summary>
+- 新增 minimind-3v (67M) 和 minimind-3v-moe (201M-A67M) 模型
+- 统一使用768+8架构，支持dense和moe两种模式
+- 数据集格式更新为parquet，新增LLaVA-SFT-665K数据源
+- 更新tokenizer，图像占位符改为`<|image_pad|>`
+</details>
+<details close>
+<summary> <b>2025-10-24</b> </summary>
+- bug修复：模型权重不对应
+- 适配[「minimind-1024更新」](https://github.com/jingyaogong/minimind)
+- 代码重构：训练和评估脚本规范化
+- 新增完整的断点续训支持
+</details>
+<details close>
+<summary> <b>2025-04-27</b> </summary>
+- 兼容性更新
+- 适配[「minimind仓库新特性」](https://github.com/jingyaogong/minimind/issues/370)
+- 规范化部分代码
+</details>
+<details close>
+<summary> <b>More...</b> </summary>
+**2025-02-20**
+- MiniMind2-V伴随MiniMind2同步更新
+- 大幅减少所有冗余代码，规范代码格式
+- 大幅精简模型冗余结构
+- 更新数据集格式，拓展新的SFT数据集
+- 比前代VLM更优秀的效果！
+**2024-10-05**
+- MiniMind-V如期而至，首次开源
+</details>
+# 📌 快速开始
+<details style="color:rgb(128,128,128)">
+<summary>分享本人的软硬件配置（仅供参考）</summary>
+* CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
+* RAM: 128 GB
+* GPU: NVIDIA GeForce RTX 3090(24GB) * 8
+* Ubuntu==20.04
+* CUDA==12.2
+* Python==3.10.16
+* [requirements.txt](./requirements.txt)
+</details>
+### 第0步
+```bash
+# 克隆代码仓库
+git clone https://github.com/jingyaogong/minimind-v
+```
+```bash
+# 下载siglip2模型到 ./model 目录下
+git clone https://huggingface.co/jingyaogong/siglip2-base-p16-ve
+# or
+git clone https://modelscope.cn/models/gongjy/siglip2-base-p16-ve
+```
+```bash
+# 下载minimind语言模型权重到 ./out 目录下（作为训练VLM的基座语言模型）
+# HuggingFace
+https://huggingface.co/jingyaogong/minimind-3v-pytorch/blob/main/llm_768.pth
+# 国内源
+https://modelscope.cn/models/gongjy/minimind-3v-pytorch/resolve/master/llm_768.pth
+```
+## Ⅰ 测试已有模型效果
+### 1' 环境准备
+```bash
+pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+### 2' 下载模型
+```bash
+git clone https://huggingface.co/jingyaogong/minimind-3v
+```
+### 3' 命令行问答
+```bash
+# load_from='model': 加载原生PyTorch权重, load_from='其他路径': 加载transformers格式
+python eval_vlm.py --load_from model --weight sft_vlm
+# 或使用transformers格式模型
+python eval_vlm.py --load_from minimind-3v
+```
+### 4' 启动WebUI（可选）
+```bash
+# ⚠️ 须先将 transformers 格式模型文件夹复制到 ./scripts/ 目录下（例如：cp -r minimind-3v ./scripts/minimind-3v），web_demo_vlm 脚本会自动扫描该目录下包含权重文件的子文件夹，如不存在则报错
+cd scripts && python web_demo_vlm.py
+```
+## Ⅱ 从0开始自己训练
+### 1' 环境准备
+```bash
+pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+<details style="color:rgb(128,128,128)">
+<summary>注：提前测试Torch是否可用cuda</summary>
+```bash
+import torch
+print(torch.cuda.is_available())
+```
+如果不可用，请自行去[torch_stable](https://download.pytorch.org/whl/torch_stable.html)
+下载whl文件安装。参考[链接](https://blog.csdn.net/weixin_45456738/article/details/141029610?ops_request_misc=&request_id=&biz_id=102&utm_term=%E5%AE%89%E8%A3%85torch&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduweb~default-2-141029610.nonecase&spm=1018.2226.3001.4187)
+</details>
+### 2' 数据下载
+从下文提供的[数据集链接](https://huggingface.co/datasets/jingyaogong/minimind-v_dataset)
+下载所需内容并放到`./dataset`下。
+<details style="color:rgb(128,128,128)">
+<summary>注：数据集须知</summary>
+【注1】之前需解压50万零碎的图像文件可能非常慢。2025-12-27起，数据集格式统一为Parquet，图文一体化存储，体积更小，无需解压，加载更快。
+【注2】Parquet是列式存储格式，支持高效压缩和快速读取。如果你对它感到陌生，可以预览数据内容，在`dataset/`目录下执行`python lm_dataset.py`可视化前5条图文对
+Pretrain数据：
+```bash
+wget https://hf-mirror.com/datasets/jingyaogong/minimind-v_dataset/resolve/main/pretrain_i2t.parquet
+```
+SFT数据：
+```bash
+wget https://hf-mirror.com/datasets/jingyaogong/minimind-v_dataset/resolve/main/sft_i2t.parquet
+```
+建议预留~2GB空间存放数据集，若无多余空间存放pretrain数据，可尝试跳过pretrain训练步骤直接进行sft训练。
+</details>
+### 3' 开始训练
+**3.1 预训练（学图像描述）**
+```bash
+# 基础训练命令（从LLM权重开始，仅训练vision_proj）
+python train_pretrain_vlm.py --epochs 4 --from_weight llm
+```
+> 执行预训练，得到 `pretrain_vlm_*.pth` 作为预训练的输出权重（其中*为模型的dimension，默认为768）
+**3.2 监督微调（学看图对话方式）**
+```bash
+# 基础训练命令（从预训练权重开始，全参数微调）
+python train_sft_vlm.py --epochs 2 --from_weight pretrain_vlm
+```
+> 执行监督微调，得到 `sft_vlm_*.pth` 作为指令微调的输出权重
+<details style="color:rgb(128,128,128)">
+<summary>注：训练须知</summary>
+**训练特性：**
+- 支持断点续训：添加`--from_resume 1`参数可从上次中断处继续训练
+- 支持GPU数量变化：续训时GPU数量改变会自动转换step
+- 原子性保存：使用临时文件+替换机制，防止保存过程中断导致权重损坏
+- 每次保存同时生成`out/**.pth`（模型权重）和`checkpoints/**_resume.pth`（训练状态）文件
+```bash
+# 训练中断后，使用相同命令并添加 --from_resume 1
+python train_sft_vlm.py --epochs 4 --from_resume 1
+```
+**参数说明：**
+- `--from_weight`: 基础权重名称（llm, pretrain_vlm, none等）
+- `--save_weight`: 保存权重的前缀名
+- `--from_resume`: 是否续训（0=从头开始，1=从检查点继续）
+- `--freeze_llm`: 是否冻结LLM参数（仅pretrain使用）
+- 更多可直接参考代码
+</details>
+---
+### 4' 测试模型效果
+确保需要测试的模型`*.pth`文件位于`./out/`目录下。
+也可以直接去[此处](https://huggingface.co/jingyaogong/minimind-3v-pytorch)下载使用我训练的`*.pth`文件。
+```bash
+# 测试SFT模型（默认）
+python eval_vlm.py --weight sft_vlm
+# 测试Pretrain模型
+python eval_vlm.py --weight pretrain_vlm
+```
+---
+> [!TIP]
+> 训练脚本均为Pytorch原生框架，均支持多卡加速，假设你的设备有N (N＞1) 张显卡：
+单机N卡启动训练方式 (DDP, 支持多机多卡集群)
+```bash
+torchrun --nproc_per_node N train_xxx.py
+```
+<details style="color:rgb(128,128,128)">
+<summary>注：其它须知</summary>
+<del>
+单机N卡启动训练 (DeepSpeed)
+```bash
+deepspeed --master_port 29500 --num_gpus=N train_xxx.py
+```
+</del>
+可根据需要开启wandb记录训练过程
+```bash
+# 需要登录: wandb login
+torchrun --nproc_per_node N train_xxx.py --use_wandb
+# and
+python train_xxx.py --use_wandb
+```
+通过添加`--use_wandb`参数，可以记录训练过程，训练完成后，可以在wandb网站上查看训练过程。通过修改`wandb_project`
+和`wandb_run_name`参数，可以指定项目名称和运行名称。
+【注】：25年6月后，国内网络环境无法直连WandB，MiniMind项目默认转为使用[SwanLab](https://swanlab.cn/)作为训练可视化工具（完全兼容WandB API），即`import wandb`改为`import swanlab as wandb`即可，其他均无需改动。
+</details>
+# 📌 模型细节
+MiniMind-V (VLM)的基座语言模型MiniMind (LLM)来自孪生项目[minimind](https://github.com/jingyaogong/minimind)，
+具体的模型结构、训练细节、原理、测试效果等均可移步[minimind](https://github.com/jingyaogong/minimind)项目查阅。
+此处为减少冗余，省略讨论LLM的相关部分，默认您已对MiniMind (LLM)的细节有基本的了解。
+> 即使您不太了解LLM的细节，也可参考“快速开始”流程训练一个MiniMind-V，
+> 这并不受到影响，仓库致力于最低成本的开箱即用！
+MiniMind-V的结构仅增加Visual Encoder和特征投影两个子模块，增加模态混合分支，以支持多种模态信息的输入：
+![LLM-structure](./images/VLM-structure.jpg)
+![LLM-structure](./images/VLM-structure-moe.jpg)
+<details>
+<summary> 【重要】一些有趣的思考 </summary>
+此处不妨展开想一想两个问题：
+* 什么叫做**L**arge **L**anguage **M**odel (LLM)？
+* 什么叫做多模态模型？
+[这篇文章](https://www.jiqizhixin.com/articles/2024-09-15-3)完美吻合本人的想法：
+大语言模型（LLM）名字虽然带有语言二字，但它们其实与语言关系不大，这只是历史问题，更确切的名字应该是自回归 Transformer
+或者其他。LLM 更多是一种统计建模的通用技术，它们主要通过自回归 Transformer 来模拟 token 流，而这些 token
+可以代表文本、图片、音频、动作选择、甚至是分子等任何东西。
+因此，只要能将问题转化为模拟一系列离散 token 的流程，理论上都可以应用 LLM 来解决。
+实际上，随着大型语言模型技术栈的日益成熟，我们可能会看到越来越多的问题被纳入这种建模范式。也就是说，问题固定在使用 LLM
+进行『下一个 token 的预测』，只是每个领域中 token 的用途和含义有所不同。
+[ZJU-LiXi老师](https://person.zju.edu.cn/xilics#694283)同样谈及过类似观点（原话大意如下）：
+文本、视频、语音、动作等在人类看来属于「多模态」信号，但所谓的「模态」其实只是人类在信息存储方式上的一种分类概念。
+就像`.txt`和`.png`文件，虽然在视觉呈现和高级表现形式上有所不同，但它们本质上并没有根本区别。
+之所以出现「多模态」这个概念，仅仅是因为人类在不同的感知层面上对这些信号的分类需求。
+然而，对于机器来说，无论信号来自何种「模态」，最终它们都只是以一串二进制的「单模态」数字序列来呈现。
+机器并不会区分这些信号的模态来源，而只是处理和分析这些序列背后所承载的信息内容。
+个人认为**G**enerative **P**retrained **T**ransformer (GPT) 比 **L**arge **L**anguage **M**odel (LLM)更为贴切，
+因此本人表达上更习惯用"GPT"去代表LLM/VLM/类GPT架构的系列模型，而非为了蹭OpenAI的热度。
+至此，我们可以用一句话总结GPT的所作所为：
+GPT模型根据现有token预测输出下一个下下一个下下下一个token ...，直到模型输出结束符；此处的"token"其实并不需要一定是文本！
+```text
+> 对于LLM模型，如果需要理解"图片"，我们只要把"图片"作为对一种特殊的从来没见过的"外国语言"，通过"外语词典"翻译后即可作为特殊的语言输入LLM
+> 对于LLM模型，如果需要理解"音频"，我们只要把"音频"作为对一种特殊的从来没见过的"外国语言"，通过"外语词典"翻译后即可作为特殊的语言输入LLM
+> ...
+```
+<u>**为了得到MiniMind-V，我们只需要完成这2件事即可：**</u>
+1. 借助擅长翻译图片的 **"外语词典"** ，把图片从 **"外国语言"** 翻译为模型便于理解的 **"LLM语言"**
+2. 训练微调LLM，使其和 **"外语词典"** 度过磨合期，从而更好的理解图片
+"外语词典" 称之为Visual Encoder模型。
+和LlaVA、Qwen-VL等视觉语言模型类似，MiniMind-V当前选用开源SigLIP2系列模型作为Visual Encoder。
+具体使用[siglip2-base-p16-ve](https://huggingface.co/jingyaogong/siglip2-base-p16-ve)，
+一种基于 ViT-B/16 架构的Visual Encoder用于描述图像文本信息。
+当前使用的 SigLIP2 NaFlex 视觉编码器会根据预处理结果生成最多256个patch token作为encoder编码层的输入，
+最终产生1×768维的嵌入向量用于和文本对计算误差。
+我们并不需要最终嵌入表示，因此只取encoder层的输出，也就是VIT核心主干的输出特征即可。
+它拿到前一层256×768大小的特征，通过reshape将每4个相邻token拼接为1个（256×768 → 64×3072），再经过2层MLP（Linear→GELU→Linear）投影到LLM的隐藏维度，最终作为64个visual token输入MiniMind-V。
+与LLM的结合在获取图像encoder特征后，一方面需要把视觉特征对齐到LLM的文本token维度，
+另一方面，要将图像特征映射到与文本embedding相同的空间，即文本token和原生的视觉token需要磨合并不能直接地一视同仁，
+可以称之为跨模态的特征对齐。
+[LlaVA-1](https://arxiv.org/pdf/2304.08485)使用简单的线性变换完成对齐，[LlaVA-1.5](https://arxiv.org/pdf/2310.03744)升级为2层MLP，MiniMind-V采用与LlaVA-1.5相同的MLP Projection方案，并结合reshape进行token压缩。
+![llava-structure](./images/llava-structure.png)
+MiniMind-V的主要结构已介绍完毕。
+</details>
+---
+下面，我们简单讨论MiniMind-V的外部输入输出的变化。
+VLM的输入依然是一段文本，其中包含特殊的`<image>`占位符。
+在计算文本嵌入后，可以将图像编码器生成的向量投影到该占位符对应的嵌入部分，替换掉原先的占位符embedding。
+例如：
+```text
+<image>\n这个图像中有什么内容？
+```
+在`minimind-v`中，使用64个`<|image_pad|>`组成的占位符代替图像（SigLIP2输出的256个patch特征经reshape+MLP压缩为64个token），因此`minimind-v`的prompt为：
+```text
+<|image_pad|><|image_pad|>...<|image_pad|>(×64)\n这个图片描述的是什么内容？
+```
+计算完embedding和projection，用视觉特征替换掉对应占位符的embedding后，整个计算过程到输出则和LLM部分没有差异。
+![input](./images/minimind-v-input.jpg)
+至此，`MiniMind-V`的所有细节呈现完毕，VLM模型子类继承自`MiniMind`，仅做**最小**变更而产生，核心算法改动`< 50行`，迁移难度极低，和`LlaVA`等模型具体实现存在区别，但思路一致。
+# 📌 实验
+## Ⅰ 数据集
+原始来源：
+- [Chinese-LLaVA-Vision](https://huggingface.co/datasets/LinkSoul/Chinese-LLaVA-Vision-Instructions)：包含约57万张预训练图像，来自CC-3M和COCO 2014
+- [llava-en-zh-300k](https://huggingface.co/datasets/BUAADreamer/llava-en-zh-300k)：包含300k条指令微调数据和15万张图像
+- [LLaVA-SFT-665K](https://huggingface.co/datasets/csuhan/LLaVA-SFT-665K)：包含665k条指令微调数据
+其中部分为中文数据，部分为英文数据。问答内容经过翻译，对中文支持更友好，进一步经过整理并`resize`（pretrain分辨率128×128，sft分辨率160×160）。
+(pretrain_i2t.parquet) 预训练数据集格式：
+```text
+列名: conversations (json string), image_bytes (binary), image_names (string)
+conversations 示例:
+[
+  {"role": "user", "content": "提供给定图像的简要描述。\n<image>"},
+  {"role": "assistant", "content": "橄榄油是自由使用的健康成分。"}
+]
+image_bytes: <图像二进制数据>
+```
+(sft_i2t.parquet) 单图指令微调数据集格式：
+```text
+列名: conversations (json string), image_bytes (binary), image_names (string)
+conversations 示例:
+[
+  {"role": "user", "content": "闹钟的位置对睡眠质量有什么影响？<image>"},
+  {"role": "assistant", "content": "把数字闹钟放在床头柜..."}
+]
+image_bytes: <图像二进制数据>
+```
+> 注：sft_i2t.parquet 共约 58 万条数据，其中约 23.6 万条为含图对话（i2t），约 34.6 万条为纯文本对话（t2t），后者用于保持模型的基础语言能力。
+数据集下载地址：([ModelScope](https://www.modelscope.cn/datasets/gongjy/minimind-v_dataset) | [HuggingFace](https://huggingface.co/datasets/jingyaogong/minimind-v_dataset))
+## Ⅱ 训练
+训练分为两个阶段，均冻结Visual Encoder梯度，仅训练Projection和LLM部分。
+训练基于LLM预训练权重初始化，支持DDP多卡训练、混合精度（bfloat16）、torch.compile加速和swanlab日志记录。
+> train_pretrain_vlm
+预训练阶段从约113万条图文描述数据中学习图片的通用知识（如鹿是鹿，狗是狗）。
+此阶段采用较高学习率（1e-4），最大序列长度360，冻结LLM主体参数，仅设置Projection和LLM的第0层可学习，
+目的是让模型快速建立视觉特征到语言空间的基础映射，同时避免破坏LLM已有的语言能力。
+> train_sft_vlm
+指令微���阶段从约58万条数据中学习真实问答格式，其中约23.6万条为图文多轮对话，约34.6万条为纯文本对话（用于保持LLM基础能力）。
+此阶段采用较低学习率（1e-5~1e-6），最大序列长度768，解冻Projection和LLM全部参数进行全量微调，
+使模型学会根据图片内容进行多轮对话，并通过混入的纯文本数据缓解灾难性遗忘。
+> 训练时间和Loss走势（仅供参考）
+Pretrain [768+8] (dense & moe)
+![input](./images/pretrain_loss.jpg)
+SFT [768+8] (dense & moe)
+![input](./images/sft_loss.jpg)
+## Ⅲ 模型权重
+| 模型格式 | ModelScope | HuggingFace |
+|---|---|---|
+| 原生PyTorch (`*.pth`) | [minimind-3v-pytorch](https://www.modelscope.cn/models/gongjy/minimind-3v-pytorch) | [minimind-3v-pytorch](https://huggingface.co/jingyaogong/minimind-3v-pytorch) |
+| Transformers 格式 | [minimind-v collection](https://modelscope.cn/collections/MiniMind-V-42b841dde22d41) | [minimind-v collection](https://huggingface.co/collections/jingyaogong/minimind-v-67000833fb60b3a2e1f3597d) |
+> 注：Transformers版本均为单图指令微调后的`MiniMind-V`模型
+# 📌 评估
+### 效果测试
+#### 单图对话
+<table>
+  <thead>
+    <tr>
+      <th>image</th>
+      <th>minimind-3v</th>
+      <th>minimind-3v-moe</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>
+        <img src="./dataset/eval_images/airplane-flying-blue-sky.jpg" alt="airplane">
+        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+      </td>
+      <td>在这幅图片中，一架白色的飞机正降落在一片广阔的天空中。天空中飘浮着许多流线型的建筑物，这些建筑物散落在天空中。除了主要的飞机外，还有两辆汽车在场景中各处停放，包括一辆巴士和一辆小型汽车。这架飞机似乎停在地面上，表明它们正在进行商业活动。</td>
+      <td>在这张照片中，有一架大型的飞机正在降落，这表明它是为这架飞机而设计的。此外，它停在云层之下，这表明它在移动。天空中有云朵，暗示着这架飞机正在空中飞行。整个场景的背景显示出一种宁静祥和的气氛，暗示这架飞机正在飞行中作为一个机会来进行外交活动或与其他客机接触。</td>
+    </tr>
+    <tr>
+      <td>
+        <img src="./dataset/eval_images/birthday-cake-candles-table.jpg" alt="birthday-cake">
+        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+      </td>
+      <td>图片显示了一张大生日蛋糕，上面刻有各种生日和装饰物的生日和一束蜡烛。这些甜蜜而独特的气味使其成为这个特殊场合的焦点。蛋糕被放在餐桌上，有蛋糕放在上面，很可能是放在桌上，用蜡烛加上蜡烛来营造出诱人的场景。此外，这张生日蛋糕还配上了蜡烛作为装饰物，进一步丰富了整个生日的氛围。</td>
+      <td>这张照片捕捉到了一个生日聚会，各种美味可口的生日蛋糕和一把勺子放在那里。它被放置在一个充满生气的蛋糕盒子里，里面装着各种各样的草莓糖霜糖浆。蛋糕上有一个大号的蜡烛，给人一种温馨和诱人的图案。周围有很多小点心，比如小心地蜡烛和糖霜，让整个蛋糕看起来更加令人放松和诱人。</td>
+    </tr>
+    <tr>
+      <td>
+        <img src="./dataset/eval_images/pizza-on-wooden-board.jpg" alt="pizza">
+        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+      </td>
+      <td>在这张图片中，有一个装饰着奶酪的比萨饼和一片新鲜的青绿野餐毯。这看起来像是一块沙拉，给人一种清新、诱人、美味的享受体验。披萨的大小和大小暗示着一种随意和随意的用餐体验。画面中，一群人围坐在一块砖块上，其中一些则分散在桌子上，周围摆放着不同种类的青绿野餐毯。此外，桌子上还放着一块披萨片，上面摆放着各种各样的青绿野餐毯，营造出一种轻松愉快的氛围。</td>
+      <td>图中，比萨饼在木桌上。它被切成了比萨饼的大小，而且披萨放置在木桌上。比萨有很多配料，包括奶酪和各种酱汁。披萨有多种口味，包括经典的牛排式和意大利式，还有一些加在比萨表面。比萨上有很多新鲜水果，如西红柿和莫吉托，以及奶酪，使这幅图片更加丰富和有吸引力。</td>
+    </tr>
+    <tr>
+      <td>
+        <img src="./dataset/eval_images/red-sports-car-road.jpg" alt="red-car">
+        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+      </td>
+      <td>在这幅图片中，一辆白色的马车停泊在一条红色的路牌上。这个车辆位于一条绿油油的道路上，很可能是一个购物中心或高速公路。在这辆车的后部，可以看到一个绿色的马车停泊在路上，这是典型的户外场所。这辆马车可能是为了娱乐或观赏车辆而停放。</td>
+      <td>画面显示了一辆红色高性能赛车，停在一辆大型汽车后面。这辆车可以看到车身、汽车、汽车及汽车停在里面。它似乎是一辆大型红色汽车，有各种大小的汽车，表明它可能是汽车制造商生产的。此外，车辆周围的环境暗示了一种户外环境，因为一辆汽车也出现在场景中。</td>
+    </tr>
+    <tr>
+      <td>
+        <img src="./dataset/eval_images/row-of-colorful-houses.jpg" alt="colorful-houses">
+        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+      </td>
+      <td>画面中，一座蓝色白色的大房子位于一条城市街道上，为这个地区增添了一丝自然与奇思妙想。墙上挂着一盏交通灯，为整个场景增添了特色和引人入胜。</td>
+      <td>画面中，一座蓝色房子旁边有很多小花瓶。这表明这座房子里可能正在营运一些小型小型花盆或盆栽植物。一些人聚集在房子周围，可能正在享受户外生活中的某种乐趣或美景。有些人站在画面中，而其他人则散布于场景中。总体而言，这幅场景捕捉到了一个美丽而令人愉快的场景，展示出该房子里一个令人放松和宁静的环境。</td>
+    </tr>
+    <tr>
+      <td>
+        <img src="./dataset/eval_images/snow-mountain-lake-view.jpg" alt="snow-mountain">
+        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+      </td>
+      <td>在这幅图片中，有一个高高的山，它看起来像是大森林中的一片高山。天空中闪烁着不同颜色的星星，给画面增加了一抹红点。天空中有两朵高大的树，树木高高地挂在一起，暗示着森林中的宁静与自然之美。在画面的中心，可以看到一棵高大的松树，树干上覆盖着一层薄薄的苔藓。这种高高的松树与周围的大山构成了一个有趣而引人入胜的背景，为这片自然之美增添了一丝神秘色彩。</td>
+      <td>这幅图片展示了一个令人印象深刻的宁静湖面。湖水从天上飘浮着，暗示着一个令人平静和放松的水面。湖边上有几匹高大、形状各异的景象，它们在湖面上显得格外美丽。此外，在湖的边缘，有一座巨大的高山，为整个湖景增添了几分神秘色彩。湖景中的天空也被描述得如画，给整个画面增添了一种纯净和宁静的气氛。</td>
+    </tr>
+    <tr>
+      <td>
+        <img src="./dataset/eval_images/street-food-hotpot-table.jpg" alt="street-food">
+        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+      </td>
+      <td>图中，一大群人聚集在一张大餐桌旁欣赏着烤肉和热带水果，其中一人站在一碗里摆放着各种各样的盘子。桌子上放着几个碗，上面摆满了肉类和其他配料。有些放在盘子里，其他的则放在桌子上。在这张餐桌周围，有几个盘子，其中两个放着一杯酒，另一个放在靠近餐桌的左侧。</td>
+      <td>画面中，一群人聚集在一家大餐馆里，享受着一顿饭。这家餐厅的菜单上有一些生菜和猪肉，但它们已经被切成了四份，上面还有一个碗。他们拿着烤肉准备食用。在背景中，有几个瓶子在场景中。还有一把勺子位于桌子左侧，使盘子看起来更吸引人。一盆盆栽植物放在桌子左侧上方，为空间增添了一抹绿色。场景中的其他元素包括一个碗，里面放着葡萄酒和两根葡萄酒。</td>
+    </tr>
+    <tr>
+      <td>
+        <img src="./dataset/eval_images/three-kittens-basket.jpg" alt="kittens">
+        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+      </td>
+      <td>图中，一只棕色的小灰猫正坐在篮子里。这只猫身上戴着一顶棕色的帽子，很可能是一个戴帽子的男人。在篮子里，一只棕色的紫色小灰猫正沿着篮子里去休息。这些小猫似乎也在享受这份温暖，但它们似乎并没有完全放过来。此外，背景中还可以看到一把剪刀。这把剪刀看起来是专门为小猫设计的，它可以用作家庭相册或礼物。在篮子的侧边，有几条篮子，其中一条是最亮的，另一条则是最暗的。在篮子中可以看到一只棕色小灰猫，而另一条则是更暗的。</td>
+      <td>在这张照片中，一只小猫坐在篮子里，紧挨着它坐在篮子里的那块木篮上。猫的身体上有九条纹毛发。这个场景描绘了它们之间的亲密关系，展示了它们在一起度过时光的不同场合。画面中，一群大的猫坐在篮子里，其中一只小猫也被描述为小猫，这可能表明他们正在享受与猫互动、与它们的互动或一起度过愉快时光。</td>
+    </tr>
+    <tr>
+      <td>
+        <img src="./dataset/eval_images/tropical-beach-palm-tree.jpg" alt="tropical-beach">
+        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+      </td>
+      <td>在这张图片中，沙滩上有很多椅子，还有一些人站着，可以看到一把遮阳伞。虽然它看起来很大，但却没有任何特别的设计。沙滩上有许多椅子，表明这是一家餐馆或者服务员办公室。其中最引人注目的是一张海边椅子，椅子上放着一只热带海滩椅。这个椅子非常适合放松身心、享受海滩时光。此外，还有一些椅子和其他人在场景中，可能是为了放置食物或其他用途。靠近椅子的椅子表示该位置可供使用的其他人使用，也许也有一人在靠近那个椅子的地方。</td>
+      <td>图片显示了一个美丽的海滩场景，有很多椅子散布在天然的棕榈树上。其中一个椅子靠近海滩，而另一个则较小。沙滩上有两把椅子，其中一把靠近中间，另一把则稍微偏左，还有一些则在靠近边缘处。在海边的海滩周围，你可以看到几个人坐在海边的沙滩上，有的靠近海水中，还有一张沙滩椅。其中一张椅子靠近海滩，另一张椅子靠近海边。此外，还可以看到几只遮阳伞，为沙滩上的躺椅提供了遮阴。</td>
+    </tr>
+    <tr>
+      <td>
+        <img src="./dataset/eval_images/yellow-school-bus-road.jpg" alt="school-bus">
+        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+      </td>
+      <td>画面中，一辆蓝色的黄色公共汽车正从一辆黄色公共汽车驶过，在道路上停泊着。这辆公共汽车看起来是在一个黄色的黄色高速公路上。图中有几个人，其中一些靠近前景，而另一些则靠后一些，但都没有看到。在黄色公共汽车附近，可以看到一辆停在路边，那辆停在路边。此外，还有两辆不同方向的巴士，一辆靠近前景，另一辆靠近前景，另一辆稍微靠后一些。</td>
+      <td>画面中，一辆黄色和黄色相间的黄色和蓝色交叉路口的蓝色公共汽车正在一条通往路缘上的红色公交车站。有几辆公共汽车正停在路边，它们离一排车道很近。在背景中，可以看到一些长凳，它们在城市里交叉起来。一个长凳位于图中最左侧，而另一个则稍微靠后一点，为画面增添了一些城市特色。整个场景中有很多人和车辆散落在场景各处，包括黄色和蓝色的交叉路口。整个场景给人一种忙碌和迷茫的感觉，这也突显了公共汽车在市区中的存在和目的。</td>
+    </tr>
+  </tbody>
+</table>
+### 效果小结：
+两个模型均能识别图像主体（飞机、蛋糕、汽车、海滩等），但普遍存在重复表述和幻觉细节。受限于模型和数据规模，整体处于"能看懂大意、细节不准"的阶段。
+视觉信号对于LLM视作一种特殊的外语，因此"学习外语"的能力高低，很大程度上取决于LLM的能力。LLM性能越强，对应的VLM越强，此时效果增益会很明显。
+#### 未来值得改进的方面：
+```text
+> 可引入动态分辨率和Tile-based编码（如LLaVA-NeXT），突破固定分辨率限制。
+> Visual Encoder可升级为更强的视觉编码器，获取更细粒度的图像特征。
+> 拓展多图理解、视频理解和视觉定位（Visual Grounding）能力。
+> ...
+```
+# 📌 致谢
+> [!TIP]
+> 如果您觉得 `MiniMind-V`对您有所帮助，可以在 GitHub 上加一个⭐<br/>
+> 水平有限难免存在未���的纰漏，欢迎所有人在Issues交流指正或提交PR改进项目<br/>
+> 您的支持就是持续改进项目的动力，谢谢！
+## 🤝[贡献者](https://github.com/jingyaogong/minimind-v/graphs/contributors)
+<a href="https://github.com/jingyaogong/minimind-v/graphs/contributors">
+  <img width="200" src="https://contrib.rocks/image?repo=jingyaogong/minimind-v" />
+</a>
+## 😊鸣谢
+<a href="https://github.com/xinyanghuang7"><b>@xinyanghuang7</b></a>: <a href="https://github.com/xinyanghuang7/minimind-v/tree/hxy">多图vlm分支</a> | <a href="https://github.com/jingyaogong/minimind-v/tree/32cf4c5c01337231fd907b92d513de8945594263">仓库截至此版本提供</a>
+<details close>
+<summary> <b>参考链接 & 感谢以下优秀的论文或项目</b> </summary>
+- 排名不分任何先后顺序
+- [LlaVA](https://arxiv.org/pdf/2304.08485)
+- [LlaVA-VL](https://arxiv.org/pdf/2310.03744)
+- [Chinese-LLaVA-Vision-Instructions](https://huggingface.co/datasets/LinkSoul/Chinese-LLaVA-Vision-Instructions)
+</details>
+## 🫶支持者
+<a href="https://github.com/jingyaogong/minimind-v/stargazers">
+    <picture>
+      <source media="(prefers-color-scheme: dark)" srcset="https://reporoster.com/stars/dark/jingyaogong/minimind-v"/>
+      <source media="(prefers-color-scheme: light)" srcset="https://reporoster.com/stars/jingyaogong/minimind-v"/>
+      <img alt="github contribution grid snake animation" src="https://reporoster.com/stars/jingyaogong/minimind-v"/>
+    </picture>
+</a>
+<a href="https://github.com/jingyaogong/minimind-v/network/members">
+    <picture>
+      <source media="(prefers-color-scheme: dark)" srcset="https://reporoster.com/forks/dark/jingyaogong/minimind-v"/>
+      <source media="(prefers-color-scheme: light)" srcset="https://reporoster.com/forks/jingyaogong/minimind-v"/>
+      <img alt="github contribution grid snake animation" src="https://reporoster.com/forks/jingyaogong/minimind-v"/>
+    </picture>
+</a>
+<picture>
+  <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=jingyaogong/minimind-v&type=Date&theme=dark"/>
+  <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=jingyaogong/minimind-v&type=Date"/>
+  <img alt="Star History Chart" src="https://api.star-history.com/svg?repos=jingyaogong/minimind-v&type=Date"/>
+</picture>
+# 🎓 引用
+如果您觉得 MiniMind-V 对您的研究或工作有所帮助，请引用：
+```bibtex
+@misc{minimind-v,
+  title = {MiniMind-V: Train a Tiny VLM from Scratch},
+  author = {Jingyao Gong},
+  year = {2024},
+  url = {https://github.com/jingyaogong/minimind-v},
+  note = {GitHub repository, accessed 2026}
+}
+```
+# 📜 许可协议
+本仓库遵循 [Apache-2.0 License](LICENSE) 开源协议。

README_en.md ADDED Viewed

	@@ -0,0 +1,716 @@

+<div align="center">
+![logo](./images/logo.png)
+</div>
+<div align="center">
+[![GitHub Repo stars](https://img.shields.io/github/stars/jingyaogong/minimind-v?style=social)](https://github.com/jingyaogong/minimind-v/stargazers)
+[![GitHub Code License](https://img.shields.io/github/license/jingyaogong/minimind-v?v=1)](LICENSE)
+[![GitHub last commit](https://img.shields.io/github/last-commit/jingyaogong/minimind-v)](https://github.com/jingyaogong/minimind-v/commits/master)
+[![GitHub pull request](https://img.shields.io/badge/PRs-welcome-blue)](https://github.com/jingyaogong/minimind-v/pulls)
+[![Collection](https://img.shields.io/badge/🤗-MiniMindV%20%20Collection-blue)](https://huggingface.co/collections/jingyaogong/minimind-v-67000833fb60b3a2e1f3597d)
+</div>
+<div align="center">
+![GitHub Trend](https://trendshift.io/api/badge/repositories/13265)
+</div>
+<div align="center">
+  <h3>"The Greatest Path is the Simplest"</h3>
+</div>
+<div align="center">
+[中文](./README.md) | English
+</div>
+* This project aims to train a super-small multimodal vision-language model, **MiniMind-V**, with just a cost of 1.3 RMB
+  and 1 hours of work, starting from scratch!
+* The smallest version of **MiniMind-V** is only about $\frac{1}{2600}$ the size of GPT-3, designed to enable fast
+  inference and even training on personal GPUs.
+* **MiniMind-V** is an extension of the visual capabilities of the [MiniMind](https://github.com/jingyaogong/minimind)
+  pure language model.
+* The project includes full code for the minimalist structure of large VLM models, dataset cleaning, pretraining, and
+  supervised fine-tuning (SFT).
+* This is not only the smallest implementation of an open-source VLM model but also a concise tutorial for beginners in
+  vision-language models.
+* The hope is that this project can provide a useful example to inspire others and share the joy of creation, helping to
+  drive progress in the wider AI community!
+> To avoid misunderstandings, the "1 hours" is based on testing (`1 epoch`) with an NVIDIA 3090 hardware device (single GPU), and
+> the "1.3 RMB" refers to GPU server rental costs.
+<div align="center">
+![minimind2-v](./images/minimind-3v.gif)
+[🔗🤖 Online Experience](https://www.modelscope.cn/studios/gongjy/MiniMind-V) | [🔗🎞️ Video Introduction](https://www.bilibili.com/video/BV1Sh1vYBEzY)
+</div>
+# 📌 Introduction
+“Building a plane with Legos is much more exciting than flying in first class!”
+Is it really as complex as imagined to build a VLM-based multimodal large model? How is the code implementation done?
+Is the training process difficult? Now, let's explore the answers and feel the joy of creation together!
+> [!TIP]
+> (As of 2026-02-15) The MiniMind-V series has completed the training of the following model versions, with the smallest
+> requiring only 67M (0.067B) parameters, capable of both image recognition and conversation!
+| Model (Size)              | Inference Memory | Release    |
+|---------------------------|------------------|------------|
+| minimind-3v-moe (201M-A67M) | 1.0 GB           | 2026.04.01 |
+| minimind-3v (67M)         | 0.5 GB           | 2026.04.01 |
+| MiniMind2-V (104M)        | 1.1 GB           | 2025.02.20 |
+| MiniMind2-Small-V (26M)   | 0.6 GB           | 2025.02.20 |
+| minimind-v-v1-small (27M) | 0.6 GB           | 2024.10.04 |
+| minimind-v-v1 (109M)      | 1.1 GB           | 2024.10.04 |
+### 👉**Recent Updates**
+<details close>
+<summary> <b>2026-02-15</b> </summary>
+- Added minimind-3v (67M) and minimind-3v-moe (201M-A67M) models
+- Unified 768+8 architecture, supporting both dense and moe modes
+- Dataset format updated to parquet, added LLaVA-SFT-665K data source
+- Updated tokenizer, image placeholder changed to `<|image_pad|>`
+</details>
+<details close>
+<summary> <b>2025-10-24</b> </summary>
+- Bug fix: model weights mismatch
+- Adapted to ["minimind-1024 update"](https://github.com/jingyaogong/minimind)
+- Code refactoring: training and evaluation scripts standardized
+- Added complete checkpoint resumption support
+</details>
+<details close>
+<summary> <b>2025-04-27</b> </summary>
+- Compatibility updates
+- Adapted to the new feature in the "minimind" repository
+- Standardized parts of the code
+</details>
+<details close>
+<summary> <b>More...</b> </summary>
+**2025-02-20**
+- MiniMind2-V updated alongside MiniMind2
+- Significant reduction of all redundant code, standardized code format
+- Major simplification of the model's redundant structure
+- Updated dataset format, expanded with new SFT datasets
+- Better performance than the previous VLM version!
+**2024-10-05**
+- MiniMind-V released on schedule, first open-source release
+</details>
+# 📌 Quick Start
+<details style="color:rgb(128,128,128)">
+<summary>Sharing my hardware and software configuration (for reference only)</summary>
+* CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
+* RAM: 128 GB
+* GPU: NVIDIA GeForce RTX 3090(24GB) * 8
+* Ubuntu==20.04
+* CUDA==12.2
+* Python==3.10.16
+* [requirements.txt](./requirements.txt)
+</details>
+### Step 0
+```bash
+# Clone the code repository
+git clone https://github.com/jingyaogong/minimind-v
+```
+```bash
+# Download the siglip2 model to the ./model directory
+git clone https://huggingface.co/jingyaogong/siglip2-base-p16-ve
+# or
+git clone https://modelscope.cn/models/gongjy/siglip2-base-p16-ve
+```
+```bash
+# Download the minimind language model to the ./out directory (as the base language model for training VLM):
+# HuggingFace
+https://huggingface.co/jingyaogong/minimind-3v-pytorch/blob/main/llm_768.pth
+# Domestic source
+https://modelscope.cn/models/gongjy/minimind-3v-pytorch/resolve/master/llm_768.pth
+```
+## Ⅰ Test an existing model's performance
+### 1' Environment Preparation
+```bash
+pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple
+```
+### 2' Download the model
+```bash
+git clone https://huggingface.co/jingyaogong/minimind-3v
+```
+### 3' Command-line Q&A
+```bash
+# load_from='model': load native PyTorch weights, load_from='other path': load transformers format
+python eval_vlm.py --load_from model --weight sft_vlm
+# Or use transformers format model
+python eval_vlm.py --load_from minimind-3v
+```
+### 4' Or start the WebUI
+```bash
+# ⚠️ You must first copy the transformers model folder to the ./scripts/ directory (e.g.: cp -r minimind-3v ./scripts/minimind-3v). The web_demo_vlm script will automatically scan subdirectories containing weight files; it will report an error if none are found.
+cd scripts && python web_demo_vlm.py
+```
+## Ⅱ Train from scratch
+### 1' Environment Preparation
+```bash
+pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple
+```
+<details style="color:rgb(128,128,128)">
+<summary>Note: Test if Torch can use CUDA</summary>
+```bash
+import torch
+print(torch.cuda.is_available())
+```
+If unavailable, download the whl file from [torch_stable](https://download.pytorch.org/whl/torch_stable.html) for
+installation. Refer
+to [this link](https://blog.csdn.net/weixin_45456738/article/details/141029610?ops_request_misc=&request_id=&biz_id=102&utm_term=%E5%AE%89%E8%A3%85torch&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduweb~default-2-141029610.nonecase&spm=1018.2226.3001.4187)
+for help.
+</details>
+### 2' Download Data
+Download the required content from the [dataset link](https://huggingface.co/datasets/jingyaogong/minimind-v_dataset)
+and place it under `./dataset`.
+<details style="color:rgb(128,128,128)">
+<summary>Note: Dataset Details</summary>
+**[Note 1]** Previously, extracting 500k fragmented image files could be very slow. From 2025-12-27, dataset format is unified to Parquet with image-text integrated storage, smaller size, no decompression needed, faster loading.
+**[Note 2]** Parquet is a columnar storage format supporting efficient compression and fast reading. To preview data content, run `python lm_dataset.py` in the `dataset/` directory to visualize the first 5 image-text pairs.
+Pretrain data:
+```bash
+wget https://hf-mirror.com/datasets/jingyaogong/minimind-v_dataset/resolve/main/pretrain_i2t.parquet
+```
+SFT data:
+```bash
+wget https://hf-mirror.com/datasets/jingyaogong/minimind-v_dataset/resolve/main/sft_i2t.parquet
+```
+Please reserve about ~2GB of space for the dataset. If there is insufficient space for pretrain data, you can try skipping the pretrain training step and proceed directly to SFT training.
+</details>
+### 3' Start Training
+**3.1 Pretraining (Learning image description)**
+```bash
+# Basic training command (start from LLM weights, train vision_proj only)
+python train_pretrain_vlm.py --epochs 4 --from_weight llm
+```
+> Run pretraining to get `pretrain_vlm_*.pth` as the pretrained model's output weights (* represents the model
+> dimension, default is 768).
+**3.2 Supervised Fine-Tuning (Learning image-caption dialogue style)**
+```bash
+# Basic training command (start from pretrain weights, full parameter fine-tuning)
+python train_sft_vlm.py --epochs 2 --from_weight pretrain_vlm
+```
+> Perform supervised fine-tuning to get `sft_vlm_*.pth` as the output weights for the fine-tuned model.
+<details style="color:rgb(128,128,128)">
+<summary>Note: Training Details</summary>
+**Training Features:**
+- Support checkpoint resumption: add `--from_resume 1` parameter to continue from last interruption
+- Support GPU count changes: automatically convert steps when GPU count changes during resumption
+- Atomic saving: use temporary file + replacement mechanism to prevent weight corruption from interruption
+- Each save generates `out/**.pth` (model weights) and `checkpoints/**_resume.pth` (training state) files
+```bash
+# To resume training after interruption, use the same command and add --from_resume 1
+python train_sft_vlm.py --epochs 4 --from_resume 1
+```
+**Parameter Description:**
+- `--from_weight`: base weight name (llm, pretrain_vlm, none, etc.)
+- `--save_weight`: save weight prefix name
+- `--from_resume`: whether to resume training (0=start from scratch, 1=continue from checkpoint)
+- `--freeze_llm`: whether to freeze LLM parameters (pretrain use only)
+- More details can be found in the code
+</details>
+---
+### 4' Test the Model's Performance
+Ensure that the model `*.pth` file you want to test is located in the `./out/` directory.
+You can also directly download the pre-trained `*.pth` file
+from [here](https://huggingface.co/jingyaogong/minimind-3v-pytorch).
+```bash
+# Test SFT model (default)
+python eval_vlm.py --weight sft_vlm
+# Test Pretrain model
+python eval_vlm.py --weight pretrain_vlm
+```
+---
+> [!TIP]
+> The training scripts are based on PyTorch's native framework and support multi-card acceleration. If your device has
+> N (N>1) GPUs:
+Single-machine N-card training method (DDP, supports multi-machine multi-card cluster)
+```bash
+torchrun --nproc_per_node N train_xxx.py
+```
+<details style="color:rgb(128,128,128)">
+<summary>Note: Other Details</summary>
+Single-machine N-card training (DeepSpeed)
+```bash
+deepspeed --master_port 29500 --num_gpus=N train_xxx.py
+```
+You can enable wandb logging during training:
+```bash
+# You need to log in: wandb login
+torchrun --nproc_per_node N train_xxx.py --use_wandb
+# and
+python train_xxx.py --use_wandb
+```
+By adding the `--use_wandb` parameter, you can log the training process, and after training is complete, you can view
+the process on the wandb website. You can specify the project name and run name by modifying the `wandb_project`
+and `wandb_run_name` parameters.
+[Note]: After June 2025, the domestic network environment cannot directly connect to WandB. The MiniMind project by default switches to using [SwanLab](https://swanlab.cn/) as the training visualization tool (fully compatible with WandB API), that is, just change `import wandb` to `import swanlab as wandb`, no other changes are needed.
+</details>
+# 📌 VLM Detail
+The base language model of MiniMind-V (VLM), MiniMind (LLM), comes from the twin
+project [minimind](https://github.com/jingyaogong/minimind). For detailed information on the model structure, training
+specifics, principles, and testing results, please refer to the [minimind](https://github.com/jingyaogong/minimind)
+project. To reduce redundancy, the discussion on LLM-related topics is omitted here, assuming you have a basic
+understanding of MiniMind (LLM).
+> Even if you are not very familiar with the details of LLMs, you can still follow the "Quick Start" guide to train a
+> MiniMind-V, as it remains unaffected and the repository focuses on the lowest cost for out-of-the-box use!
+MiniMind-V's structure adds two submodules, a Visual Encoder and a feature projection, with a modality-mixing branch to
+support inputs from multiple modalities:
+![LLM-structure](./images/VLM-structure.jpg)
+![LLM-structure](./images/VLM-structure-moe.jpg)
+<details>
+<summary> [Important] Some Interesting Thoughts </summary>
+Let's take a moment to think about two questions:
+* What is a **Large Language Model (LLM)**?
+* What is a multimodal model?
+[This article](https://www.jiqizhixin.com/articles/2024-09-15-3) perfectly aligns with my thoughts:
+Although the name "large language model" (LLM) contains the word "language," they are actually not closely related to
+language; this is just a historical issue. A more accurate name would be self-regressive Transformer or something else.
+LLMs are more of a general statistical modeling technology, mainly using a self-regressive Transformer to simulate token
+flows. These tokens can represent text, images, audio, action choices, and even molecules—anything, really.
+Therefore, as long as the problem can be converted into a process of simulating a series of discrete tokens, LLM can
+theoretically solve it. In fact, with the increasing maturity of large language model technologies, we may see more and
+more problems falling under this modeling paradigm. In other words, the problem is fixed in using LLM to "predict the
+next token," but the role and meaning of the tokens differ in each domain.
+[ZJU-LiXi](https://person.zju.edu.cn/xilics#694283) has also mentioned a similar viewpoint (roughly stated below):
+Text, video, audio, actions, etc., are considered "multimodal" signals in human perception, but the term "modality" is
+essentially just a classification concept based on how humans store information. Just like `.txt` and `.png` files,
+though they differ in visual presentation and higher-level forms, they are fundamentally the same. The concept of "
+multimodal" arose simply because humans need to categorize these signals based on different sensory dimensions.
+However, for machines, regardless of the signal's "modality," they are ultimately presented as a sequence of binary "
+monomodal" numbers. Machines do not differentiate the origin of these signals; they just process and analyze the
+information contained within these sequences.
+Personally, I think **Generative Pretrained Transformer (GPT)** is a more fitting term than **Large Language Model (LLM)
+**, and I prefer to use "GPT" to represent models in the LLM/VLM/GPT-like architecture series rather than to ride on
+OpenAI's coattails.
+To summarize what GPTs do in one sentence:
+A GPT model predicts the next, next-next, next-next-next token, etc., based on the current token... until the model
+outputs the end token; here, the "token" doesn’t necessarily have to be text!
+```text
+> For an LLM model, if we need to understand an "image," we just treat the "image" as a special "foreign language" that has never been encountered before, and translate it into the "LLM language" via a "foreign language dictionary."
+> For an LLM model, if we need to understand "audio," we just treat "audio" as a special "foreign language" that has never been encountered before, and translate it into the "LLM language" via a "foreign language dictionary."
+> ...
+```
+<u>**To obtain MiniMind-V, we only need to do these 2 things:**</u>
+1. Use the **"foreign language dictionary"** that is good at translating images, to translate the image from the **"
+   foreign language"** into a model-understandable **"LLM language."**
+2. Fine-tune the LLM so that it and the **"foreign language dictionary"** go through a period of adaptation, thereby
+   better understanding images.
+The "foreign language dictionary" is referred to as the Visual Encoder model.
+Like LlaVA, Qwen-VL, and other visual language models, MiniMind-V now uses the open-source SigLIP2 series models as the
+Visual Encoder.
+Specifically, we use [siglip2-base-p16-ve](https://huggingface.co/jingyaogong/siglip2-base-p16-ve), a Visual
+Encoder based on the ViT-B/16 architecture for describing image-text information.
+The current SigLIP2 NaFlex vision encoder generates up to 256 patch tokens from the processor output as the input to the
+encoder layer, which produces a 1×768 dimensional embedding vector for calculating error with the text.
+We don’t need the final embedding representation, so we only take the output from the encoder layer, which is the output
+feature from the core ViT backbone.
+It receives 256×768 features from the previous layer, which are then reshaped by concatenating every 4 adjacent tokens into 1 (256×768 → 64×3072), then projected to the LLM's hidden dimension via a 2-layer MLP (Linear→GELU→Linear), resulting in 64 visual tokens input into MiniMind-V.
+After obtaining the image encoder features, the integration with the LLM requires aligning the visual features to the LLM's text token dimension, and mapping the image features into the same space as text embeddings. In other
+words, the image features and native visual tokens cannot be directly treated the same; they require cross-modal feature
+alignment.
+[LlaVA-1](https://arxiv.org/pdf/2304.08485) achieves good alignment with a simple linear transformation, [LlaVA-1.5](https://arxiv.org/pdf/2310.03744) upgrades to a 2-layer MLP. MiniMind-V adopts the same MLP Projection approach as LlaVA-1.5, combined with reshape for token compression.
+![llava-structure](./images/llava-structure.png)
+With that, the internal structural changes of MiniMind-V are now fully presented.
+</details>
+---
+Next, let's briefly discuss the changes in the external input and output of MiniMind-V.
+The input to the VLM is still a segment of text containing special `<image>` placeholders.
+After computing the text embedding, the vector generated by the image encoder can be projected onto the corresponding
+embedding part of the placeholder, replacing the original placeholder embedding.
+For example:
+```text
+<image>\nWhat is in this image?
+```
+In `minimind-v`, the image is replaced by 64 `<|image_pad|>` tokens as placeholder (the 256 SigLIP2 patch features are compressed to 64 tokens via reshape+MLP),
+thus the `minimind-v` prompt becomes:
+```text
+<|image_pad|><|image_pad|>...<|image_pad|>(×64)\nWhat is this image describing?
+```
+After calculating the embedding and projection, the vision features replace the corresponding placeholder embeddings, and the rest of the computation is identical to the LLM part.
+![input](./images/minimind-v-input.jpg)
+At this point, all the details of `MiniMind-V` have been presented. The VLM model subclass inherits from `MiniMind` with only **minimal** changes, core algorithm modifications `< 50 lines`, very low migration difficulty. The specific implementation may differ from `LlaVA` and similar models, but the overall idea is consistent.
+# 📌 Experiment
+## Ⅰ Dataset
+Original Source:
+- [Chinese-LLaVA-Vision](https://huggingface.co/datasets/LinkSoul/Chinese-LLaVA-Vision-Instructions): Contains approximately 570,000 pre-trained images from CC-3M and COCO 2014
+- [llava-en-zh-300k](https://huggingface.co/datasets/BUAADreamer/llava-en-zh-300k): Contains 300k instruction fine-tuning data and 150k images
+- [LLaVA-SFT-665K](https://huggingface.co/datasets/csuhan/LLaVA-SFT-665K): Contains 665k instruction fine-tuning data
+The dataset contains both Chinese and English data. The Q&A content has been translated, with better support for Chinese, further organized and resized (pretrain resolution 128×128, sft resolution 160×160).
+(pretrain_i2t.parquet) Pre-training dataset format:
+```text
+Columns: conversations (json string), image_bytes (binary), image_names (string)
+conversations example:
+[
+  {"role": "user", "content": "Provide a brief description of the given image.\n<image>"},
+  {"role": "assistant", "content": "Olive oil is a healthy ingredient for free use."}
+]
+image_bytes: <binary image data>
+```
+(sft_i2t.parquet) Single image instruction fine-tuning dataset format:
+```text
+Columns: conversations (json string), image_bytes (binary), image_names (string)
+conversations example:
+[
+  {"role": "user", "content": "What impact does the location of the alarm clock have on sleep quality?<image>"},
+  {"role": "assistant", "content": "Place the digital alarm clock on the nightstand..."}
+]
+image_bytes: <binary image data>
+```
+> Note: sft_i2t.parquet contains ~580K samples in total, of which ~236K are image-text conversations (i2t) and ~346K are pure text conversations (t2t). The latter is used to preserve the model's base language capabilities.
+Dataset download
+link: ([ModelScope](https://www.modelscope.cn/datasets/gongjy/minimind-v_dataset) | [HuggingFace](https://huggingface.co/datasets/jingyaogong/minimind-v_dataset))
+## Ⅱ Training
+Training is divided into two stages, both freezing the Visual Encoder gradients and only training the Projection and LLM parts.
+Training is initialized from LLM pre-trained weights, with support for DDP multi-GPU training, mixed precision (bfloat16), torch.compile acceleration, and swanlab logging.
+> train_pretrain_vlm
+The pre-training stage learns general image knowledge from ~1.13M image-text description pairs (e.g., a deer is a deer, a dog is a dog).
+This stage uses a higher learning rate (1e-4), max sequence length of 360, freezes the LLM main parameters, and only sets the Projection and LLM's layer 0 as learnable,
+aiming to quickly establish a basic mapping from visual features to the language space while avoiding damage to the LLM's existing language capabilities.
+> train_sft_vlm
+The instruction fine-tuning stage learns real Q&A formats from ~580K samples, of which ~236K are image-text multi-turn conversations and ~346K are pure text conversations (to preserve LLM base capabilities).
+This stage uses a lower learning rate (1e-5~1e-6), max sequence length of 768, unfreezes all Projection and LLM parameters for full fine-tuning,
+enabling the model to conduct multi-turn conversations based on image content, while mitigating catastrophic forgetting through the mixed-in pure text data.
+> Training Time and Loss Trend (for reference only)
+Pretrain [768+8] (dense & moe)
+![input](./images/pretrain_loss.jpg)
+SFT [768+8] (dense & moe)
+![input](./images/sft_loss.jpg)
+## Ⅲ Model Weights
+| Format | ModelScope | HuggingFace |
+|---|---|---|
+| Native PyTorch (`*.pth`) | [minimind-3v-pytorch](https://www.modelscope.cn/models/gongjy/minimind-3v-pytorch) | [minimind-3v-pytorch](https://huggingface.co/jingyaogong/minimind-3v-pytorch) |
+| Transformers | [minimind-v collection](https://modelscope.cn/collections/MiniMind-V-42b841dde22d41) | [minimind-v collection](https://huggingface.co/collections/jingyaogong/minimind-v-67000833fb60b3a2e1f3597d) |
+> Note: The Transformers version is the `MiniMind-V` model after single-image instruction fine-tuning
+# 📌 Test
+### Effect Test
+#### Single Image Dialogue
+<table>
+  <thead>
+    <tr>
+      <th>image</th>
+      <th>minimind-3v</th>
+      <th>minimind-3v-moe</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>
+        <img src="./dataset/eval_images/airplane-flying-blue-sky.jpg" alt="airplane">
+        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+      </td>
+      <td>The image features a white and black airplane parked on a grassy terrain. The layers of a building are likely to be filled with air traffic control, such as the building's tall building, the large building, or the overall pavement. There are also two airplanes parked in the background. The airplanes are displayed on a board, and the airplane is flying through the air while the black and white airplane is parked on the ground.</td>
+      <td>The image features a large jetliner with a large airplane sitting on the ground. It is likely an airplane or an aircraft, possibly a flight jet or a runway.</td>
+    </tr>
+    <tr>
+      <td>
+        <img src="./dataset/eval_images/birthday-cake-candles-table.jpg" alt="birthday-cake">
+        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+      </td>
+      <td>The image features a white cake in an old-fashioned cake with cake placed on it. It is surrounded by a few cooked ingredients, including the wedding cake.</td>
+      <td>The image features a white cake in an old-fashioned cake with cake placed on it. It is surrounded by a few cooked ingredients, including the wedding cake.</td>
+    </tr>
+    <tr>
+      <td>
+        <img src="./dataset/eval_images/pizza-on-wooden-board.jpg" alt="pizza">
+        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+      </td>
+      <td>The image depicts a delicious pizza pizza with fresh toppings, which are likely present in a slice of pizza. The pizza is perfectly crispy, as it has a crispy crust and slightly crispy, making it a delightful pizza presentation. The pizza is filled with fresh toppings, adding to the crispy crust. The pizza is also a bit scrambled, as it has a fresh topping, while the crispy crust is cooked with a pizza pan. The pizza is likely to be a pizza with its crunchy texture and flavorfully. The pizza is also a popular choice for pizza with others, and it is a filling and crispy crust.</td>
+      <td>The image features a scenic burning pizza in a pasta-style board, surrounded by a wooden cabinet, a garner topping for cheese and vegetables. There is a small cabinet nearby, with a pan-familt, a pizza flatter topping. The pizza is situated on the left side of the board, with a bowl of olives placed on top of the side.</td>
+    </tr>
+    <tr>
+      <td>
+        <img src="./dataset/eval_images/red-sports-car-road.jpg" alt="red-car">
+        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+      </td>
+      <td>The image shows a yellow yellow and red turf coastline on a street, with a couple of cars and a yellow and red traffic lights in the background.</td>
+      <td>The image features a couple of female vanity van lying on a car. The car is situated on the ground, surrounded by a wet furniture. The couple is seen in the middle of the car, possibly observing the scene.</td>
+    </tr>
+    <tr>
+      <td>
+        <img src="./dataset/eval_images/row-of-colorful-houses.jpg" alt="colorful-houses">
+        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+      </td>
+      <td>IoT tree is a holiday that is often associated with Christmas culture, history, and celebration. The tree has a unique black and white striped pattern, which features a sweet treat, a budget-friendly chocolate cake with brown spots. The cake is burnt and has a balcony with a sweet treat, with the rich, vibrant colors of the striped pattern. The tree has a rich and burnt color, while the rich and vibrant colors are visually appealing. The tree has a striped pattern, which adds to the overall atmosphere of the image.</td>
+      <td>The image is a colorful scene of a colorful vintage house, with a large pink roof of the wall and a red color scheme. The colorful vintage house has a pink color, and the color scheme appears to be fine, with a red color scheme.</td>
+    </tr>
+    <tr>
+      <td>
+        <img src="./dataset/eval_images/snow-mountain-lake-view.jpg" alt="snow-mountain">
+        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+      </td>
+      <td>The image features a snowy mountain surrounded by a large, dense mountains. The large body of water is quite dense, with the body of water being hot and the water is sandy. The tall trees are swirls, and the tall trees are standing on the snow-covered ground. The body of water is also sink, creating a savanna-like appearance.</td>
+      <td>The image is an image of a large mountain visible in a lake, surrounded by the idyllic mountains and the mountains. It appears to be a blanket in the ocean, with the river and idylis watching the sea. The mountains are lined with fresh sand and waves, adding a sense of tranquility to the scene.</td>
+    </tr>
+    <tr>
+      <td>
+        <img src="./dataset/eval_images/street-food-hotpot-table.jpg" alt="street-food">
+        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+      </td>
+      <td>The image shows a variety of cooking options, including baking meat, cupcakes, and spinach. The baked vegetables are rich in a variety of flavors, including grilled, sautéed, and baked vegetables. The presence of a baked vegetable with a variety of vegetables in different parts suggests a variety of options, including baking, cooking, and baking. The cooking process is highly recommended, with a mix of vegetables and baking times, making it an ideal choice for those who prefer a variety of cooking options. The cooking process is also highly compatible, with a variety of flavors and textures enjoying the cooking process.</td>
+      <td>The image features a variety of freshwater salads, glasses, and coworkers displayed on a table. There is a mix of freshwater ingredients, likely a bun, which can be seen in the menu. The freshwater ingredients are placed on the table, and there is a portion of the coworkers displayed in the middle of the room. There is also a bowl filled with various ingredients.</td>
+    </tr>
+    <tr>
+      <td>
+        <img src="./dataset/eval_images/three-kittens-basket.jpg" alt="kittens">
+        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+      </td>
+      <td>The image features a table filled with people standing together, casing bars, and a pair of brown cats. The table is filled with pink and white cats.</td>
+      <td>The image is a black and white detail of a miscellaneous bunch of toys, which is likely to be a part of a group or a similar artistic field.</td>
+    </tr>
+    <tr>
+      <td>
+        <img src="./dataset/eval_images/tropical-beach-palm-tree.jpg" alt="tropical-beach">
+        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+      </td>
+      <td>The image features a brown wooden coat.</td>
+      <td>The image shows a sandy beach with an umbrella on top of a chair, providing a visual appeal for people to sit on.</td>
+    </tr>
+    <tr>
+      <td>
+        <img src="./dataset/eval_images/yellow-school-bus-road.jpg" alt="school-bus">
+        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+      </td>
+      <td>The image displays a group of people sitting on the bus. They are waiting to be cautious and attentive to their feet, which indicates they are likely to be cautious and followed by the bus.</td>
+      <td>The image features a large collection of school buses, a brick-and-middle bus, and a stack of cars visible in the background. The school bus is situated next to a school bus, and there are several people watching the bus. The school bus is visible in the background, with one person standing behind the other, while the other person is watching the bus. The bus is positioned behind the school bus, creating a seamless and dynamic visual effect.</td>
+    </tr>
+  </tbody>
+</table>
+### Effect Summary:
+Both models can identify image subjects (airplane, cake, car, beach, etc.), but commonly exhibit repetitive expressions and hallucinated details. Limited by model and data scale, the overall performance is at a stage of "understanding the gist but inaccurate on details".
+Visual signals are treated as a special foreign language by LLMs, so the "language learning" ability highly depends on the LLM's capacity. The stronger the LLM, the more powerful the corresponding VLM, and the performance boost becomes significant.
+#### Future Areas for Improvement:
+```text
+> Introduce dynamic resolution and Tile-based encoding (like LLaVA-NeXT) to break through the fixed resolution limit.
+> Visual Encoder could be upgraded to stronger vision encoders for finer-grained image features.
+> Extend multi-image understanding, video understanding, and Visual Grounding capabilities.
+> ...
+```
+# 📌 Acknowledge
+> [!TIP]
+> If you find `MiniMind-V` helpful, please consider giving it a ⭐ on GitHub. <br/>
+> Given the limited expertise, there may be unknown issues, and we welcome everyone to discuss, correct, or submit PRs
+> to improve the project in Issues. <br/>
+> Your support is the driving force behind continuous improvements to the project. Thank you!
+## 🤝 [Contributors](https://github.com/jingyaogong/minimind-v/graphs/contributors)
+<a href="https://github.com/jingyaogong/minimind-v/graphs/contributors">
+  <img width="200" src="https://contrib.rocks/image?repo=jingyaogong/minimind-v" />
+</a>
+## 😊 Acknowledgments
+<a href="https://github.com/xinyanghuang7"><b>@xinyanghuang7</b></a>: <a href="https://github.com/xinyanghuang7/minimind-v/tree/hxy">Multi-image VLM branch</a> | <a href="https://github.com/jingyaogong/minimind-v/tree/32cf4c5c01337231fd907b92d513de8945594263">Repository provided up to this version</a>
+<details close>
+<summary> <b>Reference Links & Thanks to the following excellent papers or projects</b> </summary>
+- No particular order
+- [LlaVA](https://arxiv.org/pdf/2304.08485)
+- [LlaVA-VL](https://arxiv.org/pdf/2310.03744)
+- [Chinese-LLaVA-Vision-Instructions](https://huggingface.co/datasets/LinkSoul/Chinese-LLaVA-Vision-Instructions)
+</details>
+## 🫶Supporter
+<a href="https://github.com/jingyaogong/minimind-v/stargazers">
+    <picture>
+      <source media="(prefers-color-scheme: dark)" srcset="https://reporoster.com/stars/dark/jingyaogong/minimind-v"/>
+      <source media="(prefers-color-scheme: light)" srcset="https://reporoster.com/stars/jingyaogong/minimind-v"/>
+      <img alt="github contribution grid snake animation" src="https://reporoster.com/stars/jingyaogong/minimind-v"/>
+    </picture>
+</a>
+<a href="https://github.com/jingyaogong/minimind-v/network/members">
+    <picture>
+      <source media="(prefers-color-scheme: dark)" srcset="https://reporoster.com/forks/dark/jingyaogong/minimind-v"/>
+      <source media="(prefers-color-scheme: light)" srcset="https://reporoster.com/forks/jingyaogong/minimind-v"/>
+      <img alt="github contribution grid snake animation" src="https://reporoster.com/forks/jingyaogong/minimind-v"/>
+    </picture>
+</a>
+<picture>
+  <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=jingyaogong/minimind-v&type=Date&theme=dark"/>
+  <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=jingyaogong/minimind-v&type=Date"/>
+  <img alt="Star History Chart" src="https://api.star-history.com/svg?repos=jingyaogong/minimind-v&type=Date"/>
+</picture>
+# 🎓 Citation
+If you find MiniMind-V helpful in your research or work, please cite:
+```bibtex
+@misc{minimind-v,
+  title = {MiniMind-V: Train a Tiny VLM from Scratch},
+  author = {Jingyao Gong},
+  year = {2024},
+  url = {https://github.com/jingyaogong/minimind-v},
+  note = {GitHub repository, accessed 2026}
+}
+```
+# 📜 License
+This repository is licensed under the [Apache-2.0 License](LICENSE).

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,85 @@

+{%- if tools %}
+    {{- '<|im_start|>system\n' }}
+    {%- if messages[0].role == 'system' %}
+        {{- messages[0].content + '\n\n' }}
+    {%- endif %}
+    {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
+    {%- for tool in tools %}
+        {{- "\n" }}
+        {{- tool | tojson }}
+    {%- endfor %}
+    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
+{%- else %}
+    {%- if messages[0].role == 'system' %}
+        {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
+{%- for message in messages[::-1] %}
+    {%- set index = (messages|length - 1) - loop.index0 %}
+    {%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
+        {%- set ns.multi_step_tool = false %}
+        {%- set ns.last_query_index = index %}
+    {%- endif %}
+{%- endfor %}
+{%- for message in messages %}
+    {%- if message.content is string %}
+        {%- set content = message.content %}
+    {%- else %}
+        {%- set content = '' %}
+    {%- endif %}
+    {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
+        {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
+    {%- elif message.role == "assistant" %}
+        {%- set reasoning_content = '' %}
+        {%- if message.reasoning_content is string %}
+            {%- set reasoning_content = message.reasoning_content %}
+        {%- else %}
+            {%- if '</think>' in content %}
+                {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
+                {%- set content = content.split('</think>')[-1].lstrip('\n') %}
+            {%- endif %}
+        {%- endif %}
+        {%- if true %}
+            {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
+        {%- endif %}
+        {%- if message.tool_calls %}
+            {%- for tool_call in message.tool_calls %}
+                {%- if (loop.first and content) or (not loop.first) %}
+                    {{- '\n' }}
+                {%- endif %}
+                {%- if tool_call.function %}
+                    {%- set tool_call = tool_call.function %}
+                {%- endif %}
+                {{- '<tool_call>\n{"name": "' }}
+                {{- tool_call.name }}
+                {{- '", "arguments": ' }}
+                {%- if tool_call.arguments is string %}
+                    {{- tool_call.arguments }}
+                {%- else %}
+                    {{- tool_call.arguments | tojson }}
+                {%- endif %}
+                {{- '}\n</tool_call>' }}
+            {%- endfor %}
+        {%- endif %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '\n<tool_response>\n' }}
+        {{- content }}
+        {{- '\n</tool_response>' }}
+        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant\n' }}
+    {%- if open_thinking is defined and open_thinking is true %}
+        {{- '<think>\n' }}
+    {%- else %}
+        {{- '<think>\n\n</think>\n\n' }}
+    {%- endif %}
+{%- endif %}

config.json ADDED Viewed

	@@ -0,0 +1,42 @@

+{
+  "architectures": [
+    "MiniMindVLM"
+  ],
+  "auto_map": {
+    "AutoConfig": "model_vlm.VLMConfig",
+    "AutoModelForCausalLM": "model_vlm.MiniMindVLM"
+  },
+  "bos_token_id": 1,
+  "dropout": 0.0,
+  "dtype": "bfloat16",
+  "eos_token_id": 2,
+  "flash_attn": true,
+  "head_dim": 96,
+  "hidden_act": "silu",
+  "hidden_size": 768,
+  "image_hidden_size": 768,
+  "image_ids": [
+    12
+  ],
+  "image_special_token": "<|image_pad|>",
+  "image_token_len": 64,
+  "inference_rope_scaling": false,
+  "intermediate_size": 2432,
+  "max_position_embeddings": 32768,
+  "max_seq_len": 8192,
+  "model_type": "minimind-v",
+  "moe_intermediate_size": 2432,
+  "norm_topk_prob": true,
+  "num_attention_heads": 8,
+  "num_experts": 4,
+  "num_experts_per_tok": 1,
+  "num_hidden_layers": 8,
+  "num_key_value_heads": 4,
+  "rms_norm_eps": 1e-06,
+  "rope_scaling": null,
+  "rope_theta": 1000000.0,
+  "router_aux_loss_coef": 0.0005,
+  "transformers_version": "4.57.6",
+  "use_moe": false,
+  "vocab_size": 6400
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "transformers_version": "4.57.6"
+}

images/VLM-structure-moe.jpg ADDED Viewed

Git LFS Details

SHA256: d2c669766ad2cff4cfd894085b49d069f78688dcf0dc9165201a3d24e088c470
Pointer size: 131 Bytes
Size of remote file: 114 kB

images/VLM-structure.jpg ADDED Viewed

Git LFS Details

SHA256: 553a11ed69a738e69799f95c6a4ded49fbb59bce863341969718760778e77193
Pointer size: 131 Bytes
Size of remote file: 116 kB

images/llava-structure.png ADDED Viewed

Git LFS Details

SHA256: 55c0d8f31e91e680326ca6ad7bff900a28ed53576c41836c5746ff4fbb51f22b
Pointer size: 131 Bytes
Size of remote file: 108 kB

images/logo.png ADDED Viewed

Git LFS Details

SHA256: 12b71f2315b51a65d51e180f32cc86fa8209672b86e737e71f3a202da79c96d3
Pointer size: 131 Bytes
Size of remote file: 703 kB

images/minimind-3v.gif ADDED Viewed

Git LFS Details

SHA256: f9525e21e1131bb9edb690720bc483dad9f82340c7de212b817739a374c1e3f0
Pointer size: 132 Bytes
Size of remote file: 3.43 MB

images/minimind-v-input.jpg ADDED Viewed

Git LFS Details

SHA256: 9582523d01c25cc5693caabefa6873d5e50defd0f194d8e191388618393103a0
Pointer size: 131 Bytes
Size of remote file: 187 kB

images/pretrain_loss.jpg ADDED Viewed

images/sft_loss.jpg ADDED Viewed

model_minimind.py ADDED Viewed

	@@ -0,0 +1,279 @@

+import math, torch, torch.nn.functional as F
+from torch import nn
+from transformers.activations import ACT2FN
+from transformers import PreTrainedModel, GenerationMixin, PretrainedConfig
+from transformers.modeling_outputs import MoeCausalLMOutputWithPast
+# 🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏
+#                                     MiniMind Config
+# 🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏
+class MiniMindConfig(PretrainedConfig):
+    model_type = "minimind"
+    def __init__(self, hidden_size=768, num_hidden_layers=8, use_moe=False, **kwargs):
+        super().__init__(**kwargs)
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.use_moe = use_moe
+        self.dropout = kwargs.get("dropout", 0.0)
+        self.vocab_size = kwargs.get("vocab_size", 6400)
+        self.bos_token_id = kwargs.get("bos_token_id", 1)
+        self.eos_token_id = kwargs.get("eos_token_id", 2)
+        self.flash_attn = kwargs.get("flash_attn", True)
+        self.num_attention_heads = kwargs.get("num_attention_heads", 8)
+        self.num_key_value_heads = kwargs.get("num_key_value_heads", 4)
+        self.head_dim = kwargs.get("head_dim", self.hidden_size // self.num_attention_heads)
+        self.hidden_act = kwargs.get("hidden_act", 'silu')
+        self.intermediate_size = kwargs.get("intermediate_size", math.ceil(hidden_size * math.pi / 64) * 64)
+        self.max_position_embeddings = kwargs.get("max_position_embeddings", 32768)
+        self.rms_norm_eps = kwargs.get("rms_norm_eps", 1e-6)
+        self.rope_theta = kwargs.get("rope_theta", 1e6)
+        self.inference_rope_scaling = kwargs.get("inference_rope_scaling", False)
+        self.rope_scaling = {
+            "beta_fast": 32,
+            "beta_slow": 1,
+            "factor": 16,
+            "original_max_position_embeddings": 2048,
+            "attention_factor": 1.0,
+            "type": "yarn"
+        } if self.inference_rope_scaling else None
+        ### MoE specific configs (ignored if use_moe = False)
+        self.num_experts = kwargs.get("num_experts", 4)
+        self.num_experts_per_tok = kwargs.get("num_experts_per_tok", 1)
+        self.moe_intermediate_size = kwargs.get("moe_intermediate_size", self.intermediate_size)
+        self.norm_topk_prob = kwargs.get("norm_topk_prob", True)
+        self.router_aux_loss_coef = kwargs.get("router_aux_loss_coef", 5e-4)
+# 🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏
+#                                     MiniMind Model
+# 🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏🌎🌍🌏
+class RMSNorm(torch.nn.Module):
+    def __init__(self, dim: int, eps: float = 1e-5):
+        super().__init__()
+        self.eps = eps
+        self.weight = nn.Parameter(torch.ones(dim))
+    def norm(self, x):
+        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
+    def forward(self, x):
+        return (self.weight * self.norm(x.float())).type_as(x)
+def precompute_freqs_cis(dim: int, end: int = int(32 * 1024), rope_base: float = 1e6, rope_scaling: dict = None):
+    freqs, attn_factor = 1.0 / (rope_base ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim)), 1.0
+    if rope_scaling is not None: # YaRN: f'(i) = f(i)((1-γ) + γ/s), where γ∈[0,1] is linear ramp
+        orig_max, factor, beta_fast, beta_slow, attn_factor = (
+            rope_scaling.get("original_max_position_embeddings", 2048), rope_scaling.get("factor", 16),
+            rope_scaling.get("beta_fast", 32.0), rope_scaling.get("beta_slow", 1.0), rope_scaling.get("attention_factor", 1.0)
+        )
+        if end / orig_max > 1.0:
+            inv_dim = lambda b: (dim * math.log(orig_max / (b * 2 * math.pi))) / (2 * math.log(rope_base))
+            low, high = max(math.floor(inv_dim(beta_fast)), 0), min(math.ceil(inv_dim(beta_slow)), dim // 2 - 1)
+            ramp = torch.clamp((torch.arange(dim // 2, device=freqs.device).float() - low) / max(high - low, 0.001), 0, 1)
+            freqs = freqs * (1 - ramp + ramp / factor)
+    t = torch.arange(end, device=freqs.device)
+    freqs = torch.outer(t, freqs).float()
+    freqs_cos = torch.cat([torch.cos(freqs), torch.cos(freqs)], dim=-1) * attn_factor
+    freqs_sin = torch.cat([torch.sin(freqs), torch.sin(freqs)], dim=-1) * attn_factor
+    return freqs_cos, freqs_sin
+def apply_rotary_pos_emb(q, k, cos, sin, unsqueeze_dim=1):
+    def rotate_half(x): return torch.cat((-x[..., x.shape[-1] // 2:], x[..., : x.shape[-1] // 2]), dim=-1)
+    q_embed = ((q * cos.unsqueeze(unsqueeze_dim)) + (rotate_half(q) * sin.unsqueeze(unsqueeze_dim))).to(q.dtype)
+    k_embed = ((k * cos.unsqueeze(unsqueeze_dim)) + (rotate_half(k) * sin.unsqueeze(unsqueeze_dim))).to(k.dtype)
+    return q_embed, k_embed
+def repeat_kv(x: torch.Tensor, n_rep: int) -> torch.Tensor:
+    bs, slen, num_key_value_heads, head_dim = x.shape
+    if n_rep == 1: return x
+    return (x[:, :, :, None, :].expand(bs, slen, num_key_value_heads, n_rep, head_dim).reshape(bs, slen, num_key_value_heads * n_rep, head_dim))
+class Attention(nn.Module):
+    def __init__(self, config: MiniMindConfig):
+        super().__init__()
+        self.num_key_value_heads = config.num_attention_heads if config.num_key_value_heads is None else config.num_key_value_heads
+        self.n_local_heads = config.num_attention_heads
+        self.n_local_kv_heads = self.num_key_value_heads
+        self.n_rep = self.n_local_heads // self.n_local_kv_heads
+        self.head_dim = config.head_dim
+        self.q_proj = nn.Linear(config.hidden_size, config.num_attention_heads * self.head_dim, bias=False)
+        self.k_proj = nn.Linear(config.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
+        self.v_proj = nn.Linear(config.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
+        self.o_proj = nn.Linear(config.num_attention_heads * self.head_dim, config.hidden_size, bias=False)
+        self.q_norm = RMSNorm(self.head_dim, eps=config.rms_norm_eps)
+        self.k_norm = RMSNorm(self.head_dim, eps=config.rms_norm_eps)
+        self.attn_dropout = nn.Dropout(config.dropout)
+        self.resid_dropout = nn.Dropout(config.dropout)
+        self.dropout = config.dropout
+        self.flash = hasattr(torch.nn.functional, 'scaled_dot_product_attention') and config.flash_attn
+    def forward(self, x, position_embeddings, past_key_value=None, use_cache=False, attention_mask=None):
+        bsz, seq_len, _ = x.shape
+        xq, xk, xv = self.q_proj(x), self.k_proj(x), self.v_proj(x)
+        xq = xq.view(bsz, seq_len, self.n_local_heads, self.head_dim)
+        xk = xk.view(bsz, seq_len, self.n_local_kv_heads, self.head_dim)
+        xv = xv.view(bsz, seq_len, self.n_local_kv_heads, self.head_dim)
+        xq, xk = self.q_norm(xq), self.k_norm(xk)
+        cos, sin = position_embeddings
+        xq, xk = apply_rotary_pos_emb(xq, xk, cos, sin)
+        if past_key_value is not None:
+            xk = torch.cat([past_key_value[0], xk], dim=1)
+            xv = torch.cat([past_key_value[1], xv], dim=1)
+        past_kv = (xk, xv) if use_cache else None
+        xq, xk, xv = (xq.transpose(1, 2), repeat_kv(xk, self.n_rep).transpose(1, 2), repeat_kv(xv, self.n_rep).transpose(1, 2))
+        if self.flash and (seq_len > 1) and (past_key_value is None) and (attention_mask is None or torch.all(attention_mask == 1)):
+            output = F.scaled_dot_product_attention(xq, xk, xv, dropout_p=self.dropout if self.training else 0.0, is_causal=True)
+        else:
+            scores = (xq @ xk.transpose(-2, -1)) / math.sqrt(self.head_dim)
+            scores[:, :, :, -seq_len:] += torch.full((seq_len, seq_len), float("-inf"), device=scores.device).triu(1)
+            if attention_mask is not None: scores += (1.0 - attention_mask.unsqueeze(1).unsqueeze(2)) * -1e9
+            output = self.attn_dropout(F.softmax(scores.float(), dim=-1).type_as(xq)) @ xv
+        output = output.transpose(1, 2).reshape(bsz, seq_len, -1)
+        output = self.resid_dropout(self.o_proj(output))
+        return output, past_kv
+class FeedForward(nn.Module):
+    def __init__(self, config: MiniMindConfig, intermediate_size: int = None):
+        super().__init__()
+        intermediate_size = intermediate_size or config.intermediate_size
+        self.gate_proj = nn.Linear(config.hidden_size, intermediate_size, bias=False)
+        self.down_proj = nn.Linear(intermediate_size, config.hidden_size, bias=False)
+        self.up_proj = nn.Linear(config.hidden_size, intermediate_size, bias=False)
+        self.act_fn = ACT2FN[config.hidden_act]
+    def forward(self, x):
+        return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
+class MOEFeedForward(nn.Module):
+    def __init__(self, config: MiniMindConfig):
+        super().__init__()
+        self.config = config
+        self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False)
+        self.experts = nn.ModuleList([FeedForward(config, intermediate_size=config.moe_intermediate_size) for _ in range(config.num_experts)])
+        self.act_fn = ACT2FN[config.hidden_act]
+    def forward(self, x):
+        batch_size, seq_len, hidden_dim = x.shape
+        x_flat = x.view(-1, hidden_dim)
+        scores = F.softmax(self.gate(x_flat), dim=-1)
+        topk_weight, topk_idx = torch.topk(scores, k=self.config.num_experts_per_tok, dim=-1, sorted=False)
+        if self.config.norm_topk_prob: topk_weight = topk_weight / (topk_weight.sum(dim=-1, keepdim=True) + 1e-20)
+        y = torch.zeros_like(x_flat)
+        for i, expert in enumerate(self.experts):
+            mask = (topk_idx == i)
+            if mask.any():
+                token_idx = mask.any(dim=-1).nonzero().flatten()
+                weight = topk_weight[mask].view(-1, 1)
+                y.index_add_(0, token_idx, (expert(x_flat[token_idx]) * weight).to(y.dtype))
+            elif self.training:
+                y[0, 0] += 0 * sum(p.sum() for p in expert.parameters())
+        if self.training and self.config.router_aux_loss_coef > 0:
+            load = F.one_hot(topk_idx, self.config.num_experts).float().mean(0)
+            self.aux_loss = (load * scores.mean(0)).sum() * self.config.num_experts * self.config.router_aux_loss_coef
+        else:
+            self.aux_loss = scores.new_zeros(1).squeeze()
+        return y.view(batch_size, seq_len, hidden_dim)
+class MiniMindBlock(nn.Module):
+    def __init__(self, layer_id: int, config: MiniMindConfig):
+        super().__init__()
+        self.self_attn = Attention(config)
+        self.input_layernorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.post_attention_layernorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.mlp = FeedForward(config) if not config.use_moe else MOEFeedForward(config)
+    def forward(self, hidden_states, position_embeddings, past_key_value=None, use_cache=False, attention_mask=None):
+        residual = hidden_states
+        hidden_states, present_key_value = self.self_attn(
+            self.input_layernorm(hidden_states), position_embeddings,
+            past_key_value, use_cache, attention_mask
+        )
+        hidden_states += residual
+        hidden_states = hidden_states + self.mlp(self.post_attention_layernorm(hidden_states))
+        return hidden_states, present_key_value
+class MiniMindModel(nn.Module):
+    def __init__(self, config: MiniMindConfig):
+        super().__init__()
+        self.config = config
+        self.vocab_size, self.num_hidden_layers = config.vocab_size, config.num_hidden_layers
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)
+        self.dropout = nn.Dropout(config.dropout)
+        self.layers = nn.ModuleList([MiniMindBlock(l, config) for l in range(self.num_hidden_layers)])
+        self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        freqs_cos, freqs_sin = precompute_freqs_cis(dim=config.head_dim, end=config.max_position_embeddings, rope_base=config.rope_theta, rope_scaling=config.rope_scaling)
+        self.register_buffer("freqs_cos", freqs_cos, persistent=False)
+        self.register_buffer("freqs_sin", freqs_sin, persistent=False)
+    def forward(self, input_ids, attention_mask=None, past_key_values=None, use_cache=False, **kwargs):
+        batch_size, seq_length = input_ids.shape
+        if hasattr(past_key_values, 'layers'): past_key_values = None
+        past_key_values = past_key_values or [None] * len(self.layers)
+        start_pos = past_key_values[0][0].shape[1] if past_key_values[0] is not None else 0
+        hidden_states = self.dropout(self.embed_tokens(input_ids))
+        position_embeddings = (self.freqs_cos[start_pos:start_pos + seq_length], self.freqs_sin[start_pos:start_pos + seq_length])
+        presents = []
+        for layer, past_key_value in zip(self.layers, past_key_values):
+            hidden_states, present = layer(
+                hidden_states,
+                position_embeddings,
+                past_key_value=past_key_value,
+                use_cache=use_cache,
+                attention_mask=attention_mask
+            )
+            presents.append(present)
+        hidden_states = self.norm(hidden_states)
+        aux_loss = sum([l.mlp.aux_loss for l in self.layers if isinstance(l.mlp, MOEFeedForward)], hidden_states.new_zeros(1).squeeze())
+        return hidden_states, presents, aux_loss
+class MiniMindForCausalLM(PreTrainedModel, GenerationMixin):
+    config_class = MiniMindConfig
+    def __init__(self, config: MiniMindConfig = None):
+        self.config = config or MiniMindConfig()
+        super().__init__(self.config)
+        self.model = MiniMindModel(self.config)
+        self.lm_head = nn.Linear(self.config.hidden_size, self.config.vocab_size, bias=False)
+        self.model.embed_tokens.weight = self.lm_head.weight
+    def forward(self, input_ids, attention_mask=None, past_key_values=None, use_cache=False, logits_to_keep=0, labels=None, **kwargs):
+        hidden_states, past_key_values, aux_loss = self.model(input_ids, attention_mask, past_key_values, use_cache, **kwargs)
+        slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
+        logits = self.lm_head(hidden_states[:, slice_indices, :])
+        loss = None
+        if labels is not None:
+            x, y = logits[..., :-1, :].contiguous(), labels[..., 1:].contiguous()
+            loss = F.cross_entropy(x.view(-1, x.size(-1)), y.view(-1), ignore_index=-100)
+        return MoeCausalLMOutputWithPast(loss=loss, aux_loss=aux_loss, logits=logits, past_key_values=past_key_values, hidden_states=hidden_states)
+    # https://github.com/jingyaogong/minimind/discussions/611
+    @torch.inference_mode()
+    def generate(self, inputs=None, attention_mask=None, max_new_tokens=8192, temperature=0.85, top_p=0.85, top_k=50, eos_token_id=2, streamer=None, use_cache=True, num_return_sequences=1, do_sample=True, repetition_penalty=1.0, **kwargs):
+        input_ids = kwargs.pop("input_ids", inputs).repeat(num_return_sequences, 1)
+        attention_mask = attention_mask.repeat(num_return_sequences, 1) if attention_mask is not None else None
+        past_key_values = kwargs.pop("past_key_values", None)
+        finished = torch.zeros(input_ids.shape[0], dtype=torch.bool, device=input_ids.device)
+        if streamer: streamer.put(input_ids.cpu())
+        for _ in range(max_new_tokens):
+            past_len = past_key_values[0][0].shape[1] if past_key_values else 0
+            outputs = self.forward(input_ids[:, past_len:], attention_mask, past_key_values, use_cache=use_cache, **kwargs)
+            attention_mask = torch.cat([attention_mask, attention_mask.new_ones(attention_mask.shape[0], 1)], -1) if attention_mask is not None else None
+            logits = outputs.logits[:, -1, :] / temperature
+            if repetition_penalty != 1.0:
+                for i in range(input_ids.shape[0]): logits[i, torch.unique(input_ids[i])] /= repetition_penalty
+            if top_k > 0:
+                logits[logits < torch.topk(logits, top_k)[0][..., -1, None]] = -float('inf')
+            if top_p < 1.0:
+                sorted_logits, sorted_indices = torch.sort(logits, descending=True)
+                mask = torch.cumsum(torch.softmax(sorted_logits, dim=-1), dim=-1) > top_p
+                mask[..., 1:], mask[..., 0] = mask[..., :-1].clone(), 0
+                logits[mask.scatter(1, sorted_indices, mask)] = -float('inf')
+            next_token = torch.multinomial(torch.softmax(logits, dim=-1), num_samples=1) if do_sample else torch.argmax(logits, dim=-1, keepdim=True)
+            if eos_token_id is not None: next_token = torch.where(finished.unsqueeze(-1), next_token.new_full((next_token.shape[0], 1), eos_token_id), next_token)
+            input_ids = torch.cat([input_ids, next_token], dim=-1)
+            past_key_values = outputs.past_key_values if use_cache else None
+            if streamer: streamer.put(next_token.cpu())
+            if eos_token_id is not None:
+                finished |= next_token.squeeze(-1).eq(eos_token_id)
+                if finished.all(): break
+        if streamer: streamer.end()
+        if kwargs.get("return_kv"): return {'generated_ids': input_ids, 'past_kv': past_key_values}
+        return input_ids

model_vlm.py ADDED Viewed

	@@ -0,0 +1,155 @@

+import os
+import torch
+import warnings
+from .model_minimind import *
+from typing import Optional, Tuple, List, Union
+from torch import nn
+from transformers import Siglip2ImageProcessor, Siglip2VisionModel
+from transformers.modeling_outputs import MoeCausalLMOutputWithPast
+warnings.filterwarnings('ignore')
+class VLMConfig(MiniMindConfig):
+    model_type = "minimind-v"
+    def __init__(self, image_special_token='<|image_pad|>', image_ids=[12], **kwargs):
+        self.image_special_token = image_special_token
+        self.image_ids = image_ids
+        self.image_hidden_size = kwargs.get("image_hidden_size", 768)
+        self.image_token_len = kwargs.get("image_token_len", 64)
+        super().__init__(**kwargs)
+class MMVisionProjector(nn.Module):
+    def __init__(self, in_dim, out_dim, source_tokens=256, target_tokens=64):
+        super().__init__()
+        self.target_tokens = target_tokens
+        self.merge = source_tokens // target_tokens
+        self.mlp = nn.Sequential(
+            nn.Linear(in_dim * self.merge, out_dim),
+            nn.GELU(),
+            nn.Linear(out_dim, out_dim),
+        )
+    def forward(self, x):
+        b, n, d = x.shape
+        x = x.reshape(b, self.target_tokens, d * self.merge)
+        return self.mlp(x)
+# 继承自语言模型
+class MiniMindVLM(MiniMindForCausalLM):
+    config_class = VLMConfig
+    def __init__(self, config: VLMConfig = None, vision_model_path="./model/siglip2-base-p16-ve"):
+        self.config = config or VLMConfig()
+        super().__init__(self.config)
+        self.vision_encoder, self.processor = self.__class__.get_vision_model(vision_model_path)
+        self.vision_proj = MMVisionProjector(self.config.image_hidden_size, self.config.hidden_size, target_tokens=self.config.image_token_len)
+    @staticmethod
+    def get_vision_model(model_path: str):
+        from transformers import logging as hf_logging
+        hf_logging.set_verbosity_error()
+        if not os.path.exists(model_path):
+            return None, None
+        model = Siglip2VisionModel.from_pretrained(model_path)
+        processor = Siglip2ImageProcessor.from_pretrained(model_path)
+        # 冻结 vision_encoder 的所有参数
+        for param in model.parameters():
+            param.requires_grad = False
+        return model.eval(), processor
+    @staticmethod
+    def image2tensor(image, processor):
+        if image.mode in ['RGBA', 'LA']: image = image.convert('RGB')
+        inputs = processor(images=image, return_tensors="pt")
+        return inputs
+    @staticmethod
+    def get_image_embeddings(image_inputs, vision_model):
+        if hasattr(image_inputs, 'keys'):
+            image_inputs = {k: v.squeeze(1) if v.ndim > 2 and v.shape[1] == 1 else v for k, v in image_inputs.items()}
+        with torch.no_grad():
+            outputs = vision_model(**image_inputs)
+        return outputs.last_hidden_state
+    @torch.compiler.disable
+    def count_vision_proj(self, tokens, h, vision_tensors=None, seqlen=512):
+        if vision_tensors is None or not self.config.image_ids:
+            return h
+        marker, vf = self.config.image_ids[0], vision_tensors
+        if vf.dim() == 3:
+            vf = vf.unsqueeze(1)
+        out = []
+        for b in range(h.size(0)):
+            hb, seq, k, i = h[b], tokens[b].tolist(), 0, 0
+            while i < len(seq):
+                if seq[i] == marker:
+                    start = i
+                    while i < len(seq) and seq[i] == marker:
+                        i += 1
+                    if k < vf.size(1):
+                        hb = torch.cat((hb[:start], vf[b][k][:i - start], hb[i:]), dim=0)[:seqlen]
+                        k += 1
+                else:
+                    i += 1
+            out.append(hb)
+        return torch.stack(out)
+    def forward(self,
+                input_ids: Optional[torch.Tensor] = None,
+                attention_mask: Optional[torch.Tensor] = None,
+                past_key_values: Optional[List[Tuple[torch.Tensor, torch.Tensor]]] = None,
+                use_cache: bool = False,
+                logits_to_keep: Union[int, torch.Tensor] = 0,
+                labels: Optional[torch.Tensor] = None,
+                pixel_values: Optional[torch.FloatTensor] = None,
+                **args):
+        batch_size, seq_length = input_ids.shape
+        if hasattr(past_key_values, 'layers'): past_key_values = None
+        past_key_values = past_key_values or [None] * len(self.model.layers)
+        start_pos = past_key_values[0][0].shape[1] if past_key_values[0] is not None else 0
+        hidden_states = self.model.dropout(self.model.embed_tokens(input_ids))
+        if pixel_values is not None and start_pos == 0:
+            if hasattr(pixel_values, 'keys'):
+                img_emb = MiniMindVLM.get_image_embeddings(pixel_values, self.vision_encoder)
+                vision_tensors = self.vision_proj(img_emb)
+            else:
+                if len(pixel_values.shape) == 6:
+                    pixel_values = pixel_values.squeeze(2)
+                bs, num, c, im_h, im_w = pixel_values.shape
+                stack_dim = 1 if bs > 1 else 0
+                vision_tensors = torch.stack([self.vision_proj(MiniMindVLM.get_image_embeddings(pixel_values[:, i, :, :, :], self.vision_encoder)) for i in range(num)], dim=stack_dim)
+            hidden_states = self.count_vision_proj(tokens=input_ids, h=hidden_states, vision_tensors=vision_tensors, seqlen=input_ids.shape[1])
+        position_embeddings = (
+            self.model.freqs_cos[start_pos:start_pos + seq_length],
+            self.model.freqs_sin[start_pos:start_pos + seq_length]
+        )
+        presents = []
+        for layer_idx, (layer, past_key_value) in enumerate(zip(self.model.layers, past_key_values)):
+            hidden_states, present = layer(
+                hidden_states,
+                position_embeddings,
+                past_key_value=past_key_value,
+                use_cache=use_cache,
+                attention_mask=attention_mask
+            )
+            presents.append(present)
+        hidden_states = self.model.norm(hidden_states)
+        aux_loss = sum([l.mlp.aux_loss for l in self.model.layers if isinstance(l.mlp, MOEFeedForward)], hidden_states.new_zeros(1).squeeze())
+        slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
+        logits = self.lm_head(hidden_states[:, slice_indices, :])
+        loss = None
+        if labels is not None:
+            shift_logits = logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            loss = F.cross_entropy(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1), ignore_index=-100)
+        output = MoeCausalLMOutputWithPast(loss=loss, aux_loss=aux_loss, logits=logits, past_key_values=presents, hidden_states=hidden_states)
+        return output

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:05bdaaa8c3dae0ef58d0e0afb4ba8089d3674cafc85f706fe7859d332c1cbb18
+size 133756040

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,52 @@

+{
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>",
+    "<|audio_start|>",
+    "<|audio_end|>",
+    "<|audio_pad|>",
+    "<tts_pad>",
+    "<tts_text_bos>",
+    "<tts_text_eod>",
+    "<tts_text_bos_single>"
+  ],
+  "bos_token": {
+    "content": "<|im_start|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|im_end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,335 @@

+{
+  "add_bos_token": false,
+  "add_eos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<|object_ref_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "<|object_ref_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "<|box_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "6": {
+      "content": "<|box_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "7": {
+      "content": "<|quad_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "8": {
+      "content": "<|quad_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "9": {
+      "content": "<|vision_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "10": {
+      "content": "<|vision_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "11": {
+      "content": "<|vision_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "12": {
+      "content": "<|image_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "13": {
+      "content": "<|video_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "14": {
+      "content": "<|audio_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "15": {
+      "content": "<|audio_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "16": {
+      "content": "<|audio_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "17": {
+      "content": "<tts_pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "18": {
+      "content": "<tts_text_bos>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "19": {
+      "content": "<tts_text_eod>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "20": {
+      "content": "<tts_text_bos_single>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "21": {
+      "content": "<tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "22": {
+      "content": "</tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "23": {
+      "content": "<tool_response>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "24": {
+      "content": "</tool_response>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "25": {
+      "content": "<think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "26": {
+      "content": "</think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "27": {
+      "content": "<|buffer1|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "28": {
+      "content": "<|buffer2|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "29": {
+      "content": "<|buffer3|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "30": {
+      "content": "<|buffer4|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "31": {
+      "content": "<|buffer5|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32": {
+      "content": "<|buffer6|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "33": {
+      "content": "<|buffer7|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "34": {
+      "content": "<|buffer8|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "35": {
+      "content": "<|buffer9|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>",
+    "<|audio_start|>",
+    "<|audio_end|>",
+    "<|audio_pad|>",
+    "<tts_pad>",
+    "<tts_text_bos>",
+    "<tts_text_eod>",
+    "<tts_text_bos_single>"
+  ],
+  "audio_bos_token": "<|audio_start|>",
+  "audio_eos_token": "<|audio_end|>",
+  "audio_token": "<|audio_pad|>",
+  "bos_token": "<|im_start|>",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "extra_special_tokens": {},
+  "image_token": "<|image_pad|>",
+  "legacy": true,
+  "model_max_length": 131072,
+  "pad_token": "<|endoftext|>",
+  "sp_model_kwargs": {},
+  "spaces_between_special_tokens": false,
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "unk_token": "<|endoftext|>",
+  "video_token": "<|video_pad|>",
+  "vision_bos_token": "<|vision_start|>",
+  "vision_eos_token": "<|vision_end|>"
+}