Yuwei-Niu commited on
Commit
7b3952b
·
verified ·
1 Parent(s): e73ddc6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -113
README.md CHANGED
@@ -8,14 +8,11 @@ pinned: false
8
  ---
9
 
10
  # NanoVLM Speedrun
11
-
12
  > The most striking thing about the [modded-nanogpt](https://github.com/karpathy/modded-nanogpt) experiments is that they expose how much of deep learning is just bloat.
13
  > To apply this to Vision-Language Models (VLMs), you have to stop acting like a researcher and start acting like a hacker. You aren't trying to follow academic standards; you are trying to maximize the movement of bits through silicon.
14
-
15
  We introduce **NanoVLM Speedrun**: a minimalist VLM recipe designed to strip away the bloat. We provide the bare-minimum components required to bridge the training and evaluation pipeline, enabling lightning-fast iteration and reproduction.
16
 
17
  ## The Recipe (2026H1)
18
-
19
  - **LLM**: [`Qwen/Qwen3-0.6B`](https://huggingface.co/Qwen/Qwen3-0.6B )
20
  - **Vision Encoder**: [`google/siglip2-so400m-patch16-naflex`](https://huggingface.co/google/siglip2-so400m-patch16-naflex )
21
  - **Projector**: Classic [LLaVA](https://arxiv.org/abs/2310.03744)-style **2-layer MLP**
@@ -24,117 +21,8 @@ We introduce **NanoVLM Speedrun**: a minimalist VLM recipe designed to strip awa
24
  - **Stage 2**: End-to-end instruction tuning (tuning both the projector and the LLM).
25
 
26
  ## Data Preparation
27
-
28
  We utilize the curated [LMMs-Lab-Speedrun/Data_NanoVLM](https://huggingface.co/datasets/LMMs-Lab-Speedrun/Data_NanoVLM ) collection.
29
-
30
  - **Stage 1**: From [liuhaotian/LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain )
31
  - **Stage 2**: From [lmms-lab/LLaVA-NeXT-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data) (Note: We explicitly filtered out excessively long samples to maintain training efficiency).
32
 
33
-
34
- ### Dataset YAML Configuration
35
-
36
- Configure your local paths in the YAML files as shown below:
37
-
38
- #### Stage 1 YAML (Example)
39
-
40
- ```yaml
41
- datasets:
42
- - path: LMMs-Lab-Speedrun/Data_NanoVLM/Stage1-LLaVA-Pretrain/Lmms_format_blip_laion_cc_sbu_558k.json
43
- data_folder: path/to/Stage1-LLaVA-Pretrain/Image
44
- data_type: json
45
- ```
46
-
47
- #### Stage 2 YAML (Example)
48
-
49
- ```yaml
50
- datasets:
51
- - path: LMMs-Lab-Speedrun/Data_NanoVLM/Stage2-LLaVA-NeXT-Data/llava_next_Lmms_format_processed.json
52
- data_folder: path/to/LLaVA-NeXT-Data/Images
53
- data_type: json
54
- ```
55
-
56
- ## Execution
57
-
58
- ### 0. Installation & Initialization
59
-
60
- For environment setup, please refer to the [lmms-engine Quick Start](https://github.com/EvolvingLMMs-Lab/lmms-engine?tab=readme-ov-file#-quick-start).
61
-
62
- Download and use [NanoVLM_Init](https://huggingface.co/datasets/LMMs-Lab-Speedrun/NanoVLM_Init) for Stage 1 initialization.
63
-
64
- ### 1. Stage 1: Pre-training
65
-
66
- ```bash
67
- bash ./examples/nanovlm/stage1_nanovlm_train.sh
68
- ```
69
-
70
- ### 2. Merge Stage 1 Checkpoint
71
-
72
- ```bash
73
- python -m lmms_engine.merger \
74
- --checkpoint_path ./output/nanovlm_stage1/checkpoint-2180 \
75
- --output_path ./output/nanovlm_stage1/checkpoint-2180-merged
76
- ```
77
-
78
- ### 3. Stage 2: Instruction Tuning
79
-
80
- ```bash
81
- export DATASET_PATH="/path/to/stage2_llava_next.yaml"
82
- bash ./examples/nanovlm/stage2_nanovlm_train.sh
83
- ```
84
-
85
- ### 4. Merge Stage 2 Checkpoint
86
-
87
- ```bash
88
- python -m lmms_engine.merger \
89
- --checkpoint_path ./output/nanovlm_stage2/checkpoint-11540 \
90
- --output_path ./output/nanovlm_stage2/checkpoint-11540-merged
91
- ```
92
-
93
- ## Evaluation (lmms-eval)
94
-
95
- ```bash
96
- git clone -b dev-v0.7 https://github.com/EvolvingLMMs-Lab/lmms-eval.git
97
- cd lmms-eval
98
- ```
99
-
100
- Run evaluation (replace pretrained=... with your merged weights):
101
-
102
- ```bash
103
- # Multi-GPU asynchronous evaluation
104
- CUDA_VISIBLE_DEVICES=0,1,2,3 python -m lmms_eval \
105
- --model nanovlm \
106
- --model_args pretrained=./output/nanovlm_stage2/checkpoint-11540-merged \
107
- --tasks mme \
108
- --batch_size 1
109
- ```
110
-
111
- ## Results
112
-
113
- ### Training Overhead
114
-
115
- | Stage | Total FLOPS | Energy | CO2 Emissions | GPU Hours (H100) |
116
- |-------|-------------|--------|---------------|------------------|
117
- | Stage 1 | 236.79 PFLOPS | 13.5221 kWh | 6.42 kg | 19.32 |
118
- | Stage 2 | 98.23 PFLOPS | 3.1006 kWh | 1.47 kg | 4.43 |
119
-
120
- ### Benchmark Scores
121
-
122
- | Benchmark | Score |
123
- |-----------|-------|
124
- | MME | 1204.46 (P: 948.75, C: 255.71) |
125
- | MMMU (val) | TBD |
126
- | MMBench (EN Dev) | TBD |
127
- | OCRBench | TBD |
128
- | BLINK | TBD |
129
-
130
- ## Launch Preparation Community Discussion Trails
131
-
132
- - [2026.02.27] Initial NanoVLM recipe released.
133
-
134
- ## List of TODOs
135
-
136
- - [x] Publish Stage 1 & Stage 2 training scripts.
137
- - [x] Publish evaluation scripts.
138
- - [ ] Add more benchmark results (MMMU, OCRBench, BLINK).
139
- - [ ] Optimize the training framework.
140
- ```
 
8
  ---
9
 
10
  # NanoVLM Speedrun
 
11
  > The most striking thing about the [modded-nanogpt](https://github.com/karpathy/modded-nanogpt) experiments is that they expose how much of deep learning is just bloat.
12
  > To apply this to Vision-Language Models (VLMs), you have to stop acting like a researcher and start acting like a hacker. You aren't trying to follow academic standards; you are trying to maximize the movement of bits through silicon.
 
13
  We introduce **NanoVLM Speedrun**: a minimalist VLM recipe designed to strip away the bloat. We provide the bare-minimum components required to bridge the training and evaluation pipeline, enabling lightning-fast iteration and reproduction.
14
 
15
  ## The Recipe (2026H1)
 
16
  - **LLM**: [`Qwen/Qwen3-0.6B`](https://huggingface.co/Qwen/Qwen3-0.6B )
17
  - **Vision Encoder**: [`google/siglip2-so400m-patch16-naflex`](https://huggingface.co/google/siglip2-so400m-patch16-naflex )
18
  - **Projector**: Classic [LLaVA](https://arxiv.org/abs/2310.03744)-style **2-layer MLP**
 
21
  - **Stage 2**: End-to-end instruction tuning (tuning both the projector and the LLM).
22
 
23
  ## Data Preparation
 
24
  We utilize the curated [LMMs-Lab-Speedrun/Data_NanoVLM](https://huggingface.co/datasets/LMMs-Lab-Speedrun/Data_NanoVLM ) collection.
 
25
  - **Stage 1**: From [liuhaotian/LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain )
26
  - **Stage 2**: From [lmms-lab/LLaVA-NeXT-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data) (Note: We explicitly filtered out excessively long samples to maintain training efficiency).
27
 
28
+ For more information about training, please refer to [NanoVLM Speedrun](https://github.com/EvolvingLMMs-Lab/lmms-engine/tree/main/examples/nanovlm).