Yuwei-Niu commited on
Commit
e73ddc6
·
verified ·
1 Parent(s): 3556d7e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +131 -1
README.md CHANGED
@@ -7,4 +7,134 @@ sdk: static
7
  pinned: false
8
  ---
9
 
10
- Edit this `README.md` markdown file to author your organization card.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  pinned: false
8
  ---
9
 
10
+ # NanoVLM Speedrun
11
+
12
+ > The most striking thing about the [modded-nanogpt](https://github.com/karpathy/modded-nanogpt) experiments is that they expose how much of deep learning is just bloat.
13
+ > To apply this to Vision-Language Models (VLMs), you have to stop acting like a researcher and start acting like a hacker. You aren't trying to follow academic standards; you are trying to maximize the movement of bits through silicon.
14
+
15
+ We introduce **NanoVLM Speedrun**: a minimalist VLM recipe designed to strip away the bloat. We provide the bare-minimum components required to bridge the training and evaluation pipeline, enabling lightning-fast iteration and reproduction.
16
+
17
+ ## The Recipe (2026H1)
18
+
19
+ - **LLM**: [`Qwen/Qwen3-0.6B`](https://huggingface.co/Qwen/Qwen3-0.6B )
20
+ - **Vision Encoder**: [`google/siglip2-so400m-patch16-naflex`](https://huggingface.co/google/siglip2-so400m-patch16-naflex )
21
+ - **Projector**: Classic [LLaVA](https://arxiv.org/abs/2310.03744)-style **2-layer MLP**
22
+ - **Training Paradigm**: A streamlined two-stage approach:
23
+ - **Stage 1**: Projector-only alignment (tuning the projector between vision and language).
24
+ - **Stage 2**: End-to-end instruction tuning (tuning both the projector and the LLM).
25
+
26
+ ## Data Preparation
27
+
28
+ We utilize the curated [LMMs-Lab-Speedrun/Data_NanoVLM](https://huggingface.co/datasets/LMMs-Lab-Speedrun/Data_NanoVLM ) collection.
29
+
30
+ - **Stage 1**: From [liuhaotian/LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain )
31
+ - **Stage 2**: From [lmms-lab/LLaVA-NeXT-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data) (Note: We explicitly filtered out excessively long samples to maintain training efficiency).
32
+
33
+
34
+ ### Dataset YAML Configuration
35
+
36
+ Configure your local paths in the YAML files as shown below:
37
+
38
+ #### Stage 1 YAML (Example)
39
+
40
+ ```yaml
41
+ datasets:
42
+ - path: LMMs-Lab-Speedrun/Data_NanoVLM/Stage1-LLaVA-Pretrain/Lmms_format_blip_laion_cc_sbu_558k.json
43
+ data_folder: path/to/Stage1-LLaVA-Pretrain/Image
44
+ data_type: json
45
+ ```
46
+
47
+ #### Stage 2 YAML (Example)
48
+
49
+ ```yaml
50
+ datasets:
51
+ - path: LMMs-Lab-Speedrun/Data_NanoVLM/Stage2-LLaVA-NeXT-Data/llava_next_Lmms_format_processed.json
52
+ data_folder: path/to/LLaVA-NeXT-Data/Images
53
+ data_type: json
54
+ ```
55
+
56
+ ## Execution
57
+
58
+ ### 0. Installation & Initialization
59
+
60
+ For environment setup, please refer to the [lmms-engine Quick Start](https://github.com/EvolvingLMMs-Lab/lmms-engine?tab=readme-ov-file#-quick-start).
61
+
62
+ Download and use [NanoVLM_Init](https://huggingface.co/datasets/LMMs-Lab-Speedrun/NanoVLM_Init) for Stage 1 initialization.
63
+
64
+ ### 1. Stage 1: Pre-training
65
+
66
+ ```bash
67
+ bash ./examples/nanovlm/stage1_nanovlm_train.sh
68
+ ```
69
+
70
+ ### 2. Merge Stage 1 Checkpoint
71
+
72
+ ```bash
73
+ python -m lmms_engine.merger \
74
+ --checkpoint_path ./output/nanovlm_stage1/checkpoint-2180 \
75
+ --output_path ./output/nanovlm_stage1/checkpoint-2180-merged
76
+ ```
77
+
78
+ ### 3. Stage 2: Instruction Tuning
79
+
80
+ ```bash
81
+ export DATASET_PATH="/path/to/stage2_llava_next.yaml"
82
+ bash ./examples/nanovlm/stage2_nanovlm_train.sh
83
+ ```
84
+
85
+ ### 4. Merge Stage 2 Checkpoint
86
+
87
+ ```bash
88
+ python -m lmms_engine.merger \
89
+ --checkpoint_path ./output/nanovlm_stage2/checkpoint-11540 \
90
+ --output_path ./output/nanovlm_stage2/checkpoint-11540-merged
91
+ ```
92
+
93
+ ## Evaluation (lmms-eval)
94
+
95
+ ```bash
96
+ git clone -b dev-v0.7 https://github.com/EvolvingLMMs-Lab/lmms-eval.git
97
+ cd lmms-eval
98
+ ```
99
+
100
+ Run evaluation (replace pretrained=... with your merged weights):
101
+
102
+ ```bash
103
+ # Multi-GPU asynchronous evaluation
104
+ CUDA_VISIBLE_DEVICES=0,1,2,3 python -m lmms_eval \
105
+ --model nanovlm \
106
+ --model_args pretrained=./output/nanovlm_stage2/checkpoint-11540-merged \
107
+ --tasks mme \
108
+ --batch_size 1
109
+ ```
110
+
111
+ ## Results
112
+
113
+ ### Training Overhead
114
+
115
+ | Stage | Total FLOPS | Energy | CO2 Emissions | GPU Hours (H100) |
116
+ |-------|-------------|--------|---------------|------------------|
117
+ | Stage 1 | 236.79 PFLOPS | 13.5221 kWh | 6.42 kg | 19.32 |
118
+ | Stage 2 | 98.23 PFLOPS | 3.1006 kWh | 1.47 kg | 4.43 |
119
+
120
+ ### Benchmark Scores
121
+
122
+ | Benchmark | Score |
123
+ |-----------|-------|
124
+ | MME | 1204.46 (P: 948.75, C: 255.71) |
125
+ | MMMU (val) | TBD |
126
+ | MMBench (EN Dev) | TBD |
127
+ | OCRBench | TBD |
128
+ | BLINK | TBD |
129
+
130
+ ## Launch Preparation Community Discussion Trails
131
+
132
+ - [2026.02.27] Initial NanoVLM recipe released.
133
+
134
+ ## List of TODOs
135
+
136
+ - [x] Publish Stage 1 & Stage 2 training scripts.
137
+ - [x] Publish evaluation scripts.
138
+ - [ ] Add more benchmark results (MMMU, OCRBench, BLINK).
139
+ - [ ] Optimize the training framework.
140
+ ```