File size: 1,921 Bytes
3556d7e
 
 
 
 
 
 
 
 
e73ddc6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7b3952b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
---
title: README
emoji: 
colorFrom: purple
colorTo: pink
sdk: static
pinned: false
---

# NanoVLM Speedrun
> The most striking thing about the [modded-nanogpt](https://github.com/karpathy/modded-nanogpt) experiments is that they expose how much of deep learning is just bloat. 
> To apply this to Vision-Language Models (VLMs), you have to stop acting like a researcher and start acting like a hacker. You aren't trying to follow academic standards; you are trying to maximize the movement of bits through silicon.
We introduce **NanoVLM Speedrun**: a minimalist VLM recipe designed to strip away the bloat. We provide the bare-minimum components required to bridge the training and evaluation pipeline, enabling lightning-fast iteration and reproduction.

## The Recipe (2026H1)
- **LLM**: [`Qwen/Qwen3-0.6B`](https://huggingface.co/Qwen/Qwen3-0.6B )
- **Vision Encoder**: [`google/siglip2-so400m-patch16-naflex`](https://huggingface.co/google/siglip2-so400m-patch16-naflex )
- **Projector**: Classic [LLaVA](https://arxiv.org/abs/2310.03744)-style **2-layer MLP**
- **Training Paradigm**: A streamlined two-stage approach:
  - **Stage 1**: Projector-only alignment (tuning the projector between vision and language).
  - **Stage 2**: End-to-end instruction tuning (tuning both the projector and the LLM).

## Data Preparation
We utilize the curated [LMMs-Lab-Speedrun/Data_NanoVLM](https://huggingface.co/datasets/LMMs-Lab-Speedrun/Data_NanoVLM ) collection.
- **Stage 1**: From [liuhaotian/LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain )
- **Stage 2**: From [lmms-lab/LLaVA-NeXT-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data) (Note: We explicitly filtered out excessively long samples to maintain training efficiency). 

For more information about training, please refer to [NanoVLM Speedrun](https://github.com/EvolvingLMMs-Lab/lmms-engine/tree/main/examples/nanovlm).