Spaces:

LMMs-Lab-Speedrun
/

README

Running

File size: 1,921 Bytes

---
title: README
emoji: ⚡
colorFrom: purple
colorTo: pink
sdk: static
pinned: false
---

# NanoVLM Speedrun
> The most striking thing about the [modded-nanogpt](https://github.com/karpathy/modded-nanogpt) experiments is that they expose how much of deep learning is just bloat. 
> To apply this to Vision-Language Models (VLMs), you have to stop acting like a researcher and start acting like a hacker. You aren't trying to follow academic standards; you are trying to maximize the movement of bits through silicon.
We introduce **NanoVLM Speedrun**: a minimalist VLM recipe designed to strip away the bloat. We provide the bare-minimum components required to bridge the training and evaluation pipeline, enabling lightning-fast iteration and reproduction.

## The Recipe (2026H1)
- **LLM**: [`Qwen/Qwen3-0.6B`](https://huggingface.co/Qwen/Qwen3-0.6B )
- **Vision Encoder**: [`google/siglip2-so400m-patch16-naflex`](https://huggingface.co/google/siglip2-so400m-patch16-naflex )
- **Projector**: Classic [LLaVA](https://arxiv.org/abs/2310.03744)-style **2-layer MLP**
- **Training Paradigm**: A streamlined two-stage approach:
  - **Stage 1**: Projector-only alignment (tuning the projector between vision and language).
  - **Stage 2**: End-to-end instruction tuning (tuning both the projector and the LLM).

## Data Preparation
We utilize the curated [LMMs-Lab-Speedrun/Data_NanoVLM](https://huggingface.co/datasets/LMMs-Lab-Speedrun/Data_NanoVLM ) collection.
- **Stage 1**: From [liuhaotian/LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain )
- **Stage 2**: From [lmms-lab/LLaVA-NeXT-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data) (Note: We explicitly filtered out excessively long samples to maintain training efficiency). 

For more information about training, please refer to [NanoVLM Speedrun](https://github.com/EvolvingLMMs-Lab/lmms-engine/tree/main/examples/nanovlm).