Spaces:

LMMs-Lab-Speedrun
/

README

Running

App Files Files Community

Yuwei-Niu commited on Feb 27

Commit

7b3952b

verified ·

1 Parent(s): e73ddc6

Update README.md

Browse files

Files changed (1) hide show

README.md +1 -113

README.md CHANGED Viewed

@@ -8,14 +8,11 @@ pinned: false
 ---
 # NanoVLM Speedrun
 > The most striking thing about the [modded-nanogpt](https://github.com/karpathy/modded-nanogpt) experiments is that they expose how much of deep learning is just bloat.
 > To apply this to Vision-Language Models (VLMs), you have to stop acting like a researcher and start acting like a hacker. You aren't trying to follow academic standards; you are trying to maximize the movement of bits through silicon.
 We introduce **NanoVLM Speedrun**: a minimalist VLM recipe designed to strip away the bloat. We provide the bare-minimum components required to bridge the training and evaluation pipeline, enabling lightning-fast iteration and reproduction.
 ## The Recipe (2026H1)
 - **LLM**: [`Qwen/Qwen3-0.6B`](https://huggingface.co/Qwen/Qwen3-0.6B )
 - **Vision Encoder**: [`google/siglip2-so400m-patch16-naflex`](https://huggingface.co/google/siglip2-so400m-patch16-naflex )
 - **Projector**: Classic [LLaVA](https://arxiv.org/abs/2310.03744)-style **2-layer MLP**
@@ -24,117 +21,8 @@ We introduce **NanoVLM Speedrun**: a minimalist VLM recipe designed to strip awa
   - **Stage 2**: End-to-end instruction tuning (tuning both the projector and the LLM).
 ## Data Preparation
 We utilize the curated [LMMs-Lab-Speedrun/Data_NanoVLM](https://huggingface.co/datasets/LMMs-Lab-Speedrun/Data_NanoVLM ) collection.
 - **Stage 1**: From [liuhaotian/LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain )
 - **Stage 2**: From [lmms-lab/LLaVA-NeXT-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data) (Note: We explicitly filtered out excessively long samples to maintain training efficiency).
-### Dataset YAML Configuration
-Configure your local paths in the YAML files as shown below:
-#### Stage 1 YAML (Example)
-```yaml
-datasets:
-  - path: LMMs-Lab-Speedrun/Data_NanoVLM/Stage1-LLaVA-Pretrain/Lmms_format_blip_laion_cc_sbu_558k.json
-    data_folder: path/to/Stage1-LLaVA-Pretrain/Image
-    data_type: json
-```
-#### Stage 2 YAML (Example)
-```yaml
-datasets:
-  - path: LMMs-Lab-Speedrun/Data_NanoVLM/Stage2-LLaVA-NeXT-Data/llava_next_Lmms_format_processed.json
-    data_folder: path/to/LLaVA-NeXT-Data/Images
-    data_type: json
-```
-## Execution
-### 0. Installation & Initialization
-For environment setup, please refer to the [lmms-engine Quick Start](https://github.com/EvolvingLMMs-Lab/lmms-engine?tab=readme-ov-file#-quick-start).
-Download and use [NanoVLM_Init](https://huggingface.co/datasets/LMMs-Lab-Speedrun/NanoVLM_Init) for Stage 1 initialization.
-### 1. Stage 1: Pre-training
-```bash
-bash ./examples/nanovlm/stage1_nanovlm_train.sh
-```
-### 2. Merge Stage 1 Checkpoint
-```bash
-python -m lmms_engine.merger \
-  --checkpoint_path ./output/nanovlm_stage1/checkpoint-2180 \
-  --output_path ./output/nanovlm_stage1/checkpoint-2180-merged
-```
-### 3. Stage 2: Instruction Tuning
-```bash
-export DATASET_PATH="/path/to/stage2_llava_next.yaml"
-bash ./examples/nanovlm/stage2_nanovlm_train.sh
-```
-### 4. Merge Stage 2 Checkpoint
-```bash
-python -m lmms_engine.merger \
-  --checkpoint_path ./output/nanovlm_stage2/checkpoint-11540 \
-  --output_path ./output/nanovlm_stage2/checkpoint-11540-merged
-```
-## Evaluation (lmms-eval)
-```bash
-git clone -b dev-v0.7 https://github.com/EvolvingLMMs-Lab/lmms-eval.git
-cd lmms-eval
-```
-Run evaluation (replace pretrained=... with your merged weights):
-```bash
-# Multi-GPU asynchronous evaluation
-CUDA_VISIBLE_DEVICES=0,1,2,3 python -m lmms_eval \
-    --model nanovlm \
-    --model_args pretrained=./output/nanovlm_stage2/checkpoint-11540-merged \
-    --tasks mme \
-    --batch_size 1
-```
-## Results
-### Training Overhead
-| Stage | Total FLOPS | Energy | CO2 Emissions | GPU Hours (H100) |
-|-------|-------------|--------|---------------|------------------|
-| Stage 1 | 236.79 PFLOPS | 13.5221 kWh | 6.42 kg | 19.32 |
-| Stage 2 | 98.23 PFLOPS | 3.1006 kWh | 1.47 kg | 4.43 |
-### Benchmark Scores
-| Benchmark | Score |
-|-----------|-------|
-| MME | 1204.46 (P: 948.75, C: 255.71) |
-| MMMU (val) | TBD |
-| MMBench (EN Dev) | TBD |
-| OCRBench | TBD |
-| BLINK | TBD |
-## Launch Preparation Community Discussion Trails
-- [2026.02.27] Initial NanoVLM recipe released.
-## List of TODOs
-- [x] Publish Stage 1 & Stage 2 training scripts.
-- [x] Publish evaluation scripts.
-- [ ] Add more benchmark results (MMMU, OCRBench, BLINK).
-- [ ] Optimize the training framework.
-```

 ---
 # NanoVLM Speedrun
 > The most striking thing about the [modded-nanogpt](https://github.com/karpathy/modded-nanogpt) experiments is that they expose how much of deep learning is just bloat.
 > To apply this to Vision-Language Models (VLMs), you have to stop acting like a researcher and start acting like a hacker. You aren't trying to follow academic standards; you are trying to maximize the movement of bits through silicon.
 We introduce **NanoVLM Speedrun**: a minimalist VLM recipe designed to strip away the bloat. We provide the bare-minimum components required to bridge the training and evaluation pipeline, enabling lightning-fast iteration and reproduction.
 ## The Recipe (2026H1)
 - **LLM**: [`Qwen/Qwen3-0.6B`](https://huggingface.co/Qwen/Qwen3-0.6B )
 - **Vision Encoder**: [`google/siglip2-so400m-patch16-naflex`](https://huggingface.co/google/siglip2-so400m-patch16-naflex )
 - **Projector**: Classic [LLaVA](https://arxiv.org/abs/2310.03744)-style **2-layer MLP**
   - **Stage 2**: End-to-end instruction tuning (tuning both the projector and the LLM).
 ## Data Preparation
 We utilize the curated [LMMs-Lab-Speedrun/Data_NanoVLM](https://huggingface.co/datasets/LMMs-Lab-Speedrun/Data_NanoVLM ) collection.
 - **Stage 1**: From [liuhaotian/LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain )
 - **Stage 2**: From [lmms-lab/LLaVA-NeXT-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data) (Note: We explicitly filtered out excessively long samples to maintain training efficiency).
+For more information about training, please refer to [NanoVLM Speedrun](https://github.com/EvolvingLMMs-Lab/lmms-engine/tree/main/examples/nanovlm).