Nanbeige4.1-VLM-Base
A vision-language model combining SigLIP so400m as the vision encoder and Nanbeige4.1-3B as the language model, connected via a lightweight MLP projector with spatial pooling.
⚠️ Stage 1 Pretrain Checkpoint
This is a Stage 1 checkpoint where only the projector was trained. The model has learned visual grounding (associating image regions with language concepts) but has not been instruction-tuned yet. Stage 2 fine-tuning is in progress.
Architecture
Image
│
▼
SigLIP so400m ← Frozen (400M params)
│ (B, 729, 1152)
▼
Spatial Avg Pool ← 729 → 196 tokens (27×27 → 14×14)
│ (B, 196, 1152)
▼
2-Layer MLP Projector ← Trained (10M params)
│ (B, 196, 3072)
▼
Nanbeige4.1-3B ← Frozen (3B params)
│
▼
Text Output
The spatial pooling step reduces image token count from 729 → 196 (73% reduction), significantly lowering memory usage and inference latency without meaningful quality loss at the pretraining stage.
Usage
Installation
pip install transformers torch accelerate sentencepiece safetensors Pillow requests
flash-attention is optional but recommended for faster inference on CUDA GPUs:
pip install flash-attn --no-build-isolation
Basic Inference
from transformers import AutoModel, AutoTokenizer
from PIL import Image
# Load model and tokenizer
model = AutoModel.from_pretrained(
"SkyAsl/Nanbeige4.1-VLM-Base",
trust_remote_code=True,
)
model.to("cuda")
tokenizer = AutoTokenizer.from_pretrained(
"SkyAsl/Nanbeige4.1-VLM-Base",
trust_remote_code=True,
)
model.set_tokenizer(tokenizer)
# Describe an image
image = Image.open("photo.jpg")
result = model.describe(image)
print(result)
Custom Prompt
result = model.describe(image, prompt="What objects are visible in this image?")
print(result)
Load from URL
import requests
from io import BytesIO
url = "https://upload.wikimedia.org/wikipedia/commons/a/a7/Camponotus_flavomarginatus_ant.jpg"
image = Image.open(BytesIO(requests.get(url).content))
print(model.describe(image))
Creative Sampling
result = model.describe(
image,
do_sample=True,
temperature=0.7,
max_new_tokens=200,
)
print(result)
Training Details
| Property | Value |
|---|---|
| Dataset | LLaVA-CC3M-Pretrain-595K (595,375 image-caption pairs) |
| Vision encoder | google/siglip-so400m-patch14-384 |
| Language model | Nanbeige/Nanbeige4.1-3B |
| Trainable params | ~10M (projector only) |
| Frozen params | ~3.4B (SigLIP + LLM) |
| Image tokens | 196 (after 2×2 avg pooling from 729) |
| Max text length | 128 tokens |
| Batch size | 32 (effective) |
| Learning rate | 2e-3 (cosine schedule, warmup 3%) |
| Precision | bfloat16 + TF32 |
| Hardware | 1× A100 80GB |
| Training time | ~6 hours |
| Final loss | ~2.47 |
Qualitative Results
Sample outputs on a bus depot image (Step 9303, end of Stage 1):
"the bus depot is a major hub for commuters. it has been the main stop in town and serves as an important transport link. buses are used by local government agencies and tourists who want rides throughout this area."
The model consistently identifies objects, scenes, and spatial relationships correctly in the first 1–2 sentences. Extended hallucination in longer outputs is expected at Stage 1 and will be resolved in Stage 2.
Roadmap
- Stage 1 — Projector pretraining on CC3M-595K
- Stage 2 — Instruction fine-tuning on LLaVA-Instruct-150K
- Stage 2 — Full evaluation on VQA, MMBench, SEED-Bench
Citation
@misc{zhai2023sigmoid,
title={Sigmoid Loss for Language Image Pre-Training},
author={Xiaohua Zhai and Basil Mustafa and Alexander Kolesnikov and Lucas Beyer},
year={2023},
eprint={2303.15343},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{yang2026nanbeige413bsmallgeneralmodel,
title={Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts},
author={Chen Yang and Guangyue Peng and Jiaying Zhu and Ran Le and Ruixiang Feng and Tao Zhang and Xiyun Xu and Yang Song and Yiming Jia and Yuntao Wen and Yunzhi Xu and Zekai Wang and Zhenwei An and Zhicong Sun and Zongchao Chen},
year={2026},
eprint={2602.13367},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.13367},
}
License
Apache 2.0 — see LICENSE for details.
- Downloads last month
- 160
