Nanbeige4.1-VLM-Base

A vision-language model combining SigLIP so400m as the vision encoder and Nanbeige4.1-3B as the language model, connected via a lightweight MLP projector with spatial pooling.

⚠️ Stage 1 Pretrain Checkpoint
This is a Stage 1 checkpoint where only the projector was trained. The model has learned visual grounding (associating image regions with language concepts) but has not been instruction-tuned yet. Stage 2 fine-tuning is in progress.


Architecture

Image
  │
  ▼
SigLIP so400m          ← Frozen  (400M params)
  │  (B, 729, 1152)
  ▼
Spatial Avg Pool       ← 729 → 196 tokens  (27×27 → 14×14)
  │  (B, 196, 1152)
  ▼
2-Layer MLP Projector  ← Trained (10M params)
  │  (B, 196, 3072)
  ▼
Nanbeige4.1-3B         ← Frozen  (3B params)
  │
  ▼
Text Output

The spatial pooling step reduces image token count from 729 → 196 (73% reduction), significantly lowering memory usage and inference latency without meaningful quality loss at the pretraining stage.


Usage

Installation

pip install transformers torch accelerate sentencepiece safetensors Pillow requests

flash-attention is optional but recommended for faster inference on CUDA GPUs:

pip install flash-attn --no-build-isolation

Basic Inference

from transformers import AutoModel, AutoTokenizer
from PIL import Image

# Load model and tokenizer
model = AutoModel.from_pretrained(
    "SkyAsl/Nanbeige4.1-VLM-Base",
    trust_remote_code=True,
)
model.to("cuda")

tokenizer = AutoTokenizer.from_pretrained(
    "SkyAsl/Nanbeige4.1-VLM-Base",
    trust_remote_code=True,
)
model.set_tokenizer(tokenizer)

# Describe an image
image  = Image.open("photo.jpg")
result = model.describe(image)
print(result)

Custom Prompt

result = model.describe(image, prompt="What objects are visible in this image?")
print(result)

Load from URL

import requests
from io import BytesIO

url   = "https://upload.wikimedia.org/wikipedia/commons/a/a7/Camponotus_flavomarginatus_ant.jpg"
image = Image.open(BytesIO(requests.get(url).content))
print(model.describe(image))

Creative Sampling

result = model.describe(
    image,
    do_sample=True,
    temperature=0.7,
    max_new_tokens=200,
)
print(result)

Training Details

Property Value
Dataset LLaVA-CC3M-Pretrain-595K (595,375 image-caption pairs)
Vision encoder google/siglip-so400m-patch14-384
Language model Nanbeige/Nanbeige4.1-3B
Trainable params ~10M (projector only)
Frozen params ~3.4B (SigLIP + LLM)
Image tokens 196 (after 2×2 avg pooling from 729)
Max text length 128 tokens
Batch size 32 (effective)
Learning rate 2e-3 (cosine schedule, warmup 3%)
Precision bfloat16 + TF32
Hardware 1× A100 80GB
Training time ~6 hours
Final loss ~2.47

Qualitative Results

Sample outputs on a bus depot image (Step 9303, end of Stage 1):

Bus-depot-image

"the bus depot is a major hub for commuters. it has been the main stop in town and serves as an important transport link. buses are used by local government agencies and tourists who want rides throughout this area."

The model consistently identifies objects, scenes, and spatial relationships correctly in the first 1–2 sentences. Extended hallucination in longer outputs is expected at Stage 1 and will be resolved in Stage 2.


Roadmap

  • Stage 1 — Projector pretraining on CC3M-595K
  • Stage 2 — Instruction fine-tuning on LLaVA-Instruct-150K
  • Stage 2 — Full evaluation on VQA, MMBench, SEED-Bench

Citation

@misc{zhai2023sigmoid,
      title={Sigmoid Loss for Language Image Pre-Training}, 
      author={Xiaohua Zhai and Basil Mustafa and Alexander Kolesnikov and Lucas Beyer},
      year={2023},
      eprint={2303.15343},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@misc{yang2026nanbeige413bsmallgeneralmodel,
      title={Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts}, 
      author={Chen Yang and Guangyue Peng and Jiaying Zhu and Ran Le and Ruixiang Feng and Tao Zhang and Xiyun Xu and Yang Song and Yiming Jia and Yuntao Wen and Yunzhi Xu and Zekai Wang and Zhenwei An and Zhicong Sun and Zongchao Chen},
      year={2026},
      eprint={2602.13367},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.13367}, 
}

License

Apache 2.0 — see LICENSE for details.

Downloads last month
160
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SkyAsl/Nanbeige4.1-VLM-Base

Finetuned
(24)
this model

Dataset used to train SkyAsl/Nanbeige4.1-VLM-Base

Papers for SkyAsl/Nanbeige4.1-VLM-Base