Instructions to use SkyAsl/Nanbeige4.1-VLM-Base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use SkyAsl/Nanbeige4.1-VLM-Base with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="SkyAsl/Nanbeige4.1-VLM-Base", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import NanbeigeVLM
model = NanbeigeVLM.from_pretrained("SkyAsl/Nanbeige4.1-VLM-Base", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use SkyAsl/Nanbeige4.1-VLM-Base with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "SkyAsl/Nanbeige4.1-VLM-Base"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SkyAsl/Nanbeige4.1-VLM-Base",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/SkyAsl/Nanbeige4.1-VLM-Base

SGLang

How to use SkyAsl/Nanbeige4.1-VLM-Base with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "SkyAsl/Nanbeige4.1-VLM-Base" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SkyAsl/Nanbeige4.1-VLM-Base",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "SkyAsl/Nanbeige4.1-VLM-Base" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SkyAsl/Nanbeige4.1-VLM-Base",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use SkyAsl/Nanbeige4.1-VLM-Base with Docker Model Runner:
```
docker model run hf.co/SkyAsl/Nanbeige4.1-VLM-Base
```

Nanbeige4.1-VLM-Base

A vision-language model combining SigLIP so400m as the vision encoder and Nanbeige4.1-3B as the language model, connected via a lightweight MLP projector with spatial pooling.

⚠️ Stage 1 Pretrain Checkpoint
This is a Stage 1 checkpoint where only the projector was trained. The model has learned visual grounding (associating image regions with language concepts) but has not been instruction-tuned yet. Stage 2 fine-tuning is in progress.

Architecture

Image
  │
  ▼
SigLIP so400m          ← Frozen  (400M params)
  │  (B, 729, 1152)
  ▼
Spatial Avg Pool       ← 729 → 196 tokens  (27×27 → 14×14)
  │  (B, 196, 1152)
  ▼
2-Layer MLP Projector  ← Trained (10M params)
  │  (B, 196, 3072)
  ▼
Nanbeige4.1-3B         ← Frozen  (3B params)
  │
  ▼
Text Output

The spatial pooling step reduces image token count from 729 → 196 (73% reduction), significantly lowering memory usage and inference latency without meaningful quality loss at the pretraining stage.

Usage

Installation

pip install transformers torch accelerate sentencepiece safetensors Pillow requests

flash-attention is optional but recommended for faster inference on CUDA GPUs:
pip install flash-attn --no-build-isolation

Basic Inference

from transformers import AutoModel, AutoTokenizer
from PIL import Image

# Load model and tokenizer
model = AutoModel.from_pretrained(
    "SkyAsl/Nanbeige4.1-VLM-Base",
    trust_remote_code=True,
)
model.to("cuda")

tokenizer = AutoTokenizer.from_pretrained(
    "SkyAsl/Nanbeige4.1-VLM-Base",
    trust_remote_code=True,
)
model.set_tokenizer(tokenizer)

# Describe an image
image  = Image.open("photo.jpg")
result = model.describe(image)
print(result)

Custom Prompt

result = model.describe(image, prompt="What objects are visible in this image?")
print(result)

Load from URL

import requests
from io import BytesIO

url   = "https://upload.wikimedia.org/wikipedia/commons/a/a7/Camponotus_flavomarginatus_ant.jpg"
image = Image.open(BytesIO(requests.get(url).content))
print(model.describe(image))

Creative Sampling

result = model.describe(
    image,
    do_sample=True,
    temperature=0.7,
    max_new_tokens=200,
)
print(result)

Training Details

Property	Value
Dataset	LLaVA-CC3M-Pretrain-595K (595,375 image-caption pairs)
Vision encoder	google/siglip-so400m-patch14-384
Language model	Nanbeige/Nanbeige4.1-3B
Trainable params	~10M (projector only)
Frozen params	~3.4B (SigLIP + LLM)
Image tokens	196 (after 2×2 avg pooling from 729)
Max text length	128 tokens
Batch size	32 (effective)
Learning rate	2e-3 (cosine schedule, warmup 3%)
Precision	bfloat16 + TF32
Hardware	1× A100 80GB
Training time	~6 hours
Final loss	~2.47

Qualitative Results

Sample outputs on a bus depot image (Step 9303, end of Stage 1):

"the bus depot is a major hub for commuters. it has been the main stop in town and serves as an important transport link. buses are used by local government agencies and tourists who want rides throughout this area."

The model consistently identifies objects, scenes, and spatial relationships correctly in the first 1–2 sentences. Extended hallucination in longer outputs is expected at Stage 1 and will be resolved in Stage 2.

Roadmap

Stage 1 — Projector pretraining on CC3M-595K
Stage 2 — Instruction fine-tuning on LLaVA-Instruct-150K
Stage 2 — Full evaluation on VQA, MMBench, SEED-Bench

Citation

@misc{zhai2023sigmoid,
      title={Sigmoid Loss for Language Image Pre-Training}, 
      author={Xiaohua Zhai and Basil Mustafa and Alexander Kolesnikov and Lucas Beyer},
      year={2023},
      eprint={2303.15343},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@misc{yang2026nanbeige413bsmallgeneralmodel,
      title={Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts}, 
      author={Chen Yang and Guangyue Peng and Jiaying Zhu and Ran Le and Ruixiang Feng and Tao Zhang and Xiyun Xu and Yang Song and Yiming Jia and Yuntao Wen and Yunzhi Xu and Zekai Wang and Zhenwei An and Zhicong Sun and Zongchao Chen},
      year={2026},
      eprint={2602.13367},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.13367}, 
}