Instructions to use SkyAsl/Nanbeige4.1-VLM-Base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use SkyAsl/Nanbeige4.1-VLM-Base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="SkyAsl/Nanbeige4.1-VLM-Base", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import NanbeigeVLM model = NanbeigeVLM.from_pretrained("SkyAsl/Nanbeige4.1-VLM-Base", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use SkyAsl/Nanbeige4.1-VLM-Base with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "SkyAsl/Nanbeige4.1-VLM-Base" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SkyAsl/Nanbeige4.1-VLM-Base", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/SkyAsl/Nanbeige4.1-VLM-Base
- SGLang
How to use SkyAsl/Nanbeige4.1-VLM-Base with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "SkyAsl/Nanbeige4.1-VLM-Base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SkyAsl/Nanbeige4.1-VLM-Base", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "SkyAsl/Nanbeige4.1-VLM-Base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SkyAsl/Nanbeige4.1-VLM-Base", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use SkyAsl/Nanbeige4.1-VLM-Base with Docker Model Runner:
docker model run hf.co/SkyAsl/Nanbeige4.1-VLM-Base
Nanbeige4.1-VLM-Base
A vision-language model combining SigLIP so400m as the vision encoder and Nanbeige4.1-3B as the language model, connected via a lightweight MLP projector with spatial pooling.
⚠️ Stage 1 Pretrain Checkpoint
This is a Stage 1 checkpoint where only the projector was trained. The model has learned visual grounding (associating image regions with language concepts) but has not been instruction-tuned yet. Stage 2 fine-tuning is in progress.
Architecture
Image
│
▼
SigLIP so400m ← Frozen (400M params)
│ (B, 729, 1152)
▼
Spatial Avg Pool ← 729 → 196 tokens (27×27 → 14×14)
│ (B, 196, 1152)
▼
2-Layer MLP Projector ← Trained (10M params)
│ (B, 196, 3072)
▼
Nanbeige4.1-3B ← Frozen (3B params)
│
▼
Text Output
The spatial pooling step reduces image token count from 729 → 196 (73% reduction), significantly lowering memory usage and inference latency without meaningful quality loss at the pretraining stage.
Usage
Installation
pip install transformers torch accelerate sentencepiece safetensors Pillow requests
flash-attention is optional but recommended for faster inference on CUDA GPUs:
pip install flash-attn --no-build-isolation
Basic Inference
from transformers import AutoModel, AutoTokenizer
from PIL import Image
# Load model and tokenizer
model = AutoModel.from_pretrained(
"SkyAsl/Nanbeige4.1-VLM-Base",
trust_remote_code=True,
)
model.to("cuda")
tokenizer = AutoTokenizer.from_pretrained(
"SkyAsl/Nanbeige4.1-VLM-Base",
trust_remote_code=True,
)
model.set_tokenizer(tokenizer)
# Describe an image
image = Image.open("photo.jpg")
result = model.describe(image)
print(result)
Custom Prompt
result = model.describe(image, prompt="What objects are visible in this image?")
print(result)
Load from URL
import requests
from io import BytesIO
url = "https://upload.wikimedia.org/wikipedia/commons/a/a7/Camponotus_flavomarginatus_ant.jpg"
image = Image.open(BytesIO(requests.get(url).content))
print(model.describe(image))
Creative Sampling
result = model.describe(
image,
do_sample=True,
temperature=0.7,
max_new_tokens=200,
)
print(result)
Training Details
| Property | Value |
|---|---|
| Dataset | LLaVA-CC3M-Pretrain-595K (595,375 image-caption pairs) |
| Vision encoder | google/siglip-so400m-patch14-384 |
| Language model | Nanbeige/Nanbeige4.1-3B |
| Trainable params | ~10M (projector only) |
| Frozen params | ~3.4B (SigLIP + LLM) |
| Image tokens | 196 (after 2×2 avg pooling from 729) |
| Max text length | 128 tokens |
| Batch size | 32 (effective) |
| Learning rate | 2e-3 (cosine schedule, warmup 3%) |
| Precision | bfloat16 + TF32 |
| Hardware | 1× A100 80GB |
| Training time | ~6 hours |
| Final loss | ~2.47 |
Qualitative Results
Sample outputs on a bus depot image (Step 9303, end of Stage 1):
"the bus depot is a major hub for commuters. it has been the main stop in town and serves as an important transport link. buses are used by local government agencies and tourists who want rides throughout this area."
The model consistently identifies objects, scenes, and spatial relationships correctly in the first 1–2 sentences. Extended hallucination in longer outputs is expected at Stage 1 and will be resolved in Stage 2.
Roadmap
- Stage 1 — Projector pretraining on CC3M-595K
- Stage 2 — Instruction fine-tuning on LLaVA-Instruct-150K
- Stage 2 — Full evaluation on VQA, MMBench, SEED-Bench
Citation
@misc{zhai2023sigmoid,
title={Sigmoid Loss for Language Image Pre-Training},
author={Xiaohua Zhai and Basil Mustafa and Alexander Kolesnikov and Lucas Beyer},
year={2023},
eprint={2303.15343},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{yang2026nanbeige413bsmallgeneralmodel,
title={Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts},
author={Chen Yang and Guangyue Peng and Jiaying Zhu and Ran Le and Ruixiang Feng and Tao Zhang and Xiyun Xu and Yang Song and Yiming Jia and Yuntao Wen and Yunzhi Xu and Zekai Wang and Zhenwei An and Zhicong Sun and Zongchao Chen},
year={2026},
eprint={2602.13367},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.13367},
}
License
Apache 2.0 — see LICENSE for details.
- Downloads last month
- 38
