Qwen3-VL-4B-Instruct — INT4 NF4 Quantized

Alibaba's latest Qwen3-VL-4B-Instruct quantized to 4-bit NF4 with double quantization for high-quality robotic visual reasoning. 3.1x smaller — from 8.5 GB to 2.7 GB — delivering stronger visual understanding than the 2B variant while still fitting on edge GPUs.

This model is part of the RobotFlowLabs model library, built for the ANIMA agentic robotics platform — a modular ROS2-native AI system that brings foundation model intelligence to real robots operating in the real world.

Why This Model Exists

When robotic tasks demand higher visual reasoning quality — complex scene descriptions, multi-step visual planning, or precise spatial grounding — the 4B variant provides a significant accuracy boost over the 2B. Qwen3-VL-4B features a deeper language model (36 layers vs 28) with wider hidden dimensions (2560 vs 2048), delivering better performance on visual grounding, counting, and reasoning benchmarks. At 2.7 GB quantized, it fits on an L4 24GB alongside a vision encoder and action model.

Model Details

Property	Value
Architecture	Qwen3-VL (vision encoder + language decoder)
Total Parameters	4B
Text Hidden Dimension	2560
Text Layers	36
Text Attention Heads	32 (8 KV heads, GQA)
Text MLP Dimension	9728 (SiLU activation)
Vision Encoder	24-layer ViT (1024d, 16 heads, patch 16)
Vision Features	DeepStack at layers [5, 11, 17]
Spatial Merge	2×2 (4 patches → 1 token)
Temporal Patch	2 frames per token
Context Length	262,144 tokens
Vocabulary	151,936 tokens
RoPE	M-RoPE (interleaved, θ = 5,000,000)
Quantization	NF4 double quantization (bitsandbytes)
Original Model	`Qwen/Qwen3-VL-4B-Instruct`
License	Apache-2.0

Compression Results

Quantized on an NVIDIA L4 24GB GPU using bitsandbytes NF4 with double quantization.

Metric	Original	INT4 Quantized	Change
Total Size	8,465 MB	2,741 MB	3.1x smaller
Quantization	BF16	NF4 + double quant	4-bit weights
Compute Dtype	BF16	BF16	Preserved at inference
Format	SafeTensors	SafeTensors	Direct HF loading

Quick Start

from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image

model = AutoModelForImageTextToText.from_pretrained(
    "robotflowlabs/qwen3-vl-4b-instruct-int4",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("robotflowlabs/qwen3-vl-4b-instruct-int4")

image = Image.open("workspace.jpg")
messages = [
    {"role": "system", "content": "You are a robotic vision assistant specialized in manipulation tasks."},
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "List all graspable objects, their approximate positions, and suggest a pick order."}
    ]}
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=[image], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(outputs[0], skip_special_tokens=True))

With FORGE (ANIMA Integration)

from forge.vlm import VLMRegistry

vlm = VLMRegistry.load("qwen3-vl-4b-instruct-int4")
plan = vlm.describe(image, "List all graspable objects and suggest a manipulation sequence.")

Use Cases in ANIMA

Qwen3-VL-4B serves as the high-quality visual reasoning engine in ANIMA:

Complex Scene Analysis — Detailed spatial reasoning about cluttered workspaces
Visual Task Planning — Multi-step manipulation plans from scene observation
Precise Grounding — Fine-grained object localization and counting
Structured Output — JSON scene graphs, object inventories, spatial relationship maps
Video Reasoning — Temporal understanding of task progress from camera feeds
Safety Assessment — Visual evaluation of workspace hazards before execution

Qwen3-VL Family on RobotFlowLabs

Model	Params	Quantized Size	Best For
qwen3-vl-2b-instruct-int4	2B	1.5 GB	Edge deployment, real-time
qwen3-vl-4b-instruct-int4	4B	2.7 GB	Higher accuracy visual reasoning

About ANIMA

ANIMA is a modular, ROS2-native agentic robotics platform developed by RobotFlowLabs. It combines 58 specialized AI modules into a unified system for real-world robotic autonomy.

Other Collections

ANIMA Vision — SAM2, DINOv2, CLIP, SigLIP, Depth Anything
ANIMA Language — Qwen2.5, SmolLM2
ANIMA VLM — Qwen3-VL, Qwen2.5-VL
ANIMA VLA — SmolVLA, RDT2-FM, FORGE students

Intended Use

Designed For

High-accuracy visual scene understanding for robotic manipulation
Complex visual task planning requiring spatial reasoning
Precise object grounding and counting in cluttered environments
Multi-turn visual dialogue with detailed scene descriptions

Limitations

INT4 quantization may slightly reduce fine-grained visual grounding precision
262K context window is generous but may not cover extremely long video sequences
Requires GPU (bitsandbytes NF4 does not run on CPU)
Inherits biases from Qwen3-VL training data

Out of Scope

Safety-critical autonomous decision making without human oversight
Medical image analysis
Surveillance applications

Technical Details

Compression Pipeline

Original Qwen3-VL-4B-Instruct (BF16, 8.5 GB)
    │
    └─→ bitsandbytes NF4 double quantization
        ├─→ bnb_4bit_quant_type: nf4
        ├─→ bnb_4bit_use_double_quant: true
        ├─→ bnb_4bit_compute_dtype: bfloat16
        └─→ model.safetensors (2.7 GB)

Quantization: NF4 (Normal Float 4-bit) with double quantization via bitsandbytes
Compute: BF16 at inference — weights dequantized on-the-fly
Hardware: NVIDIA L4 24GB, CUDA 13.0, PyTorch 2.10, Python 3.14

Attribution

Original Model: Qwen/Qwen3-VL-4B-Instruct by Alibaba Cloud
License: Apache-2.0
Compressed by: RobotFlowLabs using FORGE

Citation

@article{qwen3vl,
  title={Qwen3-VL Technical Report},
  author={Qwen Team},
  year={2025}
}

Built with FORGE by RobotFlowLabs
Optimizing foundation models for real robots.

Downloads last month: 177

Safetensors

Model size

5B params

Tensor type

F32

BF16

Model tree for robotflowlabs/qwen3-vl-4b-instruct-int4

Base model

Qwen/Qwen3-VL-4B-Instruct

Quantized

(64)

this model

Collection including robotflowlabs/qwen3-vl-4b-instruct-int4

ANIMA VLM

Collection

INT4 vision-language models for robotic scene understanding. Qwen2.5-VL for visual QA and grounding. • 2 items • Updated 30 days ago

Evaluation results

Model Size (MB)
self-reported

2741.000
Compression Ratio
self-reported

3.100
Original Size (MB)
self-reported

8465.000