Qwen3-VL-4B-Instruct — INT4 NF4 Quantized

Alibaba's latest Qwen3-VL-4B-Instruct quantized to 4-bit NF4 with double quantization for high-quality robotic visual reasoning. 3.1x smaller — from 8.5 GB to 2.7 GB — delivering stronger visual understanding than the 2B variant while still fitting on edge GPUs.

This model is part of the RobotFlowLabs model library, built for the ANIMA agentic robotics platform — a modular ROS2-native AI system that brings foundation model intelligence to real robots operating in the real world.

Why This Model Exists

When robotic tasks demand higher visual reasoning quality — complex scene descriptions, multi-step visual planning, or precise spatial grounding — the 4B variant provides a significant accuracy boost over the 2B. Qwen3-VL-4B features a deeper language model (36 layers vs 28) with wider hidden dimensions (2560 vs 2048), delivering better performance on visual grounding, counting, and reasoning benchmarks. At 2.7 GB quantized, it fits on an L4 24GB alongside a vision encoder and action model.

Model Details

Property Value
Architecture Qwen3-VL (vision encoder + language decoder)
Total Parameters 4B
Text Hidden Dimension 2560
Text Layers 36
Text Attention Heads 32 (8 KV heads, GQA)
Text MLP Dimension 9728 (SiLU activation)
Vision Encoder 24-layer ViT (1024d, 16 heads, patch 16)
Vision Features DeepStack at layers [5, 11, 17]
Spatial Merge 2×2 (4 patches → 1 token)
Temporal Patch 2 frames per token
Context Length 262,144 tokens
Vocabulary 151,936 tokens
RoPE M-RoPE (interleaved, θ = 5,000,000)
Quantization NF4 double quantization (bitsandbytes)
Original Model Qwen/Qwen3-VL-4B-Instruct
License Apache-2.0

Compression Results

Quantized on an NVIDIA L4 24GB GPU using bitsandbytes NF4 with double quantization.

Metric Original INT4 Quantized Change
Total Size 8,465 MB 2,741 MB 3.1x smaller
Quantization BF16 NF4 + double quant 4-bit weights
Compute Dtype BF16 BF16 Preserved at inference
Format SafeTensors SafeTensors Direct HF loading

Quick Start

from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image

model = AutoModelForImageTextToText.from_pretrained(
    "robotflowlabs/qwen3-vl-4b-instruct-int4",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("robotflowlabs/qwen3-vl-4b-instruct-int4")

image = Image.open("workspace.jpg")
messages = [
    {"role": "system", "content": "You are a robotic vision assistant specialized in manipulation tasks."},
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "List all graspable objects, their approximate positions, and suggest a pick order."}
    ]}
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=[image], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(outputs[0], skip_special_tokens=True))

With FORGE (ANIMA Integration)

from forge.vlm import VLMRegistry

vlm = VLMRegistry.load("qwen3-vl-4b-instruct-int4")
plan = vlm.describe(image, "List all graspable objects and suggest a manipulation sequence.")

Use Cases in ANIMA

Qwen3-VL-4B serves as the high-quality visual reasoning engine in ANIMA:

  • Complex Scene Analysis — Detailed spatial reasoning about cluttered workspaces
  • Visual Task Planning — Multi-step manipulation plans from scene observation
  • Precise Grounding — Fine-grained object localization and counting
  • Structured Output — JSON scene graphs, object inventories, spatial relationship maps
  • Video Reasoning — Temporal understanding of task progress from camera feeds
  • Safety Assessment — Visual evaluation of workspace hazards before execution

Qwen3-VL Family on RobotFlowLabs

Model Params Quantized Size Best For
qwen3-vl-2b-instruct-int4 2B 1.5 GB Edge deployment, real-time
qwen3-vl-4b-instruct-int4 4B 2.7 GB Higher accuracy visual reasoning

About ANIMA

ANIMA is a modular, ROS2-native agentic robotics platform developed by RobotFlowLabs. It combines 58 specialized AI modules into a unified system for real-world robotic autonomy.

Other Collections

Intended Use

Designed For

  • High-accuracy visual scene understanding for robotic manipulation
  • Complex visual task planning requiring spatial reasoning
  • Precise object grounding and counting in cluttered environments
  • Multi-turn visual dialogue with detailed scene descriptions

Limitations

  • INT4 quantization may slightly reduce fine-grained visual grounding precision
  • 262K context window is generous but may not cover extremely long video sequences
  • Requires GPU (bitsandbytes NF4 does not run on CPU)
  • Inherits biases from Qwen3-VL training data

Out of Scope

  • Safety-critical autonomous decision making without human oversight
  • Medical image analysis
  • Surveillance applications

Technical Details

Compression Pipeline

Original Qwen3-VL-4B-Instruct (BF16, 8.5 GB)
    │
    └─→ bitsandbytes NF4 double quantization
        ├─→ bnb_4bit_quant_type: nf4
        ├─→ bnb_4bit_use_double_quant: true
        ├─→ bnb_4bit_compute_dtype: bfloat16
        └─→ model.safetensors (2.7 GB)
  • Quantization: NF4 (Normal Float 4-bit) with double quantization via bitsandbytes
  • Compute: BF16 at inference — weights dequantized on-the-fly
  • Hardware: NVIDIA L4 24GB, CUDA 13.0, PyTorch 2.10, Python 3.14

Attribution

Citation

@article{qwen3vl,
  title={Qwen3-VL Technical Report},
  author={Qwen Team},
  year={2025}
}

Built with FORGE by RobotFlowLabs
Optimizing foundation models for real robots.

Downloads last month
177
Safetensors
Model size
5B params
Tensor type
F32
·
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for robotflowlabs/qwen3-vl-4b-instruct-int4

Quantized
(64)
this model

Collection including robotflowlabs/qwen3-vl-4b-instruct-int4

Evaluation results