rskill-qwen35-4b-nf4

OpenRAL rSkill — Qwen3.5-4B natively-multimodal video-language model packaged as an NF4 bitsandbytes vlm rSkill (ADR-0047). Accepts RGB image or video frames plus a natural-language query; returns a text answer. No actuators. Apache-2.0.

Quick Start

ral skill install hf://OpenRAL/rskill-qwen35-4b-nf4
from openral_core.schemas import RSkillManifest

manifest = RSkillManifest.from_yaml("rskills/qwen35-4b-nf4/rskill.yaml")
assert manifest.kind == "vlm"
assert manifest.role == "s2"
assert manifest.quantization.extra["scheme"] == "nf4"
assert manifest.is_commercial_use_allowed is True

What It Does

Qwen3.5-4B is a natively-multimodal foundation model trained from scratch on interleaved text, image, and video tokens. Given an RGB image or video clip and a natural-language question, it returns a free-form text answer grounded in the visual content.

This rSkill declares kind: vlm and role: s2 because it is a pure perception component operating at S2 (slow-reasoning) rate (~0.2–1 Hz), not an S1 fast policy. It consumes camera frames and natural-language queries, emits text answers, and never drives ros2_control joints.

Representative queries for robot scene understanding:

  • "What objects are on the table?"
  • "Is the gripper clear of obstacles?"
  • "Describe the relative positions of the cup and bowl."
  • "Has the pick-and-place task completed?"

Why Qwen3.5-4B over Qwen2.5-VL-7B

Qwen3.5-4B (this skill) Qwen2.5-VL-7B
Parameters 4B 7B
VideoMME (w/ subs.) 83.5% ~72%
MLVU 82.8% ~73%
VRAM at NF4 ~2.5 GB ~3.3 GB
VRAM at BF16 ~8 GB ~13 GB
Architecture Hybrid linear-attn (3:1) Full quadratic ViT+LLM
License Apache-2.0 Apache-2.0

Qwen3.5-4B beats Qwen2.5-VL-7B on every video benchmark despite being 3B smaller. The 3:1 Gated DeltaNet / full-attention hybrid processes long video sequences far more efficiently — important for continuous robot camera streams. At NF4 it fits well within 8 GB VRAM alongside the S1 skill stack.

Architecture

Qwen3.5 uses a 3:1 hybrid attention stack: three Gated DeltaNet (linear-attention, O(n)) layers for every one full-attention layer. This reduces cost on long sequences significantly. The vision encoder is shared with Qwen3-VL. Key features:

  • Native video support — temporal patch embedding, second-level event localization, up to 256K context (extensible to 1M)
  • Spatial grounding — RefCOCO avg ~80.6; strong for "where is X?" queries
  • 201-language support

Runtime

This rSkill ships a pre-quantized NF4 checkpoint as weights_uri (hf://OpenRAL/rskill-qwen35-4b-nf4): model.safetensors with an embedded bitsandbytes quantization_config (nf4, double-quant, bf16 compute). The sidecar loads it directly as 4-bit (~3.3 GB resident, no bf16 load spike), so it fits an 8 GB GPU with no loader workaround. source_repo records the SHA-pinned upstream Apache-2.0 model it was quantized from (provenance, §8).

Reproduce the checkpoint with tools/build_qwen_vlm_nf4_checkpoint.py (run in the sidecar venv):

$OPENRAL_QWEN_VLM_SIDECAR_VENV/bin/python tools/build_qwen_vlm_nf4_checkpoint.py \
  --source Qwen/Qwen3.5-4B \
  --out ~/.cache/openral/qwen35-4b-nf4-ckpt

It loads the upstream model once (forcing serial materialization so the bf16 pass fits 8 GB), saves the NF4 weights + processor, then verifies the checkpoint reloads directly as 4-bit and answers a smoke query.

The kind: vlm runtime is implemented (ADR-0047) as a read-only reasoner tool, not an ExecuteSkill (a scene VLM produces text, not actions):

  • Sidecar: tools/qwen_vlm_sidecar.py boots the NF4 model in its own venv and serves a ZMQ REQ/REP + msgpack protocol. Provision it separately and point at it with OPENRAL_QWEN_VLM_SIDECAR_VENV (or let the backend auto-spawn it on first query).
  • Backend: openral_runner.backends.gstreamer.qwen_scene_vlm.QwenSceneVlm is the node-side ZMQ client; build_scene_vlm(manifest) builds it from this manifest. The node-side client deps (pyzmq + msgpack) install with uv sync --group qwen-vlm.
  • Service node: openral_perception_ros.scene_vlm_node subscribes the cameras and serves /openral/perception/query_scene (openral_msgs/srv/QueryScene).
  • Reasoner tool: the LLM sees the read-only query_scene tool when the reasoner is launched with scene_query_available:=true. It asks open-ended scene-state questions ("has the robot grasped the mug?", "is the task complete?") and the answer feeds the next reasoning tick.

Validated live

The sidecar + backend + query_scene path was run end-to-end on an NVIDIA RTX 4070 Laptop (8 GB): NF4 Qwen3.5-4B loads to ~3.3 GB resident, and real image queries return correct answers — including the task-verification use case ("Has a robot gripper grasped any object?" → "No", grounded in the frame). Covered by the GPU-gated tests/unit/test_qwen_scene_vlm.py::test_e2e_query_coco_sample (set OPENRAL_QWEN_VLM_SIDECAR_VENV), not asserted blind.

8 GB load note. Deploying the pre-quantized weights_uri loads the 4-bit weights directly (~3.3 GB, ~6 s) with no workaround — the clean 8 GB path. The workaround only matters when quantizing at load from the raw upstream (the build step, or --model Qwen/Qwen3.5-4B): transformers 5.x's parallel loader materializes weights in bf16 on-GPU before bitsandbytes quantizes, and the 4-way-concurrent ~7.4 GB transient OOMs an 8 GB card, so the sidecar forces serial materialization (core_model_loading.GLOBAL_WORKERS = 1) + PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True. The sidecar auto-detects which path applies. The Gated-DeltaNet fast kernels (fla / causal-conv1d) are optional — without them transformers uses a slower torch fallback (the model still loads and answers). The model loads via AutoModelForImageTextToText (it registers as Qwen3_5ForConditionalGeneration).

Benchmark Numbers

Benchmarks below are paper-reported (Qwen team, February 2026); reproduced_locally: false in the eval JSON.

Benchmark Qwen3.5-4B Qwen3.5-9B
VideoMME (w/ subtitles) 83.5% 84.5%
VideoMME (w/o subtitles) 76.9% 78.4%
VideoMMMU 74.1% 78.9%
MLVU 82.8% 84.4%
MVBench 71.2% 74.4%
LVBench 66.4% 70.0%
MMMU 77.6% 78.4%
RefCOCO avg 80.6% 81.3%
LingoQA (driving / spatial) 74.4% 80.4%

Supported robots and embodiments

This scene VLM is embodiment-agnostic — it reasons about camera frames and emits text, never actuator commands, so it imposes no kinematic requirement. The only hardware dependency is an RGB camera stream of at least 336×336. All in-tree OpenRAL embodiment tags are therefore listed in rskill.yaml (aloha, franka_panda, g1, google_robot, gr1, h1, mobile_base, openarm, panda_mobile, pusht, rizon4, sawyer, so100_follower, so101_follower, ur10e, ur5e, widowx) so any robot with a compatible camera can install it and expose the reasoner's query_scene tool. It pairs with any S1 VLA policy: the VLA acts, this VLM verifies (e.g. "did the grasp succeed?").

Sensors and Observation Contract

Direction Key Modality Shape / format Notes
in any RGB camera RGB image or video min 336 × 336 vla_feature_key intentionally omitted
in query text natural language scene question, grounding query, or task-completion check
out answer text free-form grounded text response; adapter parses to SceneQueryResult

The model emits no action chunks and has no proprioception contract.

Manifest Summary

Field Value
name OpenRAL/rskill-qwen35-4b-nf4
version 0.1.0
license apache-2.0
role / kind s2 / vlm
runtime pytorch
quantization.dtype int4
quantization.extra.scheme nf4
weights_uri hf://OpenRAL/rskill-qwen35-4b-nf4 (pre-quantized NF4)
min_vram_gb.bf16 8.0 GB
min_vram_gb.int4 2.5 GB
latency_budget.per_chunk_ms 3000 ms
actions query

License

The rSkill package metadata and README are OpenRAL project files under Apache-2.0. The wrapped Qwen3.5 weights are released by the Qwen Team under Apache-2.0, permitting commercial use. No OPENRAL_ALLOW_NONCOMMERCIAL=1 flag is needed.

Downloads last month
-
Safetensors
Model size
5B params
Tensor type
F32
·
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OpenRAL/rskill-qwen35-4b-nf4

Finetuned
Qwen/Qwen3.5-4B
Quantized
(235)
this model