RoboFine-VLM

RoboFine-VLM is a fine-tuned Vision-Language Model for robot manipulation video understanding, built on Qwen3.5-VL-397B-A17B. It was introduced in the paper FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies.

Authors: Xintong Hu, Xuhong Huang, Jinyu Zhang, Yutong Yao, Yuchong Sun, Qiuyue Wang, Mingsheng Li, Sicheng Xie, Yitao Liu, Junhao Chen, Yixuan Chen, Yingming Zheng, Shuai Bai, and Tao Yu.

Model Details

Attribute	Value
Base Model	Qwen3.5-VL-397B-A17B
Architecture	Qwen3.5 MoE (Mixture of Experts)
Total Parameters	397B
Active Parameters	~17B (10 of 512 experts per token)
Text Backbone	60 layers, hidden=4096
Vision Encoder	27 layers, hidden=1152, patch=16
Training	SFT on robot manipulation data
Max Context	262,144 tokens
License	Apache 2.0

Capabilities

Robotic Video VQA: Visual question answering on robot manipulation videos (multi-view supported)
Fine-Grained Manipulation Annotation: Step-by-step captioning across 10 dimensions:

Dimension	What it captures
Action Sequence	Step-by-step execution order
Active Actor	Which arm / end-effector to use
Target Object	Object disambiguation
Initial Configuration	Starting state of objects and robot
Final Configuration	End state after manipulation
Contact & Approach	Where and how contact is made
Trajectory & Orientation	Motion path and tool orientation
Body Motion	Full-body or joint-level movement
Object Interaction	How objects relate during manipulation
Failure & Recovery	Error handling and recovery behavior

Benchmark Results

Evaluated on RoboFine-Bench, a comprehensive benchmark for robotic video understanding.

VQA (VLM4Robotics Benchmark)

Evaluated on 500 robot manipulation samples across 10 datasets.

Model	Overall	Gnd.			Act.				State
		AA	TO	IC	AS	C&A	T&O	BM	OI	FC	F&R
Qwen3-VL-Plus	47.7	57.7	47.1	44.2	56.0	45.2	46.9	60.0	46.2	39.6	42.9
Qwen3.5-Plus	55.9	73.1	60.0	58.4	56.6	49.4	53.8	80.0	38.5	57.1	42.9
Doubao-Seed-2.0-Pro	58.5	63.5	55.3	53.2	62.4	49.7	58.8	70.0	53.8	64.3	50.0
Gemini-3.1-Pro	59.6	84.6	60.0	53.2	65.1	58.7	51.7	80.0	50.0	58.8	57.1
GPT-5.4	60.2	84.6	60.0	49.4	64.7	60.7	53.1	80.0	61.5	59.9	50.0
RoboFine-VLM (Ours)	68.2	82.7	65.9	68.8	70.6	69.0	63.0	100.0	61.5	65.4	78.6

Caption (VLM4Robotics Benchmark)

Fine-grained step decomposition of robot manipulation videos.

Model	Easy				Hard
	Overall	Cons.	Cov.	A-Hal.	Overall	Cons.	Cov.	A-Hal.
Qwen3-VL-Plus	75.4	75.2	58.2	92.8	64.4	67.4	54.3	71.6
Qwen3.5-Plus	76.6	75.3	59.1	95.5	72.4	71.0	55.1	91.2
Doubao-Seed-2.0-Pro	80.2	78.5	68.2	93.8	73.4	72.4	63.7	84.1
Gemini-3.1-Pro	80.1	79.9	62.7	97.7	75.9	75.7	58.5	93.4
GPT-5.4	81.4	79.5	72.1	92.5	78.0	73.8	66.8	93.4
RoboFine-VLM (Ours)	83.2	82.1	72.7	94.8	82.2	80.4	71.6	94.8

Settings: fps=4, max_frames=512, temperature=0.0, top_p=0.95, thinking=able(only RoboFine-VLM don't use thinking)

Quick Start

RoboFine-VLM is built on Qwen3.5-VL-397B-A17B. Inference efficiency and throughput vary significantly across frameworks. We recommend using the latest framework versions to ensure optimal performance and compatibility.

Using RoboFine-VLM via the Chat Completions API (OpenAI SDK)

After serving the model with a framework like vLLM or SGLang, you can use the following code:

from openai import OpenAI
import httpx

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
    http_client=httpx.Client(timeout=httpx.Timeout(600.0, connect=60.0)),
)

# Single-view video (frame URLs as video type)
messages = [{"role": "user", "content": [
    {
        "type": "video",
        "video": [
            "https://your-bucket.oss.aliyuncs.com/frame_0000.jpg",
            "https://your-bucket.oss.aliyuncs.com/frame_0001.jpg",
            # ... more frame URLs
        ],
        "max_frames": 512,
    },
    {"type": "text", "text": "Describe the robot manipulation in this video."},
]}]

response = client.chat.completions.create(
    model="RoboFine-VLM-opensource",
    messages=messages,
    temperature=0.0,
    top_p=0.95,
    max_tokens=32768,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print(response.choices[0].message.content)

Hardware Requirements

Precision	Model Weights	Minimum GPUs	Total VRAM
BF16	~752GB	8× H200-141GB	1128GB
FP8	~376GB	8× A100/H100-80GB	640GB

8× H200-141GB: BF16 full precision, no quantization needed
8× A100/H100-80GB: Requires FP8 quantization (--quantization fp8)

Training Data

Fine-tuned on FineVLA-Data: 47,159 human-verified trajectories with fine-grained instructions, generated by FineVLA-Tool and validated through human inspection.

Citation

@article{hu2026finevla,
  title={FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies},
  author={Hu, Xintong and Huang, Xuhong and Zhang, Jinyu and Yao, Yutong and Sun, Yuchong and Wang, Qiuyue and Li, Mingsheng and Xie, Sicheng and Liu, Yitao and Chen, Junhao and others},
  journal={arXiv preprint arXiv:2605.27284},
  year={2026}
}