FineVLA: Fine-Grained Instruction Alignment For VLA
Collection
This is the collection of FineVLA, including the RoboFine-Bench RoboFine-VLM and FineVLA-policyLA • 2 items • Updated • 1
Configuration Parsing Warning:Config file config.json cannot be fetched (too big)
Configuration Parsing Warning:Config file tokenizer_config.json cannot be fetched (too big)
A fine-tuned Vision-Language Model for robot manipulation video understanding, built on Qwen3.5-VL-397B-A17B.
| Attribute | Value |
|---|---|
| Base Model | Qwen3.5-VL-397B-A17B |
| Architecture | Qwen3.5 MoE (Mixture of Experts) |
| Total Parameters | 397B |
| Active Parameters | ~17B (10 of 512 experts per token) |
| Text Backbone | 60 layers, hidden=4096 |
| Vision Encoder | 27 layers, hidden=1152, patch=16 |
| Training | SFT on robot manipulation data |
| Max Context | 262,144 tokens |
| License | Apache 2.0 |
Evaluated on 500 robot manipulation samples across 10 datasets.
| Model | Overall | Gnd. | AA | TO | IC | AS | C&A | T&O | BM | OI | FC | F&R |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen3-VL-Plus | 50.4 | 68.9 | 51.8 | 55.0 | 62.1 | 43.0 | 43.7 | 63.6 | 50.0 | 46.0 | 50.0 | — |
| Qwen3.5-Plus | 52.6 | 70.5 | 47.1 | 62.5 | 55.0 | 45.5 | 47.4 | 72.7 | 26.9 | 58.4 | 42.9 | — |
| Doubao-Seed-2.0-Pro | 54.9 | 60.7 | 55.3 | 61.3 | 61.4 | 50.0 | 45.1 | 72.7 | 42.3 | 61.6 | 50.0 | — |
| GPT-5.4 | 61.0 | 85.1 | 60.0 | 58.8 | 66.4 | 61.5 | 50.7 | 63.6 | 50.0 | 65.4 | 28.6 | — |
| Gemini-3.1-Pro | 62.1 | 83.6 | 67.1 | 68.8 | 72.9 | 52.6 | 52.1 | 63.6 | 23.1 | 67.6 | 50.0 | — |
| RoboFine-VLM (Ours) | 71.0 | 85.2 | 63.5 | 72.5 | 73.6 | 67.3 | 56.7 | 81.8 | 57.7 | 66.5 | 85.7 | — |
Fine-grained step decomposition of robot manipulation videos.
| Model | Easy | Hard | ||||||
|---|---|---|---|---|---|---|---|---|
| Overall | Cons. | Cov. | A-Hal. | Overall | Cons. | Cov. | A-Hal. | |
| Qwen3-VL-Plus | 76.8 | 75.6 | 60.4 | 94.4 | 65.1 | 68.7 | 57.0 | 69.6 |
| Qwen3.5-Plus | 77.9 | 76.0 | 61.7 | 96.0 | 72.5 | 70.9 | 56.8 | 89.7 |
| Doubao-Seed-2.0-Pro | 80.2 | 79.6 | 72.1 | 88.9 | 68.2 | 72.2 | 65.6 | 66.8 |
| Gemini-3.1-Pro | 81.3 | 80.8 | 69.8 | 93.2 | 77.2 | 77.0 | 61.3 | 93.4 |
| GPT-5.4 | 83.1 | 80.8 | 75.1 | 93.4 | 78.1 | 74.2 | 68.9 | 91.1 |
| RoboFine-VLM (Ours) | 85.2 | 83.9 | 76.7 | 95.1 | 83.6 | 81.9 | 75.3 | 93.7 |
Settings: fps=4, max_frames=512, temperature=0.0, top_p=0.95, thinking=disabled
python -m vllm.entrypoints.openai.api_server \
--model RoboFine-VLM-opensource \
--tensor-parallel-size 8 \
--max-model-len 262144 \
--dtype bfloat16 \
--enforce-eager \
--limit-mm-per-prompt '{"image": 2048}' \
--reasoning-parser qwen3
from openai import OpenAI
import httpx
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
http_client=httpx.Client(timeout=httpx.Timeout(600.0, connect=60.0)),
)
# Single-view video (frame URLs as video type)
messages = [{"role": "user", "content": [
{
"type": "video",
"video": [
"https://your-bucket.oss.aliyuncs.com/frame_0000.jpg",
"https://your-bucket.oss.aliyuncs.com/frame_0001.jpg",
# ... more frame URLs
],
"max_frames": 512,
},
{"type": "text", "text": "Describe the robot manipulation in this video."},
]}]
# Multi-view video
messages = [{"role": "user", "content": [
{"type": "text", "text": "[View: head_rgb]"},
{"type": "video", "video": ["head/frame_0000.jpg", "head/frame_0001.jpg"]},
{"type": "text", "text": "[View: wrist_rgb]"},
{"type": "video", "video": ["wrist/frame_0000.jpg", "wrist/frame_0001.jpg"]},
{"type": "text", "text": "Describe the robot actions from all views."},
]}]
response = client.chat.completions.create(
model="RoboFine-VLM-opensource",
messages=messages,
temperature=0.0,
top_p=0.95,
max_tokens=32768,
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print(response.choices[0].message.content)
| Parameter | Value | Note |
|---|---|---|
| temperature | 0.0 | Deterministic output |
| top_p | 0.95 | — |
| max_tokens | 32768 | — |
| enable_thinking | False | Disable CoT reasoning for faster inference |
| max_frames | 512 | Per-view frame limit |
| fps | 4.0 | Frame sampling rate |
Fine-tuned on curated robot manipulation datasets covering diverse platforms (Franka, UR5, Google Robot, Galaxea, etc.), tasks (pick-and-place, manipulation, navigation), and environments.
--limit-mm-per-prompt adjustmentthinking=disabled; thinking mode not specifically tuned@misc{robofine-vlm-2026,
title={RoboFine-VLM: Fine-tuned Vision-Language Model for Robot Manipulation Understanding},
year={2026}
}