Configuration Parsing Warning:Config file config.json cannot be fetched (too big)

Configuration Parsing Warning:Config file tokenizer_config.json cannot be fetched (too big)

RoboFine-VLM

A fine-tuned Vision-Language Model for robot manipulation video understanding, built on Qwen3.5-VL-397B-A17B.

Model Details

Attribute Value
Base Model Qwen3.5-VL-397B-A17B
Architecture Qwen3.5 MoE (Mixture of Experts)
Total Parameters 397B
Active Parameters ~17B (10 of 512 experts per token)
Text Backbone 60 layers, hidden=4096
Vision Encoder 27 layers, hidden=1152, patch=16
Training SFT on robot manipulation data
Max Context 262,144 tokens
License Apache 2.0

Capabilities

  • Fine-grained robot manipulation video analysis
  • Multi-view video understanding (supports 1–3 camera views simultaneously)
  • Visual Question Answering (VQA) on manipulation tasks
  • Detailed step-by-step captioning of robot actions
  • Contact region, trajectory, object state recognition

Benchmark Results

VQA (VLM4Robotics Benchmark)

Evaluated on 500 robot manipulation samples across 10 datasets.

Model Overall Gnd. AA TO IC AS C&A T&O BM OI FC F&R
Qwen3-VL-Plus 50.4 68.9 51.8 55.0 62.1 43.0 43.7 63.6 50.0 46.0 50.0 —
Qwen3.5-Plus 52.6 70.5 47.1 62.5 55.0 45.5 47.4 72.7 26.9 58.4 42.9 —
Doubao-Seed-2.0-Pro 54.9 60.7 55.3 61.3 61.4 50.0 45.1 72.7 42.3 61.6 50.0 —
GPT-5.4 61.0 85.1 60.0 58.8 66.4 61.5 50.7 63.6 50.0 65.4 28.6 —
Gemini-3.1-Pro 62.1 83.6 67.1 68.8 72.9 52.6 52.1 63.6 23.1 67.6 50.0 —
RoboFine-VLM (Ours) 71.0 85.2 63.5 72.5 73.6 67.3 56.7 81.8 57.7 66.5 85.7 —

Caption (VLM4Robotics Benchmark)

Fine-grained step decomposition of robot manipulation videos.

Model Easy Hard
Overall Cons. Cov. A-Hal. Overall Cons. Cov. A-Hal.
Qwen3-VL-Plus 76.8 75.6 60.4 94.4 65.1 68.7 57.0 69.6
Qwen3.5-Plus 77.9 76.0 61.7 96.0 72.5 70.9 56.8 89.7
Doubao-Seed-2.0-Pro 80.2 79.6 72.1 88.9 68.2 72.2 65.6 66.8
Gemini-3.1-Pro 81.3 80.8 69.8 93.2 77.2 77.0 61.3 93.4
GPT-5.4 83.1 80.8 75.1 93.4 78.1 74.2 68.9 91.1
RoboFine-VLM (Ours) 85.2 83.9 76.7 95.1 83.6 81.9 75.3 93.7

Settings: fps=4, max_frames=512, temperature=0.0, top_p=0.95, thinking=disabled

Quick Start

vLLM Deployment

python -m vllm.entrypoints.openai.api_server \
    --model RoboFine-VLM-opensource \
    --tensor-parallel-size 8 \
    --max-model-len 262144 \
    --dtype bfloat16 \
    --enforce-eager \
    --limit-mm-per-prompt '{"image": 2048}' \
    --reasoning-parser qwen3

API Call (OpenAI SDK)

from openai import OpenAI
import httpx

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
    http_client=httpx.Client(timeout=httpx.Timeout(600.0, connect=60.0)),
)

# Single-view video (frame URLs as video type)
messages = [{"role": "user", "content": [
    {
        "type": "video",
        "video": [
            "https://your-bucket.oss.aliyuncs.com/frame_0000.jpg",
            "https://your-bucket.oss.aliyuncs.com/frame_0001.jpg",
            # ... more frame URLs
        ],
        "max_frames": 512,
    },
    {"type": "text", "text": "Describe the robot manipulation in this video."},
]}]

# Multi-view video
messages = [{"role": "user", "content": [
    {"type": "text", "text": "[View: head_rgb]"},
    {"type": "video", "video": ["head/frame_0000.jpg", "head/frame_0001.jpg"]},
    {"type": "text", "text": "[View: wrist_rgb]"},
    {"type": "video", "video": ["wrist/frame_0000.jpg", "wrist/frame_0001.jpg"]},
    {"type": "text", "text": "Describe the robot actions from all views."},
]}]

response = client.chat.completions.create(
    model="RoboFine-VLM-opensource",
    messages=messages,
    temperature=0.0,
    top_p=0.95,
    max_tokens=32768,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print(response.choices[0].message.content)

Recommended Parameters

Parameter Value Note
temperature 0.0 Deterministic output
top_p 0.95 —
max_tokens 32768 —
enable_thinking False Disable CoT reasoning for faster inference
max_frames 512 Per-view frame limit
fps 4.0 Frame sampling rate

Hardware Requirements

  • GPU: 8× A100/H100 80GB (tensor parallel)
  • VRAM: ~640GB total (MoE model with 512 experts)
  • Inference: ~15–200s per sample depending on frame count

Training Data

Fine-tuned on curated robot manipulation datasets covering diverse platforms (Franka, UR5, Google Robot, Galaxea, etc.), tasks (pick-and-place, manipulation, navigation), and environments.

Limitations

  • Long multi-view videos (>1000 frames) require --limit-mm-per-prompt adjustment
  • Inference speed scales with frame count; 3-view × 100s videos may take 2–5 minutes
  • Best performance with thinking=disabled; thinking mode not specifically tuned

Citation

@misc{robofine-vlm-2026,
    title={RoboFine-VLM: Fine-tuned Vision-Language Model for Robot Manipulation Understanding},
    year={2026}
}
Downloads last month
96
Safetensors
Model size
403B params
Tensor type
BF16
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including xlangai/RoboFine-VLM-397B-A17B