GreenVLA-5b-stride-4-R1-fractal

Embodiment-Adapted VLA for Fractal (Google Robot)

Sber Robotics Center · Manipulation Team

arXiv Project Page Code


Overview

GreenVLA-5b-stride-4-R1-fractal is the R1 (embodiment-adapted) checkpoint of the Green-VLA family, fine-tuned on the Fractal dataset for the Google Robot.

Starting from the GreenVLA-5b-base-stride-4 pretrained checkpoint, this model was adapted via supervised fine-tuning (R1 stage) to the Fractal embodiment, achieving strong manipulation performance on the SimplerEnv benchmark.

Evaluation

Evaluated on SimplerEnv Google Robot (Fractal) benchmark with default episode length:

Visual Matching

Task Success Rate
Coke Can 85.7%
Move Near 75.8%
Drawer 64.8%
Apple in Drawer 81.5%
Average 77.0%

Variant Aggregation

Task Success Rate
Coke Can 92.6%
Move Near 71.9%
Drawer 35.7%
Apple in Drawer 66.7%
Average 66.7%

Overall Average: 71.8%

Training

Details
Base checkpoint GreenVLA-5b-base-stride-4
Stage R1 — Embodiment-specific adaptation
Method Supervised fine-tuning
Dataset IPEC-COMMUNITY/fractal20220817_data_lerobot
Robot Google Robot (Fractal)
Parameters ~5B

Quick Start

Installation

git clone https://github.com/greenvla/GreenVLA.git
cd GreenVLA
uv sync  # or: pip install -e .

Inference

import numpy as np
import torch
from lerobot.common.policies.factory import load_pretrained_policy
from lerobot.common.utils.torch_observation import (
    move_dict_to_batch_for_inference,
    torch_preprocess_dict_inference,
)

# 1. Load policy and transforms.
policy, input_transforms, output_transforms = load_pretrained_policy(
    "SberRoboticsCenter/GreenVLA-5b-stride-4-R1-fractal",
    data_config_name="fractal",
)
policy.to("cuda").eval()

# 2. Build an observation (replace with real sensor data).
raw_obs = {
    "observation/state": np.random.rand(8),  # x, y, z, rx, ry, rz, rw, gripper
    "observation/image": np.random.randint(256, size=(448, 448, 3), dtype=np.uint8),
    "prompt": "move the coke can to the left of the table",
}

# 3. Transform, preprocess, and batch.
obs = input_transforms(raw_obs)
obs = torch_preprocess_dict_inference(obs)
batch = move_dict_to_batch_for_inference(obs, device="cuda")

# 4. Predict actions and post-process.
with torch.inference_mode():
    raw_actions = policy.select_action(batch).cpu().numpy()

actions = output_transforms(
    {"actions": raw_actions, "state": batch["state"].cpu().numpy()}
)["actions"]
# actions shape: (action_horizon, 7) — [x, y, z, roll, pitch, yaw, gripper]

See examples/example_inference_fractal.py for the full runnable script with argument parsing.

Note: The Fractal embodiment uses an 8-dim proprioceptive state [x, y, z, rx, ry, rz, rw, gripper] and data_config_name="fractal" — this differs from Bridge which uses data_config_name="bridge" and a different state layout.

Citation

@misc{apanasevich2026greenvlastagedvisionlanguageactionmodel,
    title   = {Green-VLA: Staged Vision-Language-Action Model for Generalist Robots},
    author  = {I. Apanasevich and M. Artemyev and R. Babakyan and P. Fedotova and
               D. Grankin and E. Kupryashin and A. Misailidi and D. Nerus and
               A. Nutalapati and G. Sidorov and I. Efremov and M. Gerasyov and
               D. Pikurov and Y. Senchenko and S. Davidenko and D. Kulikov and
               M. Sultankin and K. Askarbek and O. Shamanin and D. Statovoy and
               E. Zalyaev and I. Zorin and A. Letkin and E. Rusakov and
               A. Silchenko and V. Vorobyov and S. Sobolnikov and A. Postnikov},
    year    = {2026},
    eprint  = {2602.00919},
    archivePrefix = {arXiv},
    primaryClass  = {cs.RO},
    url     = {https://arxiv.org/abs/2602.00919},
}

© 2026 Sber Robotics Center · Manipulation Team

Downloads last month
13
Video Preview
loading

Model tree for SberRoboticsCenter/GreenVLA-5b-stride-4-R1-fractal

Dataset used to train SberRoboticsCenter/GreenVLA-5b-stride-4-R1-fractal

Collection including SberRoboticsCenter/GreenVLA-5b-stride-4-R1-fractal

Paper for SberRoboticsCenter/GreenVLA-5b-stride-4-R1-fractal

Evaluation results