GreenVLA-5b-stride-4-R1-fractal

Embodiment-Adapted VLA for Fractal (Google Robot)

Sber Robotics Center · Manipulation Team

Overview

GreenVLA-5b-stride-4-R1-fractal is the R1 (embodiment-adapted) checkpoint of the Green-VLA family, fine-tuned on the Fractal dataset for the Google Robot.

Starting from the GreenVLA-5b-base-stride-4 pretrained checkpoint, this model was adapted via supervised fine-tuning (R1 stage) to the Fractal embodiment, achieving strong manipulation performance on the SimplerEnv benchmark.

Evaluation

Evaluated on SimplerEnv Google Robot (Fractal) benchmark with default episode length:

Visual Matching

Task	Success Rate
Coke Can	85.7%
Move Near	75.8%
Drawer	64.8%
Apple in Drawer	81.5%
Average	77.0%

Variant Aggregation

Task	Success Rate
Coke Can	92.6%
Move Near	71.9%
Drawer	35.7%
Apple in Drawer	66.7%
Average	66.7%

Overall Average: 71.8%

Training

	Details
Base checkpoint	GreenVLA-5b-base-stride-4
Stage	R1 — Embodiment-specific adaptation
Method	Supervised fine-tuning
Dataset	IPEC-COMMUNITY/fractal20220817_data_lerobot
Robot	Google Robot (Fractal)
Parameters	~5B

Quick Start

Installation

git clone https://github.com/greenvla/GreenVLA.git
cd GreenVLA
uv sync  # or: pip install -e .

Inference

import numpy as np
import torch
from lerobot.common.policies.factory import load_pretrained_policy
from lerobot.common.utils.torch_observation import (
    move_dict_to_batch_for_inference,
    torch_preprocess_dict_inference,
)

# 1. Load policy and transforms.
policy, input_transforms, output_transforms = load_pretrained_policy(
    "SberRoboticsCenter/GreenVLA-5b-stride-4-R1-fractal",
    data_config_name="fractal",
)
policy.to("cuda").eval()

# 2. Build an observation (replace with real sensor data).
raw_obs = {
    "observation/state": np.random.rand(8),  # x, y, z, rx, ry, rz, rw, gripper
    "observation/image": np.random.randint(256, size=(448, 448, 3), dtype=np.uint8),
    "prompt": "move the coke can to the left of the table",
}

# 3. Transform, preprocess, and batch.
obs = input_transforms(raw_obs)
obs = torch_preprocess_dict_inference(obs)
batch = move_dict_to_batch_for_inference(obs, device="cuda")

# 4. Predict actions and post-process.
with torch.inference_mode():
    raw_actions = policy.select_action(batch).cpu().numpy()

actions = output_transforms(
    {"actions": raw_actions, "state": batch["state"].cpu().numpy()}
)["actions"]
# actions shape: (action_horizon, 7) — [x, y, z, roll, pitch, yaw, gripper]

See examples/example_inference_fractal.py for the full runnable script with argument parsing.

Note: The Fractal embodiment uses an 8-dim proprioceptive state [x, y, z, rx, ry, rz, rw, gripper] and data_config_name="fractal" — this differs from Bridge which uses data_config_name="bridge" and a different state layout.

Citation

@misc{apanasevich2026greenvlastagedvisionlanguageactionmodel,
    title   = {Green-VLA: Staged Vision-Language-Action Model for Generalist Robots},
    author  = {I. Apanasevich and M. Artemyev and R. Babakyan and P. Fedotova and
               D. Grankin and E. Kupryashin and A. Misailidi and D. Nerus and
               A. Nutalapati and G. Sidorov and I. Efremov and M. Gerasyov and
               D. Pikurov and Y. Senchenko and S. Davidenko and D. Kulikov and
               M. Sultankin and K. Askarbek and O. Shamanin and D. Statovoy and
               E. Zalyaev and I. Zorin and A. Letkin and E. Rusakov and
               A. Silchenko and V. Vorobyov and S. Sobolnikov and A. Postnikov},
    year    = {2026},
    eprint  = {2602.00919},
    archivePrefix = {arXiv},
    primaryClass  = {cs.RO},
    url     = {https://arxiv.org/abs/2602.00919},
}