SberRoboticsCenter
/

GreenVLA-5b-base-stride-4

@@ -21,7 +21,7 @@ datasets:
 <div align="center">
-# GreenVLA-5b-base
 ### Staged Vision-Language-Action Model for Generalist Robots
@@ -37,7 +37,9 @@ datasets:
 ## Overview
-**GreenVLA-5b-base** is the recommended base checkpoint of the [Green-VLA](https://arxiv.org/abs/2602.00919) family — a ~5B-parameter Vision-Language-Action model pretrained on both general-domain and robotics data (3,000+ hours of demonstrations across multiple embodiments).
 This checkpoint combines:
@@ -53,6 +55,7 @@ Use this checkpoint as the starting point for **fine-tuning on your own embodime
 |---|---|
 | **VLM Backbone** | Qwen3-VL-4B-Instruct (vision encoder + language model) |
 | **Action Expert** | Flow-matching transformer operating in a reduced hidden space |
 | **Action Tokenizer** | FAST tokenizer for autoregressive action prediction |
 | **Total Parameters** | ~5B |
@@ -91,16 +94,16 @@ from lerobot.common.utils.torch_observation import (
 # 1. Load policy and transforms.
 policy, input_transforms, output_transforms = load_pretrained_policy(
-    "SberRoboticsCenter/GreenVLA-5b-R1-bridge",
-    data_config_name="bridge",
 )
 policy.to("cuda").eval()
 # 2. Build an observation (replace with real sensor data).
 raw_obs = {
-    "observation/state": np.random.rand(8).astype(np.float32),  # x y z roll pitch yaw _pad_ gripper
-    "observation/image": np.random.randint(0, 256, size=(224, 224, 3), dtype=np.uint8),
-    "prompt": "pick up the green block and place it on the plate",
 }
 # 3. Transform, preprocess, and batch.
@@ -118,7 +121,7 @@ actions = output_transforms(
 # actions shape: (action_horizon, 7) — [x, y, z, roll, pitch, yaw, gripper]
 ```
-See [`examples/example_inference_bridge.py`](https://github.com/greenvla/GreenVLA/blob/main/examples/example_inference_bridge.py) for the full runnable script with argument parsing.
 ### VLM Inference (VQA, Pointing, BBox)
@@ -130,7 +133,7 @@ from lerobot.common.policies.factory import load_pretrained_policy
 # Load without data transforms
 policy, _, _ = load_pretrained_policy(
-    "SberRoboticsCenter/GreenVLA-5b-base",
     data_config_name=None,
 )
 policy = policy.to("cuda").eval()
@@ -166,16 +169,6 @@ generated_ids_trimmed = [
 print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0])
 ```
-## Model Family
-| Model | Stage | Params | Description | Link |
-|-------|:-----:|:------:|-------------|:----:|
-| **GreenVLA-2b-base** | Base | 2B | Base pretrained (lightweight) | [Hub](https://huggingface.co/SberRoboticsCenter/GreenVLA-2b-base) |
-| **GreenVLA-5b-base** | Base | 5B | Base pretrained (recommended) | You are here |
-| **GreenVLA-5b-R1-bridge** | R1 | 5B | Fine-tuned on Bridge (WidowX) | [Hub](https://huggingface.co/SberRoboticsCenter/GreenVLA-5b-R1-bridge) |
-| **GreenVLA-5b-R2-bridge** | R2 | 5B | RL-aligned on Bridge (WidowX) | [Hub](https://huggingface.co/SberRoboticsCenter/GreenVLA-5b-R2-bridge) |
-| **GreenVLA-5b-R1-fractal** | R1 | 5B | Fine-tuned on Fractal (Google Robot) | [Hub](https://huggingface.co/SberRoboticsCenter/GreenVLA-5b-R1-fractal) |
 ## Citation
 ```bibtex

 <div align="center">
+# GreenVLA-5b-base-stride-4
 ### Staged Vision-Language-Action Model for Generalist Robots
 ## Overview
+**GreenVLA-5b-base-stride-4** is a base checkpoint of the [Green-VLA](https://arxiv.org/abs/2602.00919) family — a ~5B-parameter Vision-Language-Action model pretrained on both general-domain and robotics data (3,000+ hours of demonstrations across multiple embodiments).
+This is the **stride-4** variant: the action expert has **4× fewer transformer layers** than the VLM backbone, resulting in a lighter action head while retaining the full VLM capacity. For the variant with the same number of action-expert layers as the VLM, see [GreenVLA-5b-base-stride-1](https://huggingface.co/SberRoboticsCenter/GreenVLA-5b-base-stride-1).
 This checkpoint combines:
 |---|---|
 | **VLM Backbone** | Qwen3-VL-4B-Instruct (vision encoder + language model) |
 | **Action Expert** | Flow-matching transformer operating in a reduced hidden space |
+| **Action Expert Depth** | 4× fewer layers than the VLM (stride 4) |
 | **Action Tokenizer** | FAST tokenizer for autoregressive action prediction |
 | **Total Parameters** | ~5B |
 # 1. Load policy and transforms.
 policy, input_transforms, output_transforms = load_pretrained_policy(
+    "SberRoboticsCenter/GreenVLA-5b-stride-4-R1-fractal",
+    data_config_name="fractal",
 )
 policy.to("cuda").eval()
 # 2. Build an observation (replace with real sensor data).
 raw_obs = {
+    "observation/state": np.random.rand(8),  # x, y, z, rx, ry, rz, rw, gripper
+    "observation/image": np.random.randint(256, size=(448, 448, 3), dtype=np.uint8),
+    "prompt": "move the coke can to the left of the table",
 }
 # 3. Transform, preprocess, and batch.
 # actions shape: (action_horizon, 7) — [x, y, z, roll, pitch, yaw, gripper]
 ```
+See [`examples/example_inference_fractal.py`](https://github.com/greenvla/GreenVLA/blob/main/examples/example_inference_fractal.py) for the full runnable script with argument parsing.
 ### VLM Inference (VQA, Pointing, BBox)
 # Load without data transforms
 policy, _, _ = load_pretrained_policy(
+    "SberRoboticsCenter/GreenVLA-5b-base-stride-4",
     data_config_name=None,
 )
 policy = policy.to("cuda").eval()
 print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0])
 ```
 ## Citation
 ```bibtex