Update README.md
Browse files
README.md
CHANGED
|
@@ -21,7 +21,7 @@ datasets:
|
|
| 21 |
|
| 22 |
<div align="center">
|
| 23 |
|
| 24 |
-
# GreenVLA-5b-base
|
| 25 |
|
| 26 |
### Staged Vision-Language-Action Model for Generalist Robots
|
| 27 |
|
|
@@ -37,7 +37,9 @@ datasets:
|
|
| 37 |
|
| 38 |
## Overview
|
| 39 |
|
| 40 |
-
**GreenVLA-5b-base** is
|
|
|
|
|
|
|
| 41 |
|
| 42 |
This checkpoint combines:
|
| 43 |
|
|
@@ -53,6 +55,7 @@ Use this checkpoint as the starting point for **fine-tuning on your own embodime
|
|
| 53 |
|---|---|
|
| 54 |
| **VLM Backbone** | Qwen3-VL-4B-Instruct (vision encoder + language model) |
|
| 55 |
| **Action Expert** | Flow-matching transformer operating in a reduced hidden space |
|
|
|
|
| 56 |
| **Action Tokenizer** | FAST tokenizer for autoregressive action prediction |
|
| 57 |
| **Total Parameters** | ~5B |
|
| 58 |
|
|
@@ -91,16 +94,16 @@ from lerobot.common.utils.torch_observation import (
|
|
| 91 |
|
| 92 |
# 1. Load policy and transforms.
|
| 93 |
policy, input_transforms, output_transforms = load_pretrained_policy(
|
| 94 |
-
"SberRoboticsCenter/GreenVLA-5b-R1-
|
| 95 |
-
data_config_name="
|
| 96 |
)
|
| 97 |
policy.to("cuda").eval()
|
| 98 |
|
| 99 |
# 2. Build an observation (replace with real sensor data).
|
| 100 |
raw_obs = {
|
| 101 |
-
"observation/state": np.random.rand(8)
|
| 102 |
-
"observation/image": np.random.randint(
|
| 103 |
-
"prompt": "
|
| 104 |
}
|
| 105 |
|
| 106 |
# 3. Transform, preprocess, and batch.
|
|
@@ -118,7 +121,7 @@ actions = output_transforms(
|
|
| 118 |
# actions shape: (action_horizon, 7) — [x, y, z, roll, pitch, yaw, gripper]
|
| 119 |
```
|
| 120 |
|
| 121 |
-
See [`examples/
|
| 122 |
|
| 123 |
### VLM Inference (VQA, Pointing, BBox)
|
| 124 |
|
|
@@ -130,7 +133,7 @@ from lerobot.common.policies.factory import load_pretrained_policy
|
|
| 130 |
|
| 131 |
# Load without data transforms
|
| 132 |
policy, _, _ = load_pretrained_policy(
|
| 133 |
-
"SberRoboticsCenter/GreenVLA-5b-base",
|
| 134 |
data_config_name=None,
|
| 135 |
)
|
| 136 |
policy = policy.to("cuda").eval()
|
|
@@ -166,16 +169,6 @@ generated_ids_trimmed = [
|
|
| 166 |
print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0])
|
| 167 |
```
|
| 168 |
|
| 169 |
-
## Model Family
|
| 170 |
-
|
| 171 |
-
| Model | Stage | Params | Description | Link |
|
| 172 |
-
|-------|:-----:|:------:|-------------|:----:|
|
| 173 |
-
| **GreenVLA-2b-base** | Base | 2B | Base pretrained (lightweight) | [Hub](https://huggingface.co/SberRoboticsCenter/GreenVLA-2b-base) |
|
| 174 |
-
| **GreenVLA-5b-base** | Base | 5B | Base pretrained (recommended) | You are here |
|
| 175 |
-
| **GreenVLA-5b-R1-bridge** | R1 | 5B | Fine-tuned on Bridge (WidowX) | [Hub](https://huggingface.co/SberRoboticsCenter/GreenVLA-5b-R1-bridge) |
|
| 176 |
-
| **GreenVLA-5b-R2-bridge** | R2 | 5B | RL-aligned on Bridge (WidowX) | [Hub](https://huggingface.co/SberRoboticsCenter/GreenVLA-5b-R2-bridge) |
|
| 177 |
-
| **GreenVLA-5b-R1-fractal** | R1 | 5B | Fine-tuned on Fractal (Google Robot) | [Hub](https://huggingface.co/SberRoboticsCenter/GreenVLA-5b-R1-fractal) |
|
| 178 |
-
|
| 179 |
## Citation
|
| 180 |
|
| 181 |
```bibtex
|
|
|
|
| 21 |
|
| 22 |
<div align="center">
|
| 23 |
|
| 24 |
+
# GreenVLA-5b-base-stride-4
|
| 25 |
|
| 26 |
### Staged Vision-Language-Action Model for Generalist Robots
|
| 27 |
|
|
|
|
| 37 |
|
| 38 |
## Overview
|
| 39 |
|
| 40 |
+
**GreenVLA-5b-base-stride-4** is a base checkpoint of the [Green-VLA](https://arxiv.org/abs/2602.00919) family — a ~5B-parameter Vision-Language-Action model pretrained on both general-domain and robotics data (3,000+ hours of demonstrations across multiple embodiments).
|
| 41 |
+
|
| 42 |
+
This is the **stride-4** variant: the action expert has **4× fewer transformer layers** than the VLM backbone, resulting in a lighter action head while retaining the full VLM capacity. For the variant with the same number of action-expert layers as the VLM, see [GreenVLA-5b-base-stride-1](https://huggingface.co/SberRoboticsCenter/GreenVLA-5b-base-stride-1).
|
| 43 |
|
| 44 |
This checkpoint combines:
|
| 45 |
|
|
|
|
| 55 |
|---|---|
|
| 56 |
| **VLM Backbone** | Qwen3-VL-4B-Instruct (vision encoder + language model) |
|
| 57 |
| **Action Expert** | Flow-matching transformer operating in a reduced hidden space |
|
| 58 |
+
| **Action Expert Depth** | 4× fewer layers than the VLM (stride 4) |
|
| 59 |
| **Action Tokenizer** | FAST tokenizer for autoregressive action prediction |
|
| 60 |
| **Total Parameters** | ~5B |
|
| 61 |
|
|
|
|
| 94 |
|
| 95 |
# 1. Load policy and transforms.
|
| 96 |
policy, input_transforms, output_transforms = load_pretrained_policy(
|
| 97 |
+
"SberRoboticsCenter/GreenVLA-5b-stride-4-R1-fractal",
|
| 98 |
+
data_config_name="fractal",
|
| 99 |
)
|
| 100 |
policy.to("cuda").eval()
|
| 101 |
|
| 102 |
# 2. Build an observation (replace with real sensor data).
|
| 103 |
raw_obs = {
|
| 104 |
+
"observation/state": np.random.rand(8), # x, y, z, rx, ry, rz, rw, gripper
|
| 105 |
+
"observation/image": np.random.randint(256, size=(448, 448, 3), dtype=np.uint8),
|
| 106 |
+
"prompt": "move the coke can to the left of the table",
|
| 107 |
}
|
| 108 |
|
| 109 |
# 3. Transform, preprocess, and batch.
|
|
|
|
| 121 |
# actions shape: (action_horizon, 7) — [x, y, z, roll, pitch, yaw, gripper]
|
| 122 |
```
|
| 123 |
|
| 124 |
+
See [`examples/example_inference_fractal.py`](https://github.com/greenvla/GreenVLA/blob/main/examples/example_inference_fractal.py) for the full runnable script with argument parsing.
|
| 125 |
|
| 126 |
### VLM Inference (VQA, Pointing, BBox)
|
| 127 |
|
|
|
|
| 133 |
|
| 134 |
# Load without data transforms
|
| 135 |
policy, _, _ = load_pretrained_policy(
|
| 136 |
+
"SberRoboticsCenter/GreenVLA-5b-base-stride-4",
|
| 137 |
data_config_name=None,
|
| 138 |
)
|
| 139 |
policy = policy.to("cuda").eval()
|
|
|
|
| 169 |
print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0])
|
| 170 |
```
|
| 171 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 172 |
## Citation
|
| 173 |
|
| 174 |
```bibtex
|