lakomchik commited on
Commit
26f9fd3
·
verified ·
1 Parent(s): 6dd194b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -19
README.md CHANGED
@@ -21,7 +21,7 @@ datasets:
21
 
22
  <div align="center">
23
 
24
- # GreenVLA-5b-base
25
 
26
  ### Staged Vision-Language-Action Model for Generalist Robots
27
 
@@ -37,7 +37,9 @@ datasets:
37
 
38
  ## Overview
39
 
40
- **GreenVLA-5b-base** is the recommended base checkpoint of the [Green-VLA](https://arxiv.org/abs/2602.00919) family — a ~5B-parameter Vision-Language-Action model pretrained on both general-domain and robotics data (3,000+ hours of demonstrations across multiple embodiments).
 
 
41
 
42
  This checkpoint combines:
43
 
@@ -53,6 +55,7 @@ Use this checkpoint as the starting point for **fine-tuning on your own embodime
53
  |---|---|
54
  | **VLM Backbone** | Qwen3-VL-4B-Instruct (vision encoder + language model) |
55
  | **Action Expert** | Flow-matching transformer operating in a reduced hidden space |
 
56
  | **Action Tokenizer** | FAST tokenizer for autoregressive action prediction |
57
  | **Total Parameters** | ~5B |
58
 
@@ -91,16 +94,16 @@ from lerobot.common.utils.torch_observation import (
91
 
92
  # 1. Load policy and transforms.
93
  policy, input_transforms, output_transforms = load_pretrained_policy(
94
- "SberRoboticsCenter/GreenVLA-5b-R1-bridge",
95
- data_config_name="bridge",
96
  )
97
  policy.to("cuda").eval()
98
 
99
  # 2. Build an observation (replace with real sensor data).
100
  raw_obs = {
101
- "observation/state": np.random.rand(8).astype(np.float32), # x y z roll pitch yaw _pad_ gripper
102
- "observation/image": np.random.randint(0, 256, size=(224, 224, 3), dtype=np.uint8),
103
- "prompt": "pick up the green block and place it on the plate",
104
  }
105
 
106
  # 3. Transform, preprocess, and batch.
@@ -118,7 +121,7 @@ actions = output_transforms(
118
  # actions shape: (action_horizon, 7) — [x, y, z, roll, pitch, yaw, gripper]
119
  ```
120
 
121
- See [`examples/example_inference_bridge.py`](https://github.com/greenvla/GreenVLA/blob/main/examples/example_inference_bridge.py) for the full runnable script with argument parsing.
122
 
123
  ### VLM Inference (VQA, Pointing, BBox)
124
 
@@ -130,7 +133,7 @@ from lerobot.common.policies.factory import load_pretrained_policy
130
 
131
  # Load without data transforms
132
  policy, _, _ = load_pretrained_policy(
133
- "SberRoboticsCenter/GreenVLA-5b-base",
134
  data_config_name=None,
135
  )
136
  policy = policy.to("cuda").eval()
@@ -166,16 +169,6 @@ generated_ids_trimmed = [
166
  print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0])
167
  ```
168
 
169
- ## Model Family
170
-
171
- | Model | Stage | Params | Description | Link |
172
- |-------|:-----:|:------:|-------------|:----:|
173
- | **GreenVLA-2b-base** | Base | 2B | Base pretrained (lightweight) | [Hub](https://huggingface.co/SberRoboticsCenter/GreenVLA-2b-base) |
174
- | **GreenVLA-5b-base** | Base | 5B | Base pretrained (recommended) | You are here |
175
- | **GreenVLA-5b-R1-bridge** | R1 | 5B | Fine-tuned on Bridge (WidowX) | [Hub](https://huggingface.co/SberRoboticsCenter/GreenVLA-5b-R1-bridge) |
176
- | **GreenVLA-5b-R2-bridge** | R2 | 5B | RL-aligned on Bridge (WidowX) | [Hub](https://huggingface.co/SberRoboticsCenter/GreenVLA-5b-R2-bridge) |
177
- | **GreenVLA-5b-R1-fractal** | R1 | 5B | Fine-tuned on Fractal (Google Robot) | [Hub](https://huggingface.co/SberRoboticsCenter/GreenVLA-5b-R1-fractal) |
178
-
179
  ## Citation
180
 
181
  ```bibtex
 
21
 
22
  <div align="center">
23
 
24
+ # GreenVLA-5b-base-stride-4
25
 
26
  ### Staged Vision-Language-Action Model for Generalist Robots
27
 
 
37
 
38
  ## Overview
39
 
40
+ **GreenVLA-5b-base-stride-4** is a base checkpoint of the [Green-VLA](https://arxiv.org/abs/2602.00919) family — a ~5B-parameter Vision-Language-Action model pretrained on both general-domain and robotics data (3,000+ hours of demonstrations across multiple embodiments).
41
+
42
+ This is the **stride-4** variant: the action expert has **4× fewer transformer layers** than the VLM backbone, resulting in a lighter action head while retaining the full VLM capacity. For the variant with the same number of action-expert layers as the VLM, see [GreenVLA-5b-base-stride-1](https://huggingface.co/SberRoboticsCenter/GreenVLA-5b-base-stride-1).
43
 
44
  This checkpoint combines:
45
 
 
55
  |---|---|
56
  | **VLM Backbone** | Qwen3-VL-4B-Instruct (vision encoder + language model) |
57
  | **Action Expert** | Flow-matching transformer operating in a reduced hidden space |
58
+ | **Action Expert Depth** | 4× fewer layers than the VLM (stride 4) |
59
  | **Action Tokenizer** | FAST tokenizer for autoregressive action prediction |
60
  | **Total Parameters** | ~5B |
61
 
 
94
 
95
  # 1. Load policy and transforms.
96
  policy, input_transforms, output_transforms = load_pretrained_policy(
97
+ "SberRoboticsCenter/GreenVLA-5b-stride-4-R1-fractal",
98
+ data_config_name="fractal",
99
  )
100
  policy.to("cuda").eval()
101
 
102
  # 2. Build an observation (replace with real sensor data).
103
  raw_obs = {
104
+ "observation/state": np.random.rand(8), # x, y, z, rx, ry, rz, rw, gripper
105
+ "observation/image": np.random.randint(256, size=(448, 448, 3), dtype=np.uint8),
106
+ "prompt": "move the coke can to the left of the table",
107
  }
108
 
109
  # 3. Transform, preprocess, and batch.
 
121
  # actions shape: (action_horizon, 7) — [x, y, z, roll, pitch, yaw, gripper]
122
  ```
123
 
124
+ See [`examples/example_inference_fractal.py`](https://github.com/greenvla/GreenVLA/blob/main/examples/example_inference_fractal.py) for the full runnable script with argument parsing.
125
 
126
  ### VLM Inference (VQA, Pointing, BBox)
127
 
 
133
 
134
  # Load without data transforms
135
  policy, _, _ = load_pretrained_policy(
136
+ "SberRoboticsCenter/GreenVLA-5b-base-stride-4",
137
  data_config_name=None,
138
  )
139
  policy = policy.to("cuda").eval()
 
169
  print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0])
170
  ```
171
 
 
 
 
 
 
 
 
 
 
 
172
  ## Citation
173
 
174
  ```bibtex