lakomchik commited on
Commit
8a511e2
Β·
0 Parent(s):

Initial release

Browse files
Files changed (1) hide show
  1. README.md +203 -0
README.md ADDED
@@ -0,0 +1,203 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: lerobot
3
+ license: apache-2.0
4
+ language:
5
+ - en
6
+ base_model:
7
+ - SberRoboticsCenter/Qwen3-VL-2B-Instruct-action
8
+ pipeline_tag: robotics
9
+ tags:
10
+ - robotics
11
+ - vla
12
+ - vision-language-action
13
+ - manipulation
14
+ - flow-matching
15
+ - action-prediction
16
+ - green-vla
17
+ datasets:
18
+ - bridge
19
+ - fractal
20
+ ---
21
+
22
+ <div align="center">
23
+
24
+ # GreenVLA-2b-base
25
+
26
+ ### Staged Vision-Language-Action Model for Generalist Robots
27
+
28
+ **Sber Robotics Center &middot; Manipulation Team**
29
+
30
+ [![arXiv](https://img.shields.io/badge/arXiv-2602.00919-b31b1b?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/abs/2602.00919)
31
+ [![Project Page](https://img.shields.io/badge/Project-Page-blue?style=for-the-badge&logo=github&logoColor=white)](https://greenvla.github.io/)
32
+ [![Code](https://img.shields.io/badge/Code-GitHub-181717?style=for-the-badge&logo=github&logoColor=white)](https://github.com/greenvla/GreenVLA)
33
+
34
+ </div>
35
+
36
+ ---
37
+
38
+ ## Overview
39
+
40
+ **GreenVLA-2b-base** is the lightweight base checkpoint of the [Green-VLA](https://arxiv.org/abs/2602.00919) family β€” a ~2B-parameter Vision-Language-Action model pretrained on both general-domain and robotics data (3,000+ hours of demonstrations across multiple embodiments).
41
+
42
+ This checkpoint combines:
43
+
44
+ - **VLM capabilities** β€” Visual Question Answering, object pointing, bounding box prediction, and scene description.
45
+ - **Autoregressive action prediction** β€” FAST token-based action generation for discrete control.
46
+ - **Flow-matching action expert** β€” A continuous action head for smooth, high-frequency trajectory generation.
47
+
48
+ Use this checkpoint when you need a **smaller model footprint** for fine-tuning or deployment on resource-constrained hardware. For best performance, consider [GreenVLA-5b-base](https://huggingface.co/SberRoboticsCenter/GreenVLA-5b-base).
49
+
50
+ ## Architecture
51
+
52
+ | Component | Details |
53
+ |---|---|
54
+ | **VLM Backbone** | Qwen3-VL-2B-Instruct (vision encoder + language model) |
55
+ | **Action Expert** | Flow-matching transformer operating in a reduced hidden space |
56
+ | **Action Tokenizer** | FAST tokenizer for autoregressive action prediction |
57
+ | **Total Parameters** | ~2B |
58
+
59
+ ## Training Curriculum
60
+
61
+ This checkpoint corresponds to the **Base** stage of the Green-VLA curriculum:
62
+
63
+ | Stage | Name | Status |
64
+ |:---:|---|:---:|
65
+ | **L0** | Foundational VLM pretraining | βœ“ |
66
+ | **L1** | Multimodal grounding (VQA, pointing, bbox) | βœ“ |
67
+ | **R0** | Multi-embodiment robotics pretraining | βœ“ |
68
+ | R1 | Embodiment-specific adaptation | β€” |
69
+ | R2 | RL policy alignment | β€” |
70
+
71
+ ## Quick Start
72
+
73
+ ### Installation
74
+
75
+ ```bash
76
+ git clone https://github.com/greenvla/GreenVLA.git
77
+ cd GreenVLA
78
+ uv sync # or: pip install -e .
79
+ ```
80
+
81
+ ### Action Inference
82
+
83
+ ```python
84
+ import numpy as np
85
+ import torch
86
+ from lerobot.common.policies.factory import load_pretrained_policy
87
+ from lerobot.common.utils.torch_observation import (
88
+ move_dict_to_batch_for_inference,
89
+ torch_preprocess_dict_inference,
90
+ )
91
+
92
+ # 1. Load policy and transforms.
93
+ policy, input_transforms, output_transforms = load_pretrained_policy(
94
+ "SberRoboticsCenter/GreenVLA-2b-base",
95
+ data_config_name="bridge",
96
+ )
97
+ policy.to("cuda").eval()
98
+
99
+ # 2. Build an observation (replace with real sensor data).
100
+ raw_obs = {
101
+ "observation/state": np.random.rand(8).astype(np.float32), # x y z roll pitch yaw _pad_ gripper
102
+ "observation/image": np.random.randint(0, 256, size=(224, 224, 3), dtype=np.uint8),
103
+ "prompt": "pick up the green block and place it on the plate",
104
+ }
105
+
106
+ # 3. Transform, preprocess, and batch.
107
+ obs = input_transforms(raw_obs)
108
+ obs = torch_preprocess_dict_inference(obs)
109
+ batch = move_dict_to_batch_for_inference(obs, device="cuda")
110
+
111
+ # 4. Predict actions and post-process.
112
+ with torch.inference_mode():
113
+ raw_actions = policy.select_action(batch).cpu().numpy()
114
+
115
+ actions = output_transforms(
116
+ {"actions": raw_actions, "state": batch["state"].cpu().numpy()}
117
+ )["actions"]
118
+ # actions shape: (action_horizon, 7) β€” [x, y, z, roll, pitch, yaw, gripper]
119
+ ```
120
+
121
+ See [`examples/example_inference_bridge.py`](https://github.com/greenvla/GreenVLA/blob/main/examples/example_inference_bridge.py) for the full runnable script with argument parsing.
122
+
123
+ ### VLM Inference (VQA, Pointing, BBox)
124
+
125
+ The base model retains full VLM capabilities:
126
+
127
+ ```python
128
+ from PIL import Image
129
+ from lerobot.common.policies.factory import load_pretrained_policy
130
+
131
+ # Load without data transforms
132
+ policy, _, _ = load_pretrained_policy(
133
+ "SberRoboticsCenter/GreenVLA-2b-base",
134
+ data_config_name=None,
135
+ )
136
+ policy = policy.to("cuda").eval()
137
+
138
+ # Access the processor and model directly
139
+ processor = policy.model.processor
140
+ image = Image.open("scene.jpg")
141
+
142
+ messages = [
143
+ {
144
+ "role": "user",
145
+ "content": [
146
+ {"type": "image", "image": image},
147
+ {"type": "text", "text": "Describe what the robot should do next."},
148
+ ],
149
+ }
150
+ ]
151
+
152
+ inputs = processor.apply_chat_template(
153
+ messages, tokenize=True, add_generation_prompt=False,
154
+ return_dict=True, return_tensors="pt",
155
+ padding_side="left", padding="max_length", max_length=256,
156
+ images_kwargs={"do_resize": True},
157
+ ).to("cuda")
158
+
159
+ generated_ids = policy.model.model.generate(
160
+ **inputs, max_new_tokens=256, do_sample=False, use_cache=False,
161
+ )
162
+
163
+ generated_ids_trimmed = [
164
+ out[len(inp):] for inp, out in zip(inputs.input_ids, generated_ids)
165
+ ]
166
+ print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0])
167
+ ```
168
+
169
+ ## Model Family
170
+
171
+ | Model | Stage | Params | Description | Link |
172
+ |-------|:-----:|:------:|-------------|:----:|
173
+ | **GreenVLA-2b-base** | Base | 2B | Base pretrained (lightweight) | You are here |
174
+ | **GreenVLA-5b-base** | Base | 5B | Base pretrained (recommended) | [Hub](https://huggingface.co/SberRoboticsCenter/GreenVLA-5b-base) |
175
+ | **GreenVLA-5b-R1-bridge** | R1 | 5B | Fine-tuned on Bridge (WidowX) | [Hub](https://huggingface.co/SberRoboticsCenter/GreenVLA-5b-R1-bridge) |
176
+ | **GreenVLA-5b-R2-bridge** | R2 | 5B | RL-aligned on Bridge (WidowX) | [Hub](https://huggingface.co/SberRoboticsCenter/GreenVLA-5b-R2-bridge) |
177
+ | **GreenVLA-5b-R1-fractal** | R1 | 5B | Fine-tuned on Fractal (Google Robot) | [Hub](https://huggingface.co/SberRoboticsCenter/GreenVLA-5b-R1-fractal) |
178
+
179
+ ## Citation
180
+
181
+ ```bibtex
182
+ @misc{greenvla,
183
+ title = {Green-VLA: Staged Vision-Language-Action Model for Generalist Robots},
184
+ author = {I. Apanasevich and M. Artemyev and R. Babakyan and P. Fedotova and
185
+ D. Grankin and E. Kupryashin and A. Misailidi and D. Nerus and
186
+ A. Nutalapati and G. Sidorov and I. Efremov and M. Gerasyov and
187
+ D. Pikurov and Y. Senchenko and S. Davidenko and D. Kulikov and
188
+ M. Sultankin and K. Askarbek and O. Shamanin and D. Statovoy and
189
+ E. Zalyaev and I. Zorin and A. Letkin and E. Rusakov and
190
+ A. Silchenko and V. Vorobyov and S. Sobolnikov and A. Postnikov},
191
+ year = {2026},
192
+ eprint = {2602.00919},
193
+ archivePrefix = {arXiv},
194
+ primaryClass = {cs.RO},
195
+ url = {https://arxiv.org/abs/2602.00919},
196
+ }
197
+ ```
198
+
199
+ <div align="center">
200
+
201
+ &copy; 2026 Sber Robotics Center &middot; Manipulation Team
202
+
203
+ </div>