SberRoboticsCenter
/

GreenVLA-2b-base

+---
+library_name: lerobot
+license: apache-2.0
+language:
+  - en
+base_model:
+  - SberRoboticsCenter/Qwen3-VL-2B-Instruct-action
+pipeline_tag: robotics
+tags:
+  - robotics
+  - vla
+  - vision-language-action
+  - manipulation
+  - flow-matching
+  - action-prediction
+  - green-vla
+datasets:
+  - bridge
+  - fractal
+---
+<div align="center">
+# GreenVLA-2b-base
+### Staged Vision-Language-Action Model for Generalist Robots
+**Sber Robotics Center &middot; Manipulation Team**
+[![arXiv](https://img.shields.io/badge/arXiv-2602.00919-b31b1b?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/abs/2602.00919)
+[![Project Page](https://img.shields.io/badge/Project-Page-blue?style=for-the-badge&logo=github&logoColor=white)](https://greenvla.github.io/)
+[![Code](https://img.shields.io/badge/Code-GitHub-181717?style=for-the-badge&logo=github&logoColor=white)](https://github.com/greenvla/GreenVLA)
+</div>
+---
+## Overview
+**GreenVLA-2b-base** is the lightweight base checkpoint of the [Green-VLA](https://arxiv.org/abs/2602.00919) family — a ~2B-parameter Vision-Language-Action model pretrained on both general-domain and robotics data (3,000+ hours of demonstrations across multiple embodiments).
+This checkpoint combines:
+- **VLM capabilities** — Visual Question Answering, object pointing, bounding box prediction, and scene description.
+- **Autoregressive action prediction** — FAST token-based action generation for discrete control.
+- **Flow-matching action expert** — A continuous action head for smooth, high-frequency trajectory generation.
+Use this checkpoint when you need a **smaller model footprint** for fine-tuning or deployment on resource-constrained hardware. For best performance, consider [GreenVLA-5b-base](https://huggingface.co/SberRoboticsCenter/GreenVLA-5b-base).
+## Architecture
+| Component | Details |
+|---|---|
+| **VLM Backbone** | Qwen3-VL-2B-Instruct (vision encoder + language model) |
+| **Action Expert** | Flow-matching transformer operating in a reduced hidden space |
+| **Action Tokenizer** | FAST tokenizer for autoregressive action prediction |
+| **Total Parameters** | ~2B |
+## Training Curriculum
+This checkpoint corresponds to the **Base** stage of the Green-VLA curriculum:
+| Stage | Name | Status |
+|:---:|---|:---:|
+| **L0** | Foundational VLM pretraining | ✓ |
+| **L1** | Multimodal grounding (VQA, pointing, bbox) | ✓ |
+| **R0** | Multi-embodiment robotics pretraining | ✓ |
+| R1 | Embodiment-specific adaptation | — |
+| R2 | RL policy alignment | — |
+## Quick Start
+### Installation
+```bash
+git clone https://github.com/greenvla/GreenVLA.git
+cd GreenVLA
+uv sync  # or: pip install -e .
+```
+### Action Inference
+```python
+import numpy as np
+import torch
+from lerobot.common.policies.factory import load_pretrained_policy
+from lerobot.common.utils.torch_observation import (
+    move_dict_to_batch_for_inference,
+    torch_preprocess_dict_inference,
+)
+# 1. Load policy and transforms.
+policy, input_transforms, output_transforms = load_pretrained_policy(
+    "SberRoboticsCenter/GreenVLA-2b-base",
+    data_config_name="bridge",
+)
+policy.to("cuda").eval()
+# 2. Build an observation (replace with real sensor data).
+raw_obs = {
+    "observation/state": np.random.rand(8).astype(np.float32),  # x y z roll pitch yaw _pad_ gripper
+    "observation/image": np.random.randint(0, 256, size=(224, 224, 3), dtype=np.uint8),
+    "prompt": "pick up the green block and place it on the plate",
+}
+# 3. Transform, preprocess, and batch.
+obs = input_transforms(raw_obs)
+obs = torch_preprocess_dict_inference(obs)
+batch = move_dict_to_batch_for_inference(obs, device="cuda")
+# 4. Predict actions and post-process.
+with torch.inference_mode():
+    raw_actions = policy.select_action(batch).cpu().numpy()
+actions = output_transforms(
+    {"actions": raw_actions, "state": batch["state"].cpu().numpy()}
+)["actions"]
+# actions shape: (action_horizon, 7) — [x, y, z, roll, pitch, yaw, gripper]
+```
+See [`examples/example_inference_bridge.py`](https://github.com/greenvla/GreenVLA/blob/main/examples/example_inference_bridge.py) for the full runnable script with argument parsing.
+### VLM Inference (VQA, Pointing, BBox)
+The base model retains full VLM capabilities:
+```python
+from PIL import Image
+from lerobot.common.policies.factory import load_pretrained_policy
+# Load without data transforms
+policy, _, _ = load_pretrained_policy(
+    "SberRoboticsCenter/GreenVLA-2b-base",
+    data_config_name=None,
+)
+policy = policy.to("cuda").eval()
+# Access the processor and model directly
+processor = policy.model.processor
+image = Image.open("scene.jpg")
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": image},
+            {"type": "text", "text": "Describe what the robot should do next."},
+        ],
+    }
+]
+inputs = processor.apply_chat_template(
+    messages, tokenize=True, add_generation_prompt=False,
+    return_dict=True, return_tensors="pt",
+    padding_side="left", padding="max_length", max_length=256,
+    images_kwargs={"do_resize": True},
+).to("cuda")
+generated_ids = policy.model.model.generate(
+    **inputs, max_new_tokens=256, do_sample=False, use_cache=False,
+)
+generated_ids_trimmed = [
+    out[len(inp):] for inp, out in zip(inputs.input_ids, generated_ids)
+]
+print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0])
+```
+## Model Family
+| Model | Stage | Params | Description | Link |
+|-------|:-----:|:------:|-------------|:----:|
+| **GreenVLA-2b-base** | Base | 2B | Base pretrained (lightweight) | You are here |
+| **GreenVLA-5b-base** | Base | 5B | Base pretrained (recommended) | [Hub](https://huggingface.co/SberRoboticsCenter/GreenVLA-5b-base) |
+| **GreenVLA-5b-R1-bridge** | R1 | 5B | Fine-tuned on Bridge (WidowX) | [Hub](https://huggingface.co/SberRoboticsCenter/GreenVLA-5b-R1-bridge) |
+| **GreenVLA-5b-R2-bridge** | R2 | 5B | RL-aligned on Bridge (WidowX) | [Hub](https://huggingface.co/SberRoboticsCenter/GreenVLA-5b-R2-bridge) |
+| **GreenVLA-5b-R1-fractal** | R1 | 5B | Fine-tuned on Fractal (Google Robot) | [Hub](https://huggingface.co/SberRoboticsCenter/GreenVLA-5b-R1-fractal) |
+## Citation
+```bibtex
+@misc{greenvla,
+    title   = {Green-VLA: Staged Vision-Language-Action Model for Generalist Robots},
+    author  = {I. Apanasevich and M. Artemyev and R. Babakyan and P. Fedotova and
+               D. Grankin and E. Kupryashin and A. Misailidi and D. Nerus and
+               A. Nutalapati and G. Sidorov and I. Efremov and M. Gerasyov and
+               D. Pikurov and Y. Senchenko and S. Davidenko and D. Kulikov and
+               M. Sultankin and K. Askarbek and O. Shamanin and D. Statovoy and
+               E. Zalyaev and I. Zorin and A. Letkin and E. Rusakov and
+               A. Silchenko and V. Vorobyov and S. Sobolnikov and A. Postnikov},
+    year    = {2026},
+    eprint  = {2602.00919},
+    archivePrefix = {arXiv},
+    primaryClass  = {cs.RO},
+    url     = {https://arxiv.org/abs/2602.00919},
+}
+```
+<div align="center">
+&copy; 2026 Sber Robotics Center &middot; Manipulation Team
+</div>