| | --- |
| | library_name: lerobot |
| | license: apache-2.0 |
| | language: |
| | - en |
| | base_model: |
| | - SberRoboticsCenter/Qwen3-VL-4B-Instruct-action |
| | pipeline_tag: robotics |
| | tags: |
| | - robotics |
| | - vla |
| | - vision-language-action |
| | - manipulation |
| | - flow-matching |
| | - action-prediction |
| | - green-vla |
| | datasets: |
| | - bridge |
| | - fractal |
| | --- |
| | |
| | <div align="center"> |
| |
|
| | # GreenVLA-5b-base-stride-1 |
| |
|
| | ### Staged Vision-Language-Action Model for Generalist Robots |
| |
|
| | **Sber Robotics Center · Manipulation Team** |
| |
|
| | [](https://arxiv.org/abs/2602.00919) |
| | [](https://greenvla.github.io/) |
| | [](https://github.com/greenvla/GreenVLA) |
| |
|
| | </div> |
| |
|
| | --- |
| |
|
| | ## Overview |
| |
|
| | **GreenVLA-5b-base-stride-1** is the recommended base checkpoint of the [Green-VLA](https://arxiv.org/abs/2602.00919) family — a ~5B-parameter Vision-Language-Action model pretrained on both general-domain and robotics data (3,000+ hours of demonstrations across multiple embodiments). |
| |
|
| | This is the **stride-1** variant: the action expert has the **same number of transformer layers** as the VLM backbone, providing maximum action-prediction capacity. For a lighter-weight alternative with 4× fewer action-expert layers, see [GreenVLA-5b-base-stride-4](https://huggingface.co/SberRoboticsCenter/GreenVLA-5b-base-stride-4). |
| |
|
| | This checkpoint combines: |
| |
|
| | - **VLM capabilities** — Visual Question Answering, object pointing, bounding box prediction, and scene description, inherited from the [Qwen3-VL-4B](https://huggingface.co/SberRoboticsCenter/Qwen3-VL-4B-Instruct-action) backbone. |
| | - **Autoregressive action prediction** — FAST token-based action generation for discrete control. |
| | - **Flow-matching action expert** — A continuous action head for smooth, high-frequency trajectory generation. |
| |
|
| | Use this checkpoint as the starting point for **fine-tuning on your own embodiment** (R1 stage), or for zero-shot VLM inference. |
| |
|
| | ## Architecture |
| |
|
| | | Component | Details | |
| | |---|---| |
| | | **VLM Backbone** | Qwen3-VL-4B-Instruct (vision encoder + language model) | |
| | | **Action Expert** | Flow-matching transformer operating in a reduced hidden space | |
| | | **Action Expert Depth** | Same number of layers as the VLM (stride 1) | |
| | | **Action Tokenizer** | FAST tokenizer for autoregressive action prediction | |
| | | **Total Parameters** | ~5B | |
| |
|
| | ## Training Curriculum |
| |
|
| | This checkpoint corresponds to the **Base** stage of the Green-VLA curriculum: |
| |
|
| | | Stage | Name | Status | |
| | |:---:|---|:---:| |
| | | **L0** | Foundational VLM pretraining | ✓ | |
| | | **L1** | Multimodal grounding (VQA, pointing, bbox) | ✓ | |
| | | **R0** | Multi-embodiment robotics pretraining | ✓ | |
| | | R1 | Embodiment-specific adaptation | — | |
| | | R2 | RL policy alignment | — | |
| |
|
| | ## Quick Start |
| |
|
| | ### Installation |
| |
|
| | ```bash |
| | git clone https://github.com/greenvla/GreenVLA.git |
| | cd GreenVLA |
| | uv sync # or: pip install -e . |
| | ``` |
| |
|
| | ### Action Inference |
| |
|
| | ```python |
| | import numpy as np |
| | import torch |
| | from lerobot.common.policies.factory import load_pretrained_policy |
| | from lerobot.common.utils.torch_observation import ( |
| | move_dict_to_batch_for_inference, |
| | torch_preprocess_dict_inference, |
| | ) |
| | |
| | # 1. Load policy and transforms. |
| | policy, input_transforms, output_transforms = load_pretrained_policy( |
| | "SberRoboticsCenter/GreenVLA-5b-stride-1-R1-bridge", |
| | data_config_name="bridge", |
| | ) |
| | policy.to("cuda").eval() |
| | |
| | # 2. Build an observation (replace with real sensor data). |
| | raw_obs = { |
| | "observation/state": np.random.rand(8).astype(np.float32), # x y z roll pitch yaw _pad_ gripper |
| | "observation/image": np.random.randint(0, 256, size=(224, 224, 3), dtype=np.uint8), |
| | "prompt": "pick up the green block and place it on the plate", |
| | } |
| | |
| | # 3. Transform, preprocess, and batch. |
| | obs = input_transforms(raw_obs) |
| | obs = torch_preprocess_dict_inference(obs) |
| | batch = move_dict_to_batch_for_inference(obs, device="cuda") |
| | |
| | # 4. Predict actions and post-process. |
| | with torch.inference_mode(): |
| | raw_actions = policy.select_action(batch).cpu().numpy() |
| | |
| | actions = output_transforms( |
| | {"actions": raw_actions, "state": batch["state"].cpu().numpy()} |
| | )["actions"] |
| | # actions shape: (action_horizon, 7) — [x, y, z, roll, pitch, yaw, gripper] |
| | ``` |
| |
|
| | See [`examples/example_inference_bridge.py`](https://github.com/greenvla/GreenVLA/blob/main/examples/example_inference_bridge.py) for the full runnable script with argument parsing. |
| |
|
| | ### VLM Inference (VQA, Pointing, BBox) |
| |
|
| | The base model retains full VLM capabilities: |
| |
|
| | ```python |
| | from PIL import Image |
| | from lerobot.common.policies.factory import load_pretrained_policy |
| | |
| | # Load without data transforms |
| | policy, _, _ = load_pretrained_policy( |
| | "SberRoboticsCenter/GreenVLA-5b-base-stride-1", |
| | data_config_name=None, |
| | ) |
| | policy = policy.to("cuda").eval() |
| | |
| | # Access the processor and model directly |
| | processor = policy.model.processor |
| | image = Image.open("scene.jpg") |
| | |
| | messages = [ |
| | { |
| | "role": "user", |
| | "content": [ |
| | {"type": "image", "image": image}, |
| | {"type": "text", "text": "Describe what the robot should do next."}, |
| | ], |
| | } |
| | ] |
| | |
| | inputs = processor.apply_chat_template( |
| | messages, tokenize=True, add_generation_prompt=False, |
| | return_dict=True, return_tensors="pt", |
| | padding_side="left", padding="max_length", max_length=256, |
| | images_kwargs={"do_resize": True}, |
| | ).to("cuda") |
| | |
| | generated_ids = policy.model.model.generate( |
| | **inputs, max_new_tokens=256, do_sample=False, use_cache=False, |
| | ) |
| | |
| | generated_ids_trimmed = [ |
| | out[len(inp):] for inp, out in zip(inputs.input_ids, generated_ids) |
| | ] |
| | print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0]) |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @misc{apanasevich2026greenvlastagedvisionlanguageactionmodel, |
| | title = {Green-VLA: Staged Vision-Language-Action Model for Generalist Robots}, |
| | author = {I. Apanasevich and M. Artemyev and R. Babakyan and P. Fedotova and |
| | D. Grankin and E. Kupryashin and A. Misailidi and D. Nerus and |
| | A. Nutalapati and G. Sidorov and I. Efremov and M. Gerasyov and |
| | D. Pikurov and Y. Senchenko and S. Davidenko and D. Kulikov and |
| | M. Sultankin and K. Askarbek and O. Shamanin and D. Statovoy and |
| | E. Zalyaev and I. Zorin and A. Letkin and E. Rusakov and |
| | A. Silchenko and V. Vorobyov and S. Sobolnikov and A. Postnikov}, |
| | year = {2026}, |
| | eprint = {2602.00919}, |
| | archivePrefix = {arXiv}, |
| | primaryClass = {cs.RO}, |
| | url = {https://arxiv.org/abs/2602.00919}, |
| | } |
| | ``` |
| |
|
| | <div align="center"> |
| |
|
| | © 2026 Sber Robotics Center · Manipulation Team |
| |
|
| | </div> |
| |
|