VLA-0 (3B) — Bridge-v2 INT-ACT-aligned finetune

This repo contains intermediate and final checkpoints for a VLA-0 policy finetuned from Qwen/Qwen2.5-VL-3B-Instruct on the Bridge-v2 dataset (jesbu1/bridge_v2_lerobot), with filtering chosen to mirror the INT-ACT / OpenVLA / Octo baseline (skip_unlabeled=True, single 3rd-person camera, horizon = 4).

The intent is for these checkpoints to be drop-in usable in RLinf SIMPLER / RL4VLA evaluation against INT-ACT and other Bridge-v2-only baselines.

Training summary


Backbone	`Qwen/Qwen2.5-VL-3B-Instruct` (3B)
Method	Full finetune (no LoRA), bf16 + AMP
Dataset	Bridge-v2 LeRobot port (`jesbu1/bridge_v2_lerobot`), 53,192 episodes
Filter	`skip_empty_language=True` → 38,660 labeled episodes (~73%)
Camera	Single 3rd-person view (`observation.images.image_0` → `3p1`)
Image size	224 × 224
Action space	7-DoF, discretized to 1000 bins per dim
Horizon	4 future steps (28 action tokens per sample)
Effective batch	16 / GPU × 7 GPUs = 112
Optimizer	AdamW, lr = 1e-5 (× 7 GPUs → 7e-5 effective), no scheduler
`action_mask_aug_per`	0.4
Iterations	18,000 (≈ 1.4 epochs over the labeled subset)
Hardware	7 × NVIDIA H200

Final per-epoch cumulative cross-entropy on training: ~0.94 (random over 1000 bins is ln(1000) ≈ 6.91).

Checkpoint layout

The repo is laid out as one subfolder per saved iteration:

iter_2500/    # step  2500
iter_5000/    # step  5000
iter_7500/    # step  7500
iter_10000/   # step 10000
iter_12500/   # step 12500
iter_15000/   # step 15000
iter_17500/   # step 17500
final/        # step 18000 (end of training)

Each subfolder is a complete HF-format checkpoint: config.json, model-0000?-of-0000?.safetensors, model.safetensors.index.json, plus the Qwen2.5-VL tokenizer / processor / chat-template files. The action-token vocabulary is the standard Qwen vocab (no new tokens were added).

Top-level files:

config.yaml — the RoboVerse-style yacs training config dump
dataset_stats.pkl — pickled dict of {"out_ori_act": {"min": np.ndarray(7,), "max": np.ndarray(7,)}} used to denormalize action tokens at inference time.

Loading a single checkpoint

from transformers import AutoModelForCausalLM, AutoProcessor

ckpt = "jsiburian/vla0-3b-bridge-v2"
subfolder = "final"  # or "iter_17500", "iter_15000", ...

model = AutoModelForCausalLM.from_pretrained(
    ckpt, subfolder=subfolder, torch_dtype="bfloat16", device_map="cuda"
)
processor = AutoProcessor.from_pretrained(ckpt, subfolder=subfolder)

For the full VLA-0 inference path (system prompt, image tokenization, post-hoc action de-discretization with dataset_stats.pkl), use the rv_train.models.qwen.model.Qwen wrapper from the VLA-0 repo — these checkpoints are a direct drop-in for any qwen_model_id slot in that wrapper.

Caveats

This is an INT-ACT clone training run, not a "best-loss" run: no LR schedule, only ~1.4 epochs, action-mask augmentation at 0.4. We expect RLinf SIMPLER / RL4VLA scores to be the meaningful comparison, not the raw token CE.
Single-camera. The other three Bridge-v2 cameras (observation.images.image_{1,2,3}) are not used.
dataset_stats.pkl carries the action min/max used during training; using different stats at inference will silently miscalibrate gripper / motion.

Downloads last month: 1

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jsiburian/vla0-3b-bridge-v2

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Finetuned

(814)

this model

jsiburian
/

vla0-3b-bridge-v2