VLA-0 (3B) β€” Bridge-v2 INT-ACT-aligned finetune

This repo contains intermediate and final checkpoints for a VLA-0 policy finetuned from Qwen/Qwen2.5-VL-3B-Instruct on the Bridge-v2 dataset (jesbu1/bridge_v2_lerobot), with filtering chosen to mirror the INT-ACT / OpenVLA / Octo baseline (skip_unlabeled=True, single 3rd-person camera, horizon = 4).

The intent is for these checkpoints to be drop-in usable in RLinf SIMPLER / RL4VLA evaluation against INT-ACT and other Bridge-v2-only baselines.

Training summary

Backbone Qwen/Qwen2.5-VL-3B-Instruct (3B)
Method Full finetune (no LoRA), bf16 + AMP
Dataset Bridge-v2 LeRobot port (jesbu1/bridge_v2_lerobot), 53,192 episodes
Filter skip_empty_language=True β†’ 38,660 labeled episodes (~73%)
Camera Single 3rd-person view (observation.images.image_0 β†’ 3p1)
Image size 224 Γ— 224
Action space 7-DoF, discretized to 1000 bins per dim
Horizon 4 future steps (28 action tokens per sample)
Effective batch 16 / GPU Γ— 7 GPUs = 112
Optimizer AdamW, lr = 1e-5 (Γ— 7 GPUs β†’ 7e-5 effective), no scheduler
action_mask_aug_per 0.4
Iterations 18,000 (β‰ˆ 1.4 epochs over the labeled subset)
Hardware 7 Γ— NVIDIA H200

Final per-epoch cumulative cross-entropy on training: ~0.94 (random over 1000 bins is ln(1000) β‰ˆ 6.91).

Checkpoint layout

The repo is laid out as one subfolder per saved iteration:

iter_2500/    # step  2500
iter_5000/    # step  5000
iter_7500/    # step  7500
iter_10000/   # step 10000
iter_12500/   # step 12500
iter_15000/   # step 15000
iter_17500/   # step 17500
final/        # step 18000 (end of training)

Each subfolder is a complete HF-format checkpoint: config.json, model-0000?-of-0000?.safetensors, model.safetensors.index.json, plus the Qwen2.5-VL tokenizer / processor / chat-template files. The action-token vocabulary is the standard Qwen vocab (no new tokens were added).

Top-level files:

  • config.yaml β€” the RoboVerse-style yacs training config dump
  • dataset_stats.pkl β€” pickled dict of {"out_ori_act": {"min": np.ndarray(7,), "max": np.ndarray(7,)}} used to denormalize action tokens at inference time.

Loading a single checkpoint

from transformers import AutoModelForCausalLM, AutoProcessor

ckpt = "jsiburian/vla0-3b-bridge-v2"
subfolder = "final"  # or "iter_17500", "iter_15000", ...

model = AutoModelForCausalLM.from_pretrained(
    ckpt, subfolder=subfolder, torch_dtype="bfloat16", device_map="cuda"
)
processor = AutoProcessor.from_pretrained(ckpt, subfolder=subfolder)

For the full VLA-0 inference path (system prompt, image tokenization, post-hoc action de-discretization with dataset_stats.pkl), use the rv_train.models.qwen.model.Qwen wrapper from the VLA-0 repo β€” these checkpoints are a direct drop-in for any qwen_model_id slot in that wrapper.

Caveats

  • This is an INT-ACT clone training run, not a "best-loss" run: no LR schedule, only ~1.4 epochs, action-mask augmentation at 0.4. We expect RLinf SIMPLER / RL4VLA scores to be the meaningful comparison, not the raw token CE.
  • Single-camera. The other three Bridge-v2 cameras (observation.images.image_{1,2,3}) are not used.
  • dataset_stats.pkl carries the action min/max used during training; using different stats at inference will silently miscalibrate gripper / motion.
Downloads last month
13
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jsiburian/vla0-3b-bridge-v2

Finetuned
(788)
this model

Dataset used to train jsiburian/vla0-3b-bridge-v2