VLA-0 (3B) β Bridge-v2 INT-ACT-aligned finetune
This repo contains intermediate and final checkpoints for a VLA-0 policy
finetuned from Qwen/Qwen2.5-VL-3B-Instruct on the Bridge-v2 dataset
(jesbu1/bridge_v2_lerobot), with filtering chosen to mirror the INT-ACT /
OpenVLA / Octo baseline (skip_unlabeled=True, single 3rd-person camera,
horizon = 4).
The intent is for these checkpoints to be drop-in usable in RLinf SIMPLER / RL4VLA evaluation against INT-ACT and other Bridge-v2-only baselines.
Training summary
| Backbone | Qwen/Qwen2.5-VL-3B-Instruct (3B) |
| Method | Full finetune (no LoRA), bf16 + AMP |
| Dataset | Bridge-v2 LeRobot port (jesbu1/bridge_v2_lerobot), 53,192 episodes |
| Filter | skip_empty_language=True β 38,660 labeled episodes (~73%) |
| Camera | Single 3rd-person view (observation.images.image_0 β 3p1) |
| Image size | 224 Γ 224 |
| Action space | 7-DoF, discretized to 1000 bins per dim |
| Horizon | 4 future steps (28 action tokens per sample) |
| Effective batch | 16 / GPU Γ 7 GPUs = 112 |
| Optimizer | AdamW, lr = 1e-5 (Γ 7 GPUs β 7e-5 effective), no scheduler |
action_mask_aug_per |
0.4 |
| Iterations | 18,000 (β 1.4 epochs over the labeled subset) |
| Hardware | 7 Γ NVIDIA H200 |
Final per-epoch cumulative cross-entropy on training: ~0.94 (random over
1000 bins is ln(1000) β 6.91).
Checkpoint layout
The repo is laid out as one subfolder per saved iteration:
iter_2500/ # step 2500
iter_5000/ # step 5000
iter_7500/ # step 7500
iter_10000/ # step 10000
iter_12500/ # step 12500
iter_15000/ # step 15000
iter_17500/ # step 17500
final/ # step 18000 (end of training)
Each subfolder is a complete HF-format checkpoint:
config.json, model-0000?-of-0000?.safetensors,
model.safetensors.index.json, plus the Qwen2.5-VL tokenizer / processor /
chat-template files. The action-token vocabulary is the standard Qwen vocab
(no new tokens were added).
Top-level files:
config.yamlβ the RoboVerse-style yacs training config dumpdataset_stats.pklβ pickled dict of{"out_ori_act": {"min": np.ndarray(7,), "max": np.ndarray(7,)}}used to denormalize action tokens at inference time.
Loading a single checkpoint
from transformers import AutoModelForCausalLM, AutoProcessor
ckpt = "jsiburian/vla0-3b-bridge-v2"
subfolder = "final" # or "iter_17500", "iter_15000", ...
model = AutoModelForCausalLM.from_pretrained(
ckpt, subfolder=subfolder, torch_dtype="bfloat16", device_map="cuda"
)
processor = AutoProcessor.from_pretrained(ckpt, subfolder=subfolder)
For the full VLA-0 inference path (system prompt, image tokenization,
post-hoc action de-discretization with dataset_stats.pkl), use the
rv_train.models.qwen.model.Qwen wrapper from the
VLA-0 repo β these checkpoints are a
direct drop-in for any qwen_model_id slot in that wrapper.
Caveats
- This is an INT-ACT clone training run, not a "best-loss" run: no LR schedule, only ~1.4 epochs, action-mask augmentation at 0.4. We expect RLinf SIMPLER / RL4VLA scores to be the meaningful comparison, not the raw token CE.
- Single-camera. The other three Bridge-v2 cameras
(
observation.images.image_{1,2,3}) are not used. dataset_stats.pklcarries the action min/max used during training; using different stats at inference will silently miscalibrate gripper / motion.
- Downloads last month
- 13
Model tree for jsiburian/vla0-3b-bridge-v2
Base model
Qwen/Qwen2.5-VL-3B-Instruct