OS_libero_traj_strict_rew

OpenSora + OpenVLA-OFT checkpoints trained with GRPO on LIBERO-Spatial using a strict VLM trajectory reward (J_adversarial_fast prompt).

Training Details

Checkpoint	Description
`global_step_25/`	Step 25 — peak eval performance window
`global_step_30/`	Step 30 — peak eval performance window

Each checkpoint contains:

actor/model_state_dict/full_weights.pt — consolidated full model weights (~15GB)
actor/dcp_checkpoint/ — distributed checkpoint shards for resuming training (~43GB)

See prompt.txt for the full J_adversarial_fast prompt template used during training.

VLM Model	Accuracy	Precision	Recall	F1	FP Rate
Qwen3-VL-8B	68.3%	90.6%	59.0%	0.714	12.8%
Qwen3.5-9B	85.8%	97.1%	81.5%	0.886	5.1%

Training ran for 95 steps with periodic validation every 5 steps.

To load the full weights for inference:

import torch

state_dict = torch.load("global_step_25/actor/model_state_dict/full_weights.pt", map_location="cpu")
# Apply to your OpenVLA-OFT model

See the main RLinf repository for license details.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support