YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
OS_libero_traj_strict_rew
OpenSora + OpenVLA-OFT checkpoints trained with GRPO on LIBERO-Spatial using a strict VLM trajectory reward (J_adversarial_fast prompt).
Training Details
- Base model: RLinf-OpenVLAOFT-LIBERO-130-Base-Lora (OpenVLA-OFT with LoRA)
- RL algorithm: GRPO (Group Relative Policy Optimization)
- Environment: LIBERO-Spatial (10 tasks)
- Reward model: Qwen3-VL-8B-Instruct with J_adversarial_fast prompt
- Reward type: VLM trajectory reward (4 frames sampled from trajectory)
- Training framework: RLinf
- GPUs: 8x NVIDIA B200
- KL penalty (kl_beta): 0.0
- Reward coefficient: 5.0
- VLM max_new_tokens: 32
- enable_thinking: False
- val_check_interval: 5
Checkpoints
| Checkpoint | Description |
|---|---|
global_step_25/ |
Step 25 โ peak eval performance window |
global_step_30/ |
Step 30 โ peak eval performance window |
Each checkpoint contains:
actor/model_state_dict/full_weights.ptโ consolidated full model weights (~15GB)actor/dcp_checkpoint/โ distributed checkpoint shards for resuming training (~43GB)
VLM Reward Prompt
See prompt.txt for the full J_adversarial_fast prompt template used during training.
Prompt Metrics
| VLM Model | Accuracy | Precision | Recall | F1 | FP Rate |
|---|---|---|---|---|---|
| Qwen3-VL-8B | 68.3% | 90.6% | 59.0% | 0.714 | 12.8% |
| Qwen3.5-9B | 85.8% | 97.1% | 81.5% | 0.886 | 5.1% |
Training Eval Curve (VLM success)
Training ran for 95 steps with periodic validation every 5 steps.
- Peak VLM success ~85% around steps 20-25
- Gradual decline after step 30
Usage
To load the full weights for inference:
import torch
state_dict = torch.load("global_step_25/actor/model_state_dict/full_weights.pt", map_location="cpu")
# Apply to your OpenVLA-OFT model
License
See the main RLinf repository for license details.
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support