moka_pot_RECAP_R1

A pi0 (π₀) RECAP Vision-Language-Action (VLA) model, finetuned on the LIBERO robotic manipulation benchmark using the OpenTau training framework. This model is designed to follow natural language instructions to perform manipulation tasks in a simulated tabletop environment. Achieves ~90% success rate measured over 320 episodes.

For full documentation, evaluation results, and inference code, please visit the repository:
👉 https://github.com/TensorAuto/OpenTau

Model Details

Description

Model Type: Vision-Language-Action (VLA) Model
Base Architecture: π₀ (pi0) by Physical Intelligence
Backbone: PaliGemma-3B (VLM) + Gemma-300M (Action Expert) + RL indicator
Training Data: Moka Pot Task on LIBERO (Lifelong Robot Learning) Benchmark
Framework: OpenTau

Architecture

The PI0 RECAP architecture uses a flow-matching and Reinforcement Learning policy designed for open-world generalization. It combines a Visual Language Model (VLM) for high-level semantic understanding with a smaller "action expert" model that generates continuous joint trajectories (10-step action chunks) via flow matching. It uses RL to learn from good and bad episodes

Training and Evaluation

The Advantage Indicator (It) was set to True for only 10% of datapoints.

Dataset

This model was finetuned on the Moka Pot task in LIBERO 10 benchmark dataset and autonomous rollouts. It consists of around 29 expert teleoperated episodes and 212 autonomous rollouts under moka_pot_libero_sft policy and 320 autonomous rollouts under moka_pot_RECAP_R0 policy.

Results

For detailed usage instructions, success rates, baseline comparisons, and evaluation protocols, please refer to the OpenTau GitHub Repository. Achieves ~90% success rate measured over 320 episodes.

Downloads last month: 6

Safetensors

Model size

4B params

Tensor type

F32

BF16

Video Preview

Robotics

TensorAuto
/

moka_pot_RECAP_R1