ACT Model for SO-101 Pick Cube Task

This is an Action Chunking Transformer (ACT) model trained on the SO-101 robot arm for a cube picking task.

Demo

Model Evaluation

Visualization showing ground truth (green) vs predicted actions (blue) with mean absolute error per frame.

Environment

Environment Preview

Model Details

Parameter Value
Architecture ACT (Action Chunking Transformer)
Vision Backbone ResNet18
Training Steps 500,000
Chunk Size 100
N Action Steps 1 (with temporal ensembling)
Temporal Ensemble Coeff 0.01
KL Weight 10.0
Batch Size 16
Learning Rate 3e-5
Parameters 51.6M

Evaluation Metrics

Evaluated on a sample episode from the training set:

Joint MAE MSE
Joint 0 0.0374 0.0034
Joint 1 0.0342 0.0042
Joint 2 0.0394 0.0025
Joint 3 0.0216 0.0011
Joint 4 0.0264 0.0009
Joint 5 (gripper) 0.0020 0.00001
Overall 0.0268 0.0020

Training Dataset

Trained on gpudad/so101_pick_cube_chunked - a chunked version of the SO-101 pick cube dataset with episode-level video files for efficient loading.

  • ~11k episodes
  • 3 camera views (front, overhead, wrist)
  • 30 FPS

Camera Views

The model uses 3 camera inputs:

  • Front camera - Main observation view
  • Overhead camera - Top-down perspective
  • Wrist camera - End-effector mounted camera

Training Command

python -m roboport.train act \
  /path/to/so101_pick_cube_chunked \
  -o /path/to/output \
  --steps 500000 \
  --chunk-size 100 \
  --n-action-steps 1 \
  --temporal-ensemble 0.01 \
  --kl-weight 10.0 \
  --batch-size 16 \
  --lr 3e-5 \
  --vision-backbone resnet18 \
  --save-freq 50000 \
  --gpu 0

Usage

from lerobot.policies.act.modeling_act import ACTPolicy

policy = ACTPolicy.from_pretrained("gpudad/act_so101_pick_cube")
policy.eval()

# Run inference
action = policy.select_action(observation)

Framework

Trained using roboport with LeRobot backend.

Downloads last month
-
Video Preview
loading

Dataset used to train gpudad/act_so101_pick_cube