ACT Model for SO-101 Pick Cube Task

This is an Action Chunking Transformer (ACT) model trained on the SO-101 robot arm for a cube picking task.

Demo

Visualization showing ground truth (green) vs predicted actions (blue) with mean absolute error per frame.

Environment

Model Details

Parameter	Value
Architecture	ACT (Action Chunking Transformer)
Vision Backbone	ResNet18
Training Steps	500,000
Chunk Size	100
N Action Steps	1 (with temporal ensembling)
Temporal Ensemble Coeff	0.01
KL Weight	10.0
Batch Size	16
Learning Rate	3e-5
Parameters	51.6M

Evaluation Metrics

Evaluated on a sample episode from the training set:

Joint	MAE	MSE
Joint 0	0.0374	0.0034
Joint 1	0.0342	0.0042
Joint 2	0.0394	0.0025
Joint 3	0.0216	0.0011
Joint 4	0.0264	0.0009
Joint 5 (gripper)	0.0020	0.00001
Overall	0.0268	0.0020

Training Dataset

Trained on gpudad/so101_pick_cube_chunked - a chunked version of the SO-101 pick cube dataset with episode-level video files for efficient loading.

~11k episodes
3 camera views (front, overhead, wrist)
30 FPS

Camera Views

The model uses 3 camera inputs:

Front camera - Main observation view
Overhead camera - Top-down perspective
Wrist camera - End-effector mounted camera

Training Command

python -m roboport.train act \
  /path/to/so101_pick_cube_chunked \
  -o /path/to/output \
  --steps 500000 \
  --chunk-size 100 \
  --n-action-steps 1 \
  --temporal-ensemble 0.01 \
  --kl-weight 10.0 \
  --batch-size 16 \
  --lr 3e-5 \
  --vision-backbone resnet18 \
  --save-freq 50000 \
  --gpu 0

Usage

from lerobot.policies.act.modeling_act import ACTPolicy

policy = ACTPolicy.from_pretrained("gpudad/act_so101_pick_cube")
policy.eval()

# Run inference
action = policy.select_action(observation)

Framework

Trained using roboport with LeRobot backend.

Downloads last month: 3

Video Preview

Robotics

gpudad
/

act_so101_pick_cube