ACT Model for SO-101 Pick Cube Task
This is an Action Chunking Transformer (ACT) model trained on the SO-101 robot arm for a cube picking task.
Demo
Visualization showing ground truth (green) vs predicted actions (blue) with mean absolute error per frame.
Environment
Model Details
| Parameter | Value |
|---|---|
| Architecture | ACT (Action Chunking Transformer) |
| Vision Backbone | ResNet18 |
| Training Steps | 500,000 |
| Chunk Size | 100 |
| N Action Steps | 1 (with temporal ensembling) |
| Temporal Ensemble Coeff | 0.01 |
| KL Weight | 10.0 |
| Batch Size | 16 |
| Learning Rate | 3e-5 |
| Parameters | 51.6M |
Evaluation Metrics
Evaluated on a sample episode from the training set:
| Joint | MAE | MSE |
|---|---|---|
| Joint 0 | 0.0374 | 0.0034 |
| Joint 1 | 0.0342 | 0.0042 |
| Joint 2 | 0.0394 | 0.0025 |
| Joint 3 | 0.0216 | 0.0011 |
| Joint 4 | 0.0264 | 0.0009 |
| Joint 5 (gripper) | 0.0020 | 0.00001 |
| Overall | 0.0268 | 0.0020 |
Training Dataset
Trained on gpudad/so101_pick_cube_chunked - a chunked version of the SO-101 pick cube dataset with episode-level video files for efficient loading.
- ~11k episodes
- 3 camera views (front, overhead, wrist)
- 30 FPS
Camera Views
The model uses 3 camera inputs:
- Front camera - Main observation view
- Overhead camera - Top-down perspective
- Wrist camera - End-effector mounted camera
Training Command
python -m roboport.train act \
/path/to/so101_pick_cube_chunked \
-o /path/to/output \
--steps 500000 \
--chunk-size 100 \
--n-action-steps 1 \
--temporal-ensemble 0.01 \
--kl-weight 10.0 \
--batch-size 16 \
--lr 3e-5 \
--vision-backbone resnet18 \
--save-freq 50000 \
--gpu 0
Usage
from lerobot.policies.act.modeling_act import ACTPolicy
policy = ACTPolicy.from_pretrained("gpudad/act_so101_pick_cube")
policy.eval()
# Run inference
action = policy.select_action(observation)
Framework
Trained using roboport with LeRobot backend.
- Downloads last month
- -

