|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- robotics |
|
|
- act |
|
|
- lerobot |
|
|
- manipulation |
|
|
- imitation-learning |
|
|
datasets: |
|
|
- gpudad/so101_pick_cube_chunked |
|
|
library_name: lerobot |
|
|
pipeline_tag: robotics |
|
|
--- |
|
|
|
|
|
# ACT Model for SO-101 Pick Cube Task |
|
|
|
|
|
This is an Action Chunking Transformer (ACT) model trained on the SO-101 robot arm for a cube picking task. |
|
|
|
|
|
## Demo |
|
|
|
|
|
 |
|
|
|
|
|
*Visualization showing ground truth (green) vs predicted actions (blue) with mean absolute error per frame.* |
|
|
|
|
|
## Environment |
|
|
|
|
|
 |
|
|
|
|
|
## Model Details |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| Architecture | ACT (Action Chunking Transformer) | |
|
|
| Vision Backbone | ResNet18 | |
|
|
| Training Steps | 500,000 | |
|
|
| Chunk Size | 100 | |
|
|
| N Action Steps | 1 (with temporal ensembling) | |
|
|
| Temporal Ensemble Coeff | 0.01 | |
|
|
| KL Weight | 10.0 | |
|
|
| Batch Size | 16 | |
|
|
| Learning Rate | 3e-5 | |
|
|
| Parameters | 51.6M | |
|
|
|
|
|
## Evaluation Metrics |
|
|
|
|
|
Evaluated on a sample episode from the training set: |
|
|
|
|
|
| Joint | MAE | MSE | |
|
|
|-------|-----|-----| |
|
|
| Joint 0 | 0.0374 | 0.0034 | |
|
|
| Joint 1 | 0.0342 | 0.0042 | |
|
|
| Joint 2 | 0.0394 | 0.0025 | |
|
|
| Joint 3 | 0.0216 | 0.0011 | |
|
|
| Joint 4 | 0.0264 | 0.0009 | |
|
|
| Joint 5 (gripper) | 0.0020 | 0.00001 | |
|
|
| **Overall** | **0.0268** | **0.0020** | |
|
|
|
|
|
## Training Dataset |
|
|
|
|
|
Trained on [gpudad/so101_pick_cube_chunked](https://huggingface.co/datasets/gpudad/so101_pick_cube_chunked) - a chunked version of the SO-101 pick cube dataset with episode-level video files for efficient loading. |
|
|
|
|
|
- ~11k episodes |
|
|
- 3 camera views (front, overhead, wrist) |
|
|
- 30 FPS |
|
|
|
|
|
## Camera Views |
|
|
|
|
|
The model uses 3 camera inputs: |
|
|
- **Front camera** - Main observation view |
|
|
- **Overhead camera** - Top-down perspective |
|
|
- **Wrist camera** - End-effector mounted camera |
|
|
|
|
|
## Training Command |
|
|
|
|
|
```bash |
|
|
python -m roboport.train act \ |
|
|
/path/to/so101_pick_cube_chunked \ |
|
|
-o /path/to/output \ |
|
|
--steps 500000 \ |
|
|
--chunk-size 100 \ |
|
|
--n-action-steps 1 \ |
|
|
--temporal-ensemble 0.01 \ |
|
|
--kl-weight 10.0 \ |
|
|
--batch-size 16 \ |
|
|
--lr 3e-5 \ |
|
|
--vision-backbone resnet18 \ |
|
|
--save-freq 50000 \ |
|
|
--gpu 0 |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from lerobot.policies.act.modeling_act import ACTPolicy |
|
|
|
|
|
policy = ACTPolicy.from_pretrained("gpudad/act_so101_pick_cube") |
|
|
policy.eval() |
|
|
|
|
|
# Run inference |
|
|
action = policy.select_action(observation) |
|
|
``` |
|
|
|
|
|
## Framework |
|
|
|
|
|
Trained using [roboport](https://github.com/DreamwareInc/roboport) with LeRobot backend. |
|
|
|