File size: 2,581 Bytes
6fbda45
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d42372e
 
 
 
 
 
 
 
6fbda45
 
 
 
d42372e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6fbda45
 
 
 
 
d42372e
 
 
 
6fbda45
 
 
 
 
 
 
 
 
 
 
d42372e
 
6fbda45
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d42372e
6fbda45
 
d42372e
 
 
 
6fbda45
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
license: apache-2.0
tags:
  - robotics
  - act
  - lerobot
  - manipulation
  - imitation-learning
datasets:
  - gpudad/so101_pick_cube_chunked
library_name: lerobot
pipeline_tag: robotics
---

# ACT Model for SO-101 Pick Cube Task

This is an Action Chunking Transformer (ACT) model trained on the SO-101 robot arm for a cube picking task.

## Demo

![Model Evaluation](https://huggingface.co/gpudad/act_so101_pick_cube/resolve/main/act_eval_500k.gif)

*Visualization showing ground truth (green) vs predicted actions (blue) with mean absolute error per frame.*

## Environment

![Environment Preview](https://huggingface.co/datasets/gpudad/so101_pick_cube_chunked/resolve/main/camera_angles.png)

## Model Details

| Parameter | Value |
|-----------|-------|
| Architecture | ACT (Action Chunking Transformer) |
| Vision Backbone | ResNet18 |
| Training Steps | 500,000 |
| Chunk Size | 100 |
| N Action Steps | 1 (with temporal ensembling) |
| Temporal Ensemble Coeff | 0.01 |
| KL Weight | 10.0 |
| Batch Size | 16 |
| Learning Rate | 3e-5 |
| Parameters | 51.6M |

## Evaluation Metrics

Evaluated on a sample episode from the training set:

| Joint | MAE | MSE |
|-------|-----|-----|
| Joint 0 | 0.0374 | 0.0034 |
| Joint 1 | 0.0342 | 0.0042 |
| Joint 2 | 0.0394 | 0.0025 |
| Joint 3 | 0.0216 | 0.0011 |
| Joint 4 | 0.0264 | 0.0009 |
| Joint 5 (gripper) | 0.0020 | 0.00001 |
| **Overall** | **0.0268** | **0.0020** |

## Training Dataset

Trained on [gpudad/so101_pick_cube_chunked](https://huggingface.co/datasets/gpudad/so101_pick_cube_chunked) - a chunked version of the SO-101 pick cube dataset with episode-level video files for efficient loading.

- ~11k episodes
- 3 camera views (front, overhead, wrist)
- 30 FPS

## Camera Views

The model uses 3 camera inputs:
- **Front camera** - Main observation view
- **Overhead camera** - Top-down perspective  
- **Wrist camera** - End-effector mounted camera

## Training Command

```bash
python -m roboport.train act \
  /path/to/so101_pick_cube_chunked \
  -o /path/to/output \
  --steps 500000 \
  --chunk-size 100 \
  --n-action-steps 1 \
  --temporal-ensemble 0.01 \
  --kl-weight 10.0 \
  --batch-size 16 \
  --lr 3e-5 \
  --vision-backbone resnet18 \
  --save-freq 50000 \
  --gpu 0
```

## Usage

```python
from lerobot.policies.act.modeling_act import ACTPolicy

policy = ACTPolicy.from_pretrained("gpudad/act_so101_pick_cube")
policy.eval()

# Run inference
action = policy.select_action(observation)
```

## Framework

Trained using [roboport](https://github.com/DreamwareInc/roboport) with LeRobot backend.