Update README with evaluation metrics and GIF
Browse files
README.md
CHANGED
|
@@ -16,24 +16,53 @@ pipeline_tag: robotics
|
|
| 16 |
|
| 17 |
This is an Action Chunking Transformer (ACT) model trained on the SO-101 robot arm for a cube picking task.
|
| 18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |

|
| 20 |
|
| 21 |
## Model Details
|
| 22 |
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
## Training Dataset
|
| 34 |
|
| 35 |
Trained on [gpudad/so101_pick_cube_chunked](https://huggingface.co/datasets/gpudad/so101_pick_cube_chunked) - a chunked version of the SO-101 pick cube dataset with episode-level video files for efficient loading.
|
| 36 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
## Camera Views
|
| 38 |
|
| 39 |
The model uses 3 camera inputs:
|
|
@@ -45,8 +74,8 @@ The model uses 3 camera inputs:
|
|
| 45 |
|
| 46 |
```bash
|
| 47 |
python -m roboport.train act \
|
| 48 |
-
/
|
| 49 |
-
-o /
|
| 50 |
--steps 500000 \
|
| 51 |
--chunk-size 100 \
|
| 52 |
--n-action-steps 1 \
|
|
@@ -62,9 +91,13 @@ python -m roboport.train act \
|
|
| 62 |
## Usage
|
| 63 |
|
| 64 |
```python
|
| 65 |
-
from lerobot.
|
| 66 |
|
| 67 |
policy = ACTPolicy.from_pretrained("gpudad/act_so101_pick_cube")
|
|
|
|
|
|
|
|
|
|
|
|
|
| 68 |
```
|
| 69 |
|
| 70 |
## Framework
|
|
|
|
| 16 |
|
| 17 |
This is an Action Chunking Transformer (ACT) model trained on the SO-101 robot arm for a cube picking task.
|
| 18 |
|
| 19 |
+
## Demo
|
| 20 |
+
|
| 21 |
+

|
| 22 |
+
|
| 23 |
+
*Visualization showing ground truth (green) vs predicted actions (blue) with mean absolute error per frame.*
|
| 24 |
+
|
| 25 |
+
## Environment
|
| 26 |
+
|
| 27 |

|
| 28 |
|
| 29 |
## Model Details
|
| 30 |
|
| 31 |
+
| Parameter | Value |
|
| 32 |
+
|-----------|-------|
|
| 33 |
+
| Architecture | ACT (Action Chunking Transformer) |
|
| 34 |
+
| Vision Backbone | ResNet18 |
|
| 35 |
+
| Training Steps | 500,000 |
|
| 36 |
+
| Chunk Size | 100 |
|
| 37 |
+
| N Action Steps | 1 (with temporal ensembling) |
|
| 38 |
+
| Temporal Ensemble Coeff | 0.01 |
|
| 39 |
+
| KL Weight | 10.0 |
|
| 40 |
+
| Batch Size | 16 |
|
| 41 |
+
| Learning Rate | 3e-5 |
|
| 42 |
+
| Parameters | 51.6M |
|
| 43 |
+
|
| 44 |
+
## Evaluation Metrics
|
| 45 |
+
|
| 46 |
+
Evaluated on a sample episode from the training set:
|
| 47 |
+
|
| 48 |
+
| Joint | MAE | MSE |
|
| 49 |
+
|-------|-----|-----|
|
| 50 |
+
| Joint 0 | 0.0374 | 0.0034 |
|
| 51 |
+
| Joint 1 | 0.0342 | 0.0042 |
|
| 52 |
+
| Joint 2 | 0.0394 | 0.0025 |
|
| 53 |
+
| Joint 3 | 0.0216 | 0.0011 |
|
| 54 |
+
| Joint 4 | 0.0264 | 0.0009 |
|
| 55 |
+
| Joint 5 (gripper) | 0.0020 | 0.00001 |
|
| 56 |
+
| **Overall** | **0.0268** | **0.0020** |
|
| 57 |
|
| 58 |
## Training Dataset
|
| 59 |
|
| 60 |
Trained on [gpudad/so101_pick_cube_chunked](https://huggingface.co/datasets/gpudad/so101_pick_cube_chunked) - a chunked version of the SO-101 pick cube dataset with episode-level video files for efficient loading.
|
| 61 |
|
| 62 |
+
- ~11k episodes
|
| 63 |
+
- 3 camera views (front, overhead, wrist)
|
| 64 |
+
- 30 FPS
|
| 65 |
+
|
| 66 |
## Camera Views
|
| 67 |
|
| 68 |
The model uses 3 camera inputs:
|
|
|
|
| 74 |
|
| 75 |
```bash
|
| 76 |
python -m roboport.train act \
|
| 77 |
+
/path/to/so101_pick_cube_chunked \
|
| 78 |
+
-o /path/to/output \
|
| 79 |
--steps 500000 \
|
| 80 |
--chunk-size 100 \
|
| 81 |
--n-action-steps 1 \
|
|
|
|
| 91 |
## Usage
|
| 92 |
|
| 93 |
```python
|
| 94 |
+
from lerobot.policies.act.modeling_act import ACTPolicy
|
| 95 |
|
| 96 |
policy = ACTPolicy.from_pretrained("gpudad/act_so101_pick_cube")
|
| 97 |
+
policy.eval()
|
| 98 |
+
|
| 99 |
+
# Run inference
|
| 100 |
+
action = policy.select_action(observation)
|
| 101 |
```
|
| 102 |
|
| 103 |
## Framework
|