gpudad
/

act_so101_pick_cube

imitation-learning

Model card Files Files and versions

act_so101_pick_cube / README.md

gpudad's picture

Update README with evaluation metrics and GIF

d42372e verified about 19 hours ago

|

history blame contribute delete

2.58 kB

	---
	license: apache-2.0
	tags:
	- robotics
	- act
	- lerobot
	- manipulation
	- imitation-learning
	datasets:
	- gpudad/so101_pick_cube_chunked
	library_name: lerobot
	pipeline_tag: robotics
	---

	# ACT Model for SO-101 Pick Cube Task

	This is an Action Chunking Transformer (ACT) model trained on the SO-101 robot arm for a cube picking task.

	## Demo

	![Model Evaluation](https://huggingface.co/gpudad/act_so101_pick_cube/resolve/main/act_eval_500k.gif)

	Visualization showing ground truth (green) vs predicted actions (blue) with mean absolute error per frame.

	## Environment

	![Environment Preview](https://huggingface.co/datasets/gpudad/so101_pick_cube_chunked/resolve/main/camera_angles.png)

	## Model Details

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Architecture \| ACT (Action Chunking Transformer) \|
	\| Vision Backbone \| ResNet18 \|
	\| Training Steps \| 500,000 \|
	\| Chunk Size \| 100 \|
	\| N Action Steps \| 1 (with temporal ensembling) \|
	\| Temporal Ensemble Coeff \| 0.01 \|
	\| KL Weight \| 10.0 \|
	\| Batch Size \| 16 \|
	\| Learning Rate \| 3e-5 \|
	\| Parameters \| 51.6M \|

	## Evaluation Metrics

	Evaluated on a sample episode from the training set:

	\| Joint \| MAE \| MSE \|
	\|-------\|-----\|-----\|
	\| Joint 0 \| 0.0374 \| 0.0034 \|
	\| Joint 1 \| 0.0342 \| 0.0042 \|
	\| Joint 2 \| 0.0394 \| 0.0025 \|
	\| Joint 3 \| 0.0216 \| 0.0011 \|
	\| Joint 4 \| 0.0264 \| 0.0009 \|
	\| Joint 5 (gripper) \| 0.0020 \| 0.00001 \|
	\| Overall \| 0.0268 \| 0.0020 \|

	## Training Dataset

	Trained on [gpudad/so101_pick_cube_chunked](https://huggingface.co/datasets/gpudad/so101_pick_cube_chunked) - a chunked version of the SO-101 pick cube dataset with episode-level video files for efficient loading.

	- ~11k episodes
	- 3 camera views (front, overhead, wrist)
	- 30 FPS

	## Camera Views

	The model uses 3 camera inputs:
	- Front camera - Main observation view
	- Overhead camera - Top-down perspective
	- Wrist camera - End-effector mounted camera

	## Training Command

	```bash
	python -m roboport.train act \
	/path/to/so101_pick_cube_chunked \
	-o /path/to/output \
	--steps 500000 \
	--chunk-size 100 \
	--n-action-steps 1 \
	--temporal-ensemble 0.01 \
	--kl-weight 10.0 \
	--batch-size 16 \
	--lr 3e-5 \
	--vision-backbone resnet18 \
	--save-freq 50000 \
	--gpu 0
	```

	## Usage

	```python
	from lerobot.policies.act.modeling_act import ACTPolicy

	policy = ACTPolicy.from_pretrained("gpudad/act_so101_pick_cube")
	policy.eval()

	# Run inference
	action = policy.select_action(observation)
	```

	## Framework

	Trained using [roboport](https://github.com/DreamwareInc/roboport) with LeRobot backend.