ACT Single-Arm Pick and Place

This model is an Action Chunking with Transformers (ACT) model trained to perform a standard pick-and-place task using a single robotic arm in PyBullet. It serves as an architectural baseline comparison against massive Vision-Language-Action models.

Model Details

  • Architecture: Action Chunking with Transformers (ACT) with a ResNet18 vision backbone (trained from scratch).
  • Task: Single-arm pick and place (transferring an object into a basket).
  • Action Space: 8-D Cartesian end-effector pose + gripper state [x, y, z, qx, qy, qz, qw, gripper].
  • Vision: 2 camera streams (overhead, wrist) at 224x224 resolution.
  • Training Data: 198 expert demonstrations collected at 10 FPS.

Performance

The model proves base task-learning capability and achieves extremely low open-loop mean absolute error. Despite its smaller parameter count compared to VLM alternatives, its spatial architecture allows it to successfully solve the in-distribution manipulation trajectory with high precision.

Downloads last month
43
Safetensors
Model size
51.6M params
Tensor type
F32
·
Video Preview
loading