ACT Single-Arm Pick and Place

This model is an Action Chunking with Transformers (ACT) model trained to perform a standard pick-and-place task using a single robotic arm in PyBullet. It serves as an architectural baseline comparison against massive Vision-Language-Action models.

Model Details

Architecture: Action Chunking with Transformers (ACT) with a ResNet18 vision backbone (trained from scratch).
Task: Single-arm pick and place (transferring an object into a basket).
Action Space: 8-D Cartesian end-effector pose + gripper state [x, y, z, qx, qy, qz, qw, gripper].
Vision: 2 camera streams (overhead, wrist) at 224x224 resolution.
Training Data: 198 expert demonstrations collected at 10 FPS.

Performance

The model proves base task-learning capability and achieves extremely low open-loop mean absolute error. Despite its smaller parameter count compared to VLM alternatives, its spatial architecture allows it to successfully solve the in-distribution manipulation trajectory with high precision.

Downloads last month: 43

Safetensors

Model size

51.6M params

Tensor type

F32

Video Preview

Robotics