Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
Paper
β’
2304.13705
β’
Published
β’
6
A trained ACT (Action Chunking Transformers) policy for the ball-in-cup task using the SO-101 robot arm.
Goal: Pick up an orange ball from the table and place it into a pink cup.
Robot: SO-101 - 6-DOF robot arm with gripper
Cameras: Dual camera setup (overhead + wrist-mounted)
| Parameter | Value |
|---|---|
| Dataset | abdul004/so101_ball_in_cup_v5 |
| Episodes | 72 teleoperated demonstrations |
| Frames | 25,045 |
| Training Steps | 100,000 |
| Batch Size | 32 |
| Policy Type | ACT (Action Chunking Transformers) |
| Hardware | RTX 3080 Ti / RTX 4090 on Vast.ai |
| Training Time | ~8 hours |
| Cost | ~$2-3 USD |
Evaluated using custom metrics + VLM (Gemini) visual assessment:
| Session | VLM Score | Grasp | Lift | Transport | Final Position |
|---|---|---|---|---|---|
| s1 | 70 | β Yes | β Yes | β Yes | on_table (dropped) |
| base | 50 | β Yes | β Yes | β Yes | on_table (dropped) |
| s2 | 30 | β Yes | β Yes | β οΈ Partial | on_table |
Key Findings:
Side-by-side: Overhead camera (left) + Wrist camera (right)
5-frame composite showing: Start β Approach β Grasp β Transport β Final
from lerobot.common.policies.act.modeling_act import ACTPolicy
# Load policy
policy = ACTPolicy.from_pretrained("abdul004/so101_act_policy_v5")
# Run inference
action = policy.select_action(observation)
Also trained a DOT (Decoder-Only Transformer) policy on the same dataset:
| Policy | Steps | Grasp | Lift | VLM Score |
|---|---|---|---|---|
| ACT | 100K | β | β | 70 |
| DOT | 14K | β | β | 30 |
DOT training ongoing - decoder-only architecture may require more steps to converge.
Cloud Training Setup:
Evaluation Pipeline:
@misc{so101_ball_in_cup,
author = {Abdul},
title = {SO-101 Ball-in-Cup Policy Training},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/abdul004/so101_act_policy_v5}
}