SO-101 Ball-in-Cup ACT Policy

A trained ACT (Action Chunking Transformers) policy for the ball-in-cup task using the SO-101 robot arm.

Task Description

Goal: Pick up an orange ball from the table and place it into a pink cup.

Robot: SO-101 - 6-DOF robot arm with gripper

Cameras: Dual camera setup (overhead + wrist-mounted)

Training Details

Parameter Value
Dataset abdul004/so101_ball_in_cup_v5
Episodes 72 teleoperated demonstrations
Frames 25,045
Training Steps 100,000
Batch Size 32
Policy Type ACT (Action Chunking Transformers)
Hardware RTX 3080 Ti / RTX 4090 on Vast.ai
Training Time ~8 hours
Cost ~$2-3 USD

Evaluation Results

Evaluated using custom metrics + VLM (Gemini) visual assessment:

Session VLM Score Grasp Lift Transport Final Position
s1 70 βœ… Yes βœ… Yes βœ… Yes on_table (dropped)
base 50 βœ… Yes βœ… Yes βœ… Yes on_table (dropped)
s2 30 βœ… Yes βœ… Yes ⚠️ Partial on_table

Key Findings:

  • Successfully learns grasp and lift behaviors
  • Struggles with final placement (often drops during transport)
  • Low pause rate (~7%) indicates confident movements
  • High gripper activity suggests active grasping attempts

Demo

Evaluation Demo Side-by-side: Overhead camera (left) + Wrist camera (right)

Sample Evaluation

Evaluation Composite 5-frame composite showing: Start β†’ Approach β†’ Grasp β†’ Transport β†’ Final

Usage

from lerobot.common.policies.act.modeling_act import ACTPolicy

# Load policy
policy = ACTPolicy.from_pretrained("abdul004/so101_act_policy_v5")

# Run inference
action = policy.select_action(observation)

Comparison with DOT Policy

Also trained a DOT (Decoder-Only Transformer) policy on the same dataset:

Policy Steps Grasp Lift VLM Score
ACT 100K βœ… βœ… 70
DOT 14K ❌ ❌ 30

DOT training ongoing - decoder-only architecture may require more steps to converge.

Infrastructure Notes

Cloud Training Setup:

  • Platform: Vast.ai (interruptible instances for cost savings)
  • Checkpoint sync: Automatic upload to HF Hub every 1K steps
  • Resume capability: Training can resume from any checkpoint after interruption

Evaluation Pipeline:

  • Automated metrics from joint/action data (pause%, movement variance)
  • VLM-based visual assessment using composite images
  • Dual-camera frame capture at key moments

Limitations

  • Success rate not yet 100% - drops ball during transport phase
  • Sensitive to ball/cup initial positioning
  • 72 episodes may be insufficient for robust generalization

Citation

@misc{so101_ball_in_cup,
  author = {Abdul},
  title = {SO-101 Ball-in-Cup Policy Training},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/abdul004/so101_act_policy_v5}
}

Acknowledgments

Downloads last month
10
Video Preview
loading

Dataset used to train abdul004/so101_act_policy_v5

Paper for abdul004/so101_act_policy_v5