SO-101 Ball-in-Cup DOT Policy (Experimental)

A trained DOT (Decoder-Only Transformer) policy for the ball-in-cup task using the SO-101 robot arm.

Status: πŸ”¬ Experimental - Training in progress

What is DOT?

DOT (Decoder-Only Transformer) is an alternative to ACT that uses:

  • Decoder-only architecture (GPT-style) instead of encoder-decoder
  • Multi-step observation history (30 lookback steps)
  • LoRA regularization on visual backbone
  • No VAE - simpler architecture

Based on Ilia Larchenko's DOT implementation.

Task Description

Goal: Pick up an orange ball from the table and place it into a pink cup.

Robot: SO-101 - 6-DOF robot arm with gripper

Training Details

Parameter Value
Dataset abdul004/so101_ball_in_cup_v5
Episodes 72 teleoperated demonstrations
Current Steps 63,000 / 100,000 (63%)
Batch Size 12
Train Horizon 150 steps
Inference Horizon 100 steps
Lookback Obs Steps 30
LoRA Rank 20
Hardware RTX 3080 Ti on Vast.ai

Current Results (63K Steps)

Checkpoint VLM Score Grasp Lift Pause % Notes
11K 30 ❌ ❌ ~80% Arm positioning only, no ball approach
14K 30 ❌ ❌ 83% Similar to 11K, arm barely moves
63K 30 ❌ ❌ 53% Actively approaches ball, attempts grasp, fails to secure

Key Observation (14K β†’ 63K):

  • βœ… Significant improvement in approach behavior - arm now actively moves toward ball
  • βœ… Less hesitation - pause rate dropped from 83% β†’ 53%
  • βœ… Grasp attempts - gripper closes at correct moment (reaches min position 6.3)
  • ❌ Still fails to secure ball - grasp attempt doesn't capture ball in gripper

The policy has learned the approach phase but struggles with precise grasp execution.

Demo (63K Checkpoint)

DOT 63K Grasp Attempt Side-by-side: Overhead (left) + Wrist (right) showing approach and grasp attempt at ~36s

Sample Evaluation

63K Checkpoint

DOT 63K Evaluation T1β†’T5: Start β†’ Pre-Grasp β†’ Grasp (attempt) β†’ Drop β†’ End. Ball approached but not secured.

14K Checkpoint

DOT 14K Evaluation Ball remains stationary throughout episode - arm barely moves toward target

Comparison with ACT

Metric ACT (100K) DOT (14K) DOT (63K)
Grasp βœ… Yes ❌ No ❌ Attempts
Lift βœ… Yes ❌ No ❌ No
VLM Score 70 30 30
Pause % 7% 83% 53%
Approach βœ… ❌ Minimal βœ… Active
Training Time ~8 hrs ~4 hrs ~18 hrs

DOT Configuration

DOTConfig(
    n_obs_steps=3,
    train_horizon=150,
    inference_horizon=100,
    lookback_obs_steps=30,
    lookback_aug=5,
    lora_rank=20,
    crop_scale=0.8,
    state_noise=0.01,
    optimizer_lr=3e-5,
    optimizer_min_lr=1e-5,
)

Known Issues

  1. Data loading bottleneck: DOT loads 13 observation frames per sample vs ACT's 1, causing ~10x slower data loading (CPU-bound video decoding)
  2. Grasp precision: At 63K, policy approaches ball correctly but fails to secure it in gripper
  3. Slower learning curve: DOT requires more training steps than ACT for comparable behavior

Usage

from lerobot.policies.dot.modeling_dot import DOTPolicy

# Load policy
policy = DOTPolicy.from_pretrained("abdul004/so101_dot_policy_v5")

# Run inference
action = policy.select_action(observation)

Training Infrastructure

Challenges encountered:

  • DataLoader Bus errors with num_workers > 0 (solved with batch_size=12)
  • Resume functionality required patches for DOT's internal normalization
  • Interruptible instances on Vast.ai require checkpoint sync to HF Hub

Checkpoint sync script: Automatically uploads checkpoints every 1K steps to prevent data loss on interruption.

Next Steps

  • Continue training to 50K steps (reached 63K)
  • Evaluate learning curve (14Kβ†’63K shows progress in approach, not grasp)
  • Continue to 100K to see if grasp precision improves
  • Investigate hyperparameter tuning (action horizon, LoRA rank)
  • Try Pi Zero / Pi0.5 as alternative VLA approach

Acknowledgments

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Dataset used to train abdul004/so101_dot_policy_v5