SO-101 Ball-in-Cup DOT Policy (Experimental)
A trained DOT (Decoder-Only Transformer) policy for the ball-in-cup task using the SO-101 robot arm.
Status: π¬ Experimental - Training in progress
What is DOT?
DOT (Decoder-Only Transformer) is an alternative to ACT that uses:
- Decoder-only architecture (GPT-style) instead of encoder-decoder
- Multi-step observation history (30 lookback steps)
- LoRA regularization on visual backbone
- No VAE - simpler architecture
Based on Ilia Larchenko's DOT implementation.
Task Description
Goal: Pick up an orange ball from the table and place it into a pink cup.
Robot: SO-101 - 6-DOF robot arm with gripper
Training Details
| Parameter | Value |
|---|---|
| Dataset | abdul004/so101_ball_in_cup_v5 |
| Episodes | 72 teleoperated demonstrations |
| Current Steps | 63,000 / 100,000 (63%) |
| Batch Size | 12 |
| Train Horizon | 150 steps |
| Inference Horizon | 100 steps |
| Lookback Obs Steps | 30 |
| LoRA Rank | 20 |
| Hardware | RTX 3080 Ti on Vast.ai |
Current Results (63K Steps)
| Checkpoint | VLM Score | Grasp | Lift | Pause % | Notes |
|---|---|---|---|---|---|
| 11K | 30 | β | β | ~80% | Arm positioning only, no ball approach |
| 14K | 30 | β | β | 83% | Similar to 11K, arm barely moves |
| 63K | 30 | β | β | 53% | Actively approaches ball, attempts grasp, fails to secure |
Key Observation (14K β 63K):
- β Significant improvement in approach behavior - arm now actively moves toward ball
- β Less hesitation - pause rate dropped from 83% β 53%
- β Grasp attempts - gripper closes at correct moment (reaches min position 6.3)
- β Still fails to secure ball - grasp attempt doesn't capture ball in gripper
The policy has learned the approach phase but struggles with precise grasp execution.
Demo (63K Checkpoint)
Side-by-side: Overhead (left) + Wrist (right) showing approach and grasp attempt at ~36s
Sample Evaluation
63K Checkpoint
T1βT5: Start β Pre-Grasp β Grasp (attempt) β Drop β End. Ball approached but not secured.
14K Checkpoint
Ball remains stationary throughout episode - arm barely moves toward target
Comparison with ACT
| Metric | ACT (100K) | DOT (14K) | DOT (63K) |
|---|---|---|---|
| Grasp | β Yes | β No | β Attempts |
| Lift | β Yes | β No | β No |
| VLM Score | 70 | 30 | 30 |
| Pause % | 7% | 83% | 53% |
| Approach | β | β Minimal | β Active |
| Training Time | ~8 hrs | ~4 hrs | ~18 hrs |
DOT Configuration
DOTConfig(
n_obs_steps=3,
train_horizon=150,
inference_horizon=100,
lookback_obs_steps=30,
lookback_aug=5,
lora_rank=20,
crop_scale=0.8,
state_noise=0.01,
optimizer_lr=3e-5,
optimizer_min_lr=1e-5,
)
Known Issues
- Data loading bottleneck: DOT loads 13 observation frames per sample vs ACT's 1, causing ~10x slower data loading (CPU-bound video decoding)
- Grasp precision: At 63K, policy approaches ball correctly but fails to secure it in gripper
- Slower learning curve: DOT requires more training steps than ACT for comparable behavior
Usage
from lerobot.policies.dot.modeling_dot import DOTPolicy
# Load policy
policy = DOTPolicy.from_pretrained("abdul004/so101_dot_policy_v5")
# Run inference
action = policy.select_action(observation)
Training Infrastructure
Challenges encountered:
- DataLoader Bus errors with
num_workers > 0(solved withbatch_size=12) - Resume functionality required patches for DOT's internal normalization
- Interruptible instances on Vast.ai require checkpoint sync to HF Hub
Checkpoint sync script: Automatically uploads checkpoints every 1K steps to prevent data loss on interruption.
Next Steps
-
Continue training to 50K steps(reached 63K) -
Evaluate learning curve(14Kβ63K shows progress in approach, not grasp) - Continue to 100K to see if grasp precision improves
- Investigate hyperparameter tuning (action horizon, LoRA rank)
- Try Pi Zero / Pi0.5 as alternative VLA approach
Acknowledgments
- DOT Policy by Ilia Larchenko
- LeRobot by Hugging Face