ACT β€” ALOHA Single-Arm (Left) β€” Mask REMOVAL via Reversed Data β€” 40k steps

Action Chunking Transformer (ACT) policy for mask removal trained on a synthetic dataset derived by time-reversing the placement dataset. Each placement episode reversed becomes a removal episode (gripper opens β†’ closes, mask on face β†’ in arm).

This is the 40k-step retrain (S006), matching S003's step count for direct architectural comparison vs the shipped placement baseline. The 13.4k baseline lives at JHeisler/aloha_solo_left_act_removal_reversed_13k.

Training Config

Field Value
Architecture ACT (ResNet18 backbone + 4-layer Transformer encoder + VAE chunking head)
Dataset JHeisler/aloha_solo_left_4_6_26_reversed β€” 50 ep, 29,735 samples, 30 fps, time-reversed with 1-step action shift
State / action dim 9 / 9
Cameras cam_high, cam_left_wrist (3Γ—480Γ—640 each)
Steps 40,000
Batch size 48
Learning rate 6e-5 (linear warmup 500 β†’ cosine)
Total samples seen 1.92M (64 epochs over the dataset)
AMP enabled
torch.compile enabled
Save freq every 10,000 steps (10k / 20k / 30k / 40k checkpoints)
Final loss 0.016–0.020
Final grad norm 0.23–0.32
Wall clock ~6h 10min on RTX A4500 (matches placement S003's ~6h 7min)
LeRobot pin 96c7052777aca85d4e55dfba8f81586103ba8f61

Project Lineage

Workstream Task Steps Final loss HF
S001 placement 13,400 0.029 act_left
S005 removal (reversed) 13,400 0.035 act_removal_reversed_13k
S003 placement (shipped) 40,000 0.015 act_left_40k
S006 removal (reversed) 40,000 0.018 this repo

S003 vs S006 is the direct architectural comparison β€” same arch, same step count, placement dataset vs reversed-placement dataset. Final losses differ by only 3 milliloss (0.015 vs 0.018), suggesting the reversed-data policy converges to a similar quality as the forward-data policy on the per-timestep imitation objective. Real verdict requires offline action-L1 eval on held-out data or robot rollout.

Caveats

  • Synthetic data. Trained on time-reversed placement, not native removal. A policy trained on real removal data will likely outperform.
  • Visual transitions are physically backwards (mask materializes on face). Doesn't affect ACT's per-timestep predictions (n_obs_steps=1, no temporal context input).
  • Use as a lower-bound baseline until native removal data is available.

Usage

from lerobot.common.policies.act.modeling_act import ACTPolicy
policy = ACTPolicy.from_pretrained("JHeisler/aloha_solo_left_act_removal_reversed_40k")

Citation / Course

EN.525.681 school project β€” JHU Whiting School of Engineering. Team: Jake Heisler, Laura Kroening, Purushottam Shukla.

Code reference: HuggingFace LeRobot at commit 96c7052.

Downloads last month
30
Video Preview
loading

Dataset used to train JHeisler/aloha_solo_left_act_removal_reversed_40k