Robotics
LeRobot
Safetensors
so-101
diffusion-policy
multi-task-dit
towel-folding

Multi-Task DiT β€” clean_table

Multi-Task Diffusion Transformer policy trained on larsvandorp/clean_table_filtered (SO-101 follower arm, "Pick up the corner of the towel" task).

The repo root holds the step 6000 checkpoint (latest, lowest loss). Earlier checkpoints are archived under checkpoints/step_002000/ and checkpoints/step_004000/.

Checkpoints

Path Step Loss Epochs Samples seen
/ (default) 6000 0.018 32.7 ~1.15 M
checkpoints/step_004000/ 4000 0.021 21.1 ~768 k
checkpoints/step_002000/ 2000 0.025 10.5 ~384 k

Loss is the per-step DDIM Ξ΅-prediction MSE on the action chunk.

Hardware

  • GPU: 1Γ— NVIDIA RTX Pro 6000 (96 GB)
  • CPUs: 4
  • System RAM: 24 GiB (4 Γ— 6 GiB)
  • Cluster: ETH Euler, partition cuda13pr.4h
  • Wall time used: ~1h 55min before manual cancel (4h walltime budget)

Compute nodes had no internet; CLIP weights were pre-cached via the login node into $HF_HOME and HF_HUB_OFFLINE=1 was set.

Dataset

larsvandorp/clean_table_filtered

  • 36,439 frames, 212 episodes (after filtering noop frames where the leader arm wasn't moving)
  • 30 Hz, single wrist camera at 600Γ—800 (h264, lossy from recording)
  • 6-DoF SO-101 joint state and action
  • Single task: "Pick up the corner of the towel"

Exact training command

lerobot-train \
  --policy.type=multi_task_dit \
  --policy.device=cuda \
  --policy.use_amp=true \
  --policy.push_to_hub=true \
  --policy.repo_id=larsvandorp/clean_table_multi_task_dit \
  --policy.objective=diffusion \
  --policy.noise_scheduler_type=DDIM \
  --policy.num_train_timesteps=100 \
  --policy.num_inference_steps=10 \
  --policy.horizon=32 \
  --policy.n_action_steps=24 \
  --policy.num_layers=4 \
  --policy.vision_encoder_name=openai/clip-vit-base-patch16 \
  --policy.text_encoder_name=openai/clip-vit-base-patch16 \
  --dataset.repo_id=larsvandorp/clean_table_filtered \
  --dataset.root=$SLURM_SUBMIT_DIR/data_clean_table_filtered \
  --dataset.video_backend=pyav \
  --output_dir=$OUT \
  --batch_size=192 \
  --steps=30000 \
  --save_freq=2000 \
  --log_freq=200 \
  --eval_freq=0 \
  --num_workers=3 \
  --wandb.enable=false

Training was stopped manually at step 6000 because loss plateaued and time was needed for real-robot rollout.

Why these flag values

Every choice that deviates from a config default (or that we made explicit despite matching the default) and the reason:

Deviations from defaults

Flag Value Default Why
--policy.num_layers 4 6 Blog "small dataset (<100 examples)" preset. We have 212 episodes β€” borderline small; smaller DiT lowers overfitting risk.
--policy.noise_scheduler_type DDIM DDPM DDIM and DDPM share the training math (same add_noise, same MSE loss). Setting DDIM here bakes the fast inference scheduler into the saved checkpoint config so M1 deployment uses 10 steps with no override.
--policy.num_inference_steps 10 None (defaults to 100) Mac inference at 10 deterministic DDIM steps β‰ˆ 10Γ— faster than 100 DDPM steps, with negligible quality loss for action-space diffusion at small T.
--policy.use_amp true false A100/Pro 6000 Tensor Cores: 2–4Γ— faster fp16 matmul + halved activation VRAM. Enabled bs=192 on the 96 GB Pro 6000 with comfortable headroom.
--batch_size 192 n/a Blog recommends 192–320 for "best training dynamics". 192 is the lower end, fit within the Pro 6000's 96 GB and CPU RAM budget.
--num_workers 3 4 Matches cpus_per_task βˆ’ 1 (one CPU for the main training process).
--steps 30000 n/a Blog recommends β‰₯ 30k for a single task. We stopped at 6k after loss flattened.
--dataset.video_backend pyav torchcodec torchcodec was installed in the venv but crashes at runtime on Euler (missing libavdevice.so.58 in the system FFmpeg). PyAV is the working fallback.

Defaults explicitly set (no behaviour change, just documentation)

Flag Value Reason for being explicit
--policy.objective diffusion Blog says start with diffusion; switch to flow_matching only if generation quality is poor.
--policy.num_train_timesteps 100 Standard small-T diffusion schedule, matches blog.
--policy.horizon 32 ~1 sec of motion at 30 Hz. Blog default.
--policy.n_action_steps 24 ~0.8 sec open-loop execution. Blog warns this knob is sensitive.
--policy.vision_encoder_name openai/clip-vit-base-patch16 The most identity-defining choice; explicit makes the config readable.
--policy.text_encoder_name same Same family for the (frozen) text encoder.

Defaults not overridden but worth noting

  • image_resize_shape = None, image_crop_shape = (224, 224). Wrist frames are 600Γ—800; the random 224Γ—224 crop sees only ~28% width of the raw frame at training. Adding --policy.image_resize_shape='[240,320]' would give better field-of-view coverage and ~1.5Γ— faster dataloading; we did not enable it for this run.
  • optimizer_lr = 2e-5, vision_encoder_lr_multiplier = 0.1 (CLIP backbone gets 0.1Γ— LR), AdamW betas (0.95, 0.999), cosine LR schedule with 0 warmup steps.
  • RoPE on, no absolute positional encoding.
  • use_separate_rgb_encoder_per_camera = false (single camera anyway).

Architecture (for reference)

Component Spec
Vision encoder CLIP ViT-B/16 (~86M params, trainable, lr Γ— 0.1)
Text encoder CLIP ViT-B/16 text tower (~63M, frozen, learnable Linear(512β†’512) projection)
DiT noise predictor 4 layers Γ— 512 hidden Γ— 8 heads, 4Γ— MLP, AdaLN-Zero conditioning, RoPE (~17M params)
Total trainable ~105M params

Inference (on Mac)

Default load just works β€” DDIM + 10 inference steps are baked into the saved config.json.

from lerobot.policies.multi_task_dit.modeling_multi_task_dit import MultiTaskDiTPolicy

policy = MultiTaskDiTPolicy.from_pretrained("larsvandorp/clean_table_multi_task_dit")
policy.eval()
# Pass observation dict: {"observation.images.wrist": ..., "observation.state": ..., "task": "Pick up the corner of the towel"}

Training metrics summary

step loss grad norm lr epochs
200 0.203 0.898 2.0e-5 1.05
1000 0.032 0.599 2.0e-5 5.27
2000 0.026 0.458 2.0e-5 9.48
4000 0.021 0.378 1.9e-5 18.97
6000 0.018 0.317 1.8e-5 31.61

updt_s β‰ˆ 0.60 s/step, data_s β‰ˆ 0.50 s/step (47% of wall time was the GPU stalled waiting for the dataloader, despite 3 worker processes β€” pyav decoding 192 Γ— 600Γ—800 frames per batch is the bottleneck).

References

Downloads last month
45
Video Preview
loading

Model tree for larsvandorp/clean_table_multi_task_dit

Finetuned
(54)
this model

Dataset used to train larsvandorp/clean_table_multi_task_dit

Paper for larsvandorp/clean_table_multi_task_dit