Multi-Task DiT — clean_table

Multi-Task Diffusion Transformer policy trained on larsvandorp/clean_table_filtered (SO-101 follower arm, "Pick up the corner of the towel" task).

The repo root holds the step 6000 checkpoint (latest, lowest loss). Earlier checkpoints are archived under checkpoints/step_002000/ and checkpoints/step_004000/.

Checkpoints

Path	Step	Loss	Epochs	Samples seen
`/` (default)	6000	0.018	32.7	~1.15 M
`checkpoints/step_004000/`	4000	0.021	21.1	~768 k
`checkpoints/step_002000/`	2000	0.025	10.5	~384 k

Loss is the per-step DDIM ε-prediction MSE on the action chunk.

Hardware

GPU: 1× NVIDIA RTX Pro 6000 (96 GB)
CPUs: 4
System RAM: 24 GiB (4 × 6 GiB)
Cluster: ETH Euler, partition cuda13pr.4h
Wall time used: ~1h 55min before manual cancel (4h walltime budget)

Compute nodes had no internet; CLIP weights were pre-cached via the login node into $HF_HOME and HF_HUB_OFFLINE=1 was set.

Dataset

larsvandorp/clean_table_filtered

36,439 frames, 212 episodes (after filtering noop frames where the leader arm wasn't moving)
30 Hz, single wrist camera at 600×800 (h264, lossy from recording)
6-DoF SO-101 joint state and action
Single task: "Pick up the corner of the towel"

Exact training command

lerobot-train \
  --policy.type=multi_task_dit \
  --policy.device=cuda \
  --policy.use_amp=true \
  --policy.push_to_hub=true \
  --policy.repo_id=larsvandorp/clean_table_multi_task_dit \
  --policy.objective=diffusion \
  --policy.noise_scheduler_type=DDIM \
  --policy.num_train_timesteps=100 \
  --policy.num_inference_steps=10 \
  --policy.horizon=32 \
  --policy.n_action_steps=24 \
  --policy.num_layers=4 \
  --policy.vision_encoder_name=openai/clip-vit-base-patch16 \
  --policy.text_encoder_name=openai/clip-vit-base-patch16 \
  --dataset.repo_id=larsvandorp/clean_table_filtered \
  --dataset.root=$SLURM_SUBMIT_DIR/data_clean_table_filtered \
  --dataset.video_backend=pyav \
  --output_dir=$OUT \
  --batch_size=192 \
  --steps=30000 \
  --save_freq=2000 \
  --log_freq=200 \
  --eval_freq=0 \
  --num_workers=3 \
  --wandb.enable=false

Training was stopped manually at step 6000 because loss plateaued and time was needed for real-robot rollout.

Why these flag values

Every choice that deviates from a config default (or that we made explicit despite matching the default) and the reason:

Deviations from defaults

Flag	Value	Default	Why
`--policy.num_layers`	4	6	Blog "small dataset (<100 examples)" preset. We have 212 episodes — borderline small; smaller DiT lowers overfitting risk.
`--policy.noise_scheduler_type`	DDIM	DDPM	DDIM and DDPM share the training math (same `add_noise`, same MSE loss). Setting DDIM here bakes the fast inference scheduler into the saved checkpoint config so M1 deployment uses 10 steps with no override.
`--policy.num_inference_steps`	10	None (defaults to 100)	Mac inference at 10 deterministic DDIM steps ≈ 10× faster than 100 DDPM steps, with negligible quality loss for action-space diffusion at small T.
`--policy.use_amp`	true	false	A100/Pro 6000 Tensor Cores: 2–4× faster fp16 matmul + halved activation VRAM. Enabled bs=192 on the 96 GB Pro 6000 with comfortable headroom.
`--batch_size`	192	n/a	Blog recommends 192–320 for "best training dynamics". 192 is the lower end, fit within the Pro 6000's 96 GB and CPU RAM budget.
`--num_workers`	3	4	Matches `cpus_per_task − 1` (one CPU for the main training process).
`--steps`	30000	n/a	Blog recommends ≥ 30k for a single task. We stopped at 6k after loss flattened.
`--dataset.video_backend`	pyav	torchcodec	torchcodec was installed in the venv but crashes at runtime on Euler (missing `libavdevice.so.58` in the system FFmpeg). PyAV is the working fallback.

Defaults explicitly set (no behaviour change, just documentation)

Flag	Value	Reason for being explicit
`--policy.objective`	diffusion	Blog says start with diffusion; switch to flow_matching only if generation quality is poor.
`--policy.num_train_timesteps`	100	Standard small-T diffusion schedule, matches blog.
`--policy.horizon`	32	~1 sec of motion at 30 Hz. Blog default.
`--policy.n_action_steps`	24	~0.8 sec open-loop execution. Blog warns this knob is sensitive.
`--policy.vision_encoder_name`	openai/clip-vit-base-patch16	The most identity-defining choice; explicit makes the config readable.
`--policy.text_encoder_name`	same	Same family for the (frozen) text encoder.

Defaults not overridden but worth noting

image_resize_shape = None, image_crop_shape = (224, 224). Wrist frames are 600×800; the random 224×224 crop sees only ~28% width of the raw frame at training. Adding --policy.image_resize_shape='[240,320]' would give better field-of-view coverage and ~1.5× faster dataloading; we did not enable it for this run.
optimizer_lr = 2e-5, vision_encoder_lr_multiplier = 0.1 (CLIP backbone gets 0.1× LR), AdamW betas (0.95, 0.999), cosine LR schedule with 0 warmup steps.
RoPE on, no absolute positional encoding.
use_separate_rgb_encoder_per_camera = false (single camera anyway).

Architecture (for reference)

Component	Spec
Vision encoder	CLIP ViT-B/16 (~86M params, trainable, lr × 0.1)
Text encoder	CLIP ViT-B/16 text tower (~63M, frozen, learnable `Linear(512→512)` projection)
DiT noise predictor	4 layers × 512 hidden × 8 heads, 4× MLP, AdaLN-Zero conditioning, RoPE (~17M params)
Total trainable	~105M params

Inference (on Mac)

Default load just works — DDIM + 10 inference steps are baked into the saved config.json.

from lerobot.policies.multi_task_dit.modeling_multi_task_dit import MultiTaskDiTPolicy

policy = MultiTaskDiTPolicy.from_pretrained("larsvandorp/clean_table_multi_task_dit")
policy.eval()
# Pass observation dict: {"observation.images.wrist": ..., "observation.state": ..., "task": "Pick up the corner of the towel"}

Training metrics summary

step	loss	grad norm	lr	epochs
200	0.203	0.898	2.0e-5	1.05
1000	0.032	0.599	2.0e-5	5.27
2000	0.026	0.458	2.0e-5	9.48
4000	0.021	0.378	1.9e-5	18.97
6000	0.018	0.317	1.8e-5	31.61

updt_s ≈ 0.60 s/step, data_s ≈ 0.50 s/step (47% of wall time was the GPU stalled waiting for the dataloader, despite 3 worker processes — pyav decoding 192 × 600×800 frames per batch is the bottleneck).

References

Multi-Task DiT lerobot docs
TRI LBM paper (diffusion objective)
Boston Dynamics LBM blog
Bryson Jones — Dissecting and Open-Sourcing Multitask Diffusion Transformer Policy

Downloads last month: 45

Video Preview

Robotics

Model tree for larsvandorp/clean_table_multi_task_dit

Base model

openai/clip-vit-base-patch16

Finetuned

(54)

this model

Dataset used to train larsvandorp/clean_table_multi_task_dit

Paper for larsvandorp/clean_table_multi_task_dit

A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation

Paper • 2507.05331 • Published Jul 7, 2025 • 12