Multi-Task DiT β clean_table
Multi-Task Diffusion Transformer policy trained on larsvandorp/clean_table_filtered (SO-101 follower arm, "Pick up the corner of the towel" task).
The repo root holds the step 6000 checkpoint (latest, lowest loss). Earlier checkpoints are archived under checkpoints/step_002000/ and checkpoints/step_004000/.
Checkpoints
| Path | Step | Loss | Epochs | Samples seen |
|---|---|---|---|---|
/ (default) |
6000 | 0.018 | 32.7 | ~1.15 M |
checkpoints/step_004000/ |
4000 | 0.021 | 21.1 | ~768 k |
checkpoints/step_002000/ |
2000 | 0.025 | 10.5 | ~384 k |
Loss is the per-step DDIM Ξ΅-prediction MSE on the action chunk.
Hardware
- GPU: 1Γ NVIDIA RTX Pro 6000 (96 GB)
- CPUs: 4
- System RAM: 24 GiB (4 Γ 6 GiB)
- Cluster: ETH Euler, partition
cuda13pr.4h - Wall time used: ~1h 55min before manual cancel (4h walltime budget)
Compute nodes had no internet; CLIP weights were pre-cached via the login node into $HF_HOME and HF_HUB_OFFLINE=1 was set.
Dataset
larsvandorp/clean_table_filtered
- 36,439 frames, 212 episodes (after filtering noop frames where the leader arm wasn't moving)
- 30 Hz, single wrist camera at 600Γ800 (h264, lossy from recording)
- 6-DoF SO-101 joint state and action
- Single task:
"Pick up the corner of the towel"
Exact training command
lerobot-train \
--policy.type=multi_task_dit \
--policy.device=cuda \
--policy.use_amp=true \
--policy.push_to_hub=true \
--policy.repo_id=larsvandorp/clean_table_multi_task_dit \
--policy.objective=diffusion \
--policy.noise_scheduler_type=DDIM \
--policy.num_train_timesteps=100 \
--policy.num_inference_steps=10 \
--policy.horizon=32 \
--policy.n_action_steps=24 \
--policy.num_layers=4 \
--policy.vision_encoder_name=openai/clip-vit-base-patch16 \
--policy.text_encoder_name=openai/clip-vit-base-patch16 \
--dataset.repo_id=larsvandorp/clean_table_filtered \
--dataset.root=$SLURM_SUBMIT_DIR/data_clean_table_filtered \
--dataset.video_backend=pyav \
--output_dir=$OUT \
--batch_size=192 \
--steps=30000 \
--save_freq=2000 \
--log_freq=200 \
--eval_freq=0 \
--num_workers=3 \
--wandb.enable=false
Training was stopped manually at step 6000 because loss plateaued and time was needed for real-robot rollout.
Why these flag values
Every choice that deviates from a config default (or that we made explicit despite matching the default) and the reason:
Deviations from defaults
| Flag | Value | Default | Why |
|---|---|---|---|
--policy.num_layers |
4 | 6 | Blog "small dataset (<100 examples)" preset. We have 212 episodes β borderline small; smaller DiT lowers overfitting risk. |
--policy.noise_scheduler_type |
DDIM | DDPM | DDIM and DDPM share the training math (same add_noise, same MSE loss). Setting DDIM here bakes the fast inference scheduler into the saved checkpoint config so M1 deployment uses 10 steps with no override. |
--policy.num_inference_steps |
10 | None (defaults to 100) | Mac inference at 10 deterministic DDIM steps β 10Γ faster than 100 DDPM steps, with negligible quality loss for action-space diffusion at small T. |
--policy.use_amp |
true | false | A100/Pro 6000 Tensor Cores: 2β4Γ faster fp16 matmul + halved activation VRAM. Enabled bs=192 on the 96 GB Pro 6000 with comfortable headroom. |
--batch_size |
192 | n/a | Blog recommends 192β320 for "best training dynamics". 192 is the lower end, fit within the Pro 6000's 96 GB and CPU RAM budget. |
--num_workers |
3 | 4 | Matches cpus_per_task β 1 (one CPU for the main training process). |
--steps |
30000 | n/a | Blog recommends β₯ 30k for a single task. We stopped at 6k after loss flattened. |
--dataset.video_backend |
pyav | torchcodec | torchcodec was installed in the venv but crashes at runtime on Euler (missing libavdevice.so.58 in the system FFmpeg). PyAV is the working fallback. |
Defaults explicitly set (no behaviour change, just documentation)
| Flag | Value | Reason for being explicit |
|---|---|---|
--policy.objective |
diffusion | Blog says start with diffusion; switch to flow_matching only if generation quality is poor. |
--policy.num_train_timesteps |
100 | Standard small-T diffusion schedule, matches blog. |
--policy.horizon |
32 | ~1 sec of motion at 30 Hz. Blog default. |
--policy.n_action_steps |
24 | ~0.8 sec open-loop execution. Blog warns this knob is sensitive. |
--policy.vision_encoder_name |
openai/clip-vit-base-patch16 | The most identity-defining choice; explicit makes the config readable. |
--policy.text_encoder_name |
same | Same family for the (frozen) text encoder. |
Defaults not overridden but worth noting
image_resize_shape = None,image_crop_shape = (224, 224). Wrist frames are 600Γ800; the random 224Γ224 crop sees only ~28% width of the raw frame at training. Adding--policy.image_resize_shape='[240,320]'would give better field-of-view coverage and ~1.5Γ faster dataloading; we did not enable it for this run.optimizer_lr = 2e-5,vision_encoder_lr_multiplier = 0.1(CLIP backbone gets 0.1Γ LR), AdamW betas(0.95, 0.999), cosine LR schedule with 0 warmup steps.- RoPE on, no absolute positional encoding.
use_separate_rgb_encoder_per_camera = false(single camera anyway).
Architecture (for reference)
| Component | Spec |
|---|---|
| Vision encoder | CLIP ViT-B/16 (~86M params, trainable, lr Γ 0.1) |
| Text encoder | CLIP ViT-B/16 text tower (~63M, frozen, learnable Linear(512β512) projection) |
| DiT noise predictor | 4 layers Γ 512 hidden Γ 8 heads, 4Γ MLP, AdaLN-Zero conditioning, RoPE (~17M params) |
| Total trainable | ~105M params |
Inference (on Mac)
Default load just works β DDIM + 10 inference steps are baked into the saved config.json.
from lerobot.policies.multi_task_dit.modeling_multi_task_dit import MultiTaskDiTPolicy
policy = MultiTaskDiTPolicy.from_pretrained("larsvandorp/clean_table_multi_task_dit")
policy.eval()
# Pass observation dict: {"observation.images.wrist": ..., "observation.state": ..., "task": "Pick up the corner of the towel"}
Training metrics summary
| step | loss | grad norm | lr | epochs |
|---|---|---|---|---|
| 200 | 0.203 | 0.898 | 2.0e-5 | 1.05 |
| 1000 | 0.032 | 0.599 | 2.0e-5 | 5.27 |
| 2000 | 0.026 | 0.458 | 2.0e-5 | 9.48 |
| 4000 | 0.021 | 0.378 | 1.9e-5 | 18.97 |
| 6000 | 0.018 | 0.317 | 1.8e-5 | 31.61 |
updt_s β 0.60 s/step, data_s β 0.50 s/step (47% of wall time was the GPU stalled waiting for the dataloader, despite 3 worker processes β pyav decoding 192 Γ 600Γ800 frames per batch is the bottleneck).
References
- Downloads last month
- 45
Model tree for larsvandorp/clean_table_multi_task_dit
Base model
openai/clip-vit-base-patch16