| --- |
| license: apache-2.0 |
| library_name: lerobot |
| tags: |
| - robotics |
| - so-101 |
| - diffusion-policy |
| - multi-task-dit |
| - towel-folding |
| base_model: openai/clip-vit-base-patch16 |
| datasets: |
| - larsvandorp/clean_table_filtered |
| --- |
| |
| # Multi-Task DiT β clean_table |
| |
| Multi-Task Diffusion Transformer policy trained on [`larsvandorp/clean_table_filtered`](https://huggingface.co/datasets/larsvandorp/clean_table_filtered) (SO-101 follower arm, "Pick up the corner of the towel" task). |
| |
| The repo root holds the **step 6000** checkpoint (latest, lowest loss). Earlier checkpoints are archived under `checkpoints/step_002000/` and `checkpoints/step_004000/`. |
| |
| ## Checkpoints |
| |
| | Path | Step | Loss | Epochs | Samples seen | |
| |---|---|---|---|---| |
| | `/` (default) | 6000 | 0.018 | 32.7 | ~1.15 M | |
| | `checkpoints/step_004000/` | 4000 | 0.021 | 21.1 | ~768 k | |
| | `checkpoints/step_002000/` | 2000 | 0.025 | 10.5 | ~384 k | |
|
|
| Loss is the per-step DDIM Ξ΅-prediction MSE on the action chunk. |
|
|
| ## Hardware |
|
|
| - **GPU**: 1Γ NVIDIA RTX Pro 6000 (96 GB) |
| - **CPUs**: 4 |
| - **System RAM**: 24 GiB (4 Γ 6 GiB) |
| - **Cluster**: ETH Euler, partition `cuda13pr.4h` |
| - **Wall time used**: ~1h 55min before manual cancel (4h walltime budget) |
|
|
| Compute nodes had no internet; CLIP weights were pre-cached via the login node into `$HF_HOME` and `HF_HUB_OFFLINE=1` was set. |
|
|
| ## Dataset |
|
|
| [`larsvandorp/clean_table_filtered`](https://huggingface.co/datasets/larsvandorp/clean_table_filtered) |
|
|
| - 36,439 frames, 212 episodes (after filtering noop frames where the leader arm wasn't moving) |
| - 30 Hz, single wrist camera at 600Γ800 (h264, lossy from recording) |
| - 6-DoF SO-101 joint state and action |
| - Single task: `"Pick up the corner of the towel"` |
|
|
| ## Exact training command |
|
|
| ```bash |
| lerobot-train \ |
| --policy.type=multi_task_dit \ |
| --policy.device=cuda \ |
| --policy.use_amp=true \ |
| --policy.push_to_hub=true \ |
| --policy.repo_id=larsvandorp/clean_table_multi_task_dit \ |
| --policy.objective=diffusion \ |
| --policy.noise_scheduler_type=DDIM \ |
| --policy.num_train_timesteps=100 \ |
| --policy.num_inference_steps=10 \ |
| --policy.horizon=32 \ |
| --policy.n_action_steps=24 \ |
| --policy.num_layers=4 \ |
| --policy.vision_encoder_name=openai/clip-vit-base-patch16 \ |
| --policy.text_encoder_name=openai/clip-vit-base-patch16 \ |
| --dataset.repo_id=larsvandorp/clean_table_filtered \ |
| --dataset.root=$SLURM_SUBMIT_DIR/data_clean_table_filtered \ |
| --dataset.video_backend=pyav \ |
| --output_dir=$OUT \ |
| --batch_size=192 \ |
| --steps=30000 \ |
| --save_freq=2000 \ |
| --log_freq=200 \ |
| --eval_freq=0 \ |
| --num_workers=3 \ |
| --wandb.enable=false |
| ``` |
|
|
| Training was stopped manually at step 6000 because loss plateaued and time was needed for real-robot rollout. |
|
|
| ## Why these flag values |
|
|
| Every choice that deviates from a config default (or that we made explicit despite matching the default) and the reason: |
|
|
| ### Deviations from defaults |
| | Flag | Value | Default | Why | |
| |---|---|---|---| |
| | `--policy.num_layers` | 4 | 6 | Blog "small dataset (<100 examples)" preset. We have 212 episodes β borderline small; smaller DiT lowers overfitting risk. | |
| | `--policy.noise_scheduler_type` | DDIM | DDPM | DDIM and DDPM share the training math (same `add_noise`, same MSE loss). Setting DDIM here bakes the fast inference scheduler into the saved checkpoint config so M1 deployment uses 10 steps with no override. | |
| | `--policy.num_inference_steps` | 10 | None (defaults to 100) | Mac inference at 10 deterministic DDIM steps β 10Γ faster than 100 DDPM steps, with negligible quality loss for action-space diffusion at small T. | |
| | `--policy.use_amp` | true | false | A100/Pro 6000 Tensor Cores: 2β4Γ faster fp16 matmul + halved activation VRAM. Enabled bs=192 on the 96 GB Pro 6000 with comfortable headroom. | |
| | `--batch_size` | 192 | n/a | Blog recommends 192β320 for "best training dynamics". 192 is the lower end, fit within the Pro 6000's 96 GB and CPU RAM budget. | |
| | `--num_workers` | 3 | 4 | Matches `cpus_per_task β 1` (one CPU for the main training process). | |
| | `--steps` | 30000 | n/a | Blog recommends β₯ 30k for a single task. We stopped at 6k after loss flattened. | |
| | `--dataset.video_backend` | pyav | torchcodec | torchcodec was installed in the venv but crashes at runtime on Euler (missing `libavdevice.so.58` in the system FFmpeg). PyAV is the working fallback. | |
|
|
| ### Defaults explicitly set (no behaviour change, just documentation) |
| | Flag | Value | Reason for being explicit | |
| |---|---|---| |
| | `--policy.objective` | diffusion | Blog says start with diffusion; switch to flow_matching only if generation quality is poor. | |
| | `--policy.num_train_timesteps` | 100 | Standard small-T diffusion schedule, matches blog. | |
| | `--policy.horizon` | 32 | ~1 sec of motion at 30 Hz. Blog default. | |
| | `--policy.n_action_steps` | 24 | ~0.8 sec open-loop execution. Blog warns this knob is sensitive. | |
| | `--policy.vision_encoder_name` | openai/clip-vit-base-patch16 | The most identity-defining choice; explicit makes the config readable. | |
| | `--policy.text_encoder_name` | same | Same family for the (frozen) text encoder. | |
| |
| ### Defaults *not* overridden but worth noting |
| - `image_resize_shape = None`, `image_crop_shape = (224, 224)`. Wrist frames are 600Γ800; the random 224Γ224 crop sees only ~28% width of the raw frame at training. Adding `--policy.image_resize_shape='[240,320]'` would give better field-of-view coverage and ~1.5Γ faster dataloading; we did not enable it for this run. |
| - `optimizer_lr = 2e-5`, `vision_encoder_lr_multiplier = 0.1` (CLIP backbone gets 0.1Γ LR), AdamW betas `(0.95, 0.999)`, cosine LR schedule with 0 warmup steps. |
| - RoPE on, no absolute positional encoding. |
| - `use_separate_rgb_encoder_per_camera = false` (single camera anyway). |
|
|
| ## Architecture (for reference) |
|
|
| | Component | Spec | |
| |---|---| |
| | Vision encoder | CLIP ViT-B/16 (~86M params, trainable, lr Γ 0.1) | |
| | Text encoder | CLIP ViT-B/16 text tower (~63M, **frozen**, learnable `Linear(512β512)` projection) | |
| | DiT noise predictor | 4 layers Γ 512 hidden Γ 8 heads, 4Γ MLP, AdaLN-Zero conditioning, RoPE (~17M params) | |
| | Total trainable | ~105M params | |
|
|
| ## Inference (on Mac) |
|
|
| Default load just works β DDIM + 10 inference steps are baked into the saved `config.json`. |
|
|
| ```python |
| from lerobot.policies.multi_task_dit.modeling_multi_task_dit import MultiTaskDiTPolicy |
| |
| policy = MultiTaskDiTPolicy.from_pretrained("larsvandorp/clean_table_multi_task_dit") |
| policy.eval() |
| # Pass observation dict: {"observation.images.wrist": ..., "observation.state": ..., "task": "Pick up the corner of the towel"} |
| ``` |
|
|
| ## Training metrics summary |
|
|
| | step | loss | grad norm | lr | epochs | |
| |---|---|---|---|---| |
| | 200 | 0.203 | 0.898 | 2.0e-5 | 1.05 | |
| | 1000 | 0.032 | 0.599 | 2.0e-5 | 5.27 | |
| | 2000 | 0.026 | 0.458 | 2.0e-5 | 9.48 | |
| | 4000 | 0.021 | 0.378 | 1.9e-5 | 18.97 | |
| | 6000 | 0.018 | 0.317 | 1.8e-5 | 31.61 | |
|
|
| `updt_s β 0.60 s/step`, `data_s β 0.50 s/step` (47% of wall time was the GPU stalled waiting for the dataloader, despite 3 worker processes β pyav decoding 192 Γ 600Γ800 frames per batch is the bottleneck). |
|
|
| ## References |
|
|
| - [Multi-Task DiT lerobot docs](https://huggingface.co/docs/lerobot/en/multi_task_dit) |
| - [TRI LBM paper](https://arxiv.org/abs/2507.05331) (diffusion objective) |
| - [Boston Dynamics LBM blog](https://bostondynamics.com/blog/large-behavior-models-atlas-find-new-footing/) |
| - [Bryson Jones β Dissecting and Open-Sourcing Multitask Diffusion Transformer Policy](https://brysonkjones.substack.com/p/dissecting-and-open-sourcing-multitask-diffusion-transformer-policy) |
|
|