File size: 7,563 Bytes
5f553eb | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 | ---
license: apache-2.0
library_name: lerobot
tags:
- robotics
- so-101
- diffusion-policy
- multi-task-dit
- towel-folding
base_model: openai/clip-vit-base-patch16
datasets:
- larsvandorp/clean_table_filtered
---
# Multi-Task DiT β clean_table
Multi-Task Diffusion Transformer policy trained on [`larsvandorp/clean_table_filtered`](https://huggingface.co/datasets/larsvandorp/clean_table_filtered) (SO-101 follower arm, "Pick up the corner of the towel" task).
The repo root holds the **step 6000** checkpoint (latest, lowest loss). Earlier checkpoints are archived under `checkpoints/step_002000/` and `checkpoints/step_004000/`.
## Checkpoints
| Path | Step | Loss | Epochs | Samples seen |
|---|---|---|---|---|
| `/` (default) | 6000 | 0.018 | 32.7 | ~1.15 M |
| `checkpoints/step_004000/` | 4000 | 0.021 | 21.1 | ~768 k |
| `checkpoints/step_002000/` | 2000 | 0.025 | 10.5 | ~384 k |
Loss is the per-step DDIM Ξ΅-prediction MSE on the action chunk.
## Hardware
- **GPU**: 1Γ NVIDIA RTX Pro 6000 (96 GB)
- **CPUs**: 4
- **System RAM**: 24 GiB (4 Γ 6 GiB)
- **Cluster**: ETH Euler, partition `cuda13pr.4h`
- **Wall time used**: ~1h 55min before manual cancel (4h walltime budget)
Compute nodes had no internet; CLIP weights were pre-cached via the login node into `$HF_HOME` and `HF_HUB_OFFLINE=1` was set.
## Dataset
[`larsvandorp/clean_table_filtered`](https://huggingface.co/datasets/larsvandorp/clean_table_filtered)
- 36,439 frames, 212 episodes (after filtering noop frames where the leader arm wasn't moving)
- 30 Hz, single wrist camera at 600Γ800 (h264, lossy from recording)
- 6-DoF SO-101 joint state and action
- Single task: `"Pick up the corner of the towel"`
## Exact training command
```bash
lerobot-train \
--policy.type=multi_task_dit \
--policy.device=cuda \
--policy.use_amp=true \
--policy.push_to_hub=true \
--policy.repo_id=larsvandorp/clean_table_multi_task_dit \
--policy.objective=diffusion \
--policy.noise_scheduler_type=DDIM \
--policy.num_train_timesteps=100 \
--policy.num_inference_steps=10 \
--policy.horizon=32 \
--policy.n_action_steps=24 \
--policy.num_layers=4 \
--policy.vision_encoder_name=openai/clip-vit-base-patch16 \
--policy.text_encoder_name=openai/clip-vit-base-patch16 \
--dataset.repo_id=larsvandorp/clean_table_filtered \
--dataset.root=$SLURM_SUBMIT_DIR/data_clean_table_filtered \
--dataset.video_backend=pyav \
--output_dir=$OUT \
--batch_size=192 \
--steps=30000 \
--save_freq=2000 \
--log_freq=200 \
--eval_freq=0 \
--num_workers=3 \
--wandb.enable=false
```
Training was stopped manually at step 6000 because loss plateaued and time was needed for real-robot rollout.
## Why these flag values
Every choice that deviates from a config default (or that we made explicit despite matching the default) and the reason:
### Deviations from defaults
| Flag | Value | Default | Why |
|---|---|---|---|
| `--policy.num_layers` | 4 | 6 | Blog "small dataset (<100 examples)" preset. We have 212 episodes β borderline small; smaller DiT lowers overfitting risk. |
| `--policy.noise_scheduler_type` | DDIM | DDPM | DDIM and DDPM share the training math (same `add_noise`, same MSE loss). Setting DDIM here bakes the fast inference scheduler into the saved checkpoint config so M1 deployment uses 10 steps with no override. |
| `--policy.num_inference_steps` | 10 | None (defaults to 100) | Mac inference at 10 deterministic DDIM steps β 10Γ faster than 100 DDPM steps, with negligible quality loss for action-space diffusion at small T. |
| `--policy.use_amp` | true | false | A100/Pro 6000 Tensor Cores: 2β4Γ faster fp16 matmul + halved activation VRAM. Enabled bs=192 on the 96 GB Pro 6000 with comfortable headroom. |
| `--batch_size` | 192 | n/a | Blog recommends 192β320 for "best training dynamics". 192 is the lower end, fit within the Pro 6000's 96 GB and CPU RAM budget. |
| `--num_workers` | 3 | 4 | Matches `cpus_per_task β 1` (one CPU for the main training process). |
| `--steps` | 30000 | n/a | Blog recommends β₯ 30k for a single task. We stopped at 6k after loss flattened. |
| `--dataset.video_backend` | pyav | torchcodec | torchcodec was installed in the venv but crashes at runtime on Euler (missing `libavdevice.so.58` in the system FFmpeg). PyAV is the working fallback. |
### Defaults explicitly set (no behaviour change, just documentation)
| Flag | Value | Reason for being explicit |
|---|---|---|
| `--policy.objective` | diffusion | Blog says start with diffusion; switch to flow_matching only if generation quality is poor. |
| `--policy.num_train_timesteps` | 100 | Standard small-T diffusion schedule, matches blog. |
| `--policy.horizon` | 32 | ~1 sec of motion at 30 Hz. Blog default. |
| `--policy.n_action_steps` | 24 | ~0.8 sec open-loop execution. Blog warns this knob is sensitive. |
| `--policy.vision_encoder_name` | openai/clip-vit-base-patch16 | The most identity-defining choice; explicit makes the config readable. |
| `--policy.text_encoder_name` | same | Same family for the (frozen) text encoder. |
### Defaults *not* overridden but worth noting
- `image_resize_shape = None`, `image_crop_shape = (224, 224)`. Wrist frames are 600Γ800; the random 224Γ224 crop sees only ~28% width of the raw frame at training. Adding `--policy.image_resize_shape='[240,320]'` would give better field-of-view coverage and ~1.5Γ faster dataloading; we did not enable it for this run.
- `optimizer_lr = 2e-5`, `vision_encoder_lr_multiplier = 0.1` (CLIP backbone gets 0.1Γ LR), AdamW betas `(0.95, 0.999)`, cosine LR schedule with 0 warmup steps.
- RoPE on, no absolute positional encoding.
- `use_separate_rgb_encoder_per_camera = false` (single camera anyway).
## Architecture (for reference)
| Component | Spec |
|---|---|
| Vision encoder | CLIP ViT-B/16 (~86M params, trainable, lr Γ 0.1) |
| Text encoder | CLIP ViT-B/16 text tower (~63M, **frozen**, learnable `Linear(512β512)` projection) |
| DiT noise predictor | 4 layers Γ 512 hidden Γ 8 heads, 4Γ MLP, AdaLN-Zero conditioning, RoPE (~17M params) |
| Total trainable | ~105M params |
## Inference (on Mac)
Default load just works β DDIM + 10 inference steps are baked into the saved `config.json`.
```python
from lerobot.policies.multi_task_dit.modeling_multi_task_dit import MultiTaskDiTPolicy
policy = MultiTaskDiTPolicy.from_pretrained("larsvandorp/clean_table_multi_task_dit")
policy.eval()
# Pass observation dict: {"observation.images.wrist": ..., "observation.state": ..., "task": "Pick up the corner of the towel"}
```
## Training metrics summary
| step | loss | grad norm | lr | epochs |
|---|---|---|---|---|
| 200 | 0.203 | 0.898 | 2.0e-5 | 1.05 |
| 1000 | 0.032 | 0.599 | 2.0e-5 | 5.27 |
| 2000 | 0.026 | 0.458 | 2.0e-5 | 9.48 |
| 4000 | 0.021 | 0.378 | 1.9e-5 | 18.97 |
| 6000 | 0.018 | 0.317 | 1.8e-5 | 31.61 |
`updt_s β 0.60 s/step`, `data_s β 0.50 s/step` (47% of wall time was the GPU stalled waiting for the dataloader, despite 3 worker processes β pyav decoding 192 Γ 600Γ800 frames per batch is the bottleneck).
## References
- [Multi-Task DiT lerobot docs](https://huggingface.co/docs/lerobot/en/multi_task_dit)
- [TRI LBM paper](https://arxiv.org/abs/2507.05331) (diffusion objective)
- [Boston Dynamics LBM blog](https://bostondynamics.com/blog/large-behavior-models-atlas-find-new-footing/)
- [Bryson Jones β Dissecting and Open-Sourcing Multitask Diffusion Transformer Policy](https://brysonkjones.substack.com/p/dissecting-and-open-sourcing-multitask-diffusion-transformer-policy)
|