Robotics
LeRobot
Safetensors
so-101
diffusion-policy
multi-task-dit
towel-folding
larsvandorp's picture
Add training README with exact flags + rationale
5f553eb verified
---
license: apache-2.0
library_name: lerobot
tags:
- robotics
- so-101
- diffusion-policy
- multi-task-dit
- towel-folding
base_model: openai/clip-vit-base-patch16
datasets:
- larsvandorp/clean_table_filtered
---
# Multi-Task DiT β€” clean_table
Multi-Task Diffusion Transformer policy trained on [`larsvandorp/clean_table_filtered`](https://huggingface.co/datasets/larsvandorp/clean_table_filtered) (SO-101 follower arm, "Pick up the corner of the towel" task).
The repo root holds the **step 6000** checkpoint (latest, lowest loss). Earlier checkpoints are archived under `checkpoints/step_002000/` and `checkpoints/step_004000/`.
## Checkpoints
| Path | Step | Loss | Epochs | Samples seen |
|---|---|---|---|---|
| `/` (default) | 6000 | 0.018 | 32.7 | ~1.15 M |
| `checkpoints/step_004000/` | 4000 | 0.021 | 21.1 | ~768 k |
| `checkpoints/step_002000/` | 2000 | 0.025 | 10.5 | ~384 k |
Loss is the per-step DDIM Ξ΅-prediction MSE on the action chunk.
## Hardware
- **GPU**: 1Γ— NVIDIA RTX Pro 6000 (96 GB)
- **CPUs**: 4
- **System RAM**: 24 GiB (4 Γ— 6 GiB)
- **Cluster**: ETH Euler, partition `cuda13pr.4h`
- **Wall time used**: ~1h 55min before manual cancel (4h walltime budget)
Compute nodes had no internet; CLIP weights were pre-cached via the login node into `$HF_HOME` and `HF_HUB_OFFLINE=1` was set.
## Dataset
[`larsvandorp/clean_table_filtered`](https://huggingface.co/datasets/larsvandorp/clean_table_filtered)
- 36,439 frames, 212 episodes (after filtering noop frames where the leader arm wasn't moving)
- 30 Hz, single wrist camera at 600Γ—800 (h264, lossy from recording)
- 6-DoF SO-101 joint state and action
- Single task: `"Pick up the corner of the towel"`
## Exact training command
```bash
lerobot-train \
--policy.type=multi_task_dit \
--policy.device=cuda \
--policy.use_amp=true \
--policy.push_to_hub=true \
--policy.repo_id=larsvandorp/clean_table_multi_task_dit \
--policy.objective=diffusion \
--policy.noise_scheduler_type=DDIM \
--policy.num_train_timesteps=100 \
--policy.num_inference_steps=10 \
--policy.horizon=32 \
--policy.n_action_steps=24 \
--policy.num_layers=4 \
--policy.vision_encoder_name=openai/clip-vit-base-patch16 \
--policy.text_encoder_name=openai/clip-vit-base-patch16 \
--dataset.repo_id=larsvandorp/clean_table_filtered \
--dataset.root=$SLURM_SUBMIT_DIR/data_clean_table_filtered \
--dataset.video_backend=pyav \
--output_dir=$OUT \
--batch_size=192 \
--steps=30000 \
--save_freq=2000 \
--log_freq=200 \
--eval_freq=0 \
--num_workers=3 \
--wandb.enable=false
```
Training was stopped manually at step 6000 because loss plateaued and time was needed for real-robot rollout.
## Why these flag values
Every choice that deviates from a config default (or that we made explicit despite matching the default) and the reason:
### Deviations from defaults
| Flag | Value | Default | Why |
|---|---|---|---|
| `--policy.num_layers` | 4 | 6 | Blog "small dataset (<100 examples)" preset. We have 212 episodes β€” borderline small; smaller DiT lowers overfitting risk. |
| `--policy.noise_scheduler_type` | DDIM | DDPM | DDIM and DDPM share the training math (same `add_noise`, same MSE loss). Setting DDIM here bakes the fast inference scheduler into the saved checkpoint config so M1 deployment uses 10 steps with no override. |
| `--policy.num_inference_steps` | 10 | None (defaults to 100) | Mac inference at 10 deterministic DDIM steps β‰ˆ 10Γ— faster than 100 DDPM steps, with negligible quality loss for action-space diffusion at small T. |
| `--policy.use_amp` | true | false | A100/Pro 6000 Tensor Cores: 2–4Γ— faster fp16 matmul + halved activation VRAM. Enabled bs=192 on the 96 GB Pro 6000 with comfortable headroom. |
| `--batch_size` | 192 | n/a | Blog recommends 192–320 for "best training dynamics". 192 is the lower end, fit within the Pro 6000's 96 GB and CPU RAM budget. |
| `--num_workers` | 3 | 4 | Matches `cpus_per_task βˆ’ 1` (one CPU for the main training process). |
| `--steps` | 30000 | n/a | Blog recommends β‰₯ 30k for a single task. We stopped at 6k after loss flattened. |
| `--dataset.video_backend` | pyav | torchcodec | torchcodec was installed in the venv but crashes at runtime on Euler (missing `libavdevice.so.58` in the system FFmpeg). PyAV is the working fallback. |
### Defaults explicitly set (no behaviour change, just documentation)
| Flag | Value | Reason for being explicit |
|---|---|---|
| `--policy.objective` | diffusion | Blog says start with diffusion; switch to flow_matching only if generation quality is poor. |
| `--policy.num_train_timesteps` | 100 | Standard small-T diffusion schedule, matches blog. |
| `--policy.horizon` | 32 | ~1 sec of motion at 30 Hz. Blog default. |
| `--policy.n_action_steps` | 24 | ~0.8 sec open-loop execution. Blog warns this knob is sensitive. |
| `--policy.vision_encoder_name` | openai/clip-vit-base-patch16 | The most identity-defining choice; explicit makes the config readable. |
| `--policy.text_encoder_name` | same | Same family for the (frozen) text encoder. |
### Defaults *not* overridden but worth noting
- `image_resize_shape = None`, `image_crop_shape = (224, 224)`. Wrist frames are 600Γ—800; the random 224Γ—224 crop sees only ~28% width of the raw frame at training. Adding `--policy.image_resize_shape='[240,320]'` would give better field-of-view coverage and ~1.5Γ— faster dataloading; we did not enable it for this run.
- `optimizer_lr = 2e-5`, `vision_encoder_lr_multiplier = 0.1` (CLIP backbone gets 0.1Γ— LR), AdamW betas `(0.95, 0.999)`, cosine LR schedule with 0 warmup steps.
- RoPE on, no absolute positional encoding.
- `use_separate_rgb_encoder_per_camera = false` (single camera anyway).
## Architecture (for reference)
| Component | Spec |
|---|---|
| Vision encoder | CLIP ViT-B/16 (~86M params, trainable, lr Γ— 0.1) |
| Text encoder | CLIP ViT-B/16 text tower (~63M, **frozen**, learnable `Linear(512β†’512)` projection) |
| DiT noise predictor | 4 layers Γ— 512 hidden Γ— 8 heads, 4Γ— MLP, AdaLN-Zero conditioning, RoPE (~17M params) |
| Total trainable | ~105M params |
## Inference (on Mac)
Default load just works β€” DDIM + 10 inference steps are baked into the saved `config.json`.
```python
from lerobot.policies.multi_task_dit.modeling_multi_task_dit import MultiTaskDiTPolicy
policy = MultiTaskDiTPolicy.from_pretrained("larsvandorp/clean_table_multi_task_dit")
policy.eval()
# Pass observation dict: {"observation.images.wrist": ..., "observation.state": ..., "task": "Pick up the corner of the towel"}
```
## Training metrics summary
| step | loss | grad norm | lr | epochs |
|---|---|---|---|---|
| 200 | 0.203 | 0.898 | 2.0e-5 | 1.05 |
| 1000 | 0.032 | 0.599 | 2.0e-5 | 5.27 |
| 2000 | 0.026 | 0.458 | 2.0e-5 | 9.48 |
| 4000 | 0.021 | 0.378 | 1.9e-5 | 18.97 |
| 6000 | 0.018 | 0.317 | 1.8e-5 | 31.61 |
`updt_s β‰ˆ 0.60 s/step`, `data_s β‰ˆ 0.50 s/step` (47% of wall time was the GPU stalled waiting for the dataloader, despite 3 worker processes β€” pyav decoding 192 Γ— 600Γ—800 frames per batch is the bottleneck).
## References
- [Multi-Task DiT lerobot docs](https://huggingface.co/docs/lerobot/en/multi_task_dit)
- [TRI LBM paper](https://arxiv.org/abs/2507.05331) (diffusion objective)
- [Boston Dynamics LBM blog](https://bostondynamics.com/blog/large-behavior-models-atlas-find-new-footing/)
- [Bryson Jones β€” Dissecting and Open-Sourcing Multitask Diffusion Transformer Policy](https://brysonkjones.substack.com/p/dissecting-and-open-sourcing-multitask-diffusion-transformer-policy)