Robotics
LeRobot
Safetensors
so-101
diffusion-policy
multi-task-dit
towel-folding
File size: 7,563 Bytes
5f553eb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
license: apache-2.0
library_name: lerobot
tags:
- robotics
- so-101
- diffusion-policy
- multi-task-dit
- towel-folding
base_model: openai/clip-vit-base-patch16
datasets:
- larsvandorp/clean_table_filtered
---

# Multi-Task DiT β€” clean_table

Multi-Task Diffusion Transformer policy trained on [`larsvandorp/clean_table_filtered`](https://huggingface.co/datasets/larsvandorp/clean_table_filtered) (SO-101 follower arm, "Pick up the corner of the towel" task).

The repo root holds the **step 6000** checkpoint (latest, lowest loss). Earlier checkpoints are archived under `checkpoints/step_002000/` and `checkpoints/step_004000/`.

## Checkpoints

| Path | Step | Loss | Epochs | Samples seen |
|---|---|---|---|---|
| `/` (default) | 6000 | 0.018 | 32.7 | ~1.15 M |
| `checkpoints/step_004000/` | 4000 | 0.021 | 21.1 | ~768 k |
| `checkpoints/step_002000/` | 2000 | 0.025 | 10.5 | ~384 k |

Loss is the per-step DDIM Ξ΅-prediction MSE on the action chunk.

## Hardware

- **GPU**: 1Γ— NVIDIA RTX Pro 6000 (96 GB)
- **CPUs**: 4
- **System RAM**: 24 GiB (4 Γ— 6 GiB)
- **Cluster**: ETH Euler, partition `cuda13pr.4h`
- **Wall time used**: ~1h 55min before manual cancel (4h walltime budget)

Compute nodes had no internet; CLIP weights were pre-cached via the login node into `$HF_HOME` and `HF_HUB_OFFLINE=1` was set.

## Dataset

[`larsvandorp/clean_table_filtered`](https://huggingface.co/datasets/larsvandorp/clean_table_filtered)

- 36,439 frames, 212 episodes (after filtering noop frames where the leader arm wasn't moving)
- 30 Hz, single wrist camera at 600Γ—800 (h264, lossy from recording)
- 6-DoF SO-101 joint state and action
- Single task: `"Pick up the corner of the towel"`

## Exact training command

```bash
lerobot-train \
  --policy.type=multi_task_dit \
  --policy.device=cuda \
  --policy.use_amp=true \
  --policy.push_to_hub=true \
  --policy.repo_id=larsvandorp/clean_table_multi_task_dit \
  --policy.objective=diffusion \
  --policy.noise_scheduler_type=DDIM \
  --policy.num_train_timesteps=100 \
  --policy.num_inference_steps=10 \
  --policy.horizon=32 \
  --policy.n_action_steps=24 \
  --policy.num_layers=4 \
  --policy.vision_encoder_name=openai/clip-vit-base-patch16 \
  --policy.text_encoder_name=openai/clip-vit-base-patch16 \
  --dataset.repo_id=larsvandorp/clean_table_filtered \
  --dataset.root=$SLURM_SUBMIT_DIR/data_clean_table_filtered \
  --dataset.video_backend=pyav \
  --output_dir=$OUT \
  --batch_size=192 \
  --steps=30000 \
  --save_freq=2000 \
  --log_freq=200 \
  --eval_freq=0 \
  --num_workers=3 \
  --wandb.enable=false
```

Training was stopped manually at step 6000 because loss plateaued and time was needed for real-robot rollout.

## Why these flag values

Every choice that deviates from a config default (or that we made explicit despite matching the default) and the reason:

### Deviations from defaults
| Flag | Value | Default | Why |
|---|---|---|---|
| `--policy.num_layers` | 4 | 6 | Blog "small dataset (<100 examples)" preset. We have 212 episodes β€” borderline small; smaller DiT lowers overfitting risk. |
| `--policy.noise_scheduler_type` | DDIM | DDPM | DDIM and DDPM share the training math (same `add_noise`, same MSE loss). Setting DDIM here bakes the fast inference scheduler into the saved checkpoint config so M1 deployment uses 10 steps with no override. |
| `--policy.num_inference_steps` | 10 | None (defaults to 100) | Mac inference at 10 deterministic DDIM steps β‰ˆ 10Γ— faster than 100 DDPM steps, with negligible quality loss for action-space diffusion at small T. |
| `--policy.use_amp` | true | false | A100/Pro 6000 Tensor Cores: 2–4Γ— faster fp16 matmul + halved activation VRAM. Enabled bs=192 on the 96 GB Pro 6000 with comfortable headroom. |
| `--batch_size` | 192 | n/a | Blog recommends 192–320 for "best training dynamics". 192 is the lower end, fit within the Pro 6000's 96 GB and CPU RAM budget. |
| `--num_workers` | 3 | 4 | Matches `cpus_per_task βˆ’ 1` (one CPU for the main training process). |
| `--steps` | 30000 | n/a | Blog recommends β‰₯ 30k for a single task. We stopped at 6k after loss flattened. |
| `--dataset.video_backend` | pyav | torchcodec | torchcodec was installed in the venv but crashes at runtime on Euler (missing `libavdevice.so.58` in the system FFmpeg). PyAV is the working fallback. |

### Defaults explicitly set (no behaviour change, just documentation)
| Flag | Value | Reason for being explicit |
|---|---|---|
| `--policy.objective` | diffusion | Blog says start with diffusion; switch to flow_matching only if generation quality is poor. |
| `--policy.num_train_timesteps` | 100 | Standard small-T diffusion schedule, matches blog. |
| `--policy.horizon` | 32 | ~1 sec of motion at 30 Hz. Blog default. |
| `--policy.n_action_steps` | 24 | ~0.8 sec open-loop execution. Blog warns this knob is sensitive. |
| `--policy.vision_encoder_name` | openai/clip-vit-base-patch16 | The most identity-defining choice; explicit makes the config readable. |
| `--policy.text_encoder_name` | same | Same family for the (frozen) text encoder. |

### Defaults *not* overridden but worth noting
- `image_resize_shape = None`, `image_crop_shape = (224, 224)`. Wrist frames are 600Γ—800; the random 224Γ—224 crop sees only ~28% width of the raw frame at training. Adding `--policy.image_resize_shape='[240,320]'` would give better field-of-view coverage and ~1.5Γ— faster dataloading; we did not enable it for this run.
- `optimizer_lr = 2e-5`, `vision_encoder_lr_multiplier = 0.1` (CLIP backbone gets 0.1Γ— LR), AdamW betas `(0.95, 0.999)`, cosine LR schedule with 0 warmup steps.
- RoPE on, no absolute positional encoding.
- `use_separate_rgb_encoder_per_camera = false` (single camera anyway).

## Architecture (for reference)

| Component | Spec |
|---|---|
| Vision encoder | CLIP ViT-B/16 (~86M params, trainable, lr Γ— 0.1) |
| Text encoder | CLIP ViT-B/16 text tower (~63M, **frozen**, learnable `Linear(512β†’512)` projection) |
| DiT noise predictor | 4 layers Γ— 512 hidden Γ— 8 heads, 4Γ— MLP, AdaLN-Zero conditioning, RoPE (~17M params) |
| Total trainable | ~105M params |

## Inference (on Mac)

Default load just works β€” DDIM + 10 inference steps are baked into the saved `config.json`.

```python
from lerobot.policies.multi_task_dit.modeling_multi_task_dit import MultiTaskDiTPolicy

policy = MultiTaskDiTPolicy.from_pretrained("larsvandorp/clean_table_multi_task_dit")
policy.eval()
# Pass observation dict: {"observation.images.wrist": ..., "observation.state": ..., "task": "Pick up the corner of the towel"}
```

## Training metrics summary

| step | loss | grad norm | lr | epochs |
|---|---|---|---|---|
| 200 | 0.203 | 0.898 | 2.0e-5 | 1.05 |
| 1000 | 0.032 | 0.599 | 2.0e-5 | 5.27 |
| 2000 | 0.026 | 0.458 | 2.0e-5 | 9.48 |
| 4000 | 0.021 | 0.378 | 1.9e-5 | 18.97 |
| 6000 | 0.018 | 0.317 | 1.8e-5 | 31.61 |

`updt_s β‰ˆ 0.60 s/step`, `data_s β‰ˆ 0.50 s/step` (47% of wall time was the GPU stalled waiting for the dataloader, despite 3 worker processes β€” pyav decoding 192 Γ— 600Γ—800 frames per batch is the bottleneck).

## References

- [Multi-Task DiT lerobot docs](https://huggingface.co/docs/lerobot/en/multi_task_dit)
- [TRI LBM paper](https://arxiv.org/abs/2507.05331) (diffusion objective)
- [Boston Dynamics LBM blog](https://bostondynamics.com/blog/large-behavior-models-atlas-find-new-footing/)
- [Bryson Jones β€” Dissecting and Open-Sourcing Multitask Diffusion Transformer Policy](https://brysonkjones.substack.com/p/dissecting-and-open-sourcing-multitask-diffusion-transformer-policy)