File size: 7,563 Bytes

5f553eb

---
license: apache-2.0
library_name: lerobot
tags:
- robotics
- so-101
- diffusion-policy
- multi-task-dit
- towel-folding
base_model: openai/clip-vit-base-patch16
datasets:
- larsvandorp/clean_table_filtered
---

# Multi-Task DiT — clean_table

Multi-Task Diffusion Transformer policy trained on [`larsvandorp/clean_table_filtered`](https://huggingface.co/datasets/larsvandorp/clean_table_filtered) (SO-101 follower arm, "Pick up the corner of the towel" task).

The repo root holds the **step 6000** checkpoint (latest, lowest loss). Earlier checkpoints are archived under `checkpoints/step_002000/` and `checkpoints/step_004000/`.

## Checkpoints

| Path | Step | Loss | Epochs | Samples seen |
|---|---|---|---|---|
| `/` (default) | 6000 | 0.018 | 32.7 | ~1.15 M |
| `checkpoints/step_004000/` | 4000 | 0.021 | 21.1 | ~768 k |
| `checkpoints/step_002000/` | 2000 | 0.025 | 10.5 | ~384 k |

Loss is the per-step DDIM ε-prediction MSE on the action chunk.

## Hardware

- **GPU**: 1× NVIDIA RTX Pro 6000 (96 GB)
- **CPUs**: 4
- **System RAM**: 24 GiB (4 × 6 GiB)
- **Cluster**: ETH Euler, partition `cuda13pr.4h`
- **Wall time used**: ~1h 55min before manual cancel (4h walltime budget)

Compute nodes had no internet; CLIP weights were pre-cached via the login node into `$HF_HOME` and `HF_HUB_OFFLINE=1` was set.

## Dataset

[`larsvandorp/clean_table_filtered`](https://huggingface.co/datasets/larsvandorp/clean_table_filtered)

- 36,439 frames, 212 episodes (after filtering noop frames where the leader arm wasn't moving)
- 30 Hz, single wrist camera at 600×800 (h264, lossy from recording)
- 6-DoF SO-101 joint state and action
- Single task: `"Pick up the corner of the towel"`

## Exact training command

```bash
lerobot-train \
  --policy.type=multi_task_dit \
  --policy.device=cuda \
  --policy.use_amp=true \
  --policy.push_to_hub=true \
  --policy.repo_id=larsvandorp/clean_table_multi_task_dit \
  --policy.objective=diffusion \
  --policy.noise_scheduler_type=DDIM \
  --policy.num_train_timesteps=100 \
  --policy.num_inference_steps=10 \
  --policy.horizon=32 \
  --policy.n_action_steps=24 \
  --policy.num_layers=4 \
  --policy.vision_encoder_name=openai/clip-vit-base-patch16 \
  --policy.text_encoder_name=openai/clip-vit-base-patch16 \
  --dataset.repo_id=larsvandorp/clean_table_filtered \
  --dataset.root=$SLURM_SUBMIT_DIR/data_clean_table_filtered \
  --dataset.video_backend=pyav \
  --output_dir=$OUT \
  --batch_size=192 \
  --steps=30000 \
  --save_freq=2000 \
  --log_freq=200 \
  --eval_freq=0 \
  --num_workers=3 \
  --wandb.enable=false
```

Training was stopped manually at step 6000 because loss plateaued and time was needed for real-robot rollout.

## Why these flag values

Every choice that deviates from a config default (or that we made explicit despite matching the default) and the reason:

### Deviations from defaults
| Flag | Value | Default | Why |
|---|---|---|---|
| `--policy.num_layers` | 4 | 6 | Blog "small dataset (<100 examples)" preset. We have 212 episodes — borderline small; smaller DiT lowers overfitting risk. |
| `--policy.noise_scheduler_type` | DDIM | DDPM | DDIM and DDPM share the training math (same `add_noise`, same MSE loss). Setting DDIM here bakes the fast inference scheduler into the saved checkpoint config so M1 deployment uses 10 steps with no override. |
| `--policy.num_inference_steps` | 10 | None (defaults to 100) | Mac inference at 10 deterministic DDIM steps ≈ 10× faster than 100 DDPM steps, with negligible quality loss for action-space diffusion at small T. |
| `--policy.use_amp` | true | false | A100/Pro 6000 Tensor Cores: 2–4× faster fp16 matmul + halved activation VRAM. Enabled bs=192 on the 96 GB Pro 6000 with comfortable headroom. |
| `--batch_size` | 192 | n/a | Blog recommends 192–320 for "best training dynamics". 192 is the lower end, fit within the Pro 6000's 96 GB and CPU RAM budget. |
| `--num_workers` | 3 | 4 | Matches `cpus_per_task − 1` (one CPU for the main training process). |
| `--steps` | 30000 | n/a | Blog recommends ≥ 30k for a single task. We stopped at 6k after loss flattened. |
| `--dataset.video_backend` | pyav | torchcodec | torchcodec was installed in the venv but crashes at runtime on Euler (missing `libavdevice.so.58` in the system FFmpeg). PyAV is the working fallback. |

### Defaults explicitly set (no behaviour change, just documentation)
| Flag | Value | Reason for being explicit |
|---|---|---|
| `--policy.objective` | diffusion | Blog says start with diffusion; switch to flow_matching only if generation quality is poor. |
| `--policy.num_train_timesteps` | 100 | Standard small-T diffusion schedule, matches blog. |
| `--policy.horizon` | 32 | ~1 sec of motion at 30 Hz. Blog default. |
| `--policy.n_action_steps` | 24 | ~0.8 sec open-loop execution. Blog warns this knob is sensitive. |
| `--policy.vision_encoder_name` | openai/clip-vit-base-patch16 | The most identity-defining choice; explicit makes the config readable. |
| `--policy.text_encoder_name` | same | Same family for the (frozen) text encoder. |

### Defaults *not* overridden but worth noting
- `image_resize_shape = None`, `image_crop_shape = (224, 224)`. Wrist frames are 600×800; the random 224×224 crop sees only ~28% width of the raw frame at training. Adding `--policy.image_resize_shape='[240,320]'` would give better field-of-view coverage and ~1.5× faster dataloading; we did not enable it for this run.
- `optimizer_lr = 2e-5`, `vision_encoder_lr_multiplier = 0.1` (CLIP backbone gets 0.1× LR), AdamW betas `(0.95, 0.999)`, cosine LR schedule with 0 warmup steps.
- RoPE on, no absolute positional encoding.
- `use_separate_rgb_encoder_per_camera = false` (single camera anyway).

## Architecture (for reference)

| Component | Spec |
|---|---|
| Vision encoder | CLIP ViT-B/16 (~86M params, trainable, lr × 0.1) |
| Text encoder | CLIP ViT-B/16 text tower (~63M, **frozen**, learnable `Linear(512→512)` projection) |
| DiT noise predictor | 4 layers × 512 hidden × 8 heads, 4× MLP, AdaLN-Zero conditioning, RoPE (~17M params) |
| Total trainable | ~105M params |

## Inference (on Mac)

Default load just works — DDIM + 10 inference steps are baked into the saved `config.json`.

```python
from lerobot.policies.multi_task_dit.modeling_multi_task_dit import MultiTaskDiTPolicy

policy = MultiTaskDiTPolicy.from_pretrained("larsvandorp/clean_table_multi_task_dit")
policy.eval()
# Pass observation dict: {"observation.images.wrist": ..., "observation.state": ..., "task": "Pick up the corner of the towel"}
```

## Training metrics summary

| step | loss | grad norm | lr | epochs |
|---|---|---|---|---|
| 200 | 0.203 | 0.898 | 2.0e-5 | 1.05 |
| 1000 | 0.032 | 0.599 | 2.0e-5 | 5.27 |
| 2000 | 0.026 | 0.458 | 2.0e-5 | 9.48 |
| 4000 | 0.021 | 0.378 | 1.9e-5 | 18.97 |
| 6000 | 0.018 | 0.317 | 1.8e-5 | 31.61 |

`updt_s ≈ 0.60 s/step`, `data_s ≈ 0.50 s/step` (47% of wall time was the GPU stalled waiting for the dataloader, despite 3 worker processes — pyav decoding 192 × 600×800 frames per batch is the bottleneck).

## References

- [Multi-Task DiT lerobot docs](https://huggingface.co/docs/lerobot/en/multi_task_dit)
- [TRI LBM paper](https://arxiv.org/abs/2507.05331) (diffusion objective)
- [Boston Dynamics LBM blog](https://bostondynamics.com/blog/large-behavior-models-atlas-find-new-footing/)
- [Bryson Jones — Dissecting and Open-Sourcing Multitask Diffusion Transformer Policy](https://brysonkjones.substack.com/p/dissecting-and-open-sourcing-multitask-diffusion-transformer-policy)