Add training README with exact flags + rationale

5f553eb verified 5 days ago

7.56 kB

	---
	license: apache-2.0
	library_name: lerobot
	tags:
	- robotics
	- so-101
	- diffusion-policy
	- multi-task-dit
	- towel-folding
	base_model: openai/clip-vit-base-patch16
	datasets:
	- larsvandorp/clean_table_filtered
	---

	# Multi-Task DiT — clean_table

	Multi-Task Diffusion Transformer policy trained on [`larsvandorp/clean_table_filtered`](https://huggingface.co/datasets/larsvandorp/clean_table_filtered) (SO-101 follower arm, "Pick up the corner of the towel" task).

	The repo root holds the step 6000 checkpoint (latest, lowest loss). Earlier checkpoints are archived under `checkpoints/step_002000/` and `checkpoints/step_004000/`.

	## Checkpoints

	\| Path \| Step \| Loss \| Epochs \| Samples seen \|
	\|---\|---\|---\|---\|---\|
	\| `/` (default) \| 6000 \| 0.018 \| 32.7 \| ~1.15 M \|
	\| `checkpoints/step_004000/` \| 4000 \| 0.021 \| 21.1 \| ~768 k \|
	\| `checkpoints/step_002000/` \| 2000 \| 0.025 \| 10.5 \| ~384 k \|

	Loss is the per-step DDIM ε-prediction MSE on the action chunk.

	## Hardware

	- GPU: 1× NVIDIA RTX Pro 6000 (96 GB)
	- CPUs: 4
	- System RAM: 24 GiB (4 × 6 GiB)
	- Cluster: ETH Euler, partition `cuda13pr.4h`
	- Wall time used: ~1h 55min before manual cancel (4h walltime budget)

	Compute nodes had no internet; CLIP weights were pre-cached via the login node into `$HF_HOME` and `HF_HUB_OFFLINE=1` was set.

	## Dataset

	[`larsvandorp/clean_table_filtered`](https://huggingface.co/datasets/larsvandorp/clean_table_filtered)

	- 36,439 frames, 212 episodes (after filtering noop frames where the leader arm wasn't moving)
	- 30 Hz, single wrist camera at 600×800 (h264, lossy from recording)
	- 6-DoF SO-101 joint state and action
	- Single task: `"Pick up the corner of the towel"`

	## Exact training command

	```bash
	lerobot-train \
	--policy.type=multi_task_dit \
	--policy.device=cuda \
	--policy.use_amp=true \
	--policy.push_to_hub=true \
	--policy.repo_id=larsvandorp/clean_table_multi_task_dit \
	--policy.objective=diffusion \
	--policy.noise_scheduler_type=DDIM \
	--policy.num_train_timesteps=100 \
	--policy.num_inference_steps=10 \
	--policy.horizon=32 \
	--policy.n_action_steps=24 \
	--policy.num_layers=4 \
	--policy.vision_encoder_name=openai/clip-vit-base-patch16 \
	--policy.text_encoder_name=openai/clip-vit-base-patch16 \
	--dataset.repo_id=larsvandorp/clean_table_filtered \
	--dataset.root=$SLURM_SUBMIT_DIR/data_clean_table_filtered \
	--dataset.video_backend=pyav \
	--output_dir=$OUT \
	--batch_size=192 \
	--steps=30000 \
	--save_freq=2000 \
	--log_freq=200 \
	--eval_freq=0 \
	--num_workers=3 \
	--wandb.enable=false
	```

	Training was stopped manually at step 6000 because loss plateaued and time was needed for real-robot rollout.

	## Why these flag values

	Every choice that deviates from a config default (or that we made explicit despite matching the default) and the reason:

	### Deviations from defaults
	\| Flag \| Value \| Default \| Why \|
	\|---\|---\|---\|---\|
	\| `--policy.num_layers` \| 4 \| 6 \| Blog "small dataset (<100 examples)" preset. We have 212 episodes — borderline small; smaller DiT lowers overfitting risk. \|
	\| `--policy.noise_scheduler_type` \| DDIM \| DDPM \| DDIM and DDPM share the training math (same `add_noise`, same MSE loss). Setting DDIM here bakes the fast inference scheduler into the saved checkpoint config so M1 deployment uses 10 steps with no override. \|
	\| `--policy.num_inference_steps` \| 10 \| None (defaults to 100) \| Mac inference at 10 deterministic DDIM steps ≈ 10× faster than 100 DDPM steps, with negligible quality loss for action-space diffusion at small T. \|
	\| `--policy.use_amp` \| true \| false \| A100/Pro 6000 Tensor Cores: 2–4× faster fp16 matmul + halved activation VRAM. Enabled bs=192 on the 96 GB Pro 6000 with comfortable headroom. \|
	\| `--batch_size` \| 192 \| n/a \| Blog recommends 192–320 for "best training dynamics". 192 is the lower end, fit within the Pro 6000's 96 GB and CPU RAM budget. \|
	\| `--num_workers` \| 3 \| 4 \| Matches `cpus_per_task − 1` (one CPU for the main training process). \|
	\| `--steps` \| 30000 \| n/a \| Blog recommends ≥ 30k for a single task. We stopped at 6k after loss flattened. \|
	\| `--dataset.video_backend` \| pyav \| torchcodec \| torchcodec was installed in the venv but crashes at runtime on Euler (missing `libavdevice.so.58` in the system FFmpeg). PyAV is the working fallback. \|

	### Defaults explicitly set (no behaviour change, just documentation)
	\| Flag \| Value \| Reason for being explicit \|
	\|---\|---\|---\|
	\| `--policy.objective` \| diffusion \| Blog says start with diffusion; switch to flow_matching only if generation quality is poor. \|
	\| `--policy.num_train_timesteps` \| 100 \| Standard small-T diffusion schedule, matches blog. \|
	\| `--policy.horizon` \| 32 \| ~1 sec of motion at 30 Hz. Blog default. \|
	\| `--policy.n_action_steps` \| 24 \| ~0.8 sec open-loop execution. Blog warns this knob is sensitive. \|
	\| `--policy.vision_encoder_name` \| openai/clip-vit-base-patch16 \| The most identity-defining choice; explicit makes the config readable. \|
	\| `--policy.text_encoder_name` \| same \| Same family for the (frozen) text encoder. \|

	### Defaults not overridden but worth noting
	- `image_resize_shape = None`, `image_crop_shape = (224, 224)`. Wrist frames are 600×800; the random 224×224 crop sees only ~28% width of the raw frame at training. Adding `--policy.image_resize_shape='[240,320]'` would give better field-of-view coverage and ~1.5× faster dataloading; we did not enable it for this run.
	- `optimizer_lr = 2e-5`, `vision_encoder_lr_multiplier = 0.1` (CLIP backbone gets 0.1× LR), AdamW betas `(0.95, 0.999)`, cosine LR schedule with 0 warmup steps.
	- RoPE on, no absolute positional encoding.
	- `use_separate_rgb_encoder_per_camera = false` (single camera anyway).

	## Architecture (for reference)

	\| Component \| Spec \|
	\|---\|---\|
	\| Vision encoder \| CLIP ViT-B/16 (~86M params, trainable, lr × 0.1) \|
	\| Text encoder \| CLIP ViT-B/16 text tower (~63M, frozen, learnable `Linear(512→512)` projection) \|
	\| DiT noise predictor \| 4 layers × 512 hidden × 8 heads, 4× MLP, AdaLN-Zero conditioning, RoPE (~17M params) \|
	\| Total trainable \| ~105M params \|

	## Inference (on Mac)

	Default load just works — DDIM + 10 inference steps are baked into the saved `config.json`.

	```python
	from lerobot.policies.multi_task_dit.modeling_multi_task_dit import MultiTaskDiTPolicy

	policy = MultiTaskDiTPolicy.from_pretrained("larsvandorp/clean_table_multi_task_dit")
	policy.eval()
	# Pass observation dict: {"observation.images.wrist": ..., "observation.state": ..., "task": "Pick up the corner of the towel"}
	```

	## Training metrics summary

	\| step \| loss \| grad norm \| lr \| epochs \|
	\|---\|---\|---\|---\|---\|
	\| 200 \| 0.203 \| 0.898 \| 2.0e-5 \| 1.05 \|
	\| 1000 \| 0.032 \| 0.599 \| 2.0e-5 \| 5.27 \|
	\| 2000 \| 0.026 \| 0.458 \| 2.0e-5 \| 9.48 \|
	\| 4000 \| 0.021 \| 0.378 \| 1.9e-5 \| 18.97 \|
	\| 6000 \| 0.018 \| 0.317 \| 1.8e-5 \| 31.61 \|

	`updt_s ≈ 0.60 s/step`, `data_s ≈ 0.50 s/step` (47% of wall time was the GPU stalled waiting for the dataloader, despite 3 worker processes — pyav decoding 192 × 600×800 frames per batch is the bottleneck).

	## References

	- [Multi-Task DiT lerobot docs](https://huggingface.co/docs/lerobot/en/multi_task_dit)
	- [TRI LBM paper](https://arxiv.org/abs/2507.05331) (diffusion objective)
	- [Boston Dynamics LBM blog](https://bostondynamics.com/blog/large-behavior-models-atlas-find-new-footing/)
	- [Bryson Jones — Dissecting and Open-Sourcing Multitask Diffusion Transformer Policy](https://brysonkjones.substack.com/p/dissecting-and-open-sourcing-multitask-diffusion-transformer-policy)