card: scheduler is DDIM (train + inference), not DDPM

d32b49a verified 12 days ago

2.61 kB

	---
	license: apache-2.0
	library_name: lerobot
	pipeline_tag: robotics
	tags:
	- robotics
	- so-101
	- diffusion-policy
	- multi-task-dit
	- dinov3
	- towel-folding
	base_model: facebook/dinov3-vitb16-pretrain-lvd1689m
	datasets:
	- larsvandorp/magic_soup
	---

	# folding_dit — Multi-Task DiT (DINOv3-B) for towel folding

	Diffusion-Transformer policy for autonomous towel folding on a 6-DoF SO-101 follower arm with a single wrist camera. This is the model used at the competition (ETH "Robot Learning: From Fundamentals to Foundation Models", Project 5 — Diffusion Policy).

	The repo root holds the step-28000 checkpoint (the deployed one), so `from_pretrained("larsvandorp/folding_dit")` loads it directly.

	## Architecture

	\| Component \| Spec \|
	\|---\|---\|
	\| Vision encoder \| DINOv3 ViT-B/16 (~86M, fine-tuned, lr × 0.1) \|
	\| Text encoder \| CLIP ViT-B/16 text tower (frozen, learnable projection) \|
	\| Noise predictor \| 6-layer DiT, 512 hidden, 8 heads, AdaLN-Zero, RoPE \|
	\| Objective \| Diffusion, DDIM scheduler — 100 train timesteps, 10-step inference \|
	\| Horizon / action steps \| 32 / 24 (≈1.0 s / 0.8 s at 30 Hz) \|
	\| Augmentation \| resize-only (no crop) + RandomGrayscale (≈50% of samples) + color jitter, no rotation \|

	Requires the `multi_task_dit` policy + DINOv3 `AutoModel` loading, which live in the fork
	[`LarsvanDorp/lerobot@dinov3`](https://github.com/LarsvanDorp/lerobot/tree/dinov3) (not yet upstream).

	## Run it

	```bash
	uv venv --python 3.12 .venv
	GIT_LFS_SKIP_SMUDGE=1 uv pip install --python .venv/bin/python \
	"lerobot[multi_task_dit] @ git+https://github.com/LarsvanDorp/lerobot.git@dinov3"
	.venv/bin/hf download facebook/dinov3-vitb16-pretrain-lvd1689m # gated — accept the license first
	.venv/bin/hf download openai/clip-vit-base-patch16

	.venv/bin/lerobot-rollout \
	--strategy.type=base \
	--robot.type=so101_follower --robot.port=/dev/ttyACM0 --robot.id=my_follower \
	--robot.cameras="{wrist: {type: opencv, index_or_path: 0, width: 800, height: 600, fps: 30, fourcc: MJPG}}" \
	--policy.path=larsvandorp/folding_dit \
	--policy.device=cuda --inference.type=sync \
	--task="fold the towel" --duration=60
	```

	Feed color RGB (the "rgray" model trained on a color+grayscale mix). We always run without `--interpolation_multiplier`. Runs on Mac MPS too (drop `fourcc: MJPG`, set `--policy.device=mps`).

	## Training data

	[`larsvandorp/magic_soup`](https://huggingface.co/datasets/larsvandorp/magic_soup) — ~430 SO-101 towel-folding episodes, deliberately broad (cloths, rotations, locations), grasp next-to-corner, return-to-start after first fold.