folding_dit — Multi-Task DiT (DINOv3-B) for towel folding

Diffusion-Transformer policy for autonomous towel folding on a 6-DoF SO-101 follower arm with a single wrist camera. This is the model used at the competition (ETH "Robot Learning: From Fundamentals to Foundation Models", Project 5 — Diffusion Policy).

The repo root holds the step-28000 checkpoint (the deployed one), so from_pretrained("larsvandorp/folding_dit") loads it directly.

Architecture

Component	Spec
Vision encoder	DINOv3 ViT-B/16 (~86M, fine-tuned, lr × 0.1)
Text encoder	CLIP ViT-B/16 text tower (frozen, learnable projection)
Noise predictor	6-layer DiT, 512 hidden, 8 heads, AdaLN-Zero, RoPE
Objective	Diffusion, DDIM scheduler — 100 train timesteps, 10-step inference
Horizon / action steps	32 / 24 (≈1.0 s / 0.8 s at 30 Hz)
Augmentation	resize-only (no crop) + RandomGrayscale (≈50% of samples) + color jitter, no rotation

Requires the multi_task_dit policy + DINOv3 AutoModel loading, which live in the fork LarsvanDorp/lerobot@dinov3 (not yet upstream).

Run it

uv venv --python 3.12 .venv
GIT_LFS_SKIP_SMUDGE=1 uv pip install --python .venv/bin/python \
  "lerobot[multi_task_dit] @ git+https://github.com/LarsvanDorp/lerobot.git@dinov3"
.venv/bin/hf download facebook/dinov3-vitb16-pretrain-lvd1689m   # gated — accept the license first
.venv/bin/hf download openai/clip-vit-base-patch16

.venv/bin/lerobot-rollout \
  --strategy.type=base \
  --robot.type=so101_follower --robot.port=/dev/ttyACM0 --robot.id=my_follower \
  --robot.cameras="{wrist: {type: opencv, index_or_path: 0, width: 800, height: 600, fps: 30, fourcc: MJPG}}" \
  --policy.path=larsvandorp/folding_dit \
  --policy.device=cuda --inference.type=sync \
  --task="fold the towel" --duration=60

Feed color RGB (the "rgray" model trained on a color+grayscale mix). We always run without --interpolation_multiplier. Runs on Mac MPS too (drop fourcc: MJPG, set --policy.device=mps).

Training data

larsvandorp/magic_soup — ~430 SO-101 towel-folding episodes, deliberately broad (cloths, rotations, locations), grasp next-to-corner, return-to-start after first fold.

Downloads last month: 6

Safetensors

Model size

0.2B params

Tensor type

F32

Video Preview

Robotics

Model tree for larsvandorp/folding_dit

Base model

facebook/dinov3-vit7b16-pretrain-lvd1689m

Finetuned

facebook/dinov3-vitb16-pretrain-lvd1689m

Finetuned

(16)

this model