Instructions to use larsvandorp/folding_dit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LeRobot
How to use larsvandorp/folding_dit with LeRobot:
- Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| library_name: lerobot | |
| pipeline_tag: robotics | |
| tags: | |
| - robotics | |
| - so-101 | |
| - diffusion-policy | |
| - multi-task-dit | |
| - dinov3 | |
| - towel-folding | |
| base_model: facebook/dinov3-vitb16-pretrain-lvd1689m | |
| datasets: | |
| - larsvandorp/magic_soup | |
| # folding_dit β Multi-Task DiT (DINOv3-B) for towel folding | |
| Diffusion-Transformer policy for autonomous towel folding on a 6-DoF **SO-101** follower arm with a single wrist camera. This is the model used at the competition (ETH "Robot Learning: From Fundamentals to Foundation Models", Project 5 β Diffusion Policy). | |
| The repo **root holds the step-28000 checkpoint** (the deployed one), so `from_pretrained("larsvandorp/folding_dit")` loads it directly. | |
| ## Architecture | |
| | Component | Spec | | |
| |---|---| | |
| | Vision encoder | **DINOv3 ViT-B/16** (~86M, fine-tuned, lr Γ 0.1) | | |
| | Text encoder | CLIP ViT-B/16 text tower (frozen, learnable projection) | | |
| | Noise predictor | 6-layer DiT, 512 hidden, 8 heads, AdaLN-Zero, RoPE | | |
| | Objective | Diffusion, **DDIM** scheduler β 100 train timesteps, 10-step inference | | |
| | Horizon / action steps | 32 / 24 (β1.0 s / 0.8 s at 30 Hz) | | |
| | Augmentation | resize-only (no crop) + RandomGrayscale (β50% of samples) + color jitter, no rotation | | |
| Requires the `multi_task_dit` policy + DINOv3 `AutoModel` loading, which live in the fork | |
| [`LarsvanDorp/lerobot@dinov3`](https://github.com/LarsvanDorp/lerobot/tree/dinov3) (not yet upstream). | |
| ## Run it | |
| ```bash | |
| uv venv --python 3.12 .venv | |
| GIT_LFS_SKIP_SMUDGE=1 uv pip install --python .venv/bin/python \ | |
| "lerobot[multi_task_dit] @ git+https://github.com/LarsvanDorp/lerobot.git@dinov3" | |
| .venv/bin/hf download facebook/dinov3-vitb16-pretrain-lvd1689m # gated β accept the license first | |
| .venv/bin/hf download openai/clip-vit-base-patch16 | |
| .venv/bin/lerobot-rollout \ | |
| --strategy.type=base \ | |
| --robot.type=so101_follower --robot.port=/dev/ttyACM0 --robot.id=my_follower \ | |
| --robot.cameras="{wrist: {type: opencv, index_or_path: 0, width: 800, height: 600, fps: 30, fourcc: MJPG}}" \ | |
| --policy.path=larsvandorp/folding_dit \ | |
| --policy.device=cuda --inference.type=sync \ | |
| --task="fold the towel" --duration=60 | |
| ``` | |
| Feed **color** RGB (the "rgray" model trained on a color+grayscale mix). We always run **without** `--interpolation_multiplier`. Runs on Mac MPS too (drop `fourcc: MJPG`, set `--policy.device=mps`). | |
| ## Training data | |
| [`larsvandorp/magic_soup`](https://huggingface.co/datasets/larsvandorp/magic_soup) β ~430 SO-101 towel-folding episodes, deliberately broad (cloths, rotations, locations), grasp next-to-corner, return-to-start after first fold. | |