Instructions to use yianW/diffusion-dinov3-grasp03-25k with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LeRobot
How to use yianW/diffusion-dinov3-grasp03-25k with LeRobot:
- Notebooks
- Google Colab
- Kaggle
Diffusion Policy with DINOv3 ViT-B/16 backbone β grasp03 (final 25K checkpoint)
Diffusion Policy trained on
yianW/grasp03-sim-real-v2
with the ResNet image backbone in lerobot's DiffusionPolicy swapped for
DINOv3 ViT-B/16 initialized from Meta's LVD-1689M self-supervised
pretraining
(facebook/dinov3-vitb16-pretrain-lvd1689m).
Two training variants in this repo
| Variant | Where | Sampling | Steps |
|---|---|---|---|
| Resampled (90 / 2 / 8) β repo root | repo root (final 25 K), and checkpoints/{017500,020000,022500}/ (intermediates) |
sim eps 0β¦393 β 90 %, realA eps 394β¦399 β 2 %, realB eps 400β¦407 β 8 % via a WeightedRandomSampler |
resumed from 15 K β 25 K (+10 K) |
| Original (uniform-by-frame) | checkpoints/015000/ and checkpoints/025000_uniform/ |
Each frame equally likely β ~96.9 % sim / 1.7 % realA / 1.4 % realB | 25 000 from scratch |
Default load (resampled 25 K, the headline checkpoint):
policy = DiffusionPolicy.from_pretrained("yianW/diffusion-dinov3-grasp03-25k")
Load a non-root checkpoint with the subfolder argument, e.g. the original-uniform 25 K:
policy = DiffusionPolicy.from_pretrained(
"yianW/diffusion-dinov3-grasp03-25k", subfolder="checkpoints/025000_uniform"
)
Original-variant intermediate checkpoints at 5 K / 10 K / 20 K are kept locally on the training host (not uploaded); ask if you need any.
β οΈ Before you can load this model
The DINOv3 backbone weights are gated on Meta's HF mirror. The HF account
you use must have accepted the gate at
https://huggingface.co/facebook/dinov3-vitb16-pretrain-lvd1689m (one-time
click-through). Then huggingface-cli login so the token is on disk.
Reproducing the resampled variant
- Apply the ViT patch (
lerobot_vit_patch.py). - Apply the weighted-sampler patch and define buckets:
import lerobot_weighted_sampler as lws lws.set_buckets({ "sim": (range(0, 394), 0.90), "realA": (range(394, 400), 0.02), "realB": (range(400, 408), 0.08), }) lws.apply() - Run
train_resume_15k_weighted.pyagainst the 15 K checkpoint (setRUN_DIRto your local 15 K dir).
What's in this repo
| File | Purpose |
|---|---|
config.json, model.safetensors, train_config.json, policy_*.{json,safetensors} |
Standard lerobot pretrained-model files for the diffusion policy. Note: model.safetensors does not include the DINOv3 backbone weights β those are downloaded fresh from the gated HF mirror at load time and never re-saved. |
lerobot_vit_patch.py |
Monkey-patches lerobot to swap DiffusionRgbEncoder for a ViT (torchvision OR DINOv3) and to accept vision_backbone="dinov3_*" in DiffusionConfig. Must be imported before loading. |
train_diffusion_vit.py |
Self-contained training script: builds TrainPipelineConfig, applies the per-dim loss weighting, calls lerobot.scripts.lerobot_train.train. Reference for retraining or fine-tuning. |
How to load
import lerobot_vit_patch # noqa: F401 -- swaps in DINOv3 ViT-B/16
from lerobot.policies.diffusion.modeling_diffusion import DiffusionPolicy
policy = DiffusionPolicy.from_pretrained("yianW/diffusion-dinov3-grasp03-25k")
policy.eval()
If lerobot_vit_patch.py isn't on your PYTHONPATH yet, fetch it from this
repo first:
from huggingface_hub import hf_hub_download
import sys, importlib
sys.path.insert(0, hf_hub_download(
"yianW/diffusion-dinov3-grasp03-25k", "lerobot_vit_patch.py", repo_type="model"
).rsplit("/", 1)[0])
importlib.import_module("lerobot_vit_patch") # apply the patch
from lerobot.policies.diffusion.modeling_diffusion import DiffusionPolicy
policy = DiffusionPolicy.from_pretrained("yianW/diffusion-dinov3-grasp03-25k")
The DINOv3 backbone weights are then pulled from
facebook/dinov3-vitb16-pretrain-lvd1689m automatically on first call (cached
under ~/.cache/huggingface/hub/).
Training setup (key choices)
- Backbone:
dinov3_vitb16, LVD-1689M self-supervised pretraining, 86 M params. Loaded viatransformers.AutoModelwithattn_implementation="sdpa"(eager attention OOMs at large batch). - Image pipeline: 480Γ640 β resize 240Γ320 β 224Γ224 random crop (train) /
center crop (eval) β ImageNet normalization β DINOv3 ViT-B/16. Output:
CLS token (
pooler_output) β Linear(768 β 64) β ReLU. The 64-D feature matches2 Γ spatial_softmax_num_keypointsso the Unet'sglobal_cond_dimis unchanged from the upstream lerobot baseline. - Augmentation: lerobot's
ImageTransformswith brightness, contrast, saturation, hue, sharpness, small affine β up to 3 sampled per frame. - Action loss weighting: the dataset has 27-dim actions split as 7 arm
joints + 20 hand joints. Per-element MSE is multiplied by
[5, 5, 5, 5, 5, 5, 5, 1, 1, ..., 1]so the arm contributes 5Γ compared to the hand. See_weighted_compute_lossintrain_diffusion_vit.py. - Diffusion: DDPM, 100 train timesteps,
prediction_type=epsilon,n_obs_steps=2,horizon=16,n_action_steps=8. - Optimizer: AdamW preset (lr 1e-4 peak, cosine decay to ~0, 500-step
warmup), bf16 mixed precision via
accelerate. - Batch / steps: batch 128, 25 000 steps (~85 epochs over 37 304 frames). We tried batch 256 and 192 first β both OOM with HF's DINOv3 + sdpa even on an 80 GB A100 (DINOv3-via-transformers is ~2.5Γ more activation-hungry than torchvision's ViT-B/16 at the same batch size).
- Hardware: 1Γ A100-80GB (peak ~63 GB at b=128).
Loss curve (train)
| Step | Variant | Loss | LR |
|---|---|---|---|
| 100 | original (uniform) | 1.99 | 1.0e-5 |
| 1 000 | original (uniform) | 0.05 | 1.0e-4 |
| 5 000 | original (uniform) | 0.014 | 9.2e-5 |
| 10 000 | original (uniform) | 0.008 | 6.8e-5 |
| 15 000 | original (uniform) | 0.004 | 3.7e-5 |
| 20 000 | original (uniform) | 0.002 | 1.0e-5 |
| 25 000 | original (uniform) | 0.001 | ~0 |
| 17 500 | resampled (90/2/8) (resumed from 15K) | 0.003 | 2.3e-5 |
| 20 000 | resampled (90/2/8) | 0.002 | 9.4e-6 |
| 22 500 | resampled (90/2/8) | 0.001 | 3.1e-6 |
| 25 000 | resampled (90/2/8) β repo root | 0.001 | ~0 |
Loss values are MSE-on-noise (epsilon prediction); a single-pass through the dataset gives a noisy estimate, so the resampled trajectory looks similar to the original on this aggregate metric. The intended difference is sample coverage (real episodes seen ~5Γ more often), not headline loss β eval on real-data trajectories is the meaningful comparison.
Reproducing the training run on another machine
Environment
| Package | Version |
|---|---|
| Python | 3.12 |
lerobot |
0.5.1 |
torch |
2.10.0 (cu128) |
torchvision |
0.25.0 (cu128) |
transformers |
β₯ 4.51, < 5 (HF DINOv3 module) |
huggingface_hub |
β₯ 0.36 |
uv venv /venv/main --python 3.12
VIRTUAL_ENV=/venv/main uv pip install \
"torch==2.10.0" "torchvision==0.25.0" \
"lerobot==0.5.1" "transformers>=4.51,<5" "huggingface_hub>=0.36"
huggingface-cli login # token must have accepted the DINOv3 gate
GPU: 1Γ A100-80GB used here. With HF's DINOv3 implementation, batch 128 is the practical maximum on 80 GB; smaller GPUs need a smaller batch.
Train from scratch
huggingface-cli download yianW/diffusion-dinov3-grasp03-25k \
lerobot_vit_patch.py train_diffusion_vit.py --local-dir .
/venv/main/bin/python train_diffusion_vit.py
train_diffusion_vit.py is self-contained β it builds the
TrainPipelineConfig programmatically (dataset, DINOv3 backbone, augmentation,
crop/resize, per-dim loss weighting) and calls lerobot's training loop.
Edit build_train_config() / build_policy_config() to change steps, batch
size, learning rate, etc. To switch to torchvision ViT-B/16 with ImageNet
pretraining, set VISION_BACKBONE = "vit_b_16" and
PRETRAINED = "IMAGENET1K_V1" at the top of the file β the patch supports
both.
Fine-tune from this 25K checkpoint
Pass --policy.path=yianW/diffusion-dinov3-grasp03-25k to lerobot's CLI, or
modify build_policy_config() to use from_pretrained. The patch must still
be applied first (import lerobot_vit_patch).
Resume training (with optimizer/RNG state)
The optimizer / scheduler / RNG state needed to resume training mid-run
(training_state/, ~2.6 GB) is not uploaded here β only inference
weights. Ask if you need it.
Caveats
- Loss values are MSE-on-noise (epsilon prediction) and are not directly comparable to action-MSE.
- The patch is a runtime monkey-patch, not a fork of lerobot. Pin lerobot to
the version trained against (>= the version that has
lerobot.policies.diffusion.modeling_diffusion.DiffusionRgbEncoder). transformers5.x broke lerobot's groot policy import β pin totransformers>=4.51,<5.
- Downloads last month
- 161