Diffusion Policy with DINOv3 ViT-B/16 backbone β€” grasp03 (final 25K checkpoint)

Diffusion Policy trained on yianW/grasp03-sim-real-v2 with the ResNet image backbone in lerobot's DiffusionPolicy swapped for DINOv3 ViT-B/16 initialized from Meta's LVD-1689M self-supervised pretraining (facebook/dinov3-vitb16-pretrain-lvd1689m).

Two training variants in this repo

Variant Where Sampling Steps
Resampled (90 / 2 / 8) ← repo root repo root (final 25 K), and checkpoints/{017500,020000,022500}/ (intermediates) sim eps 0…393 β†’ 90 %, realA eps 394…399 β†’ 2 %, realB eps 400…407 β†’ 8 % via a WeightedRandomSampler resumed from 15 K β†’ 25 K (+10 K)
Original (uniform-by-frame) checkpoints/015000/ and checkpoints/025000_uniform/ Each frame equally likely β†’ ~96.9 % sim / 1.7 % realA / 1.4 % realB 25 000 from scratch

Default load (resampled 25 K, the headline checkpoint):

policy = DiffusionPolicy.from_pretrained("yianW/diffusion-dinov3-grasp03-25k")

Load a non-root checkpoint with the subfolder argument, e.g. the original-uniform 25 K:

policy = DiffusionPolicy.from_pretrained(
    "yianW/diffusion-dinov3-grasp03-25k", subfolder="checkpoints/025000_uniform"
)

Original-variant intermediate checkpoints at 5 K / 10 K / 20 K are kept locally on the training host (not uploaded); ask if you need any.

⚠️ Before you can load this model

The DINOv3 backbone weights are gated on Meta's HF mirror. The HF account you use must have accepted the gate at https://huggingface.co/facebook/dinov3-vitb16-pretrain-lvd1689m (one-time click-through). Then huggingface-cli login so the token is on disk.

Reproducing the resampled variant

  1. Apply the ViT patch (lerobot_vit_patch.py).
  2. Apply the weighted-sampler patch and define buckets:
    import lerobot_weighted_sampler as lws
    lws.set_buckets({
        "sim":   (range(0, 394),   0.90),
        "realA": (range(394, 400), 0.02),
        "realB": (range(400, 408), 0.08),
    })
    lws.apply()
    
  3. Run train_resume_15k_weighted.py against the 15 K checkpoint (set RUN_DIR to your local 15 K dir).

What's in this repo

File Purpose
config.json, model.safetensors, train_config.json, policy_*.{json,safetensors} Standard lerobot pretrained-model files for the diffusion policy. Note: model.safetensors does not include the DINOv3 backbone weights β€” those are downloaded fresh from the gated HF mirror at load time and never re-saved.
lerobot_vit_patch.py Monkey-patches lerobot to swap DiffusionRgbEncoder for a ViT (torchvision OR DINOv3) and to accept vision_backbone="dinov3_*" in DiffusionConfig. Must be imported before loading.
train_diffusion_vit.py Self-contained training script: builds TrainPipelineConfig, applies the per-dim loss weighting, calls lerobot.scripts.lerobot_train.train. Reference for retraining or fine-tuning.

How to load

import lerobot_vit_patch  # noqa: F401  -- swaps in DINOv3 ViT-B/16
from lerobot.policies.diffusion.modeling_diffusion import DiffusionPolicy

policy = DiffusionPolicy.from_pretrained("yianW/diffusion-dinov3-grasp03-25k")
policy.eval()

If lerobot_vit_patch.py isn't on your PYTHONPATH yet, fetch it from this repo first:

from huggingface_hub import hf_hub_download
import sys, importlib

sys.path.insert(0, hf_hub_download(
    "yianW/diffusion-dinov3-grasp03-25k", "lerobot_vit_patch.py", repo_type="model"
).rsplit("/", 1)[0])
importlib.import_module("lerobot_vit_patch")  # apply the patch

from lerobot.policies.diffusion.modeling_diffusion import DiffusionPolicy
policy = DiffusionPolicy.from_pretrained("yianW/diffusion-dinov3-grasp03-25k")

The DINOv3 backbone weights are then pulled from facebook/dinov3-vitb16-pretrain-lvd1689m automatically on first call (cached under ~/.cache/huggingface/hub/).

Training setup (key choices)

  • Backbone: dinov3_vitb16, LVD-1689M self-supervised pretraining, 86 M params. Loaded via transformers.AutoModel with attn_implementation="sdpa" (eager attention OOMs at large batch).
  • Image pipeline: 480Γ—640 β†’ resize 240Γ—320 β†’ 224Γ—224 random crop (train) / center crop (eval) β†’ ImageNet normalization β†’ DINOv3 ViT-B/16. Output: CLS token (pooler_output) β†’ Linear(768 β†’ 64) β†’ ReLU. The 64-D feature matches 2 Γ— spatial_softmax_num_keypoints so the Unet's global_cond_dim is unchanged from the upstream lerobot baseline.
  • Augmentation: lerobot's ImageTransforms with brightness, contrast, saturation, hue, sharpness, small affine β€” up to 3 sampled per frame.
  • Action loss weighting: the dataset has 27-dim actions split as 7 arm joints + 20 hand joints. Per-element MSE is multiplied by [5, 5, 5, 5, 5, 5, 5, 1, 1, ..., 1] so the arm contributes 5Γ— compared to the hand. See _weighted_compute_loss in train_diffusion_vit.py.
  • Diffusion: DDPM, 100 train timesteps, prediction_type=epsilon, n_obs_steps=2, horizon=16, n_action_steps=8.
  • Optimizer: AdamW preset (lr 1e-4 peak, cosine decay to ~0, 500-step warmup), bf16 mixed precision via accelerate.
  • Batch / steps: batch 128, 25 000 steps (~85 epochs over 37 304 frames). We tried batch 256 and 192 first β€” both OOM with HF's DINOv3 + sdpa even on an 80 GB A100 (DINOv3-via-transformers is ~2.5Γ— more activation-hungry than torchvision's ViT-B/16 at the same batch size).
  • Hardware: 1Γ— A100-80GB (peak ~63 GB at b=128).

Loss curve (train)

Step Variant Loss LR
100 original (uniform) 1.99 1.0e-5
1 000 original (uniform) 0.05 1.0e-4
5 000 original (uniform) 0.014 9.2e-5
10 000 original (uniform) 0.008 6.8e-5
15 000 original (uniform) 0.004 3.7e-5
20 000 original (uniform) 0.002 1.0e-5
25 000 original (uniform) 0.001 ~0
17 500 resampled (90/2/8) (resumed from 15K) 0.003 2.3e-5
20 000 resampled (90/2/8) 0.002 9.4e-6
22 500 resampled (90/2/8) 0.001 3.1e-6
25 000 resampled (90/2/8) β€” repo root 0.001 ~0

Loss values are MSE-on-noise (epsilon prediction); a single-pass through the dataset gives a noisy estimate, so the resampled trajectory looks similar to the original on this aggregate metric. The intended difference is sample coverage (real episodes seen ~5Γ— more often), not headline loss β€” eval on real-data trajectories is the meaningful comparison.

Reproducing the training run on another machine

Environment

Package Version
Python 3.12
lerobot 0.5.1
torch 2.10.0 (cu128)
torchvision 0.25.0 (cu128)
transformers β‰₯ 4.51, < 5 (HF DINOv3 module)
huggingface_hub β‰₯ 0.36
uv venv /venv/main --python 3.12
VIRTUAL_ENV=/venv/main uv pip install \
    "torch==2.10.0" "torchvision==0.25.0" \
    "lerobot==0.5.1" "transformers>=4.51,<5" "huggingface_hub>=0.36"
huggingface-cli login   # token must have accepted the DINOv3 gate

GPU: 1Γ— A100-80GB used here. With HF's DINOv3 implementation, batch 128 is the practical maximum on 80 GB; smaller GPUs need a smaller batch.

Train from scratch

huggingface-cli download yianW/diffusion-dinov3-grasp03-25k \
    lerobot_vit_patch.py train_diffusion_vit.py --local-dir .
/venv/main/bin/python train_diffusion_vit.py

train_diffusion_vit.py is self-contained β€” it builds the TrainPipelineConfig programmatically (dataset, DINOv3 backbone, augmentation, crop/resize, per-dim loss weighting) and calls lerobot's training loop. Edit build_train_config() / build_policy_config() to change steps, batch size, learning rate, etc. To switch to torchvision ViT-B/16 with ImageNet pretraining, set VISION_BACKBONE = "vit_b_16" and PRETRAINED = "IMAGENET1K_V1" at the top of the file β€” the patch supports both.

Fine-tune from this 25K checkpoint

Pass --policy.path=yianW/diffusion-dinov3-grasp03-25k to lerobot's CLI, or modify build_policy_config() to use from_pretrained. The patch must still be applied first (import lerobot_vit_patch).

Resume training (with optimizer/RNG state)

The optimizer / scheduler / RNG state needed to resume training mid-run (training_state/, ~2.6 GB) is not uploaded here β€” only inference weights. Ask if you need it.

Caveats

  • Loss values are MSE-on-noise (epsilon prediction) and are not directly comparable to action-MSE.
  • The patch is a runtime monkey-patch, not a fork of lerobot. Pin lerobot to the version trained against (>= the version that has lerobot.policies.diffusion.modeling_diffusion.DiffusionRgbEncoder).
  • transformers 5.x broke lerobot's groot policy import β€” pin to transformers>=4.51,<5.
Downloads last month
161
Video Preview
loading

Dataset used to train yianW/diffusion-dinov3-grasp03-25k