Diffusion Policy with DINOv3 ViT-B/16 backbone — grasp03 (final 25K checkpoint)

Diffusion Policy trained on yianW/grasp03-sim-real-v2 with the ResNet image backbone in lerobot's DiffusionPolicy swapped for DINOv3 ViT-B/16 initialized from Meta's LVD-1689M self-supervised pretraining (facebook/dinov3-vitb16-pretrain-lvd1689m).

Two training variants in this repo

Variant	Where	Sampling	Steps
Resampled (90 / 2 / 8) ← repo root	repo root (final 25 K), and `checkpoints/{017500,020000,022500}/` (intermediates)	sim eps 0…393 → 90 %, realA eps 394…399 → 2 %, realB eps 400…407 → 8 % via a `WeightedRandomSampler`	resumed from 15 K → 25 K (+10 K)
Original (uniform-by-frame)	`checkpoints/015000/` and `checkpoints/025000_uniform/`	Each frame equally likely → ~96.9 % sim / 1.7 % realA / 1.4 % realB	25 000 from scratch

Default load (resampled 25 K, the headline checkpoint):

policy = DiffusionPolicy.from_pretrained("yianW/diffusion-dinov3-grasp03-25k")

Load a non-root checkpoint with the subfolder argument, e.g. the original-uniform 25 K:

policy = DiffusionPolicy.from_pretrained(
    "yianW/diffusion-dinov3-grasp03-25k", subfolder="checkpoints/025000_uniform"
)

Original-variant intermediate checkpoints at 5 K / 10 K / 20 K are kept locally on the training host (not uploaded); ask if you need any.

⚠️ Before you can load this model

The DINOv3 backbone weights are gated on Meta's HF mirror. The HF account you use must have accepted the gate at https://huggingface.co/facebook/dinov3-vitb16-pretrain-lvd1689m (one-time click-through). Then huggingface-cli login so the token is on disk.

Reproducing the resampled variant

Apply the ViT patch (lerobot_vit_patch.py).

Apply the weighted-sampler patch and define buckets:

import lerobot_weighted_sampler as lws
lws.set_buckets({
    "sim":   (range(0, 394),   0.90),
    "realA": (range(394, 400), 0.02),
    "realB": (range(400, 408), 0.08),
})
lws.apply()

Run train_resume_15k_weighted.py against the 15 K checkpoint (set RUN_DIR to your local 15 K dir).

What's in this repo

File	Purpose
`config.json`, `model.safetensors`, `train_config.json`, `policy_*.{json,safetensors}`	Standard lerobot pretrained-model files for the diffusion policy. Note: `model.safetensors` does not include the DINOv3 backbone weights — those are downloaded fresh from the gated HF mirror at load time and never re-saved.
`lerobot_vit_patch.py`	Monkey-patches lerobot to swap `DiffusionRgbEncoder` for a ViT (torchvision OR DINOv3) and to accept `vision_backbone="dinov3_"` in `DiffusionConfig`. Must be imported before loading.*
`train_diffusion_vit.py`	Self-contained training script: builds `TrainPipelineConfig`, applies the per-dim loss weighting, calls `lerobot.scripts.lerobot_train.train`. Reference for retraining or fine-tuning.

How to load

import lerobot_vit_patch  # noqa: F401  -- swaps in DINOv3 ViT-B/16
from lerobot.policies.diffusion.modeling_diffusion import DiffusionPolicy

policy = DiffusionPolicy.from_pretrained("yianW/diffusion-dinov3-grasp03-25k")
policy.eval()

If lerobot_vit_patch.py isn't on your PYTHONPATH yet, fetch it from this repo first:

from huggingface_hub import hf_hub_download
import sys, importlib

sys.path.insert(0, hf_hub_download(
    "yianW/diffusion-dinov3-grasp03-25k", "lerobot_vit_patch.py", repo_type="model"
).rsplit("/", 1)[0])
importlib.import_module("lerobot_vit_patch")  # apply the patch

from lerobot.policies.diffusion.modeling_diffusion import DiffusionPolicy
policy = DiffusionPolicy.from_pretrained("yianW/diffusion-dinov3-grasp03-25k")

The DINOv3 backbone weights are then pulled from facebook/dinov3-vitb16-pretrain-lvd1689m automatically on first call (cached under ~/.cache/huggingface/hub/).

Training setup (key choices)

Backbone: dinov3_vitb16, LVD-1689M self-supervised pretraining, 86 M params. Loaded via transformers.AutoModel with attn_implementation="sdpa" (eager attention OOMs at large batch).
Image pipeline: 480×640 → resize 240×320 → 224×224 random crop (train) / center crop (eval) → ImageNet normalization → DINOv3 ViT-B/16. Output: CLS token (pooler_output) → Linear(768 → 64) → ReLU. The 64-D feature matches 2 × spatial_softmax_num_keypoints so the Unet's global_cond_dim is unchanged from the upstream lerobot baseline.
Augmentation: lerobot's ImageTransforms with brightness, contrast, saturation, hue, sharpness, small affine — up to 3 sampled per frame.
Action loss weighting: the dataset has 27-dim actions split as 7 arm joints + 20 hand joints. Per-element MSE is multiplied by [5, 5, 5, 5, 5, 5, 5, 1, 1, ..., 1] so the arm contributes 5× compared to the hand. See _weighted_compute_loss in train_diffusion_vit.py.
Diffusion: DDPM, 100 train timesteps, prediction_type=epsilon, n_obs_steps=2, horizon=16, n_action_steps=8.
Optimizer: AdamW preset (lr 1e-4 peak, cosine decay to ~0, 500-step warmup), bf16 mixed precision via accelerate.
Batch / steps: batch 128, 25 000 steps (~85 epochs over 37 304 frames). We tried batch 256 and 192 first — both OOM with HF's DINOv3 + sdpa even on an 80 GB A100 (DINOv3-via-transformers is ~2.5× more activation-hungry than torchvision's ViT-B/16 at the same batch size).
Hardware: 1× A100-80GB (peak ~63 GB at b=128).

Loss curve (train)

Step	Variant	Loss	LR
100	original (uniform)	1.99	1.0e-5
1 000	original (uniform)	0.05	1.0e-4
5 000	original (uniform)	0.014	9.2e-5
10 000	original (uniform)	0.008	6.8e-5
15 000	original (uniform)	0.004	3.7e-5
20 000	original (uniform)	0.002	1.0e-5
25 000	original (uniform)	0.001	~0
17 500	resampled (90/2/8) (resumed from 15K)	0.003	2.3e-5
20 000	resampled (90/2/8)	0.002	9.4e-6
22 500	resampled (90/2/8)	0.001	3.1e-6
25 000	resampled (90/2/8) — repo root	0.001	~0

Loss values are MSE-on-noise (epsilon prediction); a single-pass through the dataset gives a noisy estimate, so the resampled trajectory looks similar to the original on this aggregate metric. The intended difference is sample coverage (real episodes seen ~5× more often), not headline loss — eval on real-data trajectories is the meaningful comparison.

Reproducing the training run on another machine

Environment

Package	Version
Python	3.12
`lerobot`	0.5.1
`torch`	2.10.0 (cu128)
`torchvision`	0.25.0 (cu128)
`transformers`	≥ 4.51, < 5 (HF DINOv3 module)
`huggingface_hub`	≥ 0.36

uv venv /venv/main --python 3.12
VIRTUAL_ENV=/venv/main uv pip install \
    "torch==2.10.0" "torchvision==0.25.0" \
    "lerobot==0.5.1" "transformers>=4.51,<5" "huggingface_hub>=0.36"
huggingface-cli login   # token must have accepted the DINOv3 gate

GPU: 1× A100-80GB used here. With HF's DINOv3 implementation, batch 128 is the practical maximum on 80 GB; smaller GPUs need a smaller batch.

Train from scratch

huggingface-cli download yianW/diffusion-dinov3-grasp03-25k \
    lerobot_vit_patch.py train_diffusion_vit.py --local-dir .
/venv/main/bin/python train_diffusion_vit.py

train_diffusion_vit.py is self-contained — it builds the TrainPipelineConfig programmatically (dataset, DINOv3 backbone, augmentation, crop/resize, per-dim loss weighting) and calls lerobot's training loop. Edit build_train_config() / build_policy_config() to change steps, batch size, learning rate, etc. To switch to torchvision ViT-B/16 with ImageNet pretraining, set VISION_BACKBONE = "vit_b_16" and PRETRAINED = "IMAGENET1K_V1" at the top of the file — the patch supports both.

Fine-tune from this 25K checkpoint

Pass --policy.path=yianW/diffusion-dinov3-grasp03-25k to lerobot's CLI, or modify build_policy_config() to use from_pretrained. The patch must still be applied first (import lerobot_vit_patch).

Resume training (with optimizer/RNG state)

The optimizer / scheduler / RNG state needed to resume training mid-run (training_state/, ~2.6 GB) is not uploaded here — only inference weights. Ask if you need it.

Caveats

Loss values are MSE-on-noise (epsilon prediction) and are not directly comparable to action-MSE.
The patch is a runtime monkey-patch, not a fork of lerobot. Pin lerobot to the version trained against (>= the version that has lerobot.policies.diffusion.modeling_diffusion.DiffusionRgbEncoder).
transformers 5.x broke lerobot's groot policy import — pin to transformers>=4.51,<5.

Downloads last month: 5

Safetensors

Model size

0.3B params

Tensor type

F32

Video Preview

Robotics

yianW
/

diffusion-dinov3-grasp03-25k