Diffusion Policy with ViT-B/16 backbone — grasp03 (20K-step checkpoint)

Diffusion Policy trained on yianW/grasp03-sim-real-halvedreal with the ResNet image backbone in lerobot's DiffusionPolicy swapped for a torchvision ViT-B/16 initialized from ImageNet weights.

This is the checkpoint at training step 20 000 / 25 000.

What's in this repo

File	Purpose
`config.json`, `model.safetensors`, `train_config.json`, `policy_*.{json,safetensors}`	Standard lerobot pretrained-model files.
`lerobot_vit_patch.py`	Monkey-patches lerobot to swap `DiffusionRgbEncoder` for a ViT and to accept `vision_backbone="vit_"` in `DiffusionConfig`. Must be imported before loading.*
`train_diffusion_vit.py`	Self-contained training script: builds `TrainPipelineConfig`, applies the per-dim loss weighting, and calls `lerobot.scripts.lerobot_train.train`. Use this to retrain or fine-tune.

How to load

import lerobot_vit_patch  # noqa: F401  -- applies the ViT swap
from lerobot.policies.diffusion.modeling_diffusion import DiffusionPolicy

policy = DiffusionPolicy.from_pretrained("yianW/diffusion-vit-grasp03-20k")
policy.eval()

lerobot_vit_patch.py must be importable before from_pretrained runs. The simplest way is to download it alongside the weights and put it on PYTHONPATH, e.g.:

from huggingface_hub import hf_hub_download
import sys, importlib

sys.path.insert(0, hf_hub_download(
    "yianW/diffusion-vit-grasp03-20k", "lerobot_vit_patch.py", repo_type="model"
).rsplit("/", 1)[0])
importlib.import_module("lerobot_vit_patch")  # apply patch

from lerobot.policies.diffusion.modeling_diffusion import DiffusionPolicy
policy = DiffusionPolicy.from_pretrained("yianW/diffusion-vit-grasp03-20k")

Training setup (key choices)

Backbone: vit_b_16, ImageNet pretrained (IMAGENET1K_V1), 86 M params.
Image pipeline: 480×640 → resize 240×320 → 224×224 random crop (train) / center crop (eval) → ImageNet normalization → ViT-B/16. Output: CLS token → Linear(768 → 64) → ReLU. The 64-D feature matches the original 2 × spatial_softmax_num_keypoints so the Unet's global_cond_dim is unchanged.
Augmentation: lerobot's ImageTransforms with brightness, contrast, saturation, hue, sharpness, small affine — up to 3 sampled per frame.
Action loss weighting: the dataset has 27-dim actions split as 7 arm joints + 20 hand joints. Per-element MSE is multiplied by [5, 5, 5, 5, 5, 5, 5, 1, 1, ..., 1] so the arm contributes 5× compared to the hand. See _weighted_compute_loss in train_diffusion_vit.py.
Diffusion: DDPM, 100 train timesteps, prediction_type=epsilon, n_obs_steps=2, horizon=16, n_action_steps=8.
Optimizer: AdamW preset (lr 1e-4 peak, cosine decay, 500-step warmup), bf16 mixed precision via accelerate.
Batch / steps: batch 128, 25 000 steps planned (this checkpoint is step 20 000 ≈ 68 epochs through 37 304 frames).
Hardware: 1× A100-80GB.

Loss curve (train)

Step	Loss	LR
100	1.98	1.0e-5
1 000	0.05	1.0e-4
5 000	0.014	9.2e-5
10 000	0.008	6.9e-5
15 000	0.005	3.7e-5
20 000	0.002	1.1e-5

Reproducing the training run on another machine

Environment

Pinned versions used for this checkpoint:

Package	Version
Python	3.12
`lerobot`	0.5.1
`torch`	2.10.0 (cu128)
`torchvision`	0.25.0 (cu128)
`huggingface_hub`	1.8.0

Setup with uv:

uv venv /venv/main --python 3.12
VIRTUAL_ENV=/venv/main uv pip install \
    "torch==2.10.0" "torchvision==0.25.0" \
    "lerobot==0.5.1" "huggingface_hub>=1.0"

(GPU: 1× A100-80GB used here; any A100-class card with ≥40 GB should work with batch_size=64. On 80 GB use batch_size=128.)

Train from scratch

Pull the patch + training script from this repo, drop them in your working directory, then launch:

huggingface-cli download yianW/diffusion-vit-grasp03-20k \
    lerobot_vit_patch.py train_diffusion_vit.py \
    --local-dir .
/venv/main/bin/python train_diffusion_vit.py

train_diffusion_vit.py is self-contained — it builds the TrainPipelineConfig programmatically (dataset, ViT backbone, augmentation, crop/resize, per-dim loss weighting) and calls lerobot's training loop. Edit the build_train_config() / build_policy_config() functions in that file to change steps, batch size, lr, etc.

Fine-tune from this 20K checkpoint

Pass --policy.path=yianW/diffusion-vit-grasp03-20k to lerobot's CLI, or modify build_policy_config() to use from_pretrained. Note the patch must still be applied first.

Resume training (with optimizer/RNG state)

The optimizer / scheduler / RNG state needed to resume training mid-run (training_state/, ~2.6 GB) is not uploaded here — only the inference weights are. If you need it for an exact resume, ask and it can be added.

Caveats

Loss values are MSE-on-noise (epsilon prediction) and are not directly comparable to action-MSE.
The patch is a runtime monkey-patch, not a fork of lerobot. Pinning lerobot to the version trained against (>= the version that has lerobot.policies.diffusion.modeling_diffusion.DiffusionRgbEncoder) is recommended.

Downloads last month: 28

Video Preview

Robotics