Robotics
LeRobot
Safetensors
diffusion-policy
vit

Diffusion Policy with ViT-B/16 backbone β€” grasp03 (20K-step checkpoint)

Diffusion Policy trained on yianW/grasp03-sim-real-halvedreal with the ResNet image backbone in lerobot's DiffusionPolicy swapped for a torchvision ViT-B/16 initialized from ImageNet weights.

This is the checkpoint at training step 20 000 / 25 000.

What's in this repo

File Purpose
config.json, model.safetensors, train_config.json, policy_*.{json,safetensors} Standard lerobot pretrained-model files.
lerobot_vit_patch.py Monkey-patches lerobot to swap DiffusionRgbEncoder for a ViT and to accept vision_backbone="vit_*" in DiffusionConfig. Must be imported before loading.
train_diffusion_vit.py Self-contained training script: builds TrainPipelineConfig, applies the per-dim loss weighting, and calls lerobot.scripts.lerobot_train.train. Use this to retrain or fine-tune.

How to load

import lerobot_vit_patch  # noqa: F401  -- applies the ViT swap
from lerobot.policies.diffusion.modeling_diffusion import DiffusionPolicy

policy = DiffusionPolicy.from_pretrained("yianW/diffusion-vit-grasp03-20k")
policy.eval()

lerobot_vit_patch.py must be importable before from_pretrained runs. The simplest way is to download it alongside the weights and put it on PYTHONPATH, e.g.:

from huggingface_hub import hf_hub_download
import sys, importlib

sys.path.insert(0, hf_hub_download(
    "yianW/diffusion-vit-grasp03-20k", "lerobot_vit_patch.py", repo_type="model"
).rsplit("/", 1)[0])
importlib.import_module("lerobot_vit_patch")  # apply patch

from lerobot.policies.diffusion.modeling_diffusion import DiffusionPolicy
policy = DiffusionPolicy.from_pretrained("yianW/diffusion-vit-grasp03-20k")

Training setup (key choices)

  • Backbone: vit_b_16, ImageNet pretrained (IMAGENET1K_V1), 86 M params.
  • Image pipeline: 480Γ—640 β†’ resize 240Γ—320 β†’ 224Γ—224 random crop (train) / center crop (eval) β†’ ImageNet normalization β†’ ViT-B/16. Output: CLS token β†’ Linear(768 β†’ 64) β†’ ReLU. The 64-D feature matches the original 2 Γ— spatial_softmax_num_keypoints so the Unet's global_cond_dim is unchanged.
  • Augmentation: lerobot's ImageTransforms with brightness, contrast, saturation, hue, sharpness, small affine β€” up to 3 sampled per frame.
  • Action loss weighting: the dataset has 27-dim actions split as 7 arm joints + 20 hand joints. Per-element MSE is multiplied by [5, 5, 5, 5, 5, 5, 5, 1, 1, ..., 1] so the arm contributes 5Γ— compared to the hand. See _weighted_compute_loss in train_diffusion_vit.py.
  • Diffusion: DDPM, 100 train timesteps, prediction_type=epsilon, n_obs_steps=2, horizon=16, n_action_steps=8.
  • Optimizer: AdamW preset (lr 1e-4 peak, cosine decay, 500-step warmup), bf16 mixed precision via accelerate.
  • Batch / steps: batch 128, 25 000 steps planned (this checkpoint is step 20 000 β‰ˆ 68 epochs through 37 304 frames).
  • Hardware: 1Γ— A100-80GB.

Loss curve (train)

Step Loss LR
100 1.98 1.0e-5
1 000 0.05 1.0e-4
5 000 0.014 9.2e-5
10 000 0.008 6.9e-5
15 000 0.005 3.7e-5
20 000 0.002 1.1e-5

Reproducing the training run on another machine

Environment

Pinned versions used for this checkpoint:

Package Version
Python 3.12
lerobot 0.5.1
torch 2.10.0 (cu128)
torchvision 0.25.0 (cu128)
huggingface_hub 1.8.0

Setup with uv:

uv venv /venv/main --python 3.12
VIRTUAL_ENV=/venv/main uv pip install \
    "torch==2.10.0" "torchvision==0.25.0" \
    "lerobot==0.5.1" "huggingface_hub>=1.0"

(GPU: 1Γ— A100-80GB used here; any A100-class card with β‰₯40 GB should work with batch_size=64. On 80 GB use batch_size=128.)

Train from scratch

Pull the patch + training script from this repo, drop them in your working directory, then launch:

huggingface-cli download yianW/diffusion-vit-grasp03-20k \
    lerobot_vit_patch.py train_diffusion_vit.py \
    --local-dir .
/venv/main/bin/python train_diffusion_vit.py

train_diffusion_vit.py is self-contained β€” it builds the TrainPipelineConfig programmatically (dataset, ViT backbone, augmentation, crop/resize, per-dim loss weighting) and calls lerobot's training loop. Edit the build_train_config() / build_policy_config() functions in that file to change steps, batch size, lr, etc.

Fine-tune from this 20K checkpoint

Pass --policy.path=yianW/diffusion-vit-grasp03-20k to lerobot's CLI, or modify build_policy_config() to use from_pretrained. Note the patch must still be applied first.

Resume training (with optimizer/RNG state)

The optimizer / scheduler / RNG state needed to resume training mid-run (training_state/, ~2.6 GB) is not uploaded here β€” only the inference weights are. If you need it for an exact resume, ask and it can be added.

Caveats

  • Loss values are MSE-on-noise (epsilon prediction) and are not directly comparable to action-MSE.
  • The patch is a runtime monkey-patch, not a fork of lerobot. Pinning lerobot to the version trained against (>= the version that has lerobot.policies.diffusion.modeling_diffusion.DiffusionRgbEncoder) is recommended.
Downloads last month
28
Video Preview
loading