Instructions to use yianW/diffusion-vit-grasp03-20k with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LeRobot
How to use yianW/diffusion-vit-grasp03-20k with LeRobot:
- Notebooks
- Google Colab
- Kaggle
Diffusion Policy with ViT-B/16 backbone β grasp03 (20K-step checkpoint)
Diffusion Policy trained on
yianW/grasp03-sim-real-halvedreal
with the ResNet image backbone in lerobot's DiffusionPolicy swapped for a
torchvision ViT-B/16 initialized from ImageNet weights.
This is the checkpoint at training step 20 000 / 25 000.
What's in this repo
| File | Purpose |
|---|---|
config.json, model.safetensors, train_config.json, policy_*.{json,safetensors} |
Standard lerobot pretrained-model files. |
lerobot_vit_patch.py |
Monkey-patches lerobot to swap DiffusionRgbEncoder for a ViT and to accept vision_backbone="vit_*" in DiffusionConfig. Must be imported before loading. |
train_diffusion_vit.py |
Self-contained training script: builds TrainPipelineConfig, applies the per-dim loss weighting, and calls lerobot.scripts.lerobot_train.train. Use this to retrain or fine-tune. |
How to load
import lerobot_vit_patch # noqa: F401 -- applies the ViT swap
from lerobot.policies.diffusion.modeling_diffusion import DiffusionPolicy
policy = DiffusionPolicy.from_pretrained("yianW/diffusion-vit-grasp03-20k")
policy.eval()
lerobot_vit_patch.py must be importable before from_pretrained runs.
The simplest way is to download it alongside the weights and put it on
PYTHONPATH, e.g.:
from huggingface_hub import hf_hub_download
import sys, importlib
sys.path.insert(0, hf_hub_download(
"yianW/diffusion-vit-grasp03-20k", "lerobot_vit_patch.py", repo_type="model"
).rsplit("/", 1)[0])
importlib.import_module("lerobot_vit_patch") # apply patch
from lerobot.policies.diffusion.modeling_diffusion import DiffusionPolicy
policy = DiffusionPolicy.from_pretrained("yianW/diffusion-vit-grasp03-20k")
Training setup (key choices)
- Backbone:
vit_b_16, ImageNet pretrained (IMAGENET1K_V1), 86 M params. - Image pipeline: 480Γ640 β resize 240Γ320 β 224Γ224 random crop (train) /
center crop (eval) β ImageNet normalization β ViT-B/16. Output: CLS token β
Linear(768 β 64) β ReLU. The 64-D feature matches the original
2 Γ spatial_softmax_num_keypointsso the Unet'sglobal_cond_dimis unchanged. - Augmentation: lerobot's
ImageTransformswith brightness, contrast, saturation, hue, sharpness, small affine β up to 3 sampled per frame. - Action loss weighting: the dataset has 27-dim actions split as 7 arm
joints + 20 hand joints. Per-element MSE is multiplied by
[5, 5, 5, 5, 5, 5, 5, 1, 1, ..., 1]so the arm contributes 5Γ compared to the hand. See_weighted_compute_lossintrain_diffusion_vit.py. - Diffusion: DDPM, 100 train timesteps,
prediction_type=epsilon,n_obs_steps=2,horizon=16,n_action_steps=8. - Optimizer: AdamW preset (lr 1e-4 peak, cosine decay, 500-step warmup),
bf16 mixed precision via
accelerate. - Batch / steps: batch 128, 25 000 steps planned (this checkpoint is step 20 000 β 68 epochs through 37 304 frames).
- Hardware: 1Γ A100-80GB.
Loss curve (train)
| Step | Loss | LR |
|---|---|---|
| 100 | 1.98 | 1.0e-5 |
| 1 000 | 0.05 | 1.0e-4 |
| 5 000 | 0.014 | 9.2e-5 |
| 10 000 | 0.008 | 6.9e-5 |
| 15 000 | 0.005 | 3.7e-5 |
| 20 000 | 0.002 | 1.1e-5 |
Reproducing the training run on another machine
Environment
Pinned versions used for this checkpoint:
| Package | Version |
|---|---|
| Python | 3.12 |
lerobot |
0.5.1 |
torch |
2.10.0 (cu128) |
torchvision |
0.25.0 (cu128) |
huggingface_hub |
1.8.0 |
Setup with uv:
uv venv /venv/main --python 3.12
VIRTUAL_ENV=/venv/main uv pip install \
"torch==2.10.0" "torchvision==0.25.0" \
"lerobot==0.5.1" "huggingface_hub>=1.0"
(GPU: 1Γ A100-80GB used here; any A100-class card with β₯40 GB should work
with batch_size=64. On 80 GB use batch_size=128.)
Train from scratch
Pull the patch + training script from this repo, drop them in your working directory, then launch:
huggingface-cli download yianW/diffusion-vit-grasp03-20k \
lerobot_vit_patch.py train_diffusion_vit.py \
--local-dir .
/venv/main/bin/python train_diffusion_vit.py
train_diffusion_vit.py is self-contained β it builds the
TrainPipelineConfig programmatically (dataset, ViT backbone, augmentation,
crop/resize, per-dim loss weighting) and calls lerobot's training loop.
Edit the build_train_config() / build_policy_config() functions in that
file to change steps, batch size, lr, etc.
Fine-tune from this 20K checkpoint
Pass --policy.path=yianW/diffusion-vit-grasp03-20k to lerobot's CLI, or
modify build_policy_config() to use from_pretrained. Note the patch must
still be applied first.
Resume training (with optimizer/RNG state)
The optimizer / scheduler / RNG state needed to resume training mid-run
(training_state/, ~2.6 GB) is not uploaded here β only the inference
weights are. If you need it for an exact resume, ask and it can be added.
Caveats
- Loss values are MSE-on-noise (epsilon prediction) and are not directly comparable to action-MSE.
- The patch is a runtime monkey-patch, not a fork of lerobot. Pinning lerobot
to the version trained against (>= the version that has
lerobot.policies.diffusion.modeling_diffusion.DiffusionRgbEncoder) is recommended.
- Downloads last month
- 28