SmolVLA-UR7e-CaP-StackBlock-50epoch

This repository contains a LeRobot SmolVLA policy fine-tuned for a UR7e Code-as-Policies stack-block task. The policy was trained on demonstrations from CoRL2026-CSI/UR7e-CaP-Stack_Block-100epi_10fps, where the robot builds a stack on a blue dish with the red block on the bottom, the green block in the middle, and the blue block on top.

The checkpoint is intended for research use with LeRobot-compatible inference pipelines. No real-robot or offline success-rate evaluation is included in this model card; the reported metrics are training logs only.

Model Details

  • Model type: SmolVLA vision-language-action policy
  • Base policy: lerobot/smolvla_base
  • VLM backbone: HuggingFaceTB/SmolVLM2-500M-Video-Instruct
  • Robot: UR7e
  • Task: Stack red, green, and blue blocks on a blue dish
  • Training framework: LeRobot
  • Checkpoint format: safetensors
  • License: Apache 2.0

Dataset

The policy was trained on CoRL2026-CSI/UR7e-CaP-Stack_Block-100epi_10fps, a LeRobot v3 dataset collected for the UR7e stack-block task.

Dataset summary:

Field Value
Robot type ur7e
Episodes 100
Frames 69,932
Dataset FPS 10
Tasks 1
Split train: 0:100
Cameras RealSense wrist and top-view RGB video
Camera resolution 480 x 640 RGB video
Dataset state/action vectors 7D joint/gripper vector

The dataset includes additional skill annotations such as skill.type, skill.progress, target joint positions, target Cartesian poses, and natural-language skill text. The policy checkpoint uses the LeRobot preprocessing pipeline saved in this repository.

Policy Inputs and Outputs

The saved policy configuration expects the following model features after preprocessing:

Inputs, according to the saved policy config:

  • observation.state: 6D state feature
  • observation.images.camera1: wrist camera, resized/padded for SmolVLA
  • observation.images.camera2: top-view camera, resized/padded for SmolVLA
  • observation.images.camera3: visual input slot
  • observation.images.empty_camera_0: empty camera placeholder

Output, according to the saved policy config:

  • action: 7D joint/gripper action vector

The included policy_preprocessor.json maps dataset camera names to model camera names:

  • observation.images.realsense_wrist -> observation.images.camera1
  • observation.images.realsense_topview -> observation.images.camera2

State and action features use mean/std normalization. Visual features use identity normalization. The postprocessor unnormalizes the action output and moves it back to CPU.

Training Details

The final uploaded checkpoint is from step 13700.

Setting Value
Training steps 13,700
Approx. epochs 50.15
Batch size 128
Effective batch size 256
Gradient accumulation 1
Seed 1000
Optimizer AdamW
Peak learning rate 1e-4
Weight decay 1e-10
Gradient clipping 10.0
Scheduler Cosine decay with warmup
Warmup steps 1,000
Decay steps 30,000
Final decay LR 2.5e-6
AMP Disabled
PEFT Disabled
Vision encoder Frozen
Expert-only training Enabled
State projection training Enabled
Action chunk size 50
Observation steps 1
Action steps 50
Inference denoising steps 10
Empty camera placeholders 1

Image augmentation was enabled during training with up to two randomly ordered transforms per sample:

  • brightness jitter: [0.8, 1.2]
  • contrast jitter: [0.8, 1.2]
  • saturation jitter: [0.5, 1.5]
  • hue jitter: [-0.05, 0.05]
  • sharpness jitter: [0.5, 1.5]
  • random affine rotation: [-5, 5] degrees
  • random affine translation: 0.05

Training logs:

Metric Value
Final logged training loss 0.007
Final logged gradient norm 0.141
Final logged learning rate 2.5e-6

These values are training-loop logs only and should not be interpreted as task success rates.

How to Use

Install LeRobot and load the policy from the Hub:

from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy

policy = SmolVLAPolicy.from_pretrained(
    "CoRL2026-CSI/SmolVLA-UR7e-CaP-StackBlock-50epoch"
)
policy.to("cuda")
policy.eval()

For robot rollout or evaluation, use the LeRobot CLI or your existing UR7e control stack with --policy.path pointing to this repository:

lerobot-record \
  --policy.path=CoRL2026-CSI/SmolVLA-UR7e-CaP-StackBlock-50epoch \
  --dataset.repo_id=CoRL2026-CSI/eval_smolvla_ur7e_cap_stack_block_10fps

Adjust the robot, camera, and dataset arguments to match the local UR7e deployment setup.

Files

This repository contains:

  • model.safetensors: policy weights
  • config.json: policy configuration
  • train_config.json: LeRobot training configuration
  • policy_preprocessor.json: saved inference preprocessing pipeline
  • policy_preprocessor_step_5_normalizer_processor.safetensors: normalization state
  • policy_postprocessor.json: saved inference postprocessing pipeline
  • policy_postprocessor_step_0_unnormalizer_processor.safetensors: action unnormalization state

Evaluation

No evaluation run is reported for this checkpoint. The training configuration had eval_freq=0, so no offline evaluation videos, simulated rollouts, or real-robot success metrics are included in this repository.

Limitations

This policy was trained for one UR7e tabletop stack-block task and assumes the camera setup, action/state convention, object set, and workspace distribution represented in the training dataset. Validate in a controlled workspace before any hardware deployment.

Downloads last month
10
Safetensors
Model size
0.5B params
Tensor type
F32
·
BF16
·
Video Preview
loading

Model tree for CoRL2026-CSI/SmolVLA-UR7e-CaP-StackBlock-50epoch

Finetuned
(5514)
this model

Dataset used to train CoRL2026-CSI/SmolVLA-UR7e-CaP-StackBlock-50epoch