SmolVLA-UR7e-CaP-StackBlock-50epoch

This repository contains a LeRobot SmolVLA policy fine-tuned for a UR7e Code-as-Policies stack-block task. The policy was trained on demonstrations from CoRL2026-CSI/UR7e-CaP-Stack_Block-100epi_10fps, where the robot builds a stack on a blue dish with the red block on the bottom, the green block in the middle, and the blue block on top.

The checkpoint is intended for research use with LeRobot-compatible inference pipelines. No real-robot or offline success-rate evaluation is included in this model card; the reported metrics are training logs only.

Model Details

Model type: SmolVLA vision-language-action policy
Base policy: lerobot/smolvla_base
VLM backbone: HuggingFaceTB/SmolVLM2-500M-Video-Instruct
Robot: UR7e
Task: Stack red, green, and blue blocks on a blue dish
Training framework: LeRobot
Checkpoint format: safetensors
License: Apache 2.0

Dataset

The policy was trained on CoRL2026-CSI/UR7e-CaP-Stack_Block-100epi_10fps, a LeRobot v3 dataset collected for the UR7e stack-block task.

Dataset summary:

Field	Value
Robot type	`ur7e`
Episodes	100
Frames	69,932
Dataset FPS	10
Tasks	1
Split	`train: 0:100`
Cameras	RealSense wrist and top-view RGB video
Camera resolution	480 x 640 RGB video
Dataset state/action vectors	7D joint/gripper vector

The dataset includes additional skill annotations such as skill.type, skill.progress, target joint positions, target Cartesian poses, and natural-language skill text. The policy checkpoint uses the LeRobot preprocessing pipeline saved in this repository.

Policy Inputs and Outputs

The saved policy configuration expects the following model features after preprocessing:

Inputs, according to the saved policy config:

observation.state: 6D state feature
observation.images.camera1: wrist camera, resized/padded for SmolVLA
observation.images.camera2: top-view camera, resized/padded for SmolVLA
observation.images.camera3: visual input slot
observation.images.empty_camera_0: empty camera placeholder

Output, according to the saved policy config:

action: 7D joint/gripper action vector

The included policy_preprocessor.json maps dataset camera names to model camera names:

observation.images.realsense_wrist -> observation.images.camera1
observation.images.realsense_topview -> observation.images.camera2

State and action features use mean/std normalization. Visual features use identity normalization. The postprocessor unnormalizes the action output and moves it back to CPU.

Training Details

The final uploaded checkpoint is from step 13700.

Setting	Value
Training steps	13,700
Approx. epochs	50.15
Batch size	128
Effective batch size	256
Gradient accumulation	1
Seed	1000
Optimizer	AdamW
Peak learning rate	`1e-4`
Weight decay	`1e-10`
Gradient clipping	`10.0`
Scheduler	Cosine decay with warmup
Warmup steps	1,000
Decay steps	30,000
Final decay LR	`2.5e-6`
AMP	Disabled
PEFT	Disabled
Vision encoder	Frozen
Expert-only training	Enabled
State projection training	Enabled
Action chunk size	50
Observation steps	1
Action steps	50
Inference denoising steps	10
Empty camera placeholders	1

Image augmentation was enabled during training with up to two randomly ordered transforms per sample:

brightness jitter: [0.8, 1.2]
contrast jitter: [0.8, 1.2]
saturation jitter: [0.5, 1.5]
hue jitter: [-0.05, 0.05]
sharpness jitter: [0.5, 1.5]
random affine rotation: [-5, 5] degrees
random affine translation: 0.05

Training logs:

Metric	Value
Final logged training loss	`0.007`
Final logged gradient norm	`0.141`
Final logged learning rate	`2.5e-6`

These values are training-loop logs only and should not be interpreted as task success rates.

How to Use

Install LeRobot and load the policy from the Hub:

from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy

policy = SmolVLAPolicy.from_pretrained(
    "CoRL2026-CSI/SmolVLA-UR7e-CaP-StackBlock-50epoch"
)
policy.to("cuda")
policy.eval()

For robot rollout or evaluation, use the LeRobot CLI or your existing UR7e control stack with --policy.path pointing to this repository:

lerobot-record \
  --policy.path=CoRL2026-CSI/SmolVLA-UR7e-CaP-StackBlock-50epoch \
  --dataset.repo_id=CoRL2026-CSI/eval_smolvla_ur7e_cap_stack_block_10fps

Adjust the robot, camera, and dataset arguments to match the local UR7e deployment setup.

Files

This repository contains:

model.safetensors: policy weights
config.json: policy configuration
train_config.json: LeRobot training configuration
policy_preprocessor.json: saved inference preprocessing pipeline
policy_preprocessor_step_5_normalizer_processor.safetensors: normalization state
policy_postprocessor.json: saved inference postprocessing pipeline
policy_postprocessor_step_0_unnormalizer_processor.safetensors: action unnormalization state

Evaluation

No evaluation run is reported for this checkpoint. The training configuration had eval_freq=0, so no offline evaluation videos, simulated rollouts, or real-robot success metrics are included in this repository.

Limitations

This policy was trained for one UR7e tabletop stack-block task and assumes the camera setup, action/state convention, object set, and workspace distribution represented in the training dataset. Validate in a controlled workspace before any hardware deployment.

Downloads last month: 10

Safetensors

Model size

0.5B params

Tensor type

F32

BF16

Video Preview

Robotics

Model tree for CoRL2026-CSI/SmolVLA-UR7e-CaP-StackBlock-50epoch

Base model

lerobot/smolvla_base

Finetuned

(5514)

this model

CoRL2026-CSI
/

SmolVLA-UR7e-CaP-StackBlock-50epoch