SmolVLA-UR7e-CaP-StackBlock-50epoch
This repository contains a LeRobot SmolVLA policy fine-tuned for a UR7e Code-as-Policies stack-block task. The policy was trained on demonstrations from CoRL2026-CSI/UR7e-CaP-Stack_Block-100epi_10fps, where the robot builds a stack on a blue dish with the red block on the bottom, the green block in the middle, and the blue block on top.
The checkpoint is intended for research use with LeRobot-compatible inference pipelines. No real-robot or offline success-rate evaluation is included in this model card; the reported metrics are training logs only.
Model Details
- Model type: SmolVLA vision-language-action policy
- Base policy:
lerobot/smolvla_base - VLM backbone:
HuggingFaceTB/SmolVLM2-500M-Video-Instruct - Robot: UR7e
- Task: Stack red, green, and blue blocks on a blue dish
- Training framework: LeRobot
- Checkpoint format:
safetensors - License: Apache 2.0
Dataset
The policy was trained on CoRL2026-CSI/UR7e-CaP-Stack_Block-100epi_10fps, a LeRobot v3 dataset collected for the UR7e stack-block task.
Dataset summary:
| Field | Value |
|---|---|
| Robot type | ur7e |
| Episodes | 100 |
| Frames | 69,932 |
| Dataset FPS | 10 |
| Tasks | 1 |
| Split | train: 0:100 |
| Cameras | RealSense wrist and top-view RGB video |
| Camera resolution | 480 x 640 RGB video |
| Dataset state/action vectors | 7D joint/gripper vector |
The dataset includes additional skill annotations such as skill.type, skill.progress, target joint positions, target Cartesian poses, and natural-language skill text. The policy checkpoint uses the LeRobot preprocessing pipeline saved in this repository.
Policy Inputs and Outputs
The saved policy configuration expects the following model features after preprocessing:
Inputs, according to the saved policy config:
observation.state: 6D state featureobservation.images.camera1: wrist camera, resized/padded for SmolVLAobservation.images.camera2: top-view camera, resized/padded for SmolVLAobservation.images.camera3: visual input slotobservation.images.empty_camera_0: empty camera placeholder
Output, according to the saved policy config:
action: 7D joint/gripper action vector
The included policy_preprocessor.json maps dataset camera names to model camera names:
observation.images.realsense_wrist->observation.images.camera1observation.images.realsense_topview->observation.images.camera2
State and action features use mean/std normalization. Visual features use identity normalization. The postprocessor unnormalizes the action output and moves it back to CPU.
Training Details
The final uploaded checkpoint is from step 13700.
| Setting | Value |
|---|---|
| Training steps | 13,700 |
| Approx. epochs | 50.15 |
| Batch size | 128 |
| Effective batch size | 256 |
| Gradient accumulation | 1 |
| Seed | 1000 |
| Optimizer | AdamW |
| Peak learning rate | 1e-4 |
| Weight decay | 1e-10 |
| Gradient clipping | 10.0 |
| Scheduler | Cosine decay with warmup |
| Warmup steps | 1,000 |
| Decay steps | 30,000 |
| Final decay LR | 2.5e-6 |
| AMP | Disabled |
| PEFT | Disabled |
| Vision encoder | Frozen |
| Expert-only training | Enabled |
| State projection training | Enabled |
| Action chunk size | 50 |
| Observation steps | 1 |
| Action steps | 50 |
| Inference denoising steps | 10 |
| Empty camera placeholders | 1 |
Image augmentation was enabled during training with up to two randomly ordered transforms per sample:
- brightness jitter:
[0.8, 1.2] - contrast jitter:
[0.8, 1.2] - saturation jitter:
[0.5, 1.5] - hue jitter:
[-0.05, 0.05] - sharpness jitter:
[0.5, 1.5] - random affine rotation:
[-5, 5]degrees - random affine translation:
0.05
Training logs:
| Metric | Value |
|---|---|
| Final logged training loss | 0.007 |
| Final logged gradient norm | 0.141 |
| Final logged learning rate | 2.5e-6 |
These values are training-loop logs only and should not be interpreted as task success rates.
How to Use
Install LeRobot and load the policy from the Hub:
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
policy = SmolVLAPolicy.from_pretrained(
"CoRL2026-CSI/SmolVLA-UR7e-CaP-StackBlock-50epoch"
)
policy.to("cuda")
policy.eval()
For robot rollout or evaluation, use the LeRobot CLI or your existing UR7e control stack with --policy.path pointing to this repository:
lerobot-record \
--policy.path=CoRL2026-CSI/SmolVLA-UR7e-CaP-StackBlock-50epoch \
--dataset.repo_id=CoRL2026-CSI/eval_smolvla_ur7e_cap_stack_block_10fps
Adjust the robot, camera, and dataset arguments to match the local UR7e deployment setup.
Files
This repository contains:
model.safetensors: policy weightsconfig.json: policy configurationtrain_config.json: LeRobot training configurationpolicy_preprocessor.json: saved inference preprocessing pipelinepolicy_preprocessor_step_5_normalizer_processor.safetensors: normalization statepolicy_postprocessor.json: saved inference postprocessing pipelinepolicy_postprocessor_step_0_unnormalizer_processor.safetensors: action unnormalization state
Evaluation
No evaluation run is reported for this checkpoint. The training configuration had eval_freq=0, so no offline evaluation videos, simulated rollouts, or real-robot success metrics are included in this repository.
Limitations
This policy was trained for one UR7e tabletop stack-block task and assumes the camera setup, action/state convention, object set, and workspace distribution represented in the training dataset. Validate in a controlled workspace before any hardware deployment.
- Downloads last month
- 10
Model tree for CoRL2026-CSI/SmolVLA-UR7e-CaP-StackBlock-50epoch
Base model
lerobot/smolvla_base