---
license: apache-2.0
library_name: lerobot
pipeline_tag: robotics
tags:
- lerobot
- robotics
- smolvla
- vision-language-action
- imitation-learning
- behavior-cloning
- ur7e
- real-robot
- manipulation
- code-as-policies
- cap
- stack-block
- 10fps
- 50epoch
datasets:
- CoRL2026-CSI/UR7e-CaP-Stack_Block-100epi_10fps
base_model:
- lerobot/smolvla_base
model_name: SmolVLA-UR7e-CaP-StackBlock-50epoch
---

# SmolVLA-UR7e-CaP-StackBlock-50epoch

This repository contains a LeRobot SmolVLA policy fine-tuned for a UR7e Code-as-Policies stack-block task. The policy was trained on demonstrations from [`CoRL2026-CSI/UR7e-CaP-Stack_Block-100epi_10fps`](https://huggingface.co/datasets/CoRL2026-CSI/UR7e-CaP-Stack_Block-100epi_10fps), where the robot builds a stack on a blue dish with the red block on the bottom, the green block in the middle, and the blue block on top.

The checkpoint is intended for research use with LeRobot-compatible inference pipelines. No real-robot or offline success-rate evaluation is included in this model card; the reported metrics are training logs only.

## Model Details

- **Model type:** SmolVLA vision-language-action policy
- **Base policy:** [`lerobot/smolvla_base`](https://huggingface.co/lerobot/smolvla_base)
- **VLM backbone:** `HuggingFaceTB/SmolVLM2-500M-Video-Instruct`
- **Robot:** UR7e
- **Task:** Stack red, green, and blue blocks on a blue dish
- **Training framework:** [LeRobot](https://github.com/huggingface/lerobot)
- **Checkpoint format:** `safetensors`
- **License:** Apache 2.0

## Dataset

The policy was trained on [`CoRL2026-CSI/UR7e-CaP-Stack_Block-100epi_10fps`](https://huggingface.co/datasets/CoRL2026-CSI/UR7e-CaP-Stack_Block-100epi_10fps), a LeRobot v3 dataset collected for the UR7e stack-block task.

Dataset summary:

| Field | Value |
| --- | --- |
| Robot type | `ur7e` |
| Episodes | 100 |
| Frames | 69,932 |
| Dataset FPS | 10 |
| Tasks | 1 |
| Split | `train: 0:100` |
| Cameras | RealSense wrist and top-view RGB video |
| Camera resolution | 480 x 640 RGB video |
| Dataset state/action vectors | 7D joint/gripper vector |

The dataset includes additional skill annotations such as `skill.type`, `skill.progress`, target joint positions, target Cartesian poses, and natural-language skill text. The policy checkpoint uses the LeRobot preprocessing pipeline saved in this repository.

## Policy Inputs and Outputs

The saved policy configuration expects the following model features after preprocessing:

Inputs, according to the saved policy config:

- `observation.state`: 6D state feature
- `observation.images.camera1`: wrist camera, resized/padded for SmolVLA
- `observation.images.camera2`: top-view camera, resized/padded for SmolVLA
- `observation.images.camera3`: visual input slot
- `observation.images.empty_camera_0`: empty camera placeholder

Output, according to the saved policy config:

- `action`: 7D joint/gripper action vector

The included `policy_preprocessor.json` maps dataset camera names to model camera names:

- `observation.images.realsense_wrist` -> `observation.images.camera1`
- `observation.images.realsense_topview` -> `observation.images.camera2`

State and action features use mean/std normalization. Visual features use identity normalization. The postprocessor unnormalizes the `action` output and moves it back to CPU.

## Training Details

The final uploaded checkpoint is from step `13700`.

| Setting | Value |
| --- | --- |
| Training steps | 13,700 |
| Approx. epochs | 50.15 |
| Batch size | 128 |
| Effective batch size | 256 |
| Gradient accumulation | 1 |
| Seed | 1000 |
| Optimizer | AdamW |
| Peak learning rate | `1e-4` |
| Weight decay | `1e-10` |
| Gradient clipping | `10.0` |
| Scheduler | Cosine decay with warmup |
| Warmup steps | 1,000 |
| Decay steps | 30,000 |
| Final decay LR | `2.5e-6` |
| AMP | Disabled |
| PEFT | Disabled |
| Vision encoder | Frozen |
| Expert-only training | Enabled |
| State projection training | Enabled |
| Action chunk size | 50 |
| Observation steps | 1 |
| Action steps | 50 |
| Inference denoising steps | 10 |
| Empty camera placeholders | 1 |

Image augmentation was enabled during training with up to two randomly ordered transforms per sample:

- brightness jitter: `[0.8, 1.2]`
- contrast jitter: `[0.8, 1.2]`
- saturation jitter: `[0.5, 1.5]`
- hue jitter: `[-0.05, 0.05]`
- sharpness jitter: `[0.5, 1.5]`
- random affine rotation: `[-5, 5]` degrees
- random affine translation: `0.05`

Training logs:

| Metric | Value |
| --- | --- |
| Final logged training loss | `0.007` |
| Final logged gradient norm | `0.141` |
| Final logged learning rate | `2.5e-6` |

These values are training-loop logs only and should not be interpreted as task success rates.

## How to Use

Install LeRobot and load the policy from the Hub:

```python
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy

policy = SmolVLAPolicy.from_pretrained(
    "CoRL2026-CSI/SmolVLA-UR7e-CaP-StackBlock-50epoch"
)
policy.to("cuda")
policy.eval()
```

For robot rollout or evaluation, use the LeRobot CLI or your existing UR7e control stack with `--policy.path` pointing to this repository:

```bash
lerobot-record \
  --policy.path=CoRL2026-CSI/SmolVLA-UR7e-CaP-StackBlock-50epoch \
  --dataset.repo_id=CoRL2026-CSI/eval_smolvla_ur7e_cap_stack_block_10fps
```

Adjust the robot, camera, and dataset arguments to match the local UR7e deployment setup.

## Files

This repository contains:

- `model.safetensors`: policy weights
- `config.json`: policy configuration
- `train_config.json`: LeRobot training configuration
- `policy_preprocessor.json`: saved inference preprocessing pipeline
- `policy_preprocessor_step_5_normalizer_processor.safetensors`: normalization state
- `policy_postprocessor.json`: saved inference postprocessing pipeline
- `policy_postprocessor_step_0_unnormalizer_processor.safetensors`: action unnormalization state

## Evaluation

No evaluation run is reported for this checkpoint. The training configuration had `eval_freq=0`, so no offline evaluation videos, simulated rollouts, or real-robot success metrics are included in this repository.

## Limitations

This policy was trained for one UR7e tabletop stack-block task and assumes the camera setup, action/state convention, object set, and workspace distribution represented in the training dataset. Validate in a controlled workspace before any hardware deployment.