--- license: apache-2.0 library_name: lerobot pipeline_tag: robotics tags: - lerobot - robotics - smolvla - vision-language-action - imitation-learning - behavior-cloning - ur7e - real-robot - manipulation - code-as-policies - cap - stack-block - 10fps - 50epoch datasets: - CoRL2026-CSI/UR7e-CaP-Stack_Block-100epi_10fps base_model: - lerobot/smolvla_base model_name: SmolVLA-UR7e-CaP-StackBlock-50epoch --- # SmolVLA-UR7e-CaP-StackBlock-50epoch This repository contains a LeRobot SmolVLA policy fine-tuned for a UR7e Code-as-Policies stack-block task. The policy was trained on demonstrations from [`CoRL2026-CSI/UR7e-CaP-Stack_Block-100epi_10fps`](https://huggingface.co/datasets/CoRL2026-CSI/UR7e-CaP-Stack_Block-100epi_10fps), where the robot builds a stack on a blue dish with the red block on the bottom, the green block in the middle, and the blue block on top. The checkpoint is intended for research use with LeRobot-compatible inference pipelines. No real-robot or offline success-rate evaluation is included in this model card; the reported metrics are training logs only. ## Model Details - **Model type:** SmolVLA vision-language-action policy - **Base policy:** [`lerobot/smolvla_base`](https://huggingface.co/lerobot/smolvla_base) - **VLM backbone:** `HuggingFaceTB/SmolVLM2-500M-Video-Instruct` - **Robot:** UR7e - **Task:** Stack red, green, and blue blocks on a blue dish - **Training framework:** [LeRobot](https://github.com/huggingface/lerobot) - **Checkpoint format:** `safetensors` - **License:** Apache 2.0 ## Dataset The policy was trained on [`CoRL2026-CSI/UR7e-CaP-Stack_Block-100epi_10fps`](https://huggingface.co/datasets/CoRL2026-CSI/UR7e-CaP-Stack_Block-100epi_10fps), a LeRobot v3 dataset collected for the UR7e stack-block task. Dataset summary: | Field | Value | | --- | --- | | Robot type | `ur7e` | | Episodes | 100 | | Frames | 69,932 | | Dataset FPS | 10 | | Tasks | 1 | | Split | `train: 0:100` | | Cameras | RealSense wrist and top-view RGB video | | Camera resolution | 480 x 640 RGB video | | Dataset state/action vectors | 7D joint/gripper vector | The dataset includes additional skill annotations such as `skill.type`, `skill.progress`, target joint positions, target Cartesian poses, and natural-language skill text. The policy checkpoint uses the LeRobot preprocessing pipeline saved in this repository. ## Policy Inputs and Outputs The saved policy configuration expects the following model features after preprocessing: Inputs, according to the saved policy config: - `observation.state`: 6D state feature - `observation.images.camera1`: wrist camera, resized/padded for SmolVLA - `observation.images.camera2`: top-view camera, resized/padded for SmolVLA - `observation.images.camera3`: visual input slot - `observation.images.empty_camera_0`: empty camera placeholder Output, according to the saved policy config: - `action`: 7D joint/gripper action vector The included `policy_preprocessor.json` maps dataset camera names to model camera names: - `observation.images.realsense_wrist` -> `observation.images.camera1` - `observation.images.realsense_topview` -> `observation.images.camera2` State and action features use mean/std normalization. Visual features use identity normalization. The postprocessor unnormalizes the `action` output and moves it back to CPU. ## Training Details The final uploaded checkpoint is from step `13700`. | Setting | Value | | --- | --- | | Training steps | 13,700 | | Approx. epochs | 50.15 | | Batch size | 128 | | Effective batch size | 256 | | Gradient accumulation | 1 | | Seed | 1000 | | Optimizer | AdamW | | Peak learning rate | `1e-4` | | Weight decay | `1e-10` | | Gradient clipping | `10.0` | | Scheduler | Cosine decay with warmup | | Warmup steps | 1,000 | | Decay steps | 30,000 | | Final decay LR | `2.5e-6` | | AMP | Disabled | | PEFT | Disabled | | Vision encoder | Frozen | | Expert-only training | Enabled | | State projection training | Enabled | | Action chunk size | 50 | | Observation steps | 1 | | Action steps | 50 | | Inference denoising steps | 10 | | Empty camera placeholders | 1 | Image augmentation was enabled during training with up to two randomly ordered transforms per sample: - brightness jitter: `[0.8, 1.2]` - contrast jitter: `[0.8, 1.2]` - saturation jitter: `[0.5, 1.5]` - hue jitter: `[-0.05, 0.05]` - sharpness jitter: `[0.5, 1.5]` - random affine rotation: `[-5, 5]` degrees - random affine translation: `0.05` Training logs: | Metric | Value | | --- | --- | | Final logged training loss | `0.007` | | Final logged gradient norm | `0.141` | | Final logged learning rate | `2.5e-6` | These values are training-loop logs only and should not be interpreted as task success rates. ## How to Use Install LeRobot and load the policy from the Hub: ```python from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy policy = SmolVLAPolicy.from_pretrained( "CoRL2026-CSI/SmolVLA-UR7e-CaP-StackBlock-50epoch" ) policy.to("cuda") policy.eval() ``` For robot rollout or evaluation, use the LeRobot CLI or your existing UR7e control stack with `--policy.path` pointing to this repository: ```bash lerobot-record \ --policy.path=CoRL2026-CSI/SmolVLA-UR7e-CaP-StackBlock-50epoch \ --dataset.repo_id=CoRL2026-CSI/eval_smolvla_ur7e_cap_stack_block_10fps ``` Adjust the robot, camera, and dataset arguments to match the local UR7e deployment setup. ## Files This repository contains: - `model.safetensors`: policy weights - `config.json`: policy configuration - `train_config.json`: LeRobot training configuration - `policy_preprocessor.json`: saved inference preprocessing pipeline - `policy_preprocessor_step_5_normalizer_processor.safetensors`: normalization state - `policy_postprocessor.json`: saved inference postprocessing pipeline - `policy_postprocessor_step_0_unnormalizer_processor.safetensors`: action unnormalization state ## Evaluation No evaluation run is reported for this checkpoint. The training configuration had `eval_freq=0`, so no offline evaluation videos, simulated rollouts, or real-robot success metrics are included in this repository. ## Limitations This policy was trained for one UR7e tabletop stack-block task and assumes the camera setup, action/state convention, object set, and workspace distribution represented in the training dataset. Validate in a controlled workspace before any hardware deployment.