Robotics
LeRobot
Safetensors
smolvla
vision-language-action
imitation-learning
behavior-cloning
ur7e
real-robot
manipulation
code-as-policies
cap
stack-block
10fps
50epoch
Instructions to use Cache-SCA/SmolVLA-UR7e-CaP-StackBlock-50epoch with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LeRobot
How to use Cache-SCA/SmolVLA-UR7e-CaP-StackBlock-50epoch with LeRobot:
# See https://github.com/huggingface/lerobot?tab=readme-ov-file#installation for more details git clone https://github.com/huggingface/lerobot.git cd lerobot pip install -e .[smolvla]
# Launch finetuning on your dataset python lerobot/scripts/train.py \ --policy.path=Cache-SCA/SmolVLA-UR7e-CaP-StackBlock-50epoch \ --dataset.repo_id=lerobot/svla_so101_pickplace \ --batch_size=64 \ --steps=20000 \ --output_dir=outputs/train/my_smolvla \ --job_name=my_smolvla_training \ --policy.device=cuda \ --wandb.enable=true
# Run the policy using the record function python -m lerobot.record \ --robot.type=so101_follower \ --robot.port=/dev/ttyACM0 \ # <- Use your port --robot.id=my_blue_follower_arm \ # <- Use your robot id --robot.cameras="{ front: {type: opencv, index_or_path: 8, width: 640, height: 480, fps: 30}}" \ # <- Use your cameras --dataset.single_task="Grasp a lego block and put it in the bin." \ # <- Use the same task description you used in your dataset recording --dataset.repo_id=HF_USER/dataset_name \ # <- This will be the dataset name on HF Hub --dataset.episode_time_s=50 \ --dataset.num_episodes=10 \ --policy.path=Cache-SCA/SmolVLA-UR7e-CaP-StackBlock-50epoch - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| library_name: lerobot | |
| pipeline_tag: robotics | |
| tags: | |
| - lerobot | |
| - robotics | |
| - smolvla | |
| - vision-language-action | |
| - imitation-learning | |
| - behavior-cloning | |
| - ur7e | |
| - real-robot | |
| - manipulation | |
| - code-as-policies | |
| - cap | |
| - stack-block | |
| - 10fps | |
| - 50epoch | |
| datasets: | |
| - CoRL2026-CSI/UR7e-CaP-Stack_Block-100epi_10fps | |
| base_model: | |
| - lerobot/smolvla_base | |
| model_name: SmolVLA-UR7e-CaP-StackBlock-50epoch | |
| # SmolVLA-UR7e-CaP-StackBlock-50epoch | |
| This repository contains a LeRobot SmolVLA policy fine-tuned for a UR7e Code-as-Policies stack-block task. The policy was trained on demonstrations from [`CoRL2026-CSI/UR7e-CaP-Stack_Block-100epi_10fps`](https://huggingface.co/datasets/CoRL2026-CSI/UR7e-CaP-Stack_Block-100epi_10fps), where the robot builds a stack on a blue dish with the red block on the bottom, the green block in the middle, and the blue block on top. | |
| The checkpoint is intended for research use with LeRobot-compatible inference pipelines. No real-robot or offline success-rate evaluation is included in this model card; the reported metrics are training logs only. | |
| ## Model Details | |
| - **Model type:** SmolVLA vision-language-action policy | |
| - **Base policy:** [`lerobot/smolvla_base`](https://huggingface.co/lerobot/smolvla_base) | |
| - **VLM backbone:** `HuggingFaceTB/SmolVLM2-500M-Video-Instruct` | |
| - **Robot:** UR7e | |
| - **Task:** Stack red, green, and blue blocks on a blue dish | |
| - **Training framework:** [LeRobot](https://github.com/huggingface/lerobot) | |
| - **Checkpoint format:** `safetensors` | |
| - **License:** Apache 2.0 | |
| ## Dataset | |
| The policy was trained on [`CoRL2026-CSI/UR7e-CaP-Stack_Block-100epi_10fps`](https://huggingface.co/datasets/CoRL2026-CSI/UR7e-CaP-Stack_Block-100epi_10fps), a LeRobot v3 dataset collected for the UR7e stack-block task. | |
| Dataset summary: | |
| | Field | Value | | |
| | --- | --- | | |
| | Robot type | `ur7e` | | |
| | Episodes | 100 | | |
| | Frames | 69,932 | | |
| | Dataset FPS | 10 | | |
| | Tasks | 1 | | |
| | Split | `train: 0:100` | | |
| | Cameras | RealSense wrist and top-view RGB video | | |
| | Camera resolution | 480 x 640 RGB video | | |
| | Dataset state/action vectors | 7D joint/gripper vector | | |
| The dataset includes additional skill annotations such as `skill.type`, `skill.progress`, target joint positions, target Cartesian poses, and natural-language skill text. The policy checkpoint uses the LeRobot preprocessing pipeline saved in this repository. | |
| ## Policy Inputs and Outputs | |
| The saved policy configuration expects the following model features after preprocessing: | |
| Inputs, according to the saved policy config: | |
| - `observation.state`: 6D state feature | |
| - `observation.images.camera1`: wrist camera, resized/padded for SmolVLA | |
| - `observation.images.camera2`: top-view camera, resized/padded for SmolVLA | |
| - `observation.images.camera3`: visual input slot | |
| - `observation.images.empty_camera_0`: empty camera placeholder | |
| Output, according to the saved policy config: | |
| - `action`: 7D joint/gripper action vector | |
| The included `policy_preprocessor.json` maps dataset camera names to model camera names: | |
| - `observation.images.realsense_wrist` -> `observation.images.camera1` | |
| - `observation.images.realsense_topview` -> `observation.images.camera2` | |
| State and action features use mean/std normalization. Visual features use identity normalization. The postprocessor unnormalizes the `action` output and moves it back to CPU. | |
| ## Training Details | |
| The final uploaded checkpoint is from step `13700`. | |
| | Setting | Value | | |
| | --- | --- | | |
| | Training steps | 13,700 | | |
| | Approx. epochs | 50.15 | | |
| | Batch size | 128 | | |
| | Effective batch size | 256 | | |
| | Gradient accumulation | 1 | | |
| | Seed | 1000 | | |
| | Optimizer | AdamW | | |
| | Peak learning rate | `1e-4` | | |
| | Weight decay | `1e-10` | | |
| | Gradient clipping | `10.0` | | |
| | Scheduler | Cosine decay with warmup | | |
| | Warmup steps | 1,000 | | |
| | Decay steps | 30,000 | | |
| | Final decay LR | `2.5e-6` | | |
| | AMP | Disabled | | |
| | PEFT | Disabled | | |
| | Vision encoder | Frozen | | |
| | Expert-only training | Enabled | | |
| | State projection training | Enabled | | |
| | Action chunk size | 50 | | |
| | Observation steps | 1 | | |
| | Action steps | 50 | | |
| | Inference denoising steps | 10 | | |
| | Empty camera placeholders | 1 | | |
| Image augmentation was enabled during training with up to two randomly ordered transforms per sample: | |
| - brightness jitter: `[0.8, 1.2]` | |
| - contrast jitter: `[0.8, 1.2]` | |
| - saturation jitter: `[0.5, 1.5]` | |
| - hue jitter: `[-0.05, 0.05]` | |
| - sharpness jitter: `[0.5, 1.5]` | |
| - random affine rotation: `[-5, 5]` degrees | |
| - random affine translation: `0.05` | |
| Training logs: | |
| | Metric | Value | | |
| | --- | --- | | |
| | Final logged training loss | `0.007` | | |
| | Final logged gradient norm | `0.141` | | |
| | Final logged learning rate | `2.5e-6` | | |
| These values are training-loop logs only and should not be interpreted as task success rates. | |
| ## How to Use | |
| Install LeRobot and load the policy from the Hub: | |
| ```python | |
| from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy | |
| policy = SmolVLAPolicy.from_pretrained( | |
| "CoRL2026-CSI/SmolVLA-UR7e-CaP-StackBlock-50epoch" | |
| ) | |
| policy.to("cuda") | |
| policy.eval() | |
| ``` | |
| For robot rollout or evaluation, use the LeRobot CLI or your existing UR7e control stack with `--policy.path` pointing to this repository: | |
| ```bash | |
| lerobot-record \ | |
| --policy.path=CoRL2026-CSI/SmolVLA-UR7e-CaP-StackBlock-50epoch \ | |
| --dataset.repo_id=CoRL2026-CSI/eval_smolvla_ur7e_cap_stack_block_10fps | |
| ``` | |
| Adjust the robot, camera, and dataset arguments to match the local UR7e deployment setup. | |
| ## Files | |
| This repository contains: | |
| - `model.safetensors`: policy weights | |
| - `config.json`: policy configuration | |
| - `train_config.json`: LeRobot training configuration | |
| - `policy_preprocessor.json`: saved inference preprocessing pipeline | |
| - `policy_preprocessor_step_5_normalizer_processor.safetensors`: normalization state | |
| - `policy_postprocessor.json`: saved inference postprocessing pipeline | |
| - `policy_postprocessor_step_0_unnormalizer_processor.safetensors`: action unnormalization state | |
| ## Evaluation | |
| No evaluation run is reported for this checkpoint. The training configuration had `eval_freq=0`, so no offline evaluation videos, simulated rollouts, or real-robot success metrics are included in this repository. | |
| ## Limitations | |
| This policy was trained for one UR7e tabletop stack-block task and assumes the camera setup, action/state convention, object set, and workspace distribution represented in the training dataset. Validate in a controlled workspace before any hardware deployment. | |