Add files using upload-large-folder tool

a3ee4f0 verified about 2 months ago

6.41 kB

	---
	license: apache-2.0
	library_name: lerobot
	pipeline_tag: robotics
	tags:
	- lerobot
	- robotics
	- smolvla
	- vision-language-action
	- imitation-learning
	- behavior-cloning
	- ur7e
	- real-robot
	- manipulation
	- code-as-policies
	- cap
	- stack-block
	- 10fps
	- 50epoch
	datasets:
	- CoRL2026-CSI/UR7e-CaP-Stack_Block-100epi_10fps
	base_model:
	- lerobot/smolvla_base
	model_name: SmolVLA-UR7e-CaP-StackBlock-50epoch
	---

	# SmolVLA-UR7e-CaP-StackBlock-50epoch

	This repository contains a LeRobot SmolVLA policy fine-tuned for a UR7e Code-as-Policies stack-block task. The policy was trained on demonstrations from [`CoRL2026-CSI/UR7e-CaP-Stack_Block-100epi_10fps`](https://huggingface.co/datasets/CoRL2026-CSI/UR7e-CaP-Stack_Block-100epi_10fps), where the robot builds a stack on a blue dish with the red block on the bottom, the green block in the middle, and the blue block on top.

	The checkpoint is intended for research use with LeRobot-compatible inference pipelines. No real-robot or offline success-rate evaluation is included in this model card; the reported metrics are training logs only.

	## Model Details

	- Model type: SmolVLA vision-language-action policy
	- Base policy: [`lerobot/smolvla_base`](https://huggingface.co/lerobot/smolvla_base)
	- VLM backbone: `HuggingFaceTB/SmolVLM2-500M-Video-Instruct`
	- Robot: UR7e
	- Task: Stack red, green, and blue blocks on a blue dish
	- Training framework: [LeRobot](https://github.com/huggingface/lerobot)
	- Checkpoint format: `safetensors`
	- License: Apache 2.0

	## Dataset

	The policy was trained on [`CoRL2026-CSI/UR7e-CaP-Stack_Block-100epi_10fps`](https://huggingface.co/datasets/CoRL2026-CSI/UR7e-CaP-Stack_Block-100epi_10fps), a LeRobot v3 dataset collected for the UR7e stack-block task.

	Dataset summary:

	\| Field \| Value \|
	\| --- \| --- \|
	\| Robot type \| `ur7e` \|
	\| Episodes \| 100 \|
	\| Frames \| 69,932 \|
	\| Dataset FPS \| 10 \|
	\| Tasks \| 1 \|
	\| Split \| `train: 0:100` \|
	\| Cameras \| RealSense wrist and top-view RGB video \|
	\| Camera resolution \| 480 x 640 RGB video \|
	\| Dataset state/action vectors \| 7D joint/gripper vector \|

	The dataset includes additional skill annotations such as `skill.type`, `skill.progress`, target joint positions, target Cartesian poses, and natural-language skill text. The policy checkpoint uses the LeRobot preprocessing pipeline saved in this repository.

	## Policy Inputs and Outputs

	The saved policy configuration expects the following model features after preprocessing:

	Inputs, according to the saved policy config:

	- `observation.state`: 6D state feature
	- `observation.images.camera1`: wrist camera, resized/padded for SmolVLA
	- `observation.images.camera2`: top-view camera, resized/padded for SmolVLA
	- `observation.images.camera3`: visual input slot
	- `observation.images.empty_camera_0`: empty camera placeholder

	Output, according to the saved policy config:

	- `action`: 7D joint/gripper action vector

	The included `policy_preprocessor.json` maps dataset camera names to model camera names:

	- `observation.images.realsense_wrist` -> `observation.images.camera1`
	- `observation.images.realsense_topview` -> `observation.images.camera2`

	State and action features use mean/std normalization. Visual features use identity normalization. The postprocessor unnormalizes the `action` output and moves it back to CPU.

	## Training Details

	The final uploaded checkpoint is from step `13700`.

	\| Setting \| Value \|
	\| --- \| --- \|
	\| Training steps \| 13,700 \|
	\| Approx. epochs \| 50.15 \|
	\| Batch size \| 128 \|
	\| Effective batch size \| 256 \|
	\| Gradient accumulation \| 1 \|
	\| Seed \| 1000 \|
	\| Optimizer \| AdamW \|
	\| Peak learning rate \| `1e-4` \|
	\| Weight decay \| `1e-10` \|
	\| Gradient clipping \| `10.0` \|
	\| Scheduler \| Cosine decay with warmup \|
	\| Warmup steps \| 1,000 \|
	\| Decay steps \| 30,000 \|
	\| Final decay LR \| `2.5e-6` \|
	\| AMP \| Disabled \|
	\| PEFT \| Disabled \|
	\| Vision encoder \| Frozen \|
	\| Expert-only training \| Enabled \|
	\| State projection training \| Enabled \|
	\| Action chunk size \| 50 \|
	\| Observation steps \| 1 \|
	\| Action steps \| 50 \|
	\| Inference denoising steps \| 10 \|
	\| Empty camera placeholders \| 1 \|

	Image augmentation was enabled during training with up to two randomly ordered transforms per sample:

	- brightness jitter: `[0.8, 1.2]`
	- contrast jitter: `[0.8, 1.2]`
	- saturation jitter: `[0.5, 1.5]`
	- hue jitter: `[-0.05, 0.05]`
	- sharpness jitter: `[0.5, 1.5]`
	- random affine rotation: `[-5, 5]` degrees
	- random affine translation: `0.05`

	Training logs:

	\| Metric \| Value \|
	\| --- \| --- \|
	\| Final logged training loss \| `0.007` \|
	\| Final logged gradient norm \| `0.141` \|
	\| Final logged learning rate \| `2.5e-6` \|

	These values are training-loop logs only and should not be interpreted as task success rates.

	## How to Use

	Install LeRobot and load the policy from the Hub:

	```python
	from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy

	policy = SmolVLAPolicy.from_pretrained(
	"CoRL2026-CSI/SmolVLA-UR7e-CaP-StackBlock-50epoch"
	)
	policy.to("cuda")
	policy.eval()
	```

	For robot rollout or evaluation, use the LeRobot CLI or your existing UR7e control stack with `--policy.path` pointing to this repository:

	```bash
	lerobot-record \
	--policy.path=CoRL2026-CSI/SmolVLA-UR7e-CaP-StackBlock-50epoch \
	--dataset.repo_id=CoRL2026-CSI/eval_smolvla_ur7e_cap_stack_block_10fps
	```

	Adjust the robot, camera, and dataset arguments to match the local UR7e deployment setup.

	## Files

	This repository contains:

	- `model.safetensors`: policy weights
	- `config.json`: policy configuration
	- `train_config.json`: LeRobot training configuration
	- `policy_preprocessor.json`: saved inference preprocessing pipeline
	- `policy_preprocessor_step_5_normalizer_processor.safetensors`: normalization state
	- `policy_postprocessor.json`: saved inference postprocessing pipeline
	- `policy_postprocessor_step_0_unnormalizer_processor.safetensors`: action unnormalization state

	## Evaluation

	No evaluation run is reported for this checkpoint. The training configuration had `eval_freq=0`, so no offline evaluation videos, simulated rollouts, or real-robot success metrics are included in this repository.

	## Limitations

	This policy was trained for one UR7e tabletop stack-block task and assumes the camera setup, action/state convention, object set, and workspace distribution represented in the training dataset. Validate in a controlled workspace before any hardware deployment.