lerobot
/

smolvla_base

@@ -1,146 +1,60 @@
 ---
-language:
-- en
-library_name: lerobot
 pipeline_tag: robotics
 tags:
-- vision-language-action
-- imitation-learning
-- lerobot
-inference: false
 ---
-# SmolVLA (LeRobot)
-SmolVLA is a compact, efficient Vision-Language-Action (VLA) model designed for affordable robotics, trainable on a single GPU and deployable on consumer hardware, while matching the performance of much larger VLAs through community-driven data.
-**Original paper:** (SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics)[https://arxiv.org/abs/2506.01844]
-**Reference implementation:** https://github.com/huggingface/lerobot
-## Model description
-- **Inputs:** images (multi-view), proprio/state, optional language instruction
-- **Outputs:** continuous actions
-- **Training objective:** flow matching
-- **Action representation:** continuous
-- **Intended use:** Base model to fine tune on your specific use case
-## Quick start (inference on a real batch)
-### Installation
-```bash
-pip install "lerobot[smolvla]"
-```
-For full installation details (including optional video dependencies such as ffmpeg for torchcodec), see the official documentation: https://huggingface.co/docs/lerobot/installation
-### Load model + dataset, run `select_action`
-```python
-import torch
-from lerobot.datasets.lerobot_dataset import LeRobotDataset
-from lerobot.policies.factory import make_pre_post_processors
-# Swap this import per-policy
-from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
-# load a policy
-model_id = "lerobot/smolvla_base"  # <- swap checkpoint
-device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-policy = SmolVLAPolicy.from_pretrained(model_id).to(device).eval()
-preprocess, postprocess = make_pre_post_processors(
-    policy.config,
-    model_id,
-    preprocessor_overrides={"device_processor": {"device": str(device)}},
-)
-# load a lerobotdataset
-dataset = LeRobotDataset("lerobot/libero")
-# pick an episode
-episode_index = 0
-# each episode corresponds to a contiguous range of frame indices
-from_idx = dataset.meta.episodes["dataset_from_index"][episode_index]
-to_idx   = dataset.meta.episodes["dataset_to_index"][episode_index]
-# get a single frame from that episode (e.g. the first frame)
-frame_index = from_idx
-frame = dict(dataset[frame_index])
-batch = preprocess(frame)
-with torch.inference_mode():
-    pred_action = policy.select_action(frame)
-    # use your policy postprocess, this post process the action
-    # for instance unnormalize the actions, detokenize it etc..
-    pred_action = postprocess(pred_action)
-```
-## Training step (loss + backward)
-If you’re training / fine-tuning, you typically call `forward(...)` to get a loss and then:
-```python
-policy.train()
-batch = dict(dataset[0])
-batch = preprocess(batch)
-loss, outputs = policy.forward(batch)
-loss.backward()
 ```
-> Notes:
->
-> - Some policies expose `policy(**batch)` or return a dict; keep this snippet aligned with the policy API.
-> - Use your trainer script (`lerobot-train`) for full training loops.
-## How to train / fine-tune
 ```bash
-lerobot-train \
-  --dataset.repo_id=${HF_USER}/<dataset> \
-  --output_dir=./outputs/[RUN_NAME] \
-  --job_name=[RUN_NAME] \
-  --policy.repo_id=${HF_USER}/<desired_policy_repo_id> \
-  --policy.path=lerobot/[BASE_CHECKPOINT] \
-  --policy.dtype=bfloat16 \
   --policy.device=cuda \
-  --steps=100000 \
-  --batch_size=4
 ```
-Add policy-specific flags below:
-- `-policy.chunk_size=...`
-- `-policy.n_action_steps=...`
-- `-policy.max_action_tokens=...`
-- `-policy.gradient_checkpointing=true`
-## Real-World Inference & Evaluation
-You can use the `record` script from [**`lerobot-record`**](https://github.com/huggingface/lerobot/blob/main/src/lerobot/scripts/lerobot_record.py) with a policy checkpoint as input, to run inference and evaluate your policy.
-For instance, run this command or API example to run inference and record 10 evaluation episodes:
-```
-lerobot-record  \
-  --robot.type=so100_follower \
-  --robot.port=/dev/ttyACM1 \
-  --robot.cameras="{ up: {type: opencv, index_or_path: /dev/video10, width: 640, height: 480, fps: 30}, side: {type: intelrealsense, serial_number_or_name: 233522074606, width: 640, height: 480, fps: 30}}" \
-  --robot.id=my_awesome_follower_arm \
-  --display_data=false \
-  --dataset.repo_id=${HF_USER}/eval_so100 \
-  --dataset.single_task="Put lego brick into the transparent box" \
-  # <- Teleop optional if you want to teleoperate in between episodes \
-  # --teleop.type=so100_leader \
-  # --teleop.port=/dev/ttyACM0 \
-  # --teleop.id=my_awesome_leader_arm \
-  --policy.path=${HF_USER}/my_policy
 ```

 ---
 pipeline_tag: robotics
 tags:
+- smolvla
+library_name: lerobot
+datasets:
+- lerobot/svla_so101_pickplace
 ---
+## SmolVLA: A vision-language-action model for affordable and efficient robotics
+Resources and technical documentation:
+[SmolVLA Paper](https://huggingface.co/papers/2506.01844)
+[SmolVLA Blogpost](https://huggingface.co/blog/smolvla)
+[Code](https://github.com/huggingface/lerobot/blob/main/lerobot/common/policies/smolvla/modeling_smolvla.py)
+[Train using Google Colab Notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/lerobot/training-smolvla.ipynb#scrollTo=ZO52lcQtxseE)
+[SmolVLA HF Documentation](https://huggingface.co/docs/lerobot/smolvla)
+Designed by Hugging Face.
+This model has 450M parameters in total.
+You can use inside the [LeRobot library](https://github.com/huggingface/lerobot).
+Before proceeding to the next steps, you need to properly install the environment by following [Installation Guide](https://huggingface.co/docs/lerobot/installation) on the docs.
+Install smolvla extra dependencies:
+```bash
+pip install -e ".[smolvla]"
 ```
+Example of finetuning the smolvla pretrained model (`smolvla_base`):
 ```bash
+python lerobot/scripts/train.py \
+  --policy.path=lerobot/smolvla_base \
+  --dataset.repo_id=lerobot/svla_so101_pickplace \
+  --batch_size=64 \
+  --steps=20000 \
+  --output_dir=outputs/train/my_smolvla \
+  --job_name=my_smolvla_training \
   --policy.device=cuda \
+  --wandb.enable=true
 ```
+Example of finetuning the smolvla neural network with pretrained VLM and action expert
+intialized from scratch:
+```bash
+python lerobot/scripts/train.py \
+  --dataset.repo_id=lerobot/svla_so101_pickplace \
+  --batch_size=64 \
+  --steps=200000 \
+  --output_dir=outputs/train/my_smolvla \
+  --job_name=my_smolvla_training \
+  --policy.device=cuda \
+  --wandb.enable=true
 ```