Instructions to use azhicles/FineTunedSmolVLA with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LeRobot
How to use azhicles/FineTunedSmolVLA with LeRobot:
# See https://github.com/huggingface/lerobot?tab=readme-ov-file#installation for more details git clone https://github.com/huggingface/lerobot.git cd lerobot pip install -e .[smolvla]
# Launch finetuning on your dataset python lerobot/scripts/train.py \ --policy.path=azhicles/FineTunedSmolVLA \ --dataset.repo_id=lerobot/svla_so101_pickplace \ --batch_size=64 \ --steps=20000 \ --output_dir=outputs/train/my_smolvla \ --job_name=my_smolvla_training \ --policy.device=cuda \ --wandb.enable=true
# Run the policy using the record function python -m lerobot.record \ --robot.type=so101_follower \ --robot.port=/dev/ttyACM0 \ # <- Use your port --robot.id=my_blue_follower_arm \ # <- Use your robot id --robot.cameras="{ front: {type: opencv, index_or_path: 8, width: 640, height: 480, fps: 30}}" \ # <- Use your cameras --dataset.single_task="Grasp a lego block and put it in the bin." \ # <- Use the same task description you used in your dataset recording --dataset.repo_id=HF_USER/dataset_name \ # <- This will be the dataset name on HF Hub --dataset.episode_time_s=50 \ --dataset.num_episodes=10 \ --policy.path=azhicles/FineTunedSmolVLA - Notebooks
- Google Colab
- Kaggle
SmolVLA fine-tuned on SO-100 tic-tac-toe (pick-and-place)
Two full fine-tunes of lerobot/smolvla_base (450M)
on a LeRobot SO-100 "tic-tac-toe" dataset โ 180 episodes, 9 pick-and-place tasks
("place the blue cross cube in the <cell> box"), top + wrist cameras (640ร480 @30 fps), 6-DOF
joint actions. Seed-fixed split: 144 train / 18 val / 18 test.
Camera mapping used in training: top โ observation.images.camera1, wrist โ observation.images.camera2.
A note on the third camera
The underlying dataset has only two cameras (top, wrist). The model config, however, lists
three image inputs (camera1, camera2, camera3) โ this 3-view layout is inherited from the
lerobot/smolvla_base pretrained model, not from the data:
| config input | source | content |
|---|---|---|
observation.images.camera1 |
dataset top view |
real |
observation.images.camera2 |
dataset wrist view |
real |
observation.images.camera3 |
โ | padded empty/black image (no information) |
camera3 is a placeholder kept only to match the base model's expected input shape; SmolVLA tolerates
the missing view by feeding it a blank image. (Note: the saved config.json shows empty_cameras: 0
because the slot comes straight from the base model's input_features rather than being added as an
explicit empty camera โ behaviour is unchanged either way.)
For inference you only need to provide the two real views (camera1=top, camera2=wrist),
observation.state, and the task string โ the third slot is padded automatically.
Checkpoints
| folder | training data | eval | major-4 joint corr |
|---|---|---|---|
in_distribution/ |
all 9 cells, 20k steps (loss 0.009) | all 9 cells (in-distribution fit) | 0.983 (per-joint all >0.96, MAE ~1.8ยฐ) |
heldout_cells/ |
6 cells, 15k steps (best ckpt) | 3 held-out cells (top-left / center / middle-right) | 0.928 (MAE 3.2ยฐ) |
The held-out-cell run demonstrates that a pretrained VLA generalizes the pick-place skill across board positions (0.928 on unseen cells vs 0.983 in-distribution) โ far above a bespoke Cosmos3 action policy's 0.715 on seen cells. Generalization climbs with training (no overfit collapse).
Usage
Each folder is a standard LeRobot SmolVLA pretrained_model/ (model + pre/post-processors). Load with
LeRobot 0.4.4+:
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
policy = SmolVLAPolicy.from_pretrained("azhicles/smolvla-so100-tictactoe", subfolder="in_distribution")
Provide observations as observation.images.camera1 (top), observation.images.camera2 (wrist),
observation.state (6-DOF), and the task language string; the policy returns a chunked 6-DOF action.
Inference speed
~4.70 s / episode (16 chunked forwards @ 0.284 s) on 1ร NVIDIA GB300 โ faster and more accurate than the bespoke Cosmos3 action policy baseline.
Model tree for azhicles/FineTunedSmolVLA
Base model
lerobot/smolvla_base