SmolVLA fine-tuned on SO-100 tic-tac-toe (pick-and-place)

Two full fine-tunes of lerobot/smolvla_base (450M) on a LeRobot SO-100 "tic-tac-toe" dataset — 180 episodes, 9 pick-and-place tasks ("place the blue cross cube in the <cell> box"), top + wrist cameras (640×480 @30 fps), 6-DOF joint actions. Seed-fixed split: 144 train / 18 val / 18 test.

Camera mapping used in training: top → observation.images.camera1, wrist → observation.images.camera2.

A note on the third camera

The underlying dataset has only two cameras (top, wrist). The model config, however, lists three image inputs (camera1, camera2, camera3) — this 3-view layout is inherited from the lerobot/smolvla_base pretrained model, not from the data:

config input	source	content
`observation.images.camera1`	dataset `top` view	real
`observation.images.camera2`	dataset `wrist` view	real
`observation.images.camera3`	—	padded empty/black image (no information)

camera3 is a placeholder kept only to match the base model's expected input shape; SmolVLA tolerates the missing view by feeding it a blank image. (Note: the saved config.json shows empty_cameras: 0 because the slot comes straight from the base model's input_features rather than being added as an explicit empty camera — behaviour is unchanged either way.)

For inference you only need to provide the two real views (camera1=top, camera2=wrist), observation.state, and the task string — the third slot is padded automatically.

Checkpoints

folder	training data	eval	major-4 joint corr
`in_distribution/`	all 9 cells, 20k steps (loss 0.009)	all 9 cells (in-distribution fit)	0.983 (per-joint all >0.96, MAE ~1.8°)
`heldout_cells/`	6 cells, 15k steps (best ckpt)	3 held-out cells (top-left / center / middle-right)	0.928 (MAE 3.2°)

The held-out-cell run demonstrates that a pretrained VLA generalizes the pick-place skill across board positions (0.928 on unseen cells vs 0.983 in-distribution) — far above a bespoke Cosmos3 action policy's 0.715 on seen cells. Generalization climbs with training (no overfit collapse).

Usage

Each folder is a standard LeRobot SmolVLA pretrained_model/ (model + pre/post-processors). Load with LeRobot 0.4.4+:

from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
policy = SmolVLAPolicy.from_pretrained("azhicles/smolvla-so100-tictactoe", subfolder="in_distribution")

Provide observations as observation.images.camera1 (top), observation.images.camera2 (wrist), observation.state (6-DOF), and the task language string; the policy returns a chunked 6-DOF action.

Inference speed

~4.70 s / episode (16 chunked forwards @ 0.284 s) on 1× NVIDIA GB300 — faster and more accurate than the bespoke Cosmos3 action policy baseline.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Robotics

Model tree for azhicles/FineTunedSmolVLA

Base model

lerobot/smolvla_base

Finetuned

(6418)

this model