SmolVLA fine-tuned on SO-100 tic-tac-toe (pick-and-place)

Two full fine-tunes of lerobot/smolvla_base (450M) on a LeRobot SO-100 "tic-tac-toe" dataset โ€” 180 episodes, 9 pick-and-place tasks ("place the blue cross cube in the <cell> box"), top + wrist cameras (640ร—480 @30 fps), 6-DOF joint actions. Seed-fixed split: 144 train / 18 val / 18 test.

Camera mapping used in training: top โ†’ observation.images.camera1, wrist โ†’ observation.images.camera2.

A note on the third camera

The underlying dataset has only two cameras (top, wrist). The model config, however, lists three image inputs (camera1, camera2, camera3) โ€” this 3-view layout is inherited from the lerobot/smolvla_base pretrained model, not from the data:

config input source content
observation.images.camera1 dataset top view real
observation.images.camera2 dataset wrist view real
observation.images.camera3 โ€” padded empty/black image (no information)

camera3 is a placeholder kept only to match the base model's expected input shape; SmolVLA tolerates the missing view by feeding it a blank image. (Note: the saved config.json shows empty_cameras: 0 because the slot comes straight from the base model's input_features rather than being added as an explicit empty camera โ€” behaviour is unchanged either way.)

For inference you only need to provide the two real views (camera1=top, camera2=wrist), observation.state, and the task string โ€” the third slot is padded automatically.

Checkpoints

folder training data eval major-4 joint corr
in_distribution/ all 9 cells, 20k steps (loss 0.009) all 9 cells (in-distribution fit) 0.983 (per-joint all >0.96, MAE ~1.8ยฐ)
heldout_cells/ 6 cells, 15k steps (best ckpt) 3 held-out cells (top-left / center / middle-right) 0.928 (MAE 3.2ยฐ)

The held-out-cell run demonstrates that a pretrained VLA generalizes the pick-place skill across board positions (0.928 on unseen cells vs 0.983 in-distribution) โ€” far above a bespoke Cosmos3 action policy's 0.715 on seen cells. Generalization climbs with training (no overfit collapse).

Usage

Each folder is a standard LeRobot SmolVLA pretrained_model/ (model + pre/post-processors). Load with LeRobot 0.4.4+:

from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
policy = SmolVLAPolicy.from_pretrained("azhicles/smolvla-so100-tictactoe", subfolder="in_distribution")

Provide observations as observation.images.camera1 (top), observation.images.camera2 (wrist), observation.state (6-DOF), and the task language string; the policy returns a chunked 6-DOF action.

Inference speed

~4.70 s / episode (16 chunked forwards @ 0.284 s) on 1ร— NVIDIA GB300 โ€” faster and more accurate than the bespoke Cosmos3 action policy baseline.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Model tree for azhicles/FineTunedSmolVLA

Finetuned
(6418)
this model