SmolVLA-JetBot

A lightweight Vision-Language-Action (VLA) model fine-tuned for controlling JetBot, an NVIDIA Jetson Nano-based differential-drive robot. This model takes camera images and natural language instructions as input and outputs motor commands for robot navigation.

Model Description

SmolVLA-JetBot is designed to bring VLA capabilities to resource-constrained edge devices. It uses a frozen SmolVLM-500M backbone with a custom trainable action head that maps vision-language representations to motor commands.

Developed by: shraavb
Model type: Vision-Language-Action (VLA)
Language: English
License: Apache 2.0
Base Model: HuggingFaceTB/SmolVLM-500M-Instruct
Training Data: 10,000+ samples from NVIDIA Isaac Sim with domain randomization

Architecture

Base Model: HuggingFaceTB/SmolVLM-500M-Instruct (Frozen)
├── Vision Encoder: SigLIP-400M
├── Language Model: SmolLM-360M
└── Hidden Size: 960

Action Head (Trainable, ~123K parameters):
├── Linear(960 → 128)
├── ReLU + Dropout(0.1)
├── Linear(128 → 2)
└── Tanh → outputs in [-1, 1]

Output: [left_motor_speed, right_motor_speed]

Total Parameters: ~500M (frozen backbone) + 123K (trainable action head)

Memory Footprint: ~1.5GB (vs ~15GB for full OpenVLA models)

Intended Use

Primary Use Cases

Real-time VLA-guided navigation on JetBot and similar differential-drive robots
Natural language control of mobile robots
Research platform for embodied AI on edge devices
Sim-to-real transfer experiments

Supported Instructions

The model responds to natural language navigation commands such as:

"go forward" / "move forward"
"turn left" / "go left"
"turn right" / "go right" (see limitations)
"go towards the [object]" (e.g., "go towards the red ball")
"stop" / "wait"

Training

Training Data

Dataset	Samples	Description
dataset_vla	10,000	Primary training data from Isaac Sim with domain randomization
dataset_vla_synthetic	1,000	Initial synthetic data
dataset_vla_synthetic_large	2,000	Extended synthetic data
dataset_turn_right_10k	10,000	Focused on right-turn instructions (~23% right-turn commands)

Instruction Distribution:

Forward motion: ~25%
Left turns: ~18%
Right turns: ~18%
Obstacle avoidance: ~18%
Object approach: ~12%
Stop/wait: ~6%
Other: ~3%

Training Procedure

Hyperparameter	Value
Optimizer	AdamW
Learning Rate	5e-5
Weight Decay	0.01
Batch Size	2-16
Epochs	5-20
Warmup	10% of steps
Scheduler	OneCycleLR
Loss Function	MSE
Gradient Clipping	1.0
Precision	FP16 (mixed precision)

Hardware: NVIDIA RTX 4090, 16GB+ VRAM

Training Time: ~2-4 hours for 20 epochs

Training Command

python -m server.vla_server.fine_tuning.train_smolvla \
    --data-dir dataset_vla \
    --output-dir models/smolvla_jetbot \
    --epochs 20 \
    --batch-size 2 \
    --lr 5e-5

Evaluation

Results (v2 Model)

Instruction	Avg Left	Avg Right	R-L Diff	Assessment
go forward	0.85	0.94	+0.09	✅ Correct
turn left	0.57	0.92	+0.35	✅ Correct
turn right	0.64	0.95	+0.31	❌ Incorrect (turns left)
go towards red ball	0.86	0.45	-0.41	✅ Correct

Overall Accuracy: 75% (3/4 correct)

Best Validation Loss: 0.0394

Key Observations

Visual grounding works: "go towards the red ball" correctly produces right-turning behavior when the ball is on the right
Language differentiation limited: "turn right" produces similar output to "turn left"
Forward motion is robust and consistent

Limitations and Biases

Known Limitations

Turn Right Confusion: The model does not reliably differentiate "turn right" from "turn left" - both produce left-turning behavior (R > L motor values)
Limited Visual Grounding: Response to visual targets beyond simple colored objects is limited
No Speed Control: Does not differentiate between "slow" and "fast" modifiers
Sim-to-Real Gap: Trained in simulation; real-world performance may vary due to lighting, textures, and motor response differences

Failure Cases

"turn right" produces left-turning behavior
Complex instructions like "go around the obstacle" are not well understood
Model may output similar actions regardless of visual scene changes

Mitigation Strategies for Sim-to-Real Transfer

Domain randomization during data collection (implemented)
Real-world fine-tuning with small dataset
Action scaling/calibration on real robot
Visual domain adaptation

How to Use

Installation

pip install transformers torch pillow

Inference Example

from transformers import AutoProcessor, AutoModel
import torch
from PIL import Image

# Load model and processor
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-500M-Instruct")
base_model = AutoModel.from_pretrained("HuggingFaceTB/SmolVLM-500M-Instruct")

# Load fine-tuned action head
action_head = torch.load("path/to/action_head.pt")

# Prepare inputs
image = Image.open("camera_image.jpg")
instruction = "go forward"

inputs = processor(
    images=image,
    text=f"<image>\n{instruction}",
    return_tensors="pt"
)

# Get hidden states
with torch.no_grad():
    outputs = base_model(**inputs, output_hidden_states=True)
    hidden_states = outputs.hidden_states[-1][:, -1, :]  # Last token

    # Get motor commands
    actions = action_head(hidden_states)
    left_speed, right_speed = actions[0].tolist()

print(f"Left motor: {left_speed:.3f}, Right motor: {right_speed:.3f}")

Running the VLA Server

# Start the inference server
python -m server.vla_server.server \
    --model-type smolvla \
    --fine-tuned \
    --model models/smolvla_jetbot/best

# The server accepts ZMQ requests with images and instructions

Hardware Requirements

Training

GPU: NVIDIA RTX 4090 or equivalent
VRAM: 16GB+
Training Time: ~2-4 hours for 20 epochs

Inference

GPU: CUDA-capable GPU
VRAM: 4GB+
Latency: ~50-100ms per frame

Target Deployment

Platform: NVIDIA Jetson Nano/Orin
Memory: ~1.5GB model footprint

Citation

If you use this model, please cite:

@misc{smolvla-jetbot,
  author = {shraavb},
  title = {SmolVLA-JetBot: Lightweight Vision-Language-Action Model for Robot Navigation},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/shraavb/smolvla-jetbot}
}

Acknowledgments

HuggingFace SmolVLM for the base vision-language model
NVIDIA Isaac Sim for simulation and synthetic data generation
JetBot Project for the robot platform

Model Card Contact

For questions, issues, or contributions, please open an issue on the model repository or contact the developer.

Downloads last month: 1

Video Preview

Robotics

Model tree for shraavb/smolvla-jetbot

Base model

HuggingFaceTB/SmolLM2-360M

Quantized

HuggingFaceTB/SmolLM2-360M-Instruct

Quantized

HuggingFaceTB/SmolVLM-500M-Instruct

Finetuned

(29)

this model

Evaluation results

Instruction Accuracy on JetBot Isaac Sim
self-reported

0.750
Validation Loss on JetBot Isaac Sim
self-reported

0.039