SmolVLA-JetBot

A lightweight Vision-Language-Action (VLA) model fine-tuned for controlling JetBot, an NVIDIA Jetson Nano-based differential-drive robot. This model takes camera images and natural language instructions as input and outputs motor commands for robot navigation.

Model Description

SmolVLA-JetBot is designed to bring VLA capabilities to resource-constrained edge devices. It uses a frozen SmolVLM-500M backbone with a custom trainable action head that maps vision-language representations to motor commands.

  • Developed by: shraavb
  • Model type: Vision-Language-Action (VLA)
  • Language: English
  • License: Apache 2.0
  • Base Model: HuggingFaceTB/SmolVLM-500M-Instruct
  • Training Data: 10,000+ samples from NVIDIA Isaac Sim with domain randomization

Architecture

Base Model: HuggingFaceTB/SmolVLM-500M-Instruct (Frozen)
├── Vision Encoder: SigLIP-400M
├── Language Model: SmolLM-360M
└── Hidden Size: 960

Action Head (Trainable, ~123K parameters):
├── Linear(960 → 128)
├── ReLU + Dropout(0.1)
├── Linear(128 → 2)
└── Tanh → outputs in [-1, 1]

Output: [left_motor_speed, right_motor_speed]

Total Parameters: ~500M (frozen backbone) + 123K (trainable action head)

Memory Footprint: ~1.5GB (vs ~15GB for full OpenVLA models)

Intended Use

Primary Use Cases

  • Real-time VLA-guided navigation on JetBot and similar differential-drive robots
  • Natural language control of mobile robots
  • Research platform for embodied AI on edge devices
  • Sim-to-real transfer experiments

Supported Instructions

The model responds to natural language navigation commands such as:

  • "go forward" / "move forward"
  • "turn left" / "go left"
  • "turn right" / "go right" (see limitations)
  • "go towards the [object]" (e.g., "go towards the red ball")
  • "stop" / "wait"

Training

Training Data

Dataset Samples Description
dataset_vla 10,000 Primary training data from Isaac Sim with domain randomization
dataset_vla_synthetic 1,000 Initial synthetic data
dataset_vla_synthetic_large 2,000 Extended synthetic data
dataset_turn_right_10k 10,000 Focused on right-turn instructions (~23% right-turn commands)

Instruction Distribution:

  • Forward motion: ~25%
  • Left turns: ~18%
  • Right turns: ~18%
  • Obstacle avoidance: ~18%
  • Object approach: ~12%
  • Stop/wait: ~6%
  • Other: ~3%

Training Procedure

Hyperparameter Value
Optimizer AdamW
Learning Rate 5e-5
Weight Decay 0.01
Batch Size 2-16
Epochs 5-20
Warmup 10% of steps
Scheduler OneCycleLR
Loss Function MSE
Gradient Clipping 1.0
Precision FP16 (mixed precision)

Hardware: NVIDIA RTX 4090, 16GB+ VRAM

Training Time: ~2-4 hours for 20 epochs

Training Command

python -m server.vla_server.fine_tuning.train_smolvla \
    --data-dir dataset_vla \
    --output-dir models/smolvla_jetbot \
    --epochs 20 \
    --batch-size 2 \
    --lr 5e-5

Evaluation

Results (v2 Model)

Instruction Avg Left Avg Right R-L Diff Assessment
go forward 0.85 0.94 +0.09 ✅ Correct
turn left 0.57 0.92 +0.35 ✅ Correct
turn right 0.64 0.95 +0.31 ❌ Incorrect (turns left)
go towards red ball 0.86 0.45 -0.41 ✅ Correct

Overall Accuracy: 75% (3/4 correct)

Best Validation Loss: 0.0394

Key Observations

  • Visual grounding works: "go towards the red ball" correctly produces right-turning behavior when the ball is on the right
  • Language differentiation limited: "turn right" produces similar output to "turn left"
  • Forward motion is robust and consistent

Limitations and Biases

Known Limitations

  1. Turn Right Confusion: The model does not reliably differentiate "turn right" from "turn left" - both produce left-turning behavior (R > L motor values)
  2. Limited Visual Grounding: Response to visual targets beyond simple colored objects is limited
  3. No Speed Control: Does not differentiate between "slow" and "fast" modifiers
  4. Sim-to-Real Gap: Trained in simulation; real-world performance may vary due to lighting, textures, and motor response differences

Failure Cases

  • "turn right" produces left-turning behavior
  • Complex instructions like "go around the obstacle" are not well understood
  • Model may output similar actions regardless of visual scene changes

Mitigation Strategies for Sim-to-Real Transfer

  1. Domain randomization during data collection (implemented)
  2. Real-world fine-tuning with small dataset
  3. Action scaling/calibration on real robot
  4. Visual domain adaptation

How to Use

Installation

pip install transformers torch pillow

Inference Example

from transformers import AutoProcessor, AutoModel
import torch
from PIL import Image

# Load model and processor
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-500M-Instruct")
base_model = AutoModel.from_pretrained("HuggingFaceTB/SmolVLM-500M-Instruct")

# Load fine-tuned action head
action_head = torch.load("path/to/action_head.pt")

# Prepare inputs
image = Image.open("camera_image.jpg")
instruction = "go forward"

inputs = processor(
    images=image,
    text=f"<image>\n{instruction}",
    return_tensors="pt"
)

# Get hidden states
with torch.no_grad():
    outputs = base_model(**inputs, output_hidden_states=True)
    hidden_states = outputs.hidden_states[-1][:, -1, :]  # Last token

    # Get motor commands
    actions = action_head(hidden_states)
    left_speed, right_speed = actions[0].tolist()

print(f"Left motor: {left_speed:.3f}, Right motor: {right_speed:.3f}")

Running the VLA Server

# Start the inference server
python -m server.vla_server.server \
    --model-type smolvla \
    --fine-tuned \
    --model models/smolvla_jetbot/best

# The server accepts ZMQ requests with images and instructions

Hardware Requirements

Training

  • GPU: NVIDIA RTX 4090 or equivalent
  • VRAM: 16GB+
  • Training Time: ~2-4 hours for 20 epochs

Inference

  • GPU: CUDA-capable GPU
  • VRAM: 4GB+
  • Latency: ~50-100ms per frame

Target Deployment

  • Platform: NVIDIA Jetson Nano/Orin
  • Memory: ~1.5GB model footprint

Citation

If you use this model, please cite:

@misc{smolvla-jetbot,
  author = {shraavb},
  title = {SmolVLA-JetBot: Lightweight Vision-Language-Action Model for Robot Navigation},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/shraavb/smolvla-jetbot}
}

Acknowledgments

Model Card Contact

For questions, issues, or contributions, please open an issue on the model repository or contact the developer.

Downloads last month
5
Video Preview
loading

Model tree for shraavb/smolvla-jetbot

Evaluation results