SmolVLA-JetBot
A lightweight Vision-Language-Action (VLA) model fine-tuned for controlling JetBot, an NVIDIA Jetson Nano-based differential-drive robot. This model takes camera images and natural language instructions as input and outputs motor commands for robot navigation.
Model Description
SmolVLA-JetBot is designed to bring VLA capabilities to resource-constrained edge devices. It uses a frozen SmolVLM-500M backbone with a custom trainable action head that maps vision-language representations to motor commands.
- Developed by: shraavb
- Model type: Vision-Language-Action (VLA)
- Language: English
- License: Apache 2.0
- Base Model: HuggingFaceTB/SmolVLM-500M-Instruct
- Training Data: 10,000+ samples from NVIDIA Isaac Sim with domain randomization
Architecture
Base Model: HuggingFaceTB/SmolVLM-500M-Instruct (Frozen)
├── Vision Encoder: SigLIP-400M
├── Language Model: SmolLM-360M
└── Hidden Size: 960
Action Head (Trainable, ~123K parameters):
├── Linear(960 → 128)
├── ReLU + Dropout(0.1)
├── Linear(128 → 2)
└── Tanh → outputs in [-1, 1]
Output: [left_motor_speed, right_motor_speed]
Total Parameters: ~500M (frozen backbone) + 123K (trainable action head)
Memory Footprint: ~1.5GB (vs ~15GB for full OpenVLA models)
Intended Use
Primary Use Cases
- Real-time VLA-guided navigation on JetBot and similar differential-drive robots
- Natural language control of mobile robots
- Research platform for embodied AI on edge devices
- Sim-to-real transfer experiments
Supported Instructions
The model responds to natural language navigation commands such as:
- "go forward" / "move forward"
- "turn left" / "go left"
- "turn right" / "go right" (see limitations)
- "go towards the [object]" (e.g., "go towards the red ball")
- "stop" / "wait"
Training
Training Data
| Dataset | Samples | Description |
|---|---|---|
| dataset_vla | 10,000 | Primary training data from Isaac Sim with domain randomization |
| dataset_vla_synthetic | 1,000 | Initial synthetic data |
| dataset_vla_synthetic_large | 2,000 | Extended synthetic data |
| dataset_turn_right_10k | 10,000 | Focused on right-turn instructions (~23% right-turn commands) |
Instruction Distribution:
- Forward motion: ~25%
- Left turns: ~18%
- Right turns: ~18%
- Obstacle avoidance: ~18%
- Object approach: ~12%
- Stop/wait: ~6%
- Other: ~3%
Training Procedure
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning Rate | 5e-5 |
| Weight Decay | 0.01 |
| Batch Size | 2-16 |
| Epochs | 5-20 |
| Warmup | 10% of steps |
| Scheduler | OneCycleLR |
| Loss Function | MSE |
| Gradient Clipping | 1.0 |
| Precision | FP16 (mixed precision) |
Hardware: NVIDIA RTX 4090, 16GB+ VRAM
Training Time: ~2-4 hours for 20 epochs
Training Command
python -m server.vla_server.fine_tuning.train_smolvla \
--data-dir dataset_vla \
--output-dir models/smolvla_jetbot \
--epochs 20 \
--batch-size 2 \
--lr 5e-5
Evaluation
Results (v2 Model)
| Instruction | Avg Left | Avg Right | R-L Diff | Assessment |
|---|---|---|---|---|
| go forward | 0.85 | 0.94 | +0.09 | ✅ Correct |
| turn left | 0.57 | 0.92 | +0.35 | ✅ Correct |
| turn right | 0.64 | 0.95 | +0.31 | ❌ Incorrect (turns left) |
| go towards red ball | 0.86 | 0.45 | -0.41 | ✅ Correct |
Overall Accuracy: 75% (3/4 correct)
Best Validation Loss: 0.0394
Key Observations
- Visual grounding works: "go towards the red ball" correctly produces right-turning behavior when the ball is on the right
- Language differentiation limited: "turn right" produces similar output to "turn left"
- Forward motion is robust and consistent
Limitations and Biases
Known Limitations
- Turn Right Confusion: The model does not reliably differentiate "turn right" from "turn left" - both produce left-turning behavior (R > L motor values)
- Limited Visual Grounding: Response to visual targets beyond simple colored objects is limited
- No Speed Control: Does not differentiate between "slow" and "fast" modifiers
- Sim-to-Real Gap: Trained in simulation; real-world performance may vary due to lighting, textures, and motor response differences
Failure Cases
- "turn right" produces left-turning behavior
- Complex instructions like "go around the obstacle" are not well understood
- Model may output similar actions regardless of visual scene changes
Mitigation Strategies for Sim-to-Real Transfer
- Domain randomization during data collection (implemented)
- Real-world fine-tuning with small dataset
- Action scaling/calibration on real robot
- Visual domain adaptation
How to Use
Installation
pip install transformers torch pillow
Inference Example
from transformers import AutoProcessor, AutoModel
import torch
from PIL import Image
# Load model and processor
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-500M-Instruct")
base_model = AutoModel.from_pretrained("HuggingFaceTB/SmolVLM-500M-Instruct")
# Load fine-tuned action head
action_head = torch.load("path/to/action_head.pt")
# Prepare inputs
image = Image.open("camera_image.jpg")
instruction = "go forward"
inputs = processor(
images=image,
text=f"<image>\n{instruction}",
return_tensors="pt"
)
# Get hidden states
with torch.no_grad():
outputs = base_model(**inputs, output_hidden_states=True)
hidden_states = outputs.hidden_states[-1][:, -1, :] # Last token
# Get motor commands
actions = action_head(hidden_states)
left_speed, right_speed = actions[0].tolist()
print(f"Left motor: {left_speed:.3f}, Right motor: {right_speed:.3f}")
Running the VLA Server
# Start the inference server
python -m server.vla_server.server \
--model-type smolvla \
--fine-tuned \
--model models/smolvla_jetbot/best
# The server accepts ZMQ requests with images and instructions
Hardware Requirements
Training
- GPU: NVIDIA RTX 4090 or equivalent
- VRAM: 16GB+
- Training Time: ~2-4 hours for 20 epochs
Inference
- GPU: CUDA-capable GPU
- VRAM: 4GB+
- Latency: ~50-100ms per frame
Target Deployment
- Platform: NVIDIA Jetson Nano/Orin
- Memory: ~1.5GB model footprint
Citation
If you use this model, please cite:
@misc{smolvla-jetbot,
author = {shraavb},
title = {SmolVLA-JetBot: Lightweight Vision-Language-Action Model for Robot Navigation},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/shraavb/smolvla-jetbot}
}
Acknowledgments
- HuggingFace SmolVLM for the base vision-language model
- NVIDIA Isaac Sim for simulation and synthetic data generation
- JetBot Project for the robot platform
Model Card Contact
For questions, issues, or contributions, please open an issue on the model repository or contact the developer.
- Downloads last month
- 5
Model tree for shraavb/smolvla-jetbot
Base model
HuggingFaceTB/SmolLM2-360MEvaluation results
- Instruction Accuracy on JetBot Isaac Simself-reported0.750
- Validation Loss on JetBot Isaac Simself-reported0.039