EdgeVLA-Tiny (FMB)

164M parameter ultra-compact Vision-Language-Action model. 64% smaller than SmolVLA, 3x faster, the smallest open-source VLA. Trained on real-robot data.

EdgeVLA-Tiny pushes the EdgeVLA architecture to its smallest configuration: FastViT-t8 vision, 4 VLM layers (75% pruned), and 0.75x expert width. At 164M parameters, it still beats the 450M SmolVLA baseline on action prediction while running 3x faster. The vision encoder is trained end-to-end. Architecture inspired by DynamicVLA; VLM layer pruning is our contribution.

Trained exclusively on lerobot/fmb (3-camera Franka Panda manipulation). Source code: enfuse/edgevla

Intended Use & What You Can Do With This Model

This model predicts 7-DoF robot actions (x, y, z, rx, ry, rz, gripper) from 3 camera images. It outputs 50-step action chunks at 10Hz — each inference produces 5 seconds of continuous robot motion.

Immediate uses:

Deploy on a Franka Panda (or compatible 7-DoF arm) with a 3-camera setup for FMB-style tabletop manipulation. Feed camera frames in, execute the predicted delta actions.
Fine-tune on your own robot data — this is the most practical use. If you have any robot with cameras in LeRobot format, this checkpoint is an excellent pretrained starting point. Fine-tuning at LR=3e-5 for 50K steps typically adapts well to new setups.
Edge deployment — the smallest model in the EdgeVLA family and likely the smallest open-source VLA model available. At 164M params / 313MB FP16, it runs on even the most constrained Jetson devices. Estimated ~142ms on Jetson Orin AGX, ~57ms on H200. Fits on Jetson Orin Nano (8GB).
Real-time control — at 17.7 Hz throughput on H200, this model can run closed-loop at well above the 10Hz control frequency, enabling reactive manipulation.

Important caveats:

All metrics below are offline action prediction on held-out FMB samples. There are no closed-loop success rate numbers — the model has not been validated on a physical robot completing full tasks.
Trained specifically on FMB data (Franka Panda, specific manipulation tasks, 3-camera setup). It will not generalize to different robots, camera configurations, or tasks without fine-tuning.
The model expects 3 camera inputs (side_1, side_2, wrist). For single-camera setups, you would need to fine-tune with --empty_cameras or retrain.
As the smallest variant, Tiny trades some accuracy for speed — rz and x dimensions are slightly worse than SmolVLA (see per-dimension table below).

Results (FMB Offline, 500 held-out samples)

Metric	SmolVLA (450M)	EdgeVLA-Tiny (164M)	Delta
Action MSE	0.618	0.555	-10%
Cosine Similarity	0.663	0.654	-1%
Gripper Accuracy	94.9%	95.1%	+0.2pp
Inference Latency (H200)	169ms	57ms	-66%
Memory (FP16)	858MB	313MB	-64%

Per-Dimension MSE

Dim	SmolVLA	Tiny	Delta
x	0.538	0.541	+1%
y	0.598	0.557	-7%
z	0.599	0.557	-7%
rx	0.624	0.544	-13%
ry	1.358	1.054	-22%
rz	0.373	0.404	+8%
gripper	0.233	0.226	-3%

Latency (H200, FP32)

Mean	P50	P95	Throughput
57ms	56ms	59ms	17.7 Hz

Architecture

EdgeVLA-Tiny (164M total, 30M trainable):
  FastViT-t8 vision:        4.0M  (trainable, replaces SigLIP 98M frozen)
  VLM (SmolLM2-360M):    133.9M  (frozen, 4 layers — pruned from 16)
  Action expert:           24.6M  (trainable, flow matching, 0.75x width)
  Projections:              1.6M  (trainable)

Key changes from SmolVLA: FastViT-t8 (conv, trainable) replaces SigLIP (ViT, frozen). VLM pruned 16 to 4 layers (75%). 64 visual tokens vs 729 (11x fewer). 256x256 input vs 384x384.

Training

Parameter	Value
Dataset	`lerobot/fmb`
Total steps	100K (50K + 50K fine-tune at LR=5e-5)
Batch size	64
Learning rate	1e-4 initial, 5e-5 fine-tune (cosine)
Warmup	2,000 / 1,000 steps
Augmentation	ColorJitter + RandomSharpness + RandomAffine
Cameras	3 (side_1, side_2, wrist)
Actions	7-dim (x, y, z, rx, ry, rz, gripper)
VLM layers	4 (pruned from 16)
Expert width	0.75x
Hardware	1x NVIDIA H200
Training time	~10 hours total

EdgeVLA Family

Model	Params	MSE	Cosine Sim	Gripper	Latency	HF Repo
SmolVLA	450M	0.618	0.663	94.9%	169ms	lerobot/smolvla_base
Base	363M	0.458	0.713	96.5%	162ms	enfuse/edgevla-base-fmb
Small	228M	0.515	0.679	95.8%	90ms	enfuse/edgevla-small-fmb
Tiny	164M	0.555	0.654	95.1%	57ms	this repo

Quick Start

from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy

policy = SmolVLAPolicy.from_pretrained("enfuse/edgevla-tiny-fmb")
policy.eval()

Fine-Tuning on Your Own Data

git clone https://github.com/enfuse/edgevla
cd edgevla

python edgevla/train.py \
  --base_policy enfuse/edgevla-tiny-fmb \
  --dataset your_lerobot_dataset \
  --fastvit_variant fastvit_t8 \
  --num_vlm_layers 4 \
  --expert_width_multiplier 0.75 \
  --lr 3e-5 \
  --steps 50000 \
  --batch_size 64

See the training README for full configuration options and multi-round training strategy.

Attribution

Architecture from DynamicVLA (Xie et al., 2026). VLM layer pruning is our contribution. Built on SmolVLA, FastViT, and LeRobot.

@article{xie2026dynamicvla,
  title={DynamicVLA: Efficient Vision-Language-Action Model via Dynamic Fusion for Robotic Manipulation},
  author={Xie, Yue and others},
  journal={arXiv preprint arXiv:2601.22153},
  year={2026}
}