EdgeVLA-Tiny (FMB)
164M parameter ultra-compact Vision-Language-Action model. 64% smaller than SmolVLA, 3x faster, the smallest open-source VLA. Trained on real-robot data.
EdgeVLA-Tiny pushes the EdgeVLA architecture to its smallest configuration: FastViT-t8 vision, 4 VLM layers (75% pruned), and 0.75x expert width. At 164M parameters, it still beats the 450M SmolVLA baseline on action prediction while running 3x faster. The vision encoder is trained end-to-end. Architecture inspired by DynamicVLA; VLM layer pruning is our contribution.
Trained exclusively on lerobot/fmb (3-camera Franka Panda manipulation). Source code: enfuse/edgevla
Intended Use & What You Can Do With This Model
This model predicts 7-DoF robot actions (x, y, z, rx, ry, rz, gripper) from 3 camera images. It outputs 50-step action chunks at 10Hz — each inference produces 5 seconds of continuous robot motion.
Immediate uses:
- Deploy on a Franka Panda (or compatible 7-DoF arm) with a 3-camera setup for FMB-style tabletop manipulation. Feed camera frames in, execute the predicted delta actions.
- Fine-tune on your own robot data — this is the most practical use. If you have any robot with cameras in LeRobot format, this checkpoint is an excellent pretrained starting point. Fine-tuning at LR=3e-5 for 50K steps typically adapts well to new setups.
- Edge deployment — the smallest model in the EdgeVLA family and likely the smallest open-source VLA model available. At 164M params / 313MB FP16, it runs on even the most constrained Jetson devices. Estimated ~142ms on Jetson Orin AGX, ~57ms on H200. Fits on Jetson Orin Nano (8GB).
- Real-time control — at 17.7 Hz throughput on H200, this model can run closed-loop at well above the 10Hz control frequency, enabling reactive manipulation.
Important caveats:
- All metrics below are offline action prediction on held-out FMB samples. There are no closed-loop success rate numbers — the model has not been validated on a physical robot completing full tasks.
- Trained specifically on FMB data (Franka Panda, specific manipulation tasks, 3-camera setup). It will not generalize to different robots, camera configurations, or tasks without fine-tuning.
- The model expects 3 camera inputs (side_1, side_2, wrist). For single-camera setups, you would need to fine-tune with
--empty_camerasor retrain. - As the smallest variant, Tiny trades some accuracy for speed — rz and x dimensions are slightly worse than SmolVLA (see per-dimension table below).
Results (FMB Offline, 500 held-out samples)
| Metric | SmolVLA (450M) | EdgeVLA-Tiny (164M) | Delta |
|---|---|---|---|
| Action MSE | 0.618 | 0.555 | -10% |
| Cosine Similarity | 0.663 | 0.654 | -1% |
| Gripper Accuracy | 94.9% | 95.1% | +0.2pp |
| Inference Latency (H200) | 169ms | 57ms | -66% |
| Memory (FP16) | 858MB | 313MB | -64% |
Per-Dimension MSE
| Dim | SmolVLA | Tiny | Delta |
|---|---|---|---|
| x | 0.538 | 0.541 | +1% |
| y | 0.598 | 0.557 | -7% |
| z | 0.599 | 0.557 | -7% |
| rx | 0.624 | 0.544 | -13% |
| ry | 1.358 | 1.054 | -22% |
| rz | 0.373 | 0.404 | +8% |
| gripper | 0.233 | 0.226 | -3% |
Latency (H200, FP32)
| Mean | P50 | P95 | Throughput |
|---|---|---|---|
| 57ms | 56ms | 59ms | 17.7 Hz |
Architecture
EdgeVLA-Tiny (164M total, 30M trainable):
FastViT-t8 vision: 4.0M (trainable, replaces SigLIP 98M frozen)
VLM (SmolLM2-360M): 133.9M (frozen, 4 layers — pruned from 16)
Action expert: 24.6M (trainable, flow matching, 0.75x width)
Projections: 1.6M (trainable)
Key changes from SmolVLA: FastViT-t8 (conv, trainable) replaces SigLIP (ViT, frozen). VLM pruned 16 to 4 layers (75%). 64 visual tokens vs 729 (11x fewer). 256x256 input vs 384x384.
Training
| Parameter | Value |
|---|---|
| Dataset | lerobot/fmb |
| Total steps | 100K (50K + 50K fine-tune at LR=5e-5) |
| Batch size | 64 |
| Learning rate | 1e-4 initial, 5e-5 fine-tune (cosine) |
| Warmup | 2,000 / 1,000 steps |
| Augmentation | ColorJitter + RandomSharpness + RandomAffine |
| Cameras | 3 (side_1, side_2, wrist) |
| Actions | 7-dim (x, y, z, rx, ry, rz, gripper) |
| VLM layers | 4 (pruned from 16) |
| Expert width | 0.75x |
| Hardware | 1x NVIDIA H200 |
| Training time | ~10 hours total |
EdgeVLA Family
| Model | Params | MSE | Cosine Sim | Gripper | Latency | HF Repo |
|---|---|---|---|---|---|---|
| SmolVLA | 450M | 0.618 | 0.663 | 94.9% | 169ms | lerobot/smolvla_base |
| Base | 363M | 0.458 | 0.713 | 96.5% | 162ms | enfuse/edgevla-base-fmb |
| Small | 228M | 0.515 | 0.679 | 95.8% | 90ms | enfuse/edgevla-small-fmb |
| Tiny | 164M | 0.555 | 0.654 | 95.1% | 57ms | this repo |
Quick Start
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
policy = SmolVLAPolicy.from_pretrained("enfuse/edgevla-tiny-fmb")
policy.eval()
Fine-Tuning on Your Own Data
git clone https://github.com/enfuse/edgevla
cd edgevla
python edgevla/train.py \
--base_policy enfuse/edgevla-tiny-fmb \
--dataset your_lerobot_dataset \
--fastvit_variant fastvit_t8 \
--num_vlm_layers 4 \
--expert_width_multiplier 0.75 \
--lr 3e-5 \
--steps 50000 \
--batch_size 64
See the training README for full configuration options and multi-round training strategy.
Attribution
Architecture from DynamicVLA (Xie et al., 2026). VLM layer pruning is our contribution. Built on SmolVLA, FastViT, and LeRobot.
@article{xie2026dynamicvla,
title={DynamicVLA: Efficient Vision-Language-Action Model via Dynamic Fusion for Robotic Manipulation},
author={Xie, Yue and others},
journal={arXiv preprint arXiv:2601.22153},
year={2026}
}
- Downloads last month
- -
Model tree for enfuse/edgevla-tiny-fmb
Base model
HuggingFaceTB/SmolLM2-360MDataset used to train enfuse/edgevla-tiny-fmb
Papers for enfuse/edgevla-tiny-fmb
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization
Evaluation results
- Action MSE on Functional Manipulation Benchmarkself-reported0.555
- Gripper Accuracy (%) on Functional Manipulation Benchmarkself-reported95.100
- Cosine Similarity on Functional Manipulation Benchmarkself-reported0.654