EdgeVLA-Base (FMB)
363M parameter edge-optimized Vision-Language-Action model. 19% smaller than SmolVLA, 26% better accuracy. Trained on real-robot data.
EdgeVLA-Base replaces the frozen SigLIP vision encoder in SmolVLA with a trainable FastViT-sa12 convolutional backbone, reducing vision parameters from 98M to 12M while improving action prediction accuracy. The vision encoder is trained end-to-end — every parameter contributes at inference. Architecture inspired by DynamicVLA; VLM layer pruning is our contribution.
Trained exclusively on lerobot/fmb (3-camera Franka Panda manipulation). Source code: enfuse/edgevla
Intended Use & What You Can Do With This Model
This model predicts 7-DoF robot actions (x, y, z, rx, ry, rz, gripper) from 3 camera images. It outputs 50-step action chunks at 10Hz — each inference produces 5 seconds of continuous robot motion.
Immediate uses:
- Deploy on a Franka Panda (or compatible 7-DoF arm) with a 3-camera setup for FMB-style tabletop manipulation. Feed camera frames in, execute the predicted delta actions.
- Fine-tune on your own robot data — this is the most practical use. If you have any robot with cameras in LeRobot format, this checkpoint is an excellent pretrained starting point. Fine-tuning at LR=3e-5 for 50K steps typically adapts well to new setups.
- Edge deployment — designed for NVIDIA Jetson (estimated ~368ms per inference on Orin AGX with TensorRT FP16, well within the 5-second action chunk budget).
- Research baseline — beats the 450M SmolVLA baseline while being 19% smaller. Good starting point for VLA architecture research.
Important caveats:
- All metrics below are offline action prediction on held-out FMB samples. There are no closed-loop success rate numbers — the model has not been validated on a physical robot completing full tasks.
- Trained specifically on FMB data (Franka Panda, specific manipulation tasks, 3-camera setup). It will not generalize to different robots, camera configurations, or tasks without fine-tuning.
- The model expects 3 camera inputs (side_1, side_2, wrist). For single-camera setups, you would need to fine-tune with
--empty_camerasor retrain.
Results (FMB Offline, 500 held-out samples)
| Metric | SmolVLA (450M) | EdgeVLA-Base (363M) | Delta |
|---|---|---|---|
| Action MSE | 0.618 | 0.458 | -26% |
| Cosine Similarity | 0.663 | 0.713 | +8% |
| Gripper Accuracy | 94.9% | 96.5% | +1.6pp |
| Inference Latency (H200) | 169ms | 162ms | -4% |
| Memory (FP16) | 858MB | 693MB | -19% |
Per-Dimension MSE
| Dim | SmolVLA | Base | Delta |
|---|---|---|---|
| x | 0.538 | 0.464 | -14% |
| y | 0.598 | 0.507 | -15% |
| z | 0.599 | 0.497 | -17% |
| rx | 0.624 | 0.396 | -37% |
| ry | 1.358 | 0.831 | -39% |
| rz | 0.373 | 0.351 | -6% |
| gripper | 0.233 | 0.161 | -31% |
Latency (H200, FP32)
| Mean | P50 | P95 | Throughput |
|---|---|---|---|
| 162ms | 161ms | 173ms | 6.2 Hz |
Architecture
EdgeVLA-Base (363M total, 111M trainable):
FastViT-sa12 vision: 11.5M (trainable, replaces SigLIP 98M frozen)
VLM (SmolLM2-360M): 251.9M (frozen, 16 layers)
Action expert: 98.2M (trainable, flow matching)
Projections: 1.6M (trainable)
Key changes from SmolVLA: FastViT-sa12 (conv, trainable) replaces SigLIP (ViT, frozen). 64 visual tokens vs 729 (11x fewer). 256x256 input vs 384x384.
Training
| Parameter | Value |
|---|---|
| Dataset | lerobot/fmb |
| Total steps | 150K (50K + 50K + 50K fine-tune) |
| Batch size | 64 |
| Learning rate | 1e-4 → 3e-5 (cosine) |
| Warmup | 2,000 / 500 steps |
| Augmentation | ColorJitter + RandomSharpness + RandomAffine |
| Cameras | 3 (side_1, side_2, wrist) |
| Actions | 7-dim (x, y, z, rx, ry, rz, gripper) |
| VLM layers | 16 |
| Expert width | 0.75x |
| Hardware | 1x NVIDIA H200 |
| Training time | ~16 hours total |
EdgeVLA Family
| Model | Params | MSE | Cosine Sim | Gripper | Latency | HF Repo |
|---|---|---|---|---|---|---|
| SmolVLA | 450M | 0.618 | 0.663 | 94.9% | 169ms | lerobot/smolvla_base |
| Base | 363M | 0.458 | 0.713 | 96.5% | 162ms | this repo |
| Small | 228M | 0.515 | 0.679 | 95.8% | 90ms | enfuse/edgevla-small-fmb |
| Tiny | 164M | 0.555 | 0.654 | 95.1% | 57ms | enfuse/edgevla-tiny-fmb |
Quick Start
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
policy = SmolVLAPolicy.from_pretrained("enfuse/edgevla-base-fmb")
policy.eval()
Fine-Tuning on Your Own Data
git clone https://github.com/enfuse/edgevla
cd edgevla
# Fine-tune from this checkpoint
python edgevla/train.py \
--base_policy enfuse/edgevla-base-fmb \
--dataset your_lerobot_dataset \
--fastvit_variant fastvit_sa12 \
--num_vlm_layers 16 \
--expert_width_multiplier 0.75 \
--lr 3e-5 \
--steps 50000 \
--batch_size 64
See the training README for full configuration options and multi-round training strategy.
Attribution
Architecture from DynamicVLA (Xie et al., 2026). VLM layer pruning is our contribution. Built on SmolVLA, FastViT, and LeRobot.
@article{xie2026dynamicvla,
title={DynamicVLA: Efficient Vision-Language-Action Model via Dynamic Fusion for Robotic Manipulation},
author={Xie, Yue and others},
journal={arXiv preprint arXiv:2601.22153},
year={2026}
}
- Downloads last month
- -
Model tree for enfuse/edgevla-base-fmb
Base model
HuggingFaceTB/SmolLM2-360MDataset used to train enfuse/edgevla-base-fmb
Papers for enfuse/edgevla-base-fmb
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization
Evaluation results
- Action MSE on Functional Manipulation Benchmarkself-reported0.458
- Gripper Accuracy (%) on Functional Manipulation Benchmarkself-reported96.500
- Cosine Similarity on Functional Manipulation Benchmarkself-reported0.713