edgevla-small-fmb / README.md
cahlen's picture
Update model card: concise, consistent, add GitHub link
691664c verified
metadata
license: apache-2.0
language:
  - en
library_name: lerobot
pipeline_tag: robotics
tags:
  - robotics
  - vla
  - vision-language-action
  - edge-ai
  - manipulation
  - fastvit
  - smolvla
  - lerobot
  - jetson
  - edge-deployment
datasets:
  - lerobot/fmb
base_model:
  - lerobot/smolvla_base
  - HuggingFaceTB/SmolVLM2-500M-Video-Instruct
model-index:
  - name: EdgeVLA-Small
    results:
      - task:
          type: robotics
          name: Action Prediction (FMB)
        dataset:
          type: lerobot/fmb
          name: Functional Manipulation Benchmark
        metrics:
          - type: mse
            value: 0.515
            name: Action MSE
          - type: accuracy
            value: 95.8
            name: Gripper Accuracy (%)
          - type: cosine_similarity
            value: 0.679
            name: Cosine Similarity

EdgeVLA-Small (FMB)

228M parameter edge-optimized Vision-Language-Action model. 49% smaller than SmolVLA, 47% faster, 17% better accuracy. Best efficiency/accuracy tradeoff. Trained on real-robot data.

EdgeVLA-Small combines FastViT-t8 vision with aggressive VLM layer pruning (16 to 8 layers) to achieve a 228M model that beats the 450M SmolVLA baseline on action prediction while running nearly 2x faster. The vision encoder is trained end-to-end — every parameter contributes at inference. Architecture inspired by DynamicVLA; VLM layer pruning is our contribution.

Trained exclusively on lerobot/fmb (3-camera Franka Panda manipulation). Source code: enfuse/edgevla

Intended Use & What You Can Do With This Model

This model predicts 7-DoF robot actions (x, y, z, rx, ry, rz, gripper) from 3 camera images. It outputs 50-step action chunks at 10Hz — each inference produces 5 seconds of continuous robot motion.

Immediate uses:

  • Deploy on a Franka Panda (or compatible 7-DoF arm) with a 3-camera setup for FMB-style tabletop manipulation. Feed camera frames in, execute the predicted delta actions.
  • Fine-tune on your own robot data — this is the most practical use. If you have any robot with cameras in LeRobot format, this checkpoint is an excellent pretrained starting point. Fine-tuning at LR=3e-5 for 50K steps typically adapts well to new setups.
  • Edge deployment — the best efficiency/accuracy tradeoff in the EdgeVLA family. Estimated ~184ms on Jetson Orin AGX with TensorRT FP16. At 228M params and 435MB FP16, it fits comfortably on Jetson Orin NX (16GB) and AGX (32–64GB).
  • Research baseline — 49% smaller than SmolVLA, nearly 2x faster, and still 17% better on action prediction.

Important caveats:

  • All metrics below are offline action prediction on held-out FMB samples. There are no closed-loop success rate numbers — the model has not been validated on a physical robot completing full tasks.
  • Trained specifically on FMB data (Franka Panda, specific manipulation tasks, 3-camera setup). It will not generalize to different robots, camera configurations, or tasks without fine-tuning.
  • The model expects 3 camera inputs (side_1, side_2, wrist). For single-camera setups, you would need to fine-tune with --empty_cameras or retrain.

Results (FMB Offline, 500 held-out samples)

Metric SmolVLA (450M) EdgeVLA-Small (228M) Delta
Action MSE 0.618 0.515 -17%
Cosine Similarity 0.663 0.679 +2%
Gripper Accuracy 94.9% 95.8% +0.9pp
Inference Latency (H200) 169ms 90ms -47%
Memory (FP16) 858MB 435MB -49%

Per-Dimension MSE

Dim SmolVLA Small Delta
x 0.538 0.514 -4%
y 0.598 0.540 -10%
z 0.599 0.529 -12%
rx 0.624 0.462 -26%
ry 1.358 0.974 -28%
rz 0.373 0.395 +6%
gripper 0.233 0.192 -18%

Latency (H200, FP32)

Mean P50 P95 Throughput
90ms 87ms 105ms 11.1 Hz

Architecture

EdgeVLA-Small (228M total, 55M trainable):
  FastViT-t8 vision:        4.0M  (trainable, replaces SigLIP 98M frozen)
  VLM (SmolLM2-360M):    169.2M  (frozen, 8 layers — pruned from 16)
  Action expert:           49.1M  (trainable, flow matching, 0.75x width)
  Projections:              1.6M  (trainable)

Key changes from SmolVLA: FastViT-t8 (conv, trainable) replaces SigLIP (ViT, frozen). VLM pruned 16 to 8 layers. 64 visual tokens vs 729 (11x fewer). 256x256 input vs 384x384.

Training

Parameter Value
Dataset lerobot/fmb
Total steps 150K (50K + 100K continued)
Batch size 64
Learning rate 1e-4 (cosine)
Warmup 2,000 steps
Augmentation ColorJitter + RandomSharpness + RandomAffine
Cameras 3 (side_1, side_2, wrist)
Actions 7-dim (x, y, z, rx, ry, rz, gripper)
VLM layers 8 (pruned from 16)
Expert width 0.75x
Hardware 1x NVIDIA H200
Training time ~16 hours total

EdgeVLA Family

Model Params MSE Cosine Sim Gripper Latency HF Repo
SmolVLA 450M 0.618 0.663 94.9% 169ms lerobot/smolvla_base
Base 363M 0.458 0.713 96.5% 162ms enfuse/edgevla-base-fmb
Small 228M 0.515 0.679 95.8% 90ms this repo
Tiny 164M 0.555 0.654 95.1% 57ms enfuse/edgevla-tiny-fmb

Quick Start

from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy

policy = SmolVLAPolicy.from_pretrained("enfuse/edgevla-small-fmb")
policy.eval()

Fine-Tuning on Your Own Data

git clone https://github.com/enfuse/edgevla
cd edgevla

python edgevla/train.py \
  --base_policy enfuse/edgevla-small-fmb \
  --dataset your_lerobot_dataset \
  --fastvit_variant fastvit_t8 \
  --num_vlm_layers 8 \
  --expert_width_multiplier 0.75 \
  --lr 3e-5 \
  --steps 50000 \
  --batch_size 64

See the training README for full configuration options and multi-round training strategy.

Attribution

Architecture from DynamicVLA (Xie et al., 2026). VLM layer pruning is our contribution. Built on SmolVLA, FastViT, and LeRobot.

@article{xie2026dynamicvla,
  title={DynamicVLA: Efficient Vision-Language-Action Model via Dynamic Fusion for Robotic Manipulation},
  author={Xie, Yue and others},
  journal={arXiv preprint arXiv:2601.22153},
  year={2026}
}