FORGE-Nano: Compressed VLA for Real-Time Robot Control

7B VLA teacher → <1B student → 14.1 fps on edge GPU

What is FORGE?

FORGE (Fast Optimized Robot Generation Engine) is a model distillation pipeline that takes any 7B+ Vision-Language-Action (VLA) model and compresses it to <2GB for real-time edge deployment on NVIDIA Jetson and Apple Silicon.

Part of the ANIMA agentic robotics AI stack by Robot Flow Labs.

Architecture

Teacher (7B VLA)
    |
    v
[SigLIP-SO400M] ---> [Bridge Attention] ---> [Qwen2.5-0.5B + LoRA] ---> [Action Head]
   (frozen)          (64 queries, 4L)       (rank=32/64)             (diffusion/flow)
   472.3M params     39.7M params           ~494M params             ~1.7M params

Total: 967.9M params (495.6M trainable, 472.3M frozen)

Benchmark Results (4x NVIDIA L4 24GB)

Student Variants

Variant Params FP32 fps FP16 fps FP16 Speedup Training Loss Reduction
Nano (diffusion, LoRA=32) 967.9M 7.9 11.0 1.39x 67.0%
Nano (diffusion, LoRA=64) 972.3M 7.9 10.8 1.37x 76.9%
Nano (flow, LoRA=32) 967.9M 8.2 12.6 1.54x 85.8%
Small (diffusion) 2097.7M 6.2 9.9 -- --
Small (flow) 2097.7M 6.1 11.3 -- --

Full Pipeline: Build -> Train -> Prune -> Deploy

Configuration Post-Prune Params FP32 fps FP16 fps Loss Reduction
Diffusion + p75 + INT4 830.8M 10.0 12.0 41.4%
Flow + p50 + INT4 739.3M 14.1 7.8 76.3%
LoRA-64 + p90 + INT4 880.8M 9.1 11.2 86.3%
Flow + LoRA-64 + p60 774.1M 12.7 14.1 75.7%
No prune + INT8 922.2M 8.1 11.0 59.4%

Multi-GPU Scaling

GPUs FP32 b=16 FP16 b=32
1 GPU 9.3 fps 33.6 fps
2 GPU 13.5 fps --
4 GPU 13.6 fps 31.6 fps

Multi-Teacher Distillation

  • 5 teachers fit across 2 GPUs (22.7 GB total)
  • Router learns optimal teacher weighting in <50 steps
  • Best config: balanced (alpha_task=0.3) achieves 76.1% loss reduction
  • Supports: OpenVLA-7B, RDT2-FM, SmolVLA, BitVLA, Pi0

Pruning Results

Pruning Ratio Layers Params FP32 fps Speedup
No prune 24 967.9M 7.9 1.0x
90% keep 18 880.8M 9.1 1.15x
75% keep 15 830.8M 10.0 1.27x
60% keep 11 774.1M 12.7 1.61x
50% keep 9 739.3M 14.1 1.78x

Recommended Configurations

Production (Edge Deployment)

variant: nano
action_head: flow
lora_rank: 64
prune_ratio: 0.60
target_bits: 4
# Result: 774M params, FP16 14.1 fps, <600MB INT4

Quality-First

variant: nano
action_head: diffusion
lora_rank: 32
prune_ratio: 0.75
target_bits: 8
# Result: 830M params, 92.3% loss reduction

Key Findings

  1. Flow matching head is 15% faster than diffusion at FP16 inference (12.6 vs 11.0 fps)
  2. LoRA rank=64 trains 10% better than rank=32 (76.9% vs 67.0% loss reduction) with negligible speed cost
  3. Aggressive pruning works: 50% layer removal still produces a functional model at 14.1 fps
  4. FP16 autocast gives 1.4-1.5x speedup for free — always use it in production
  5. Multi-teacher routing converges fast: Router learns to weight teachers optimally in <50 steps

Supported Teachers

Teacher Type Params Chunk Size
OpenVLA-7B Token-AR 7.6B H=1
RDT2-FM Diffusion 1.2B H=8
SmolVLA Parallel 0.5B H=1
BitVLA Quantized 5.9B H=1
Pi0 Flow 3.0B H=4

Supported Robots

Robot DoF Action Head Horizon Control Rate
Franka Panda 7 Flow H=8 20 Hz
ALOHA (bimanual) 14 Chunk H=16 50 Hz
xArm 6 Flow H=4 100 Hz
UR5e 6 Flow H=4 125 Hz

Pipeline

Teacher Labels -> Knowledge Distillation -> Layer Pruning -> Quantization -> Edge Export
  (HDF5)           (LoRA + Bridge)        (Chunk-aware)    (INT4/INT8)     (TRT/ONNX/MLX)

Usage

pip install anima-forge

# Full pipeline
forge pipeline --device cuda --variant nano --steps 5000

# Auto-detect model dimensions
forge autosense --model-dir /path/to/models

# Benchmark
forge benchmark run --device cuda

# Deploy
forge serve --port 8000

Citation

@software{forge2026,
  title={FORGE: Fast Optimized Robot Generation Engine},
  author={Robot Flow Labs},
  year={2026},
  url={https://github.com/RobotFlow-Labs/anima-forge-distillation-pipeline}
}

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Datasets used to train robotflowlabs/FORGE-Nano-Benchmark

Collection including robotflowlabs/FORGE-Nano-Benchmark