--- license: apache-2.0 tags: - robotics - vla - knowledge-distillation - model-compression - edge-deployment - action-chunking - multi-teacher datasets: - lerobot/pusht - lerobot/libero language: - en library_name: forge pipeline_tag: robotics --- # FORGE-Nano: Compressed VLA for Real-Time Robot Control

7B VLA teacher → <1B student → 14.1 fps on edge GPU

## What is FORGE? **FORGE** (Fast Optimized Robot Generation Engine) is a model distillation pipeline that takes any 7B+ Vision-Language-Action (VLA) model and compresses it to **<2GB for real-time edge deployment** on NVIDIA Jetson and Apple Silicon. Part of the **ANIMA** agentic robotics AI stack by [Robot Flow Labs](https://robotflowlabs.com). ## Architecture ``` Teacher (7B VLA) | v [SigLIP-SO400M] ---> [Bridge Attention] ---> [Qwen2.5-0.5B + LoRA] ---> [Action Head] (frozen) (64 queries, 4L) (rank=32/64) (diffusion/flow) 472.3M params 39.7M params ~494M params ~1.7M params ``` **Total: 967.9M params** (495.6M trainable, 472.3M frozen) ## Benchmark Results (4x NVIDIA L4 24GB) ### Student Variants | Variant | Params | FP32 fps | FP16 fps | FP16 Speedup | Training Loss Reduction | |---------|--------|----------|----------|--------------|------------------------| | Nano (diffusion, LoRA=32) | 967.9M | 7.9 | **11.0** | 1.39x | 67.0% | | Nano (diffusion, LoRA=64) | 972.3M | 7.9 | 10.8 | 1.37x | **76.9%** | | Nano (flow, LoRA=32) | 967.9M | **8.2** | **12.6** | **1.54x** | 85.8% | | Small (diffusion) | 2097.7M | 6.2 | 9.9 | -- | -- | | Small (flow) | 2097.7M | 6.1 | **11.3** | -- | -- | ### Full Pipeline: Build -> Train -> Prune -> Deploy | Configuration | Post-Prune Params | FP32 fps | FP16 fps | Loss Reduction | |---------------|-------------------|----------|----------|----------------| | Diffusion + p75 + INT4 | 830.8M | 10.0 | 12.0 | 41.4% | | Flow + p50 + INT4 | **739.3M** | **14.1** | 7.8 | 76.3% | | LoRA-64 + p90 + INT4 | 880.8M | 9.1 | 11.2 | **86.3%** | | **Flow + LoRA-64 + p60** | **774.1M** | **12.7** | **14.1** | 75.7% | | No prune + INT8 | 922.2M | 8.1 | 11.0 | 59.4% | ### Multi-GPU Scaling | GPUs | FP32 b=16 | FP16 b=32 | |------|-----------|-----------| | 1 GPU | 9.3 fps | **33.6 fps** | | 2 GPU | 13.5 fps | -- | | 4 GPU | **13.6 fps** | 31.6 fps | ### Multi-Teacher Distillation - **5 teachers** fit across 2 GPUs (22.7 GB total) - Router learns optimal teacher weighting in <50 steps - Best config: balanced (alpha_task=0.3) achieves **76.1% loss reduction** - Supports: OpenVLA-7B, RDT2-FM, SmolVLA, BitVLA, Pi0 ### Pruning Results | Pruning Ratio | Layers | Params | FP32 fps | Speedup | |---------------|--------|--------|----------|---------| | No prune | 24 | 967.9M | 7.9 | 1.0x | | 90% keep | 18 | 880.8M | 9.1 | 1.15x | | 75% keep | 15 | 830.8M | 10.0 | 1.27x | | 60% keep | 11 | 774.1M | 12.7 | **1.61x** | | 50% keep | 9 | 739.3M | **14.1** | **1.78x** | ## Recommended Configurations ### Production (Edge Deployment) ```yaml variant: nano action_head: flow lora_rank: 64 prune_ratio: 0.60 target_bits: 4 # Result: 774M params, FP16 14.1 fps, <600MB INT4 ``` ### Quality-First ```yaml variant: nano action_head: diffusion lora_rank: 32 prune_ratio: 0.75 target_bits: 8 # Result: 830M params, 92.3% loss reduction ``` ## Key Findings 1. **Flow matching head is 15% faster** than diffusion at FP16 inference (12.6 vs 11.0 fps) 2. **LoRA rank=64 trains 10% better** than rank=32 (76.9% vs 67.0% loss reduction) with negligible speed cost 3. **Aggressive pruning works**: 50% layer removal still produces a functional model at 14.1 fps 4. **FP16 autocast gives 1.4-1.5x speedup** for free — always use it in production 5. **Multi-teacher routing converges fast**: Router learns to weight teachers optimally in <50 steps ## Supported Teachers | Teacher | Type | Params | Chunk Size | |---------|------|--------|------------| | OpenVLA-7B | Token-AR | 7.6B | H=1 | | RDT2-FM | Diffusion | 1.2B | H=8 | | SmolVLA | Parallel | 0.5B | H=1 | | BitVLA | Quantized | 5.9B | H=1 | | Pi0 | Flow | 3.0B | H=4 | ## Supported Robots | Robot | DoF | Action Head | Horizon | Control Rate | |-------|-----|-------------|---------|-------------| | Franka Panda | 7 | Flow | H=8 | 20 Hz | | ALOHA (bimanual) | 14 | Chunk | H=16 | 50 Hz | | xArm | 6 | Flow | H=4 | 100 Hz | | UR5e | 6 | Flow | H=4 | 125 Hz | ## Pipeline ``` Teacher Labels -> Knowledge Distillation -> Layer Pruning -> Quantization -> Edge Export (HDF5) (LoRA + Bridge) (Chunk-aware) (INT4/INT8) (TRT/ONNX/MLX) ``` ## Usage ```bash pip install anima-forge # Full pipeline forge pipeline --device cuda --variant nano --steps 5000 # Auto-detect model dimensions forge autosense --model-dir /path/to/models # Benchmark forge benchmark run --device cuda # Deploy forge serve --port 8000 ``` ## Citation ```bibtex @software{forge2026, title={FORGE: Fast Optimized Robot Generation Engine}, author={Robot Flow Labs}, year={2026}, url={https://github.com/RobotFlow-Labs/anima-forge-distillation-pipeline} } ``` ## License Apache 2.0