Upload README.md with huggingface_hub

9bb1ffd verified 16 days ago

5.2 kB

	---
	license: apache-2.0
	tags:
	- robotics
	- vla
	- knowledge-distillation
	- model-compression
	- edge-deployment
	- action-chunking
	- multi-teacher
	datasets:
	- lerobot/pusht
	- lerobot/libero
	language:
	- en
	library_name: forge
	pipeline_tag: robotics
	---

	# FORGE-Nano: Compressed VLA for Real-Time Robot Control

	<p align="center">
	<strong>7B VLA teacher → <1B student → 14.1 fps on edge GPU</strong>
	</p>

	## What is FORGE?

	FORGE (Fast Optimized Robot Generation Engine) is a model distillation pipeline that takes any 7B+ Vision-Language-Action (VLA) model and compresses it to <2GB for real-time edge deployment on NVIDIA Jetson and Apple Silicon.

	Part of the ANIMA agentic robotics AI stack by [Robot Flow Labs](https://robotflowlabs.com).

	## Architecture

	```
	Teacher (7B VLA)
	\|
	v
	[SigLIP-SO400M] ---> [Bridge Attention] ---> [Qwen2.5-0.5B + LoRA] ---> [Action Head]
	(frozen) (64 queries, 4L) (rank=32/64) (diffusion/flow)
	472.3M params 39.7M params ~494M params ~1.7M params
	```

	Total: 967.9M params (495.6M trainable, 472.3M frozen)

	## Benchmark Results (4x NVIDIA L4 24GB)

	### Student Variants
	\| Variant \| Params \| FP32 fps \| FP16 fps \| FP16 Speedup \| Training Loss Reduction \|
	\|---------\|--------\|----------\|----------\|--------------\|------------------------\|
	\| Nano (diffusion, LoRA=32) \| 967.9M \| 7.9 \| 11.0 \| 1.39x \| 67.0% \|
	\| Nano (diffusion, LoRA=64) \| 972.3M \| 7.9 \| 10.8 \| 1.37x \| 76.9% \|
	\| Nano (flow, LoRA=32) \| 967.9M \| 8.2 \| 12.6 \| 1.54x \| 85.8% \|
	\| Small (diffusion) \| 2097.7M \| 6.2 \| 9.9 \| -- \| -- \|
	\| Small (flow) \| 2097.7M \| 6.1 \| 11.3 \| -- \| -- \|

	### Full Pipeline: Build -> Train -> Prune -> Deploy
	\| Configuration \| Post-Prune Params \| FP32 fps \| FP16 fps \| Loss Reduction \|
	\|---------------\|-------------------\|----------\|----------\|----------------\|
	\| Diffusion + p75 + INT4 \| 830.8M \| 10.0 \| 12.0 \| 41.4% \|
	\| Flow + p50 + INT4 \| 739.3M \| 14.1 \| 7.8 \| 76.3% \|
	\| LoRA-64 + p90 + INT4 \| 880.8M \| 9.1 \| 11.2 \| 86.3% \|
	\| Flow + LoRA-64 + p60 \| 774.1M \| 12.7 \| 14.1 \| 75.7% \|
	\| No prune + INT8 \| 922.2M \| 8.1 \| 11.0 \| 59.4% \|

	### Multi-GPU Scaling
	\| GPUs \| FP32 b=16 \| FP16 b=32 \|
	\|------\|-----------\|-----------\|
	\| 1 GPU \| 9.3 fps \| 33.6 fps \|
	\| 2 GPU \| 13.5 fps \| -- \|
	\| 4 GPU \| 13.6 fps \| 31.6 fps \|

	### Multi-Teacher Distillation
	- 5 teachers fit across 2 GPUs (22.7 GB total)
	- Router learns optimal teacher weighting in <50 steps
	- Best config: balanced (alpha_task=0.3) achieves 76.1% loss reduction
	- Supports: OpenVLA-7B, RDT2-FM, SmolVLA, BitVLA, Pi0

	### Pruning Results
	\| Pruning Ratio \| Layers \| Params \| FP32 fps \| Speedup \|
	\|---------------\|--------\|--------\|----------\|---------\|
	\| No prune \| 24 \| 967.9M \| 7.9 \| 1.0x \|
	\| 90% keep \| 18 \| 880.8M \| 9.1 \| 1.15x \|
	\| 75% keep \| 15 \| 830.8M \| 10.0 \| 1.27x \|
	\| 60% keep \| 11 \| 774.1M \| 12.7 \| 1.61x \|
	\| 50% keep \| 9 \| 739.3M \| 14.1 \| 1.78x \|

	## Recommended Configurations

	### Production (Edge Deployment)
	```yaml
	variant: nano
	action_head: flow
	lora_rank: 64
	prune_ratio: 0.60
	target_bits: 4
	# Result: 774M params, FP16 14.1 fps, <600MB INT4
	```

	### Quality-First
	```yaml
	variant: nano
	action_head: diffusion
	lora_rank: 32
	prune_ratio: 0.75
	target_bits: 8
	# Result: 830M params, 92.3% loss reduction
	```

	## Key Findings

	1. Flow matching head is 15% faster than diffusion at FP16 inference (12.6 vs 11.0 fps)
	2. LoRA rank=64 trains 10% better than rank=32 (76.9% vs 67.0% loss reduction) with negligible speed cost
	3. Aggressive pruning works: 50% layer removal still produces a functional model at 14.1 fps
	4. FP16 autocast gives 1.4-1.5x speedup for free — always use it in production
	5. Multi-teacher routing converges fast: Router learns to weight teachers optimally in <50 steps

	## Supported Teachers

	\| Teacher \| Type \| Params \| Chunk Size \|
	\|---------\|------\|--------\|------------\|
	\| OpenVLA-7B \| Token-AR \| 7.6B \| H=1 \|
	\| RDT2-FM \| Diffusion \| 1.2B \| H=8 \|
	\| SmolVLA \| Parallel \| 0.5B \| H=1 \|
	\| BitVLA \| Quantized \| 5.9B \| H=1 \|
	\| Pi0 \| Flow \| 3.0B \| H=4 \|

	## Supported Robots

	\| Robot \| DoF \| Action Head \| Horizon \| Control Rate \|
	\|-------\|-----\|-------------\|---------\|-------------\|
	\| Franka Panda \| 7 \| Flow \| H=8 \| 20 Hz \|
	\| ALOHA (bimanual) \| 14 \| Chunk \| H=16 \| 50 Hz \|
	\| xArm \| 6 \| Flow \| H=4 \| 100 Hz \|
	\| UR5e \| 6 \| Flow \| H=4 \| 125 Hz \|

	## Pipeline

	```
	Teacher Labels -> Knowledge Distillation -> Layer Pruning -> Quantization -> Edge Export
	(HDF5) (LoRA + Bridge) (Chunk-aware) (INT4/INT8) (TRT/ONNX/MLX)
	```

	## Usage

	```bash
	pip install anima-forge

	# Full pipeline
	forge pipeline --device cuda --variant nano --steps 5000

	# Auto-detect model dimensions
	forge autosense --model-dir /path/to/models

	# Benchmark
	forge benchmark run --device cuda

	# Deploy
	forge serve --port 8000
	```

	## Citation

	```bibtex
	@software{forge2026,
	title={FORGE: Fast Optimized Robot Generation Engine},
	author={Robot Flow Labs},
	year={2026},
	url={https://github.com/RobotFlow-Labs/anima-forge-distillation-pipeline}
	}
	```

	## License

	Apache 2.0