# FORGE — GPU Performance Report > All benchmarks on NVIDIA L4 24GB, CUDA 13.0, PyTorch 2.10, Python 3.14 --- ## Run: 2026-03-19 03:00 UTC — Phase 1-6 Full Pipeline ### Environment | Property | Value | |----------|-------| | GPU | NVIDIA L4 24GB | | Driver | 580.126.09 | | CUDA | 13.0 | | PyTorch | 2.10.0+cu128 | | Python | 3.14.0 | | OS | Linux 6.17.0-1008-gcp | ### Model: FORGE-Nano (SigLIP-SO400M + Qwen2.5-0.5B) #### Architecture | Component | Details | |-----------|---------| | Vision Encoder | SigLIP-SO400M-patch14-384 (frozen, 472.3M params) | | Bridge Attention | 64 queries, 4 layers, 8 heads (39.7M params) | | Language Backbone | Qwen2.5-0.5B + LoRA rank=32 (494.2M params) | | Action Head | Diffusion, 4 layers, 10 steps (1.7M params) | | **Total** | **967.9M params** | | **Trainable** | **495.6M params** (51.2%) | | **Frozen** | **472.3M params** (48.8%) | #### Inference Latency | Metric | Value | |--------|-------| | Single inference (avg) | **129.0 ms** | | Single inference (min) | 121.3 ms | | Single inference (max) | 135.5 ms | | Batch=8 total | 843.4 ms | | Batch=8 per-sample | **105.4 ms** | | Throughput (single) | **7.8 fps** | | Throughput (batch=8) | **9.5 fps** | | P50 latency | 132.3 ms | | P99 latency | 136.2 ms | #### GPU Memory | Metric | Value | |--------|-------| | Allocated | 3.90 GB | | Reserved | 4.62 GB | | Available | 18.4 GB (headroom) | #### Knowledge Distillation (200 steps) | Metric | Value | |--------|-------| | Training speed | **1.8 steps/s** | | Loss (first 10 avg) | 17.8218 | | Loss (last 10 avg) | 1.0994 | | **Loss reduction** | **93.8%** | | Total time | 110.2s | | Trainable params | 45.7M (bridge + action head + LoRA) | | Optimizer | AdamW (lr=2e-4, wd=0.01) | | Gradient accumulation | 2 steps | | Gradient clipping | max_norm=1.0 | #### Knowledge Distillation (150 steps, demo run) | Metric | Value | |--------|-------| | Training speed | 1.8 steps/s | | Loss start | 4.6994 | | Loss end | 0.9845 | | **Loss reduction** | **79.1%** | | Total time | 83.5s | #### Layer Pruning (Shallow-Pi) | Metric | Value | |--------|-------| | Layers before | 27 | | Layers after | **18** | | Layers removed | 9 (indices: 9-17, middle layers) | | Params before | 967.9M | | Params after | **830.8M** | | **Param reduction** | **14.2%** | | Strategy | U-shaped importance (edges > middle) | | Keep first/last | 2 layers each | #### Quantization | Format | Size | Compression (vs FP32) | |--------|------|----------------------| | FP32 (original) | 3,871.7 MB | 1.0x | | BF16 | 1,935.9 MB | 2.0x | | INT8 | 830.8 MB | 4.7x | | **INT4** | **415.4 MB** | **9.3x** | #### INT4 Inference (post-prune + quantize) | Metric | Value | |--------|-------| | Latency | **103.4 ms** | | Throughput | **9.7 fps** | | Speedup vs FP32 | 1.25x | #### ONNX Export | Metric | Value | |--------|-------| | ONNX file size | 7.3 MB | | Optimized ONNX | **6.7 MB** | | Status | Success | #### TensorRT | Metric | Value | |--------|-------| | Status | Not installed on this machine | | Plan | Install TRT SDK, build FP16 + INT8 engines | --- ## Comparison: OpenVLA-7B vs FORGE-Nano | Metric | OpenVLA-7B | FORGE-Nano | Delta | |--------|-----------|------------|-------| | Parameters | 7,000M | 967.9M | **7.2x ↓** | | Size (bf16) | ~13 GB | 1.8 GB | **7.2x ↓** | | Size (INT4) | ~3.5 GB | 415 MB | **8.4x ↓** | | Latency (L4) | ~2,000 ms | 129 ms | **15.5x ↓** | | Throughput | ~0.5 fps | 7.8 fps | **15.6x ↑** | | GPU Memory | ~14 GB | 3.9 GB | **3.6x ↓** | | Edge deployable | No | Yes | ✓ | | Jetson Orin Nano | No (OOM) | Yes | ✓ | | Apple Silicon | No | Yes (MLX) | ✓ | --- ## Experiment Log > Every experiment run gets appended here with date, config, and key metrics. ### [2026-03-19 03:00] Initial GPU Validation - **Config**: FORGE-Nano, SigLIP+Qwen2.5-0.5B, LoRA r=32 - **Device**: NVIDIA L4 24GB - **Result**: 61/61 tests passing, all phases complete - **Key metrics**: 129ms latency, 93.8% loss reduction (200 steps), 415MB INT4 ### [2026-03-19 03:15] Demo Run (150 steps) - **Config**: Same as above, 150 KD steps - **Result**: Demo command working end-to-end - **Key metrics**: 131.9ms latency, 79.1% loss reduction (150 steps), ONNX 6.7MB --- ## v2 Manual GPU Validation ### [2026-03-19 ~14:00] Step 1: SigLIP-SO400M Vision Encoder - **Config**: google--siglip-so400m-patch14-384, FP32 - **Device**: NVIDIA L4 24GB (22.5GB free) - **Result**: PASS - **Key metrics**: - Vision-only params: 428.2M - GPU VRAM: 1.71 GB - Warm latency: 96.7ms (std=1.9ms), P50=96.4ms, P99=100.7ms - Output shape: [1, 729, 1152] (matches spec: d=1152, 729 tokens) - Load time: ~1.1s CPU, ~8.6s to CUDA - **Issues found**: `SiglipVisionModel.from_pretrained()` fails on full SiglipConfig — must load full model then extract `.vision_model` - **Fix applied**: student.py now tries SiglipVisionModel first, falls back to SiglipModel + extract .vision_model ### [2026-03-19 ~14:30] Step 2: Full FORGEStudent Build - **Config**: FORGE-Nano, SigLIP+Qwen2.5-0.5B, LoRA r=32 - **Device**: NVIDIA L4 24GB - **Result**: PASS - **Key metrics**: - Total params: 967.9M (496M trainable, 472M frozen) - GPU VRAM: 3.9 GB - Build time: 6.2s - Output shapes: actions (1,7), vision_features (1,64,896) ### [2026-03-19 ~15:00] Step 3: KD Training Loop (50 steps) - **Config**: FORGE-Nano, AdamW lr=2e-4, diffusion action head built-in loss - **Device**: NVIDIA L4 24GB - **Result**: PASS - **Key metrics**: - Loss: 9.50 → 2.04 (78.5% reduction in 50 steps) - Speed: 2.2 steps/s (22.9s total) - GPU VRAM: 9.7 GB - **Issues found**: External ForgeDistillationLoss broke gradient chain; used model's built-in loss instead ### [2026-03-19 ~16:00] Step 4: Chunk-Aware Layer Pruning - **Config**: FORGE-Nano (27 Qwen layers), α=0.6 (standard vs temporal) - **Device**: NVIDIA L4 24GB - **Result**: PASS - **Key metrics**: - Layers: 27 → 20 (removed 7: [5, 8, 11, 12, 15, 17, 21]) - Params: 967.9M → 861.3M (89.0% retained) - Importance scoring: 11.0s (3 calibration samples) - Top layer: 24 (0.8000), Bottom: 21 (0.2000) - Pruned model forward pass verified - GPU VRAM: 7.8 GB (pruning deepcopy overhead) ### [2026-03-19 ~16:15] Step 5: Chunk-Aware INT4 Quantization - **Config**: FORGE-Nano, target_bits=4.0, action_head_bits=8 - **Device**: NVIDIA L4 24GB - **Result**: PASS - **Key metrics**: - Calibrated 569 linear modules (1.1s) - FP32 size: 3872 MB → INT4 estimated: 484 MB (8.0x compression) - Action MSE (FP32 vs INT4): 2.161 - Temporal coherence delta: 0.000 - Quantization time: 116.1s - GPU VRAM: 7.8 GB ### [2026-03-19 ~16:30] Step 6: Inference Latency Benchmark - **Config**: FORGE-Nano, FP32 + FP16 autocast - **Device**: NVIDIA L4 24GB - **Result**: PASS - **Key metrics** (FP32, batch=1, 50 iterations): - p50: 134.8 ms, p95: 138.2 ms, p99: 140.3 ms - Mean: 134.6 ms (std 2.7 ms) - Throughput: 7.4 fps - **Key metrics** (FP32, batch=4, 20 iterations): - p50: 455.2 ms, p95: 467.0 ms - Throughput: 8.8 fps - **Key metrics** (FP16 autocast, batch=1, 30 iterations): - p50: 88.6 ms, p95: 91.6 ms - Throughput: 11.3 fps - Speedup vs FP32: 1.52x - GPU VRAM: 4.6 GB - **Issue**: `.half()` fails due to LoRA dtype mismatch; use `torch.autocast` instead ### [2026-03-19 ~16:45] Step 7: AutoSense Model Detection - **Config**: All models in /home/datai/development/forge/datasets/ - **Result**: PASS - **Key metrics**: - SigLIP-SO400M: d_output=1152, image_size=384, patch_size=14, n_tokens=729 - Qwen2.5-0.5B: d_model=896, vocab_size=151936, n_layers=24, n_heads=14 - Qwen2.5-1.5B: d_model=1536, vocab_size=151936, n_layers=28, n_heads=12 - apply_autosense correctly updates bridge_d_model from 896 to 1536 for 1.5B variant ### [2026-03-19 ~17:00] Step 8: Cross-Embodiment Transfer (manual) - **Config**: FORGE-Nano actions → UR5e/ALOHA via linear/learned/joint_name - **Device**: NVIDIA L4 24GB - **Result**: PASS - **Key metrics**: - Franka (7D) → UR5e (6D) linear: action range [-16.2, 25.0] (with joint limit scaling) - Franka (7D) → ALOHA (14D) mirror pad: action range [-8.1, 12.5] - Learned adapter: 5062 params MLP, output shape correct - Joint-name mapping: 0 matches (expected — j1-j7 vs shoulder/wrist names have low trigram overlap) --- ## Automated Benchmark Suite (8 benchmarks) Results in `benchmarks/results/*.json` — run via `uv run python benchmarks/run_all.py` ### [2026-03-19 11:20] Bench 01: Vision Encoder (100 iterations) - **FP32 b=1**: p50=101.0ms, 9.9 fps - **FP16 b=1**: p50=28.7ms, 32.3 fps (3.26x speedup) - **FP32 b=8**: p50=619.4ms, 12.8 fps - **GPU mem**: 2.05 GB ### [2026-03-19 11:20] Bench 02: Full Student Inference (50 iterations) - **FP32 b=1**: p50=135.4ms, 7.4 fps - **FP16 b=1**: p50=87.3ms, 11.5 fps (1.56x speedup) - **Batch scaling**: b1=7.4fps → b2=8.6fps → b4=8.8fps - **GPU mem**: 4.65 GB ### [2026-03-19 11:20] Bench 03: KD Training (3 runs) - **Run 1** (lr=2e-4, 50 steps): 2.63→1.67 (36.5%), 2.7 steps/s, 9.65 GB - **Run 2** (lr=5e-4, 50 steps): 9.27→1.56 (83.1%), 2.8 steps/s, 14.97 GB - **Run 3** (lr=2e-4, 100 steps): 3.72→3.15 (15.3%), 2.8 steps/s, 20.29 GB ### [2026-03-19 11:20] Bench 04: Pruning (4 ratios) | Keep % | Layers | Params (M) | Latency p50 | FPS | |--------|--------|-----------|-------------|-----| | 90% | 27→24 | 922.2 | 121.4 ms | 8.2 | | 75% | 27→20 | 861.3 | 105.7 ms | 9.5 | | 60% | 27→16 | 800.3 | 91.1 ms | 11.0 | | 50% | 27→13 | 754.6 | 80.1 ms | 12.5 | ### [2026-03-19 11:20] Bench 05: Quantization (4 configs) | Config | Compression | Action MSE | Latency p50 | FPS | |--------|-------------|-----------|-------------|-----| | INT8/AH8 | 4.0x | 2.477 | 139.2 ms | 7.2 | | INT4/AH8 | 8.0x | 3.221 | 136.5 ms | 7.3 | | INT4/AH4 | 8.0x | 2.989 | 136.8 ms | 7.3 | | INT3/AH8 | 10.7x | 5.135 | 138.1 ms | 7.2 | ### [2026-03-19 11:20] Bench 06: AutoSense - 9 vision encoders detected, 5 language models detected - Sub-millisecond detection per model (<0.2ms) - Qwen-1.5B auto-updates bridge_d_model from 896→1536 ### [2026-03-19 11:20] Bench 07: Cross-Embodiment Transfer (6 pairs × 3 strategies) - **Linear mapping**: ~12-14 μs/action (70-84k maps/s) - **Joint-name mapping**: ~1.7 μs/action (585-601k maps/s) - **Learned adapter**: ~64 μs/action (15.4-15.6k maps/s) ### [2026-03-19 11:20] Bench 08: E2E Pipeline - **Total pipeline**: 167s (build → train → prune → quantize → benchmark) - **Build**: 6.0s, 967.9M params - **Train**: 30 steps, 5.32→1.88 loss, 2.6 steps/s - **Prune**: 27→20 layers, 861.3M params - **Quantize**: INT4, 3445→431 MB (8.0x) - **Inference**: FP32=109.6ms, FP16=84.7ms (11.8 fps) --- ## Multi-GPU Benchmarks (4x NVIDIA L4 24GB) ### [2026-03-19 12:21] Bench 09: Multi-GPU DataParallel #### Inference Scaling (FORGE-Nano FP32) | GPUs | Batch=1 | Batch=4 | Batch=8 | Batch=16 | |------|---------|---------|---------|----------| | 1 GPU | 7.8 fps | **9.3 fps** | 9.3 fps | 9.3 fps | | 2 GPU | 6.1 fps | 6.5 fps | **10.0 fps** | 13.5 fps | | 4 GPU | 6.0 fps | 4.4 fps | 8.0 fps | **13.6 fps** | - **Optimal**: 2-4 GPUs at batch≥16 for 1.46x throughput over single GPU - **Key insight**: DataParallel overhead dominates at small batches — single GPU is faster at batch=1-4 #### FP16 Multi-GPU Inference | GPUs | Batch=4 | Batch=8 | Batch=16 | Batch=32 | |------|---------|---------|----------|----------| | 1 GPU | 32.7 fps | 34.2 fps | 32.9 fps | **33.6 fps** | | 4 GPU | 4.4 fps | 8.8 fps | 17.5 fps | **31.6 fps** | - **FP16 1-GPU**: 33.6 fps at batch=32 (4.3x faster than FP32!) - **FP16 4-GPU**: Matches 1-GPU throughput at batch=32 #### Training Scaling | GPUs | Batch | Steps/s | Loss Reduction | |------|-------|---------|----------------| | 1 GPU | 2 | **2.31** | 56.3% | | 2 GPU (DP) | 4 | 0.79 | -12.9% | | 4 GPU (DP) | 8 | 0.50 | **82.6%** | - **Training**: Single GPU is faster per-step; 4-GPU benefits at larger effective batch - **VRAM**: 1 GPU=9.0 GB, 4 GPU=14.6 GB primary + 4.1 GB per replica ### [2026-03-19 12:10] Bench 10: Multi-Teacher Distillation #### GPU Placement Planning | Teachers | Total VRAM | Placement | |----------|-----------|-----------| | 2 (smolvla + rdt2) | 3.5 GB | All GPU:0 | | 3 (+openvla) | 18.7 GB | All GPU:0 | | 4 (+bitvla) | 19.5 GB | All GPU:0 | | 5 (+pi0) | **22.7 GB** | GPU:0 + overflow to GPU:1 | #### Multi-Teacher Training (mock teachers, 50 steps) | Teachers | Loss Start | Loss End | Reduction | Speed | Peak VRAM | |----------|-----------|---------|-----------|-------|-----------| | 1 teacher | 0.181 | 0.124 | **31.6%** | 0.72 s/s | 5.68 GB | | 2 teachers | 0.444 | 0.227 | **48.9%** | 1.14 s/s | 6.87 GB | | 3 teachers | 0.259 | 0.277 | -7.1% | 1.23 s/s | 6.86 GB | - **Router entropy**: Converges from 0.69→0.0002 (2 teachers), 1.08→0.0001 (3 teachers) - **Key insight**: Router learns to prefer most accurate teacher within 50 steps #### Universal Distillation (3 configs × 40 steps) | Config | α_task | α_div | α_con | Loss↓ | Diversity↑ | Router Weights | |--------|--------|-------|-------|-------|------------|----------------| | balanced | 0.30 | 0.05 | 0.10 | **76.1%** | 0.01→0.42 | [0.69, 0.28, 0.04] | | kd_heavy | 0.10 | 0.05 | 0.05 | 9.2% | 0.00→0.48 | [0.80, 0.12, 0.09] | | diverse | 0.20 | 0.15 | 0.10 | -568% | 0.00→0.12 | [0.63, 0.27, 0.11] | - **Best config**: `balanced` (α_task=0.3, α_div=0.05, α_con=0.1) — 76.1% loss reduction - **Worst config**: `diverse` — high diversity weight destabilizes training ### [2026-03-19 12:35] Bench 11: Student Variants Comparison | Variant | Params | FP32 fps | FP16 fps | FP16 Speedup | Train Steps/s | Loss↓ | Train VRAM | |---------|--------|----------|----------|--------------|---------------|-------|-----------| | **nano_baseline** (LoRA=32, diffusion) | 967.9M | 7.9 | **11.0** | 1.39x | 1.64 | 67.0% | 9.0 GB | | **nano_lora64** (LoRA=64, diffusion) | 972.3M | 7.9 | 10.8 | 1.37x | 1.62 | **76.9%** | 9.1 GB | | **nano_flow** (LoRA=32, flow) | 967.9M | **8.2** | **12.6** | **1.54x** | 1.58 | 85.8% | 9.0 GB | | small_baseline (LoRA=32, diffusion) | 2097.7M | 6.2 | 9.9 | — | OOM | — | >22 GB | | small_flow (LoRA=32, flow) | 2097.7M | 6.1 | **11.3** | — | OOM | — | >22 GB | **Key findings:** - **Flow matching is 15% faster** than diffusion at FP16 (12.6 vs 11.0 fps) - **Flow has best FP16 speedup**: 1.54x vs 1.39x for diffusion - **LoRA=64 trains better** (76.9% vs 67.0% loss reduction) with negligible speed cost - **Small (2.1B)** fits inference on single L4 but needs multi-GPU for training ### [2026-03-19 12:50] Bench 12: Full Pipeline Combinations (build→train→prune→infer) | Pipeline | Head | LoRA | Prune | Layers | Params Post-Prune | FP32 fps | FP16 fps | Loss↓ | Time | |----------|------|------|-------|--------|-------------------|----------|----------|-------|------| | nano_diff_p75_q4 | diffusion | 32 | 75% | 24→15 | 830.8M | 10.0 | 12.0 | 41.4% | 171s | | nano_flow_p50_q4 | flow | 32 | 50% | 24→9 | **739.3M** | **14.1** | 7.8 | 76.3% | 166s | | nano_lora64_p90_q4 | diffusion | 64 | 90% | 24→18 | 880.8M | 9.1 | 11.2 | **86.3%** | 176s | | nano_diff_p75_q8 | diffusion | 32 | 75% | 24→15 | 830.8M | 10.0 | 11.3 | **92.3%** | 172s | | **nano_flow_lora64_p60** | **flow** | **64** | **60%** | **24→11** | **774.1M** | **12.7** | **14.1** | 75.7% | 168s | | nano_diff_noprune_q8 | diffusion | 32 | ~100% | 24→21 | 922.2M | 8.1 | 11.0 | 59.4% | 167s | **Optimal configurations:** - **Fastest inference**: `nano_flow_lora64_p60` — **14.1 fps FP16**, 12.7 fps FP32 - **Best loss reduction**: `nano_diff_p75_q8` — **92.3%** in 30 steps - **Most compressed**: `nano_flow_p50_q4` — 967.9M → **739.3M** (24% reduction) - **Best balanced**: `nano_flow_lora64_p60` — fast inference + good compression + strong training **Pruning impact on speed (FP32):** - 24→21 layers: 8.1 fps (baseline) - 24→18 layers: 9.1 fps (+12%) - 24→15 layers: 10.0 fps (+23%) - 24→11 layers: 12.7 fps (+57%) - 24→9 layers: **14.1 fps** (+74%) --- ## Recommended Configurations ### Production (Edge Deployment) ``` variant: nano action_head: flow lora_rank: 64 prune_ratio: 0.60 quant_bits: 4 → 774.1M params, FP16: 14.1 fps, ~<600MB INT4 ``` ### Quality (Best Training) ``` variant: nano action_head: diffusion lora_rank: 32 prune_ratio: 0.75 quant_bits: 8 → 830.8M params, FP16: 11.3 fps, 92.3% loss reduction ``` ### Minimum Size (IoT/Embedded) ``` variant: nano action_head: flow lora_rank: 32 prune_ratio: 0.50 quant_bits: 4 → 739.3M params, FP32: 14.1 fps, ~<500MB INT4 ```