File size: 16,622 Bytes

1c00ab9

# FORGE — GPU Performance Report

> All benchmarks on NVIDIA L4 24GB, CUDA 13.0, PyTorch 2.10, Python 3.14

---

## Run: 2026-03-19 03:00 UTC — Phase 1-6 Full Pipeline

### Environment
| Property | Value |
|----------|-------|
| GPU | NVIDIA L4 24GB |
| Driver | 580.126.09 |
| CUDA | 13.0 |
| PyTorch | 2.10.0+cu128 |
| Python | 3.14.0 |
| OS | Linux 6.17.0-1008-gcp |

### Model: FORGE-Nano (SigLIP-SO400M + Qwen2.5-0.5B)

#### Architecture
| Component | Details |
|-----------|---------|
| Vision Encoder | SigLIP-SO400M-patch14-384 (frozen, 472.3M params) |
| Bridge Attention | 64 queries, 4 layers, 8 heads (39.7M params) |
| Language Backbone | Qwen2.5-0.5B + LoRA rank=32 (494.2M params) |
| Action Head | Diffusion, 4 layers, 10 steps (1.7M params) |
| **Total** | **967.9M params** |
| **Trainable** | **495.6M params** (51.2%) |
| **Frozen** | **472.3M params** (48.8%) |

#### Inference Latency
| Metric | Value |
|--------|-------|
| Single inference (avg) | **129.0 ms** |
| Single inference (min) | 121.3 ms |
| Single inference (max) | 135.5 ms |
| Batch=8 total | 843.4 ms |
| Batch=8 per-sample | **105.4 ms** |
| Throughput (single) | **7.8 fps** |
| Throughput (batch=8) | **9.5 fps** |
| P50 latency | 132.3 ms |
| P99 latency | 136.2 ms |

#### GPU Memory
| Metric | Value |
|--------|-------|
| Allocated | 3.90 GB |
| Reserved | 4.62 GB |
| Available | 18.4 GB (headroom) |

#### Knowledge Distillation (200 steps)
| Metric | Value |
|--------|-------|
| Training speed | **1.8 steps/s** |
| Loss (first 10 avg) | 17.8218 |
| Loss (last 10 avg) | 1.0994 |
| **Loss reduction** | **93.8%** |
| Total time | 110.2s |
| Trainable params | 45.7M (bridge + action head + LoRA) |
| Optimizer | AdamW (lr=2e-4, wd=0.01) |
| Gradient accumulation | 2 steps |
| Gradient clipping | max_norm=1.0 |

#### Knowledge Distillation (150 steps, demo run)
| Metric | Value |
|--------|-------|
| Training speed | 1.8 steps/s |
| Loss start | 4.6994 |
| Loss end | 0.9845 |
| **Loss reduction** | **79.1%** |
| Total time | 83.5s |

#### Layer Pruning (Shallow-Pi)
| Metric | Value |
|--------|-------|
| Layers before | 27 |
| Layers after | **18** |
| Layers removed | 9 (indices: 9-17, middle layers) |
| Params before | 967.9M |
| Params after | **830.8M** |
| **Param reduction** | **14.2%** |
| Strategy | U-shaped importance (edges > middle) |
| Keep first/last | 2 layers each |

#### Quantization
| Format | Size | Compression (vs FP32) |
|--------|------|----------------------|
| FP32 (original) | 3,871.7 MB | 1.0x |
| BF16 | 1,935.9 MB | 2.0x |
| INT8 | 830.8 MB | 4.7x |
| **INT4** | **415.4 MB** | **9.3x** |

#### INT4 Inference (post-prune + quantize)
| Metric | Value |
|--------|-------|
| Latency | **103.4 ms** |
| Throughput | **9.7 fps** |
| Speedup vs FP32 | 1.25x |

#### ONNX Export
| Metric | Value |
|--------|-------|
| ONNX file size | 7.3 MB |
| Optimized ONNX | **6.7 MB** |
| Status | Success |

#### TensorRT
| Metric | Value |
|--------|-------|
| Status | Not installed on this machine |
| Plan | Install TRT SDK, build FP16 + INT8 engines |

---

## Comparison: OpenVLA-7B vs FORGE-Nano

| Metric | OpenVLA-7B | FORGE-Nano | Delta |
|--------|-----------|------------|-------|
| Parameters | 7,000M | 967.9M | **7.2x ↓** |
| Size (bf16) | ~13 GB | 1.8 GB | **7.2x ↓** |
| Size (INT4) | ~3.5 GB | 415 MB | **8.4x ↓** |
| Latency (L4) | ~2,000 ms | 129 ms | **15.5x ↓** |
| Throughput | ~0.5 fps | 7.8 fps | **15.6x ↑** |
| GPU Memory | ~14 GB | 3.9 GB | **3.6x ↓** |
| Edge deployable | No | Yes | ✓ |
| Jetson Orin Nano | No (OOM) | Yes | ✓ |
| Apple Silicon | No | Yes (MLX) | ✓ |

---

## Experiment Log

> Every experiment run gets appended here with date, config, and key metrics.

### [2026-03-19 03:00] Initial GPU Validation
- **Config**: FORGE-Nano, SigLIP+Qwen2.5-0.5B, LoRA r=32
- **Device**: NVIDIA L4 24GB
- **Result**: 61/61 tests passing, all phases complete
- **Key metrics**: 129ms latency, 93.8% loss reduction (200 steps), 415MB INT4

### [2026-03-19 03:15] Demo Run (150 steps)
- **Config**: Same as above, 150 KD steps
- **Result**: Demo command working end-to-end
- **Key metrics**: 131.9ms latency, 79.1% loss reduction (150 steps), ONNX 6.7MB

---

## v2 Manual GPU Validation

### [2026-03-19 ~14:00] Step 1: SigLIP-SO400M Vision Encoder
- **Config**: google--siglip-so400m-patch14-384, FP32
- **Device**: NVIDIA L4 24GB (22.5GB free)
- **Result**: PASS
- **Key metrics**:
  - Vision-only params: 428.2M
  - GPU VRAM: 1.71 GB
  - Warm latency: 96.7ms (std=1.9ms), P50=96.4ms, P99=100.7ms
  - Output shape: [1, 729, 1152] (matches spec: d=1152, 729 tokens)
  - Load time: ~1.1s CPU, ~8.6s to CUDA
- **Issues found**: `SiglipVisionModel.from_pretrained()` fails on full SiglipConfig — must load full model then extract `.vision_model`
- **Fix applied**: student.py now tries SiglipVisionModel first, falls back to SiglipModel + extract .vision_model

### [2026-03-19 ~14:30] Step 2: Full FORGEStudent Build
- **Config**: FORGE-Nano, SigLIP+Qwen2.5-0.5B, LoRA r=32
- **Device**: NVIDIA L4 24GB
- **Result**: PASS
- **Key metrics**:
  - Total params: 967.9M (496M trainable, 472M frozen)
  - GPU VRAM: 3.9 GB
  - Build time: 6.2s
  - Output shapes: actions (1,7), vision_features (1,64,896)

### [2026-03-19 ~15:00] Step 3: KD Training Loop (50 steps)
- **Config**: FORGE-Nano, AdamW lr=2e-4, diffusion action head built-in loss
- **Device**: NVIDIA L4 24GB
- **Result**: PASS
- **Key metrics**:
  - Loss: 9.50 → 2.04 (78.5% reduction in 50 steps)
  - Speed: 2.2 steps/s (22.9s total)
  - GPU VRAM: 9.7 GB
- **Issues found**: External ForgeDistillationLoss broke gradient chain; used model's built-in loss instead

### [2026-03-19 ~16:00] Step 4: Chunk-Aware Layer Pruning
- **Config**: FORGE-Nano (27 Qwen layers), α=0.6 (standard vs temporal)
- **Device**: NVIDIA L4 24GB
- **Result**: PASS
- **Key metrics**:
  - Layers: 27 → 20 (removed 7: [5, 8, 11, 12, 15, 17, 21])
  - Params: 967.9M → 861.3M (89.0% retained)
  - Importance scoring: 11.0s (3 calibration samples)
  - Top layer: 24 (0.8000), Bottom: 21 (0.2000)
  - Pruned model forward pass verified
  - GPU VRAM: 7.8 GB (pruning deepcopy overhead)

### [2026-03-19 ~16:15] Step 5: Chunk-Aware INT4 Quantization
- **Config**: FORGE-Nano, target_bits=4.0, action_head_bits=8
- **Device**: NVIDIA L4 24GB
- **Result**: PASS
- **Key metrics**:
  - Calibrated 569 linear modules (1.1s)
  - FP32 size: 3872 MB → INT4 estimated: 484 MB (8.0x compression)
  - Action MSE (FP32 vs INT4): 2.161
  - Temporal coherence delta: 0.000
  - Quantization time: 116.1s
  - GPU VRAM: 7.8 GB

### [2026-03-19 ~16:30] Step 6: Inference Latency Benchmark
- **Config**: FORGE-Nano, FP32 + FP16 autocast
- **Device**: NVIDIA L4 24GB
- **Result**: PASS
- **Key metrics** (FP32, batch=1, 50 iterations):
  - p50: 134.8 ms, p95: 138.2 ms, p99: 140.3 ms
  - Mean: 134.6 ms (std 2.7 ms)
  - Throughput: 7.4 fps
- **Key metrics** (FP32, batch=4, 20 iterations):
  - p50: 455.2 ms, p95: 467.0 ms
  - Throughput: 8.8 fps
- **Key metrics** (FP16 autocast, batch=1, 30 iterations):
  - p50: 88.6 ms, p95: 91.6 ms
  - Throughput: 11.3 fps
  - Speedup vs FP32: 1.52x
  - GPU VRAM: 4.6 GB
- **Issue**: `.half()` fails due to LoRA dtype mismatch; use `torch.autocast` instead

### [2026-03-19 ~16:45] Step 7: AutoSense Model Detection
- **Config**: All models in /home/datai/development/forge/datasets/
- **Result**: PASS
- **Key metrics**:
  - SigLIP-SO400M: d_output=1152, image_size=384, patch_size=14, n_tokens=729
  - Qwen2.5-0.5B: d_model=896, vocab_size=151936, n_layers=24, n_heads=14
  - Qwen2.5-1.5B: d_model=1536, vocab_size=151936, n_layers=28, n_heads=12
  - apply_autosense correctly updates bridge_d_model from 896 to 1536 for 1.5B variant

### [2026-03-19 ~17:00] Step 8: Cross-Embodiment Transfer (manual)
- **Config**: FORGE-Nano actions → UR5e/ALOHA via linear/learned/joint_name
- **Device**: NVIDIA L4 24GB
- **Result**: PASS
- **Key metrics**:
  - Franka (7D) → UR5e (6D) linear: action range [-16.2, 25.0] (with joint limit scaling)
  - Franka (7D) → ALOHA (14D) mirror pad: action range [-8.1, 12.5]
  - Learned adapter: 5062 params MLP, output shape correct
  - Joint-name mapping: 0 matches (expected — j1-j7 vs shoulder/wrist names have low trigram overlap)

---

## Automated Benchmark Suite (8 benchmarks)

Results in `benchmarks/results/*.json` — run via `uv run python benchmarks/run_all.py`

### [2026-03-19 11:20] Bench 01: Vision Encoder (100 iterations)
- **FP32 b=1**: p50=101.0ms, 9.9 fps
- **FP16 b=1**: p50=28.7ms, 32.3 fps (3.26x speedup)
- **FP32 b=8**: p50=619.4ms, 12.8 fps
- **GPU mem**: 2.05 GB

### [2026-03-19 11:20] Bench 02: Full Student Inference (50 iterations)
- **FP32 b=1**: p50=135.4ms, 7.4 fps
- **FP16 b=1**: p50=87.3ms, 11.5 fps (1.56x speedup)
- **Batch scaling**: b1=7.4fps → b2=8.6fps → b4=8.8fps
- **GPU mem**: 4.65 GB

### [2026-03-19 11:20] Bench 03: KD Training (3 runs)
- **Run 1** (lr=2e-4, 50 steps): 2.63→1.67 (36.5%), 2.7 steps/s, 9.65 GB
- **Run 2** (lr=5e-4, 50 steps): 9.27→1.56 (83.1%), 2.8 steps/s, 14.97 GB
- **Run 3** (lr=2e-4, 100 steps): 3.72→3.15 (15.3%), 2.8 steps/s, 20.29 GB

### [2026-03-19 11:20] Bench 04: Pruning (4 ratios)
| Keep % | Layers | Params (M) | Latency p50 | FPS |
|--------|--------|-----------|-------------|-----|
| 90% | 27→24 | 922.2 | 121.4 ms | 8.2 |
| 75% | 27→20 | 861.3 | 105.7 ms | 9.5 |
| 60% | 27→16 | 800.3 | 91.1 ms | 11.0 |
| 50% | 27→13 | 754.6 | 80.1 ms | 12.5 |

### [2026-03-19 11:20] Bench 05: Quantization (4 configs)
| Config | Compression | Action MSE | Latency p50 | FPS |
|--------|-------------|-----------|-------------|-----|
| INT8/AH8 | 4.0x | 2.477 | 139.2 ms | 7.2 |
| INT4/AH8 | 8.0x | 3.221 | 136.5 ms | 7.3 |
| INT4/AH4 | 8.0x | 2.989 | 136.8 ms | 7.3 |
| INT3/AH8 | 10.7x | 5.135 | 138.1 ms | 7.2 |

### [2026-03-19 11:20] Bench 06: AutoSense
- 9 vision encoders detected, 5 language models detected
- Sub-millisecond detection per model (<0.2ms)
- Qwen-1.5B auto-updates bridge_d_model from 896→1536

### [2026-03-19 11:20] Bench 07: Cross-Embodiment Transfer (6 pairs × 3 strategies)
- **Linear mapping**: ~12-14 μs/action (70-84k maps/s)
- **Joint-name mapping**: ~1.7 μs/action (585-601k maps/s)
- **Learned adapter**: ~64 μs/action (15.4-15.6k maps/s)

### [2026-03-19 11:20] Bench 08: E2E Pipeline
- **Total pipeline**: 167s (build → train → prune → quantize → benchmark)
- **Build**: 6.0s, 967.9M params
- **Train**: 30 steps, 5.32→1.88 loss, 2.6 steps/s
- **Prune**: 27→20 layers, 861.3M params
- **Quantize**: INT4, 3445→431 MB (8.0x)
- **Inference**: FP32=109.6ms, FP16=84.7ms (11.8 fps)

---

## Multi-GPU Benchmarks (4x NVIDIA L4 24GB)

### [2026-03-19 12:21] Bench 09: Multi-GPU DataParallel

#### Inference Scaling (FORGE-Nano FP32)
| GPUs | Batch=1 | Batch=4 | Batch=8 | Batch=16 |
|------|---------|---------|---------|----------|
| 1 GPU | 7.8 fps | **9.3 fps** | 9.3 fps | 9.3 fps |
| 2 GPU | 6.1 fps | 6.5 fps | **10.0 fps** | 13.5 fps |
| 4 GPU | 6.0 fps | 4.4 fps | 8.0 fps | **13.6 fps** |

- **Optimal**: 2-4 GPUs at batch≥16 for 1.46x throughput over single GPU
- **Key insight**: DataParallel overhead dominates at small batches — single GPU is faster at batch=1-4

#### FP16 Multi-GPU Inference
| GPUs | Batch=4 | Batch=8 | Batch=16 | Batch=32 |
|------|---------|---------|----------|----------|
| 1 GPU | 32.7 fps | 34.2 fps | 32.9 fps | **33.6 fps** |
| 4 GPU | 4.4 fps | 8.8 fps | 17.5 fps | **31.6 fps** |

- **FP16 1-GPU**: 33.6 fps at batch=32 (4.3x faster than FP32!)
- **FP16 4-GPU**: Matches 1-GPU throughput at batch=32

#### Training Scaling
| GPUs | Batch | Steps/s | Loss Reduction |
|------|-------|---------|----------------|
| 1 GPU | 2 | **2.31** | 56.3% |
| 2 GPU (DP) | 4 | 0.79 | -12.9% |
| 4 GPU (DP) | 8 | 0.50 | **82.6%** |

- **Training**: Single GPU is faster per-step; 4-GPU benefits at larger effective batch
- **VRAM**: 1 GPU=9.0 GB, 4 GPU=14.6 GB primary + 4.1 GB per replica

### [2026-03-19 12:10] Bench 10: Multi-Teacher Distillation

#### GPU Placement Planning
| Teachers | Total VRAM | Placement |
|----------|-----------|-----------|
| 2 (smolvla + rdt2) | 3.5 GB | All GPU:0 |
| 3 (+openvla) | 18.7 GB | All GPU:0 |
| 4 (+bitvla) | 19.5 GB | All GPU:0 |
| 5 (+pi0) | **22.7 GB** | GPU:0 + overflow to GPU:1 |

#### Multi-Teacher Training (mock teachers, 50 steps)
| Teachers | Loss Start | Loss End | Reduction | Speed | Peak VRAM |
|----------|-----------|---------|-----------|-------|-----------|
| 1 teacher | 0.181 | 0.124 | **31.6%** | 0.72 s/s | 5.68 GB |
| 2 teachers | 0.444 | 0.227 | **48.9%** | 1.14 s/s | 6.87 GB |
| 3 teachers | 0.259 | 0.277 | -7.1% | 1.23 s/s | 6.86 GB |

- **Router entropy**: Converges from 0.69→0.0002 (2 teachers), 1.08→0.0001 (3 teachers)
- **Key insight**: Router learns to prefer most accurate teacher within 50 steps

#### Universal Distillation (3 configs × 40 steps)
| Config | α_task | α_div | α_con | Loss↓ | Diversity↑ | Router Weights |
|--------|--------|-------|-------|-------|------------|----------------|
| balanced | 0.30 | 0.05 | 0.10 | **76.1%** | 0.01→0.42 | [0.69, 0.28, 0.04] |
| kd_heavy | 0.10 | 0.05 | 0.05 | 9.2% | 0.00→0.48 | [0.80, 0.12, 0.09] |
| diverse | 0.20 | 0.15 | 0.10 | -568% | 0.00→0.12 | [0.63, 0.27, 0.11] |

- **Best config**: `balanced` (α_task=0.3, α_div=0.05, α_con=0.1) — 76.1% loss reduction
- **Worst config**: `diverse` — high diversity weight destabilizes training

### [2026-03-19 12:35] Bench 11: Student Variants Comparison

| Variant | Params | FP32 fps | FP16 fps | FP16 Speedup | Train Steps/s | Loss↓ | Train VRAM |
|---------|--------|----------|----------|--------------|---------------|-------|-----------|
| **nano_baseline** (LoRA=32, diffusion) | 967.9M | 7.9 | **11.0** | 1.39x | 1.64 | 67.0% | 9.0 GB |
| **nano_lora64** (LoRA=64, diffusion) | 972.3M | 7.9 | 10.8 | 1.37x | 1.62 | **76.9%** | 9.1 GB |
| **nano_flow** (LoRA=32, flow) | 967.9M | **8.2** | **12.6** | **1.54x** | 1.58 | 85.8% | 9.0 GB |
| small_baseline (LoRA=32, diffusion) | 2097.7M | 6.2 | 9.9 | — | OOM | — | >22 GB |
| small_flow (LoRA=32, flow) | 2097.7M | 6.1 | **11.3** | — | OOM | — | >22 GB |

**Key findings:**
- **Flow matching is 15% faster** than diffusion at FP16 (12.6 vs 11.0 fps)
- **Flow has best FP16 speedup**: 1.54x vs 1.39x for diffusion
- **LoRA=64 trains better** (76.9% vs 67.0% loss reduction) with negligible speed cost
- **Small (2.1B)** fits inference on single L4 but needs multi-GPU for training

### [2026-03-19 12:50] Bench 12: Full Pipeline Combinations (build→train→prune→infer)

| Pipeline | Head | LoRA | Prune | Layers | Params Post-Prune | FP32 fps | FP16 fps | Loss↓ | Time |
|----------|------|------|-------|--------|-------------------|----------|----------|-------|------|
| nano_diff_p75_q4 | diffusion | 32 | 75% | 24→15 | 830.8M | 10.0 | 12.0 | 41.4% | 171s |
| nano_flow_p50_q4 | flow | 32 | 50% | 24→9 | **739.3M** | **14.1** | 7.8 | 76.3% | 166s |
| nano_lora64_p90_q4 | diffusion | 64 | 90% | 24→18 | 880.8M | 9.1 | 11.2 | **86.3%** | 176s |
| nano_diff_p75_q8 | diffusion | 32 | 75% | 24→15 | 830.8M | 10.0 | 11.3 | **92.3%** | 172s |
| **nano_flow_lora64_p60** | **flow** | **64** | **60%** | **24→11** | **774.1M** | **12.7** | **14.1** | 75.7% | 168s |
| nano_diff_noprune_q8 | diffusion | 32 | ~100% | 24→21 | 922.2M | 8.1 | 11.0 | 59.4% | 167s |

**Optimal configurations:**
- **Fastest inference**: `nano_flow_lora64_p60` — **14.1 fps FP16**, 12.7 fps FP32
- **Best loss reduction**: `nano_diff_p75_q8` — **92.3%** in 30 steps
- **Most compressed**: `nano_flow_p50_q4` — 967.9M → **739.3M** (24% reduction)
- **Best balanced**: `nano_flow_lora64_p60` — fast inference + good compression + strong training

**Pruning impact on speed (FP32):**
- 24→21 layers: 8.1 fps (baseline)
- 24→18 layers: 9.1 fps (+12%)
- 24→15 layers: 10.0 fps (+23%)
- 24→11 layers: 12.7 fps (+57%)
- 24→9 layers: **14.1 fps** (+74%)

---

## Recommended Configurations

### Production (Edge Deployment)
```
variant: nano
action_head: flow
lora_rank: 64
prune_ratio: 0.60
quant_bits: 4
→ 774.1M params, FP16: 14.1 fps, ~<600MB INT4
```

### Quality (Best Training)
```
variant: nano
action_head: diffusion
lora_rank: 32
prune_ratio: 0.75
quant_bits: 8
→ 830.8M params, FP16: 11.3 fps, 92.3% loss reduction
```

### Minimum Size (IoT/Embedded)
```
variant: nano
action_head: flow
lora_rank: 32
prune_ratio: 0.50
quant_bits: 4
→ 739.3M params, FP32: 14.1 fps, ~<500MB INT4
```