File size: 16,622 Bytes
1c00ab9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 | # FORGE β GPU Performance Report
> All benchmarks on NVIDIA L4 24GB, CUDA 13.0, PyTorch 2.10, Python 3.14
---
## Run: 2026-03-19 03:00 UTC β Phase 1-6 Full Pipeline
### Environment
| Property | Value |
|----------|-------|
| GPU | NVIDIA L4 24GB |
| Driver | 580.126.09 |
| CUDA | 13.0 |
| PyTorch | 2.10.0+cu128 |
| Python | 3.14.0 |
| OS | Linux 6.17.0-1008-gcp |
### Model: FORGE-Nano (SigLIP-SO400M + Qwen2.5-0.5B)
#### Architecture
| Component | Details |
|-----------|---------|
| Vision Encoder | SigLIP-SO400M-patch14-384 (frozen, 472.3M params) |
| Bridge Attention | 64 queries, 4 layers, 8 heads (39.7M params) |
| Language Backbone | Qwen2.5-0.5B + LoRA rank=32 (494.2M params) |
| Action Head | Diffusion, 4 layers, 10 steps (1.7M params) |
| **Total** | **967.9M params** |
| **Trainable** | **495.6M params** (51.2%) |
| **Frozen** | **472.3M params** (48.8%) |
#### Inference Latency
| Metric | Value |
|--------|-------|
| Single inference (avg) | **129.0 ms** |
| Single inference (min) | 121.3 ms |
| Single inference (max) | 135.5 ms |
| Batch=8 total | 843.4 ms |
| Batch=8 per-sample | **105.4 ms** |
| Throughput (single) | **7.8 fps** |
| Throughput (batch=8) | **9.5 fps** |
| P50 latency | 132.3 ms |
| P99 latency | 136.2 ms |
#### GPU Memory
| Metric | Value |
|--------|-------|
| Allocated | 3.90 GB |
| Reserved | 4.62 GB |
| Available | 18.4 GB (headroom) |
#### Knowledge Distillation (200 steps)
| Metric | Value |
|--------|-------|
| Training speed | **1.8 steps/s** |
| Loss (first 10 avg) | 17.8218 |
| Loss (last 10 avg) | 1.0994 |
| **Loss reduction** | **93.8%** |
| Total time | 110.2s |
| Trainable params | 45.7M (bridge + action head + LoRA) |
| Optimizer | AdamW (lr=2e-4, wd=0.01) |
| Gradient accumulation | 2 steps |
| Gradient clipping | max_norm=1.0 |
#### Knowledge Distillation (150 steps, demo run)
| Metric | Value |
|--------|-------|
| Training speed | 1.8 steps/s |
| Loss start | 4.6994 |
| Loss end | 0.9845 |
| **Loss reduction** | **79.1%** |
| Total time | 83.5s |
#### Layer Pruning (Shallow-Pi)
| Metric | Value |
|--------|-------|
| Layers before | 27 |
| Layers after | **18** |
| Layers removed | 9 (indices: 9-17, middle layers) |
| Params before | 967.9M |
| Params after | **830.8M** |
| **Param reduction** | **14.2%** |
| Strategy | U-shaped importance (edges > middle) |
| Keep first/last | 2 layers each |
#### Quantization
| Format | Size | Compression (vs FP32) |
|--------|------|----------------------|
| FP32 (original) | 3,871.7 MB | 1.0x |
| BF16 | 1,935.9 MB | 2.0x |
| INT8 | 830.8 MB | 4.7x |
| **INT4** | **415.4 MB** | **9.3x** |
#### INT4 Inference (post-prune + quantize)
| Metric | Value |
|--------|-------|
| Latency | **103.4 ms** |
| Throughput | **9.7 fps** |
| Speedup vs FP32 | 1.25x |
#### ONNX Export
| Metric | Value |
|--------|-------|
| ONNX file size | 7.3 MB |
| Optimized ONNX | **6.7 MB** |
| Status | Success |
#### TensorRT
| Metric | Value |
|--------|-------|
| Status | Not installed on this machine |
| Plan | Install TRT SDK, build FP16 + INT8 engines |
---
## Comparison: OpenVLA-7B vs FORGE-Nano
| Metric | OpenVLA-7B | FORGE-Nano | Delta |
|--------|-----------|------------|-------|
| Parameters | 7,000M | 967.9M | **7.2x β** |
| Size (bf16) | ~13 GB | 1.8 GB | **7.2x β** |
| Size (INT4) | ~3.5 GB | 415 MB | **8.4x β** |
| Latency (L4) | ~2,000 ms | 129 ms | **15.5x β** |
| Throughput | ~0.5 fps | 7.8 fps | **15.6x β** |
| GPU Memory | ~14 GB | 3.9 GB | **3.6x β** |
| Edge deployable | No | Yes | β |
| Jetson Orin Nano | No (OOM) | Yes | β |
| Apple Silicon | No | Yes (MLX) | β |
---
## Experiment Log
> Every experiment run gets appended here with date, config, and key metrics.
### [2026-03-19 03:00] Initial GPU Validation
- **Config**: FORGE-Nano, SigLIP+Qwen2.5-0.5B, LoRA r=32
- **Device**: NVIDIA L4 24GB
- **Result**: 61/61 tests passing, all phases complete
- **Key metrics**: 129ms latency, 93.8% loss reduction (200 steps), 415MB INT4
### [2026-03-19 03:15] Demo Run (150 steps)
- **Config**: Same as above, 150 KD steps
- **Result**: Demo command working end-to-end
- **Key metrics**: 131.9ms latency, 79.1% loss reduction (150 steps), ONNX 6.7MB
---
## v2 Manual GPU Validation
### [2026-03-19 ~14:00] Step 1: SigLIP-SO400M Vision Encoder
- **Config**: google--siglip-so400m-patch14-384, FP32
- **Device**: NVIDIA L4 24GB (22.5GB free)
- **Result**: PASS
- **Key metrics**:
- Vision-only params: 428.2M
- GPU VRAM: 1.71 GB
- Warm latency: 96.7ms (std=1.9ms), P50=96.4ms, P99=100.7ms
- Output shape: [1, 729, 1152] (matches spec: d=1152, 729 tokens)
- Load time: ~1.1s CPU, ~8.6s to CUDA
- **Issues found**: `SiglipVisionModel.from_pretrained()` fails on full SiglipConfig β must load full model then extract `.vision_model`
- **Fix applied**: student.py now tries SiglipVisionModel first, falls back to SiglipModel + extract .vision_model
### [2026-03-19 ~14:30] Step 2: Full FORGEStudent Build
- **Config**: FORGE-Nano, SigLIP+Qwen2.5-0.5B, LoRA r=32
- **Device**: NVIDIA L4 24GB
- **Result**: PASS
- **Key metrics**:
- Total params: 967.9M (496M trainable, 472M frozen)
- GPU VRAM: 3.9 GB
- Build time: 6.2s
- Output shapes: actions (1,7), vision_features (1,64,896)
### [2026-03-19 ~15:00] Step 3: KD Training Loop (50 steps)
- **Config**: FORGE-Nano, AdamW lr=2e-4, diffusion action head built-in loss
- **Device**: NVIDIA L4 24GB
- **Result**: PASS
- **Key metrics**:
- Loss: 9.50 β 2.04 (78.5% reduction in 50 steps)
- Speed: 2.2 steps/s (22.9s total)
- GPU VRAM: 9.7 GB
- **Issues found**: External ForgeDistillationLoss broke gradient chain; used model's built-in loss instead
### [2026-03-19 ~16:00] Step 4: Chunk-Aware Layer Pruning
- **Config**: FORGE-Nano (27 Qwen layers), Ξ±=0.6 (standard vs temporal)
- **Device**: NVIDIA L4 24GB
- **Result**: PASS
- **Key metrics**:
- Layers: 27 β 20 (removed 7: [5, 8, 11, 12, 15, 17, 21])
- Params: 967.9M β 861.3M (89.0% retained)
- Importance scoring: 11.0s (3 calibration samples)
- Top layer: 24 (0.8000), Bottom: 21 (0.2000)
- Pruned model forward pass verified
- GPU VRAM: 7.8 GB (pruning deepcopy overhead)
### [2026-03-19 ~16:15] Step 5: Chunk-Aware INT4 Quantization
- **Config**: FORGE-Nano, target_bits=4.0, action_head_bits=8
- **Device**: NVIDIA L4 24GB
- **Result**: PASS
- **Key metrics**:
- Calibrated 569 linear modules (1.1s)
- FP32 size: 3872 MB β INT4 estimated: 484 MB (8.0x compression)
- Action MSE (FP32 vs INT4): 2.161
- Temporal coherence delta: 0.000
- Quantization time: 116.1s
- GPU VRAM: 7.8 GB
### [2026-03-19 ~16:30] Step 6: Inference Latency Benchmark
- **Config**: FORGE-Nano, FP32 + FP16 autocast
- **Device**: NVIDIA L4 24GB
- **Result**: PASS
- **Key metrics** (FP32, batch=1, 50 iterations):
- p50: 134.8 ms, p95: 138.2 ms, p99: 140.3 ms
- Mean: 134.6 ms (std 2.7 ms)
- Throughput: 7.4 fps
- **Key metrics** (FP32, batch=4, 20 iterations):
- p50: 455.2 ms, p95: 467.0 ms
- Throughput: 8.8 fps
- **Key metrics** (FP16 autocast, batch=1, 30 iterations):
- p50: 88.6 ms, p95: 91.6 ms
- Throughput: 11.3 fps
- Speedup vs FP32: 1.52x
- GPU VRAM: 4.6 GB
- **Issue**: `.half()` fails due to LoRA dtype mismatch; use `torch.autocast` instead
### [2026-03-19 ~16:45] Step 7: AutoSense Model Detection
- **Config**: All models in /home/datai/development/forge/datasets/
- **Result**: PASS
- **Key metrics**:
- SigLIP-SO400M: d_output=1152, image_size=384, patch_size=14, n_tokens=729
- Qwen2.5-0.5B: d_model=896, vocab_size=151936, n_layers=24, n_heads=14
- Qwen2.5-1.5B: d_model=1536, vocab_size=151936, n_layers=28, n_heads=12
- apply_autosense correctly updates bridge_d_model from 896 to 1536 for 1.5B variant
### [2026-03-19 ~17:00] Step 8: Cross-Embodiment Transfer (manual)
- **Config**: FORGE-Nano actions β UR5e/ALOHA via linear/learned/joint_name
- **Device**: NVIDIA L4 24GB
- **Result**: PASS
- **Key metrics**:
- Franka (7D) β UR5e (6D) linear: action range [-16.2, 25.0] (with joint limit scaling)
- Franka (7D) β ALOHA (14D) mirror pad: action range [-8.1, 12.5]
- Learned adapter: 5062 params MLP, output shape correct
- Joint-name mapping: 0 matches (expected β j1-j7 vs shoulder/wrist names have low trigram overlap)
---
## Automated Benchmark Suite (8 benchmarks)
Results in `benchmarks/results/*.json` β run via `uv run python benchmarks/run_all.py`
### [2026-03-19 11:20] Bench 01: Vision Encoder (100 iterations)
- **FP32 b=1**: p50=101.0ms, 9.9 fps
- **FP16 b=1**: p50=28.7ms, 32.3 fps (3.26x speedup)
- **FP32 b=8**: p50=619.4ms, 12.8 fps
- **GPU mem**: 2.05 GB
### [2026-03-19 11:20] Bench 02: Full Student Inference (50 iterations)
- **FP32 b=1**: p50=135.4ms, 7.4 fps
- **FP16 b=1**: p50=87.3ms, 11.5 fps (1.56x speedup)
- **Batch scaling**: b1=7.4fps β b2=8.6fps β b4=8.8fps
- **GPU mem**: 4.65 GB
### [2026-03-19 11:20] Bench 03: KD Training (3 runs)
- **Run 1** (lr=2e-4, 50 steps): 2.63β1.67 (36.5%), 2.7 steps/s, 9.65 GB
- **Run 2** (lr=5e-4, 50 steps): 9.27β1.56 (83.1%), 2.8 steps/s, 14.97 GB
- **Run 3** (lr=2e-4, 100 steps): 3.72β3.15 (15.3%), 2.8 steps/s, 20.29 GB
### [2026-03-19 11:20] Bench 04: Pruning (4 ratios)
| Keep % | Layers | Params (M) | Latency p50 | FPS |
|--------|--------|-----------|-------------|-----|
| 90% | 27β24 | 922.2 | 121.4 ms | 8.2 |
| 75% | 27β20 | 861.3 | 105.7 ms | 9.5 |
| 60% | 27β16 | 800.3 | 91.1 ms | 11.0 |
| 50% | 27β13 | 754.6 | 80.1 ms | 12.5 |
### [2026-03-19 11:20] Bench 05: Quantization (4 configs)
| Config | Compression | Action MSE | Latency p50 | FPS |
|--------|-------------|-----------|-------------|-----|
| INT8/AH8 | 4.0x | 2.477 | 139.2 ms | 7.2 |
| INT4/AH8 | 8.0x | 3.221 | 136.5 ms | 7.3 |
| INT4/AH4 | 8.0x | 2.989 | 136.8 ms | 7.3 |
| INT3/AH8 | 10.7x | 5.135 | 138.1 ms | 7.2 |
### [2026-03-19 11:20] Bench 06: AutoSense
- 9 vision encoders detected, 5 language models detected
- Sub-millisecond detection per model (<0.2ms)
- Qwen-1.5B auto-updates bridge_d_model from 896β1536
### [2026-03-19 11:20] Bench 07: Cross-Embodiment Transfer (6 pairs Γ 3 strategies)
- **Linear mapping**: ~12-14 ΞΌs/action (70-84k maps/s)
- **Joint-name mapping**: ~1.7 ΞΌs/action (585-601k maps/s)
- **Learned adapter**: ~64 ΞΌs/action (15.4-15.6k maps/s)
### [2026-03-19 11:20] Bench 08: E2E Pipeline
- **Total pipeline**: 167s (build β train β prune β quantize β benchmark)
- **Build**: 6.0s, 967.9M params
- **Train**: 30 steps, 5.32β1.88 loss, 2.6 steps/s
- **Prune**: 27β20 layers, 861.3M params
- **Quantize**: INT4, 3445β431 MB (8.0x)
- **Inference**: FP32=109.6ms, FP16=84.7ms (11.8 fps)
---
## Multi-GPU Benchmarks (4x NVIDIA L4 24GB)
### [2026-03-19 12:21] Bench 09: Multi-GPU DataParallel
#### Inference Scaling (FORGE-Nano FP32)
| GPUs | Batch=1 | Batch=4 | Batch=8 | Batch=16 |
|------|---------|---------|---------|----------|
| 1 GPU | 7.8 fps | **9.3 fps** | 9.3 fps | 9.3 fps |
| 2 GPU | 6.1 fps | 6.5 fps | **10.0 fps** | 13.5 fps |
| 4 GPU | 6.0 fps | 4.4 fps | 8.0 fps | **13.6 fps** |
- **Optimal**: 2-4 GPUs at batchβ₯16 for 1.46x throughput over single GPU
- **Key insight**: DataParallel overhead dominates at small batches β single GPU is faster at batch=1-4
#### FP16 Multi-GPU Inference
| GPUs | Batch=4 | Batch=8 | Batch=16 | Batch=32 |
|------|---------|---------|----------|----------|
| 1 GPU | 32.7 fps | 34.2 fps | 32.9 fps | **33.6 fps** |
| 4 GPU | 4.4 fps | 8.8 fps | 17.5 fps | **31.6 fps** |
- **FP16 1-GPU**: 33.6 fps at batch=32 (4.3x faster than FP32!)
- **FP16 4-GPU**: Matches 1-GPU throughput at batch=32
#### Training Scaling
| GPUs | Batch | Steps/s | Loss Reduction |
|------|-------|---------|----------------|
| 1 GPU | 2 | **2.31** | 56.3% |
| 2 GPU (DP) | 4 | 0.79 | -12.9% |
| 4 GPU (DP) | 8 | 0.50 | **82.6%** |
- **Training**: Single GPU is faster per-step; 4-GPU benefits at larger effective batch
- **VRAM**: 1 GPU=9.0 GB, 4 GPU=14.6 GB primary + 4.1 GB per replica
### [2026-03-19 12:10] Bench 10: Multi-Teacher Distillation
#### GPU Placement Planning
| Teachers | Total VRAM | Placement |
|----------|-----------|-----------|
| 2 (smolvla + rdt2) | 3.5 GB | All GPU:0 |
| 3 (+openvla) | 18.7 GB | All GPU:0 |
| 4 (+bitvla) | 19.5 GB | All GPU:0 |
| 5 (+pi0) | **22.7 GB** | GPU:0 + overflow to GPU:1 |
#### Multi-Teacher Training (mock teachers, 50 steps)
| Teachers | Loss Start | Loss End | Reduction | Speed | Peak VRAM |
|----------|-----------|---------|-----------|-------|-----------|
| 1 teacher | 0.181 | 0.124 | **31.6%** | 0.72 s/s | 5.68 GB |
| 2 teachers | 0.444 | 0.227 | **48.9%** | 1.14 s/s | 6.87 GB |
| 3 teachers | 0.259 | 0.277 | -7.1% | 1.23 s/s | 6.86 GB |
- **Router entropy**: Converges from 0.69β0.0002 (2 teachers), 1.08β0.0001 (3 teachers)
- **Key insight**: Router learns to prefer most accurate teacher within 50 steps
#### Universal Distillation (3 configs Γ 40 steps)
| Config | Ξ±_task | Ξ±_div | Ξ±_con | Lossβ | Diversityβ | Router Weights |
|--------|--------|-------|-------|-------|------------|----------------|
| balanced | 0.30 | 0.05 | 0.10 | **76.1%** | 0.01β0.42 | [0.69, 0.28, 0.04] |
| kd_heavy | 0.10 | 0.05 | 0.05 | 9.2% | 0.00β0.48 | [0.80, 0.12, 0.09] |
| diverse | 0.20 | 0.15 | 0.10 | -568% | 0.00β0.12 | [0.63, 0.27, 0.11] |
- **Best config**: `balanced` (Ξ±_task=0.3, Ξ±_div=0.05, Ξ±_con=0.1) β 76.1% loss reduction
- **Worst config**: `diverse` β high diversity weight destabilizes training
### [2026-03-19 12:35] Bench 11: Student Variants Comparison
| Variant | Params | FP32 fps | FP16 fps | FP16 Speedup | Train Steps/s | Lossβ | Train VRAM |
|---------|--------|----------|----------|--------------|---------------|-------|-----------|
| **nano_baseline** (LoRA=32, diffusion) | 967.9M | 7.9 | **11.0** | 1.39x | 1.64 | 67.0% | 9.0 GB |
| **nano_lora64** (LoRA=64, diffusion) | 972.3M | 7.9 | 10.8 | 1.37x | 1.62 | **76.9%** | 9.1 GB |
| **nano_flow** (LoRA=32, flow) | 967.9M | **8.2** | **12.6** | **1.54x** | 1.58 | 85.8% | 9.0 GB |
| small_baseline (LoRA=32, diffusion) | 2097.7M | 6.2 | 9.9 | β | OOM | β | >22 GB |
| small_flow (LoRA=32, flow) | 2097.7M | 6.1 | **11.3** | β | OOM | β | >22 GB |
**Key findings:**
- **Flow matching is 15% faster** than diffusion at FP16 (12.6 vs 11.0 fps)
- **Flow has best FP16 speedup**: 1.54x vs 1.39x for diffusion
- **LoRA=64 trains better** (76.9% vs 67.0% loss reduction) with negligible speed cost
- **Small (2.1B)** fits inference on single L4 but needs multi-GPU for training
### [2026-03-19 12:50] Bench 12: Full Pipeline Combinations (buildβtrainβpruneβinfer)
| Pipeline | Head | LoRA | Prune | Layers | Params Post-Prune | FP32 fps | FP16 fps | Lossβ | Time |
|----------|------|------|-------|--------|-------------------|----------|----------|-------|------|
| nano_diff_p75_q4 | diffusion | 32 | 75% | 24β15 | 830.8M | 10.0 | 12.0 | 41.4% | 171s |
| nano_flow_p50_q4 | flow | 32 | 50% | 24β9 | **739.3M** | **14.1** | 7.8 | 76.3% | 166s |
| nano_lora64_p90_q4 | diffusion | 64 | 90% | 24β18 | 880.8M | 9.1 | 11.2 | **86.3%** | 176s |
| nano_diff_p75_q8 | diffusion | 32 | 75% | 24β15 | 830.8M | 10.0 | 11.3 | **92.3%** | 172s |
| **nano_flow_lora64_p60** | **flow** | **64** | **60%** | **24β11** | **774.1M** | **12.7** | **14.1** | 75.7% | 168s |
| nano_diff_noprune_q8 | diffusion | 32 | ~100% | 24β21 | 922.2M | 8.1 | 11.0 | 59.4% | 167s |
**Optimal configurations:**
- **Fastest inference**: `nano_flow_lora64_p60` β **14.1 fps FP16**, 12.7 fps FP32
- **Best loss reduction**: `nano_diff_p75_q8` β **92.3%** in 30 steps
- **Most compressed**: `nano_flow_p50_q4` β 967.9M β **739.3M** (24% reduction)
- **Best balanced**: `nano_flow_lora64_p60` β fast inference + good compression + strong training
**Pruning impact on speed (FP32):**
- 24β21 layers: 8.1 fps (baseline)
- 24β18 layers: 9.1 fps (+12%)
- 24β15 layers: 10.0 fps (+23%)
- 24β11 layers: 12.7 fps (+57%)
- 24β9 layers: **14.1 fps** (+74%)
---
## Recommended Configurations
### Production (Edge Deployment)
```
variant: nano
action_head: flow
lora_rank: 64
prune_ratio: 0.60
quant_bits: 4
β 774.1M params, FP16: 14.1 fps, ~<600MB INT4
```
### Quality (Best Training)
```
variant: nano
action_head: diffusion
lora_rank: 32
prune_ratio: 0.75
quant_bits: 8
β 830.8M params, FP16: 11.3 fps, 92.3% loss reduction
```
### Minimum Size (IoT/Embedded)
```
variant: nano
action_head: flow
lora_rank: 32
prune_ratio: 0.50
quant_bits: 4
β 739.3M params, FP32: 14.1 fps, ~<500MB INT4
```
|