FORGE-Nano-Benchmark / REPORT_GPU.md

ilessio-aiflowlab

Upload REPORT_GPU.md with huggingface_hub

1c00ab9 verified 3 days ago

preview code

raw

history blame contribute delete

16.6 kB

FORGE — GPU Performance Report

All benchmarks on NVIDIA L4 24GB, CUDA 13.0, PyTorch 2.10, Python 3.14

Run: 2026-03-19 03:00 UTC — Phase 1-6 Full Pipeline

Environment

Property	Value
GPU	NVIDIA L4 24GB
Driver	580.126.09
CUDA	13.0
PyTorch	2.10.0+cu128
Python	3.14.0
OS	Linux 6.17.0-1008-gcp

Model: FORGE-Nano (SigLIP-SO400M + Qwen2.5-0.5B)

Architecture

Component	Details
Vision Encoder	SigLIP-SO400M-patch14-384 (frozen, 472.3M params)
Bridge Attention	64 queries, 4 layers, 8 heads (39.7M params)
Language Backbone	Qwen2.5-0.5B + LoRA rank=32 (494.2M params)
Action Head	Diffusion, 4 layers, 10 steps (1.7M params)
Total	967.9M params
Trainable	495.6M params (51.2%)
Frozen	472.3M params (48.8%)

Inference Latency

Metric	Value
Single inference (avg)	129.0 ms
Single inference (min)	121.3 ms
Single inference (max)	135.5 ms
Batch=8 total	843.4 ms
Batch=8 per-sample	105.4 ms
Throughput (single)	7.8 fps
Throughput (batch=8)	9.5 fps
P50 latency	132.3 ms
P99 latency	136.2 ms

GPU Memory

Metric	Value
Allocated	3.90 GB
Reserved	4.62 GB
Available	18.4 GB (headroom)

Knowledge Distillation (200 steps)

Metric	Value
Training speed	1.8 steps/s
Loss (first 10 avg)	17.8218
Loss (last 10 avg)	1.0994
Loss reduction	93.8%
Total time	110.2s
Trainable params	45.7M (bridge + action head + LoRA)
Optimizer	AdamW (lr=2e-4, wd=0.01)
Gradient accumulation	2 steps
Gradient clipping	max_norm=1.0

Knowledge Distillation (150 steps, demo run)

Metric	Value
Training speed	1.8 steps/s
Loss start	4.6994
Loss end	0.9845
Loss reduction	79.1%
Total time	83.5s

Layer Pruning (Shallow-Pi)

Metric	Value
Layers before	27
Layers after	18
Layers removed	9 (indices: 9-17, middle layers)
Params before	967.9M
Params after	830.8M
Param reduction	14.2%
Strategy	U-shaped importance (edges > middle)
Keep first/last	2 layers each

Quantization

Format	Size	Compression (vs FP32)
FP32 (original)	3,871.7 MB	1.0x
BF16	1,935.9 MB	2.0x
INT8	830.8 MB	4.7x
INT4	415.4 MB	9.3x

INT4 Inference (post-prune + quantize)

Metric	Value
Latency	103.4 ms
Throughput	9.7 fps
Speedup vs FP32	1.25x

ONNX Export

Metric	Value
ONNX file size	7.3 MB
Optimized ONNX	6.7 MB
Status	Success

TensorRT

Metric	Value
Status	Not installed on this machine
Plan	Install TRT SDK, build FP16 + INT8 engines

Comparison: OpenVLA-7B vs FORGE-Nano

Metric	OpenVLA-7B	FORGE-Nano	Delta
Parameters	7,000M	967.9M	7.2x ↓
Size (bf16)	~13 GB	1.8 GB	7.2x ↓
Size (INT4)	~3.5 GB	415 MB	8.4x ↓
Latency (L4)	~2,000 ms	129 ms	15.5x ↓
Throughput	~0.5 fps	7.8 fps	15.6x ↑
GPU Memory	~14 GB	3.9 GB	3.6x ↓
Edge deployable	No	Yes	✓
Jetson Orin Nano	No (OOM)	Yes	✓
Apple Silicon	No	Yes (MLX)	✓

Experiment Log

Every experiment run gets appended here with date, config, and key metrics.

[2026-03-19 03:00] Initial GPU Validation

Config: FORGE-Nano, SigLIP+Qwen2.5-0.5B, LoRA r=32
Device: NVIDIA L4 24GB
Result: 61/61 tests passing, all phases complete
Key metrics: 129ms latency, 93.8% loss reduction (200 steps), 415MB INT4

[2026-03-19 03:15] Demo Run (150 steps)

Config: Same as above, 150 KD steps
Result: Demo command working end-to-end
Key metrics: 131.9ms latency, 79.1% loss reduction (150 steps), ONNX 6.7MB

v2 Manual GPU Validation

[2026-03-19 ~14:00] Step 1: SigLIP-SO400M Vision Encoder

Config: google--siglip-so400m-patch14-384, FP32
Device: NVIDIA L4 24GB (22.5GB free)
Result: PASS
Key metrics:
- Vision-only params: 428.2M
- GPU VRAM: 1.71 GB
- Warm latency: 96.7ms (std=1.9ms), P50=96.4ms, P99=100.7ms
- Output shape: [1, 729, 1152] (matches spec: d=1152, 729 tokens)
- Load time: ~1.1s CPU, ~8.6s to CUDA
Issues found: SiglipVisionModel.from_pretrained() fails on full SiglipConfig — must load full model then extract .vision_model
Fix applied: student.py now tries SiglipVisionModel first, falls back to SiglipModel + extract .vision_model

[2026-03-19 ~14:30] Step 2: Full FORGEStudent Build

Config: FORGE-Nano, SigLIP+Qwen2.5-0.5B, LoRA r=32
Device: NVIDIA L4 24GB
Result: PASS
Key metrics:
- Total params: 967.9M (496M trainable, 472M frozen)
- GPU VRAM: 3.9 GB
- Build time: 6.2s
- Output shapes: actions (1,7), vision_features (1,64,896)

[2026-03-19 ~15:00] Step 3: KD Training Loop (50 steps)

Config: FORGE-Nano, AdamW lr=2e-4, diffusion action head built-in loss
Device: NVIDIA L4 24GB
Result: PASS
Key metrics:
- Loss: 9.50 → 2.04 (78.5% reduction in 50 steps)
- Speed: 2.2 steps/s (22.9s total)
- GPU VRAM: 9.7 GB
Issues found: External ForgeDistillationLoss broke gradient chain; used model's built-in loss instead

[2026-03-19 ~16:00] Step 4: Chunk-Aware Layer Pruning

Config: FORGE-Nano (27 Qwen layers), α=0.6 (standard vs temporal)
Device: NVIDIA L4 24GB
Result: PASS
Key metrics:
- Layers: 27 → 20 (removed 7: [5, 8, 11, 12, 15, 17, 21])
- Params: 967.9M → 861.3M (89.0% retained)
- Importance scoring: 11.0s (3 calibration samples)
- Top layer: 24 (0.8000), Bottom: 21 (0.2000)
- Pruned model forward pass verified
- GPU VRAM: 7.8 GB (pruning deepcopy overhead)

[2026-03-19 ~16:15] Step 5: Chunk-Aware INT4 Quantization

Config: FORGE-Nano, target_bits=4.0, action_head_bits=8
Device: NVIDIA L4 24GB
Result: PASS
Key metrics:
- Calibrated 569 linear modules (1.1s)
- FP32 size: 3872 MB → INT4 estimated: 484 MB (8.0x compression)
- Action MSE (FP32 vs INT4): 2.161
- Temporal coherence delta: 0.000
- Quantization time: 116.1s
- GPU VRAM: 7.8 GB

[2026-03-19 ~16:30] Step 6: Inference Latency Benchmark

Config: FORGE-Nano, FP32 + FP16 autocast
Device: NVIDIA L4 24GB
Result: PASS
Key metrics (FP32, batch=1, 50 iterations):
- p50: 134.8 ms, p95: 138.2 ms, p99: 140.3 ms
- Mean: 134.6 ms (std 2.7 ms)
- Throughput: 7.4 fps
Key metrics (FP32, batch=4, 20 iterations):
- p50: 455.2 ms, p95: 467.0 ms
- Throughput: 8.8 fps
Key metrics (FP16 autocast, batch=1, 30 iterations):
- p50: 88.6 ms, p95: 91.6 ms
- Throughput: 11.3 fps
- Speedup vs FP32: 1.52x
- GPU VRAM: 4.6 GB
Issue: .half() fails due to LoRA dtype mismatch; use torch.autocast instead

[2026-03-19 ~16:45] Step 7: AutoSense Model Detection

Config: All models in /home/datai/development/forge/datasets/
Result: PASS
Key metrics:
- SigLIP-SO400M: d_output=1152, image_size=384, patch_size=14, n_tokens=729
- Qwen2.5-0.5B: d_model=896, vocab_size=151936, n_layers=24, n_heads=14
- Qwen2.5-1.5B: d_model=1536, vocab_size=151936, n_layers=28, n_heads=12
- apply_autosense correctly updates bridge_d_model from 896 to 1536 for 1.5B variant

[2026-03-19 ~17:00] Step 8: Cross-Embodiment Transfer (manual)

Config: FORGE-Nano actions → UR5e/ALOHA via linear/learned/joint_name
Device: NVIDIA L4 24GB
Result: PASS
Key metrics:
- Franka (7D) → UR5e (6D) linear: action range [-16.2, 25.0] (with joint limit scaling)
- Franka (7D) → ALOHA (14D) mirror pad: action range [-8.1, 12.5]
- Learned adapter: 5062 params MLP, output shape correct
- Joint-name mapping: 0 matches (expected — j1-j7 vs shoulder/wrist names have low trigram overlap)

Automated Benchmark Suite (8 benchmarks)

Results in benchmarks/results/*.json — run via uv run python benchmarks/run_all.py

[2026-03-19 11:20] Bench 01: Vision Encoder (100 iterations)

FP32 b=1: p50=101.0ms, 9.9 fps
FP16 b=1: p50=28.7ms, 32.3 fps (3.26x speedup)
FP32 b=8: p50=619.4ms, 12.8 fps
GPU mem: 2.05 GB

[2026-03-19 11:20] Bench 02: Full Student Inference (50 iterations)

FP32 b=1: p50=135.4ms, 7.4 fps
FP16 b=1: p50=87.3ms, 11.5 fps (1.56x speedup)
Batch scaling: b1=7.4fps → b2=8.6fps → b4=8.8fps
GPU mem: 4.65 GB

[2026-03-19 11:20] Bench 03: KD Training (3 runs)

Run 1 (lr=2e-4, 50 steps): 2.63→1.67 (36.5%), 2.7 steps/s, 9.65 GB
Run 2 (lr=5e-4, 50 steps): 9.27→1.56 (83.1%), 2.8 steps/s, 14.97 GB
Run 3 (lr=2e-4, 100 steps): 3.72→3.15 (15.3%), 2.8 steps/s, 20.29 GB

[2026-03-19 11:20] Bench 04: Pruning (4 ratios)

Keep %	Layers	Params (M)	Latency p50	FPS
90%	27→24	922.2	121.4 ms	8.2
75%	27→20	861.3	105.7 ms	9.5
60%	27→16	800.3	91.1 ms	11.0
50%	27→13	754.6	80.1 ms	12.5

[2026-03-19 11:20] Bench 05: Quantization (4 configs)

Config	Compression	Action MSE	Latency p50	FPS
INT8/AH8	4.0x	2.477	139.2 ms	7.2
INT4/AH8	8.0x	3.221	136.5 ms	7.3
INT4/AH4	8.0x	2.989	136.8 ms	7.3
INT3/AH8	10.7x	5.135	138.1 ms	7.2

[2026-03-19 11:20] Bench 06: AutoSense

9 vision encoders detected, 5 language models detected
Sub-millisecond detection per model (<0.2ms)
Qwen-1.5B auto-updates bridge_d_model from 896→1536

[2026-03-19 11:20] Bench 07: Cross-Embodiment Transfer (6 pairs × 3 strategies)

Linear mapping: ~12-14 μs/action (70-84k maps/s)
Joint-name mapping: ~1.7 μs/action (585-601k maps/s)
Learned adapter: ~64 μs/action (15.4-15.6k maps/s)

[2026-03-19 11:20] Bench 08: E2E Pipeline

Total pipeline: 167s (build → train → prune → quantize → benchmark)
Build: 6.0s, 967.9M params
Train: 30 steps, 5.32→1.88 loss, 2.6 steps/s
Prune: 27→20 layers, 861.3M params
Quantize: INT4, 3445→431 MB (8.0x)
Inference: FP32=109.6ms, FP16=84.7ms (11.8 fps)

Multi-GPU Benchmarks (4x NVIDIA L4 24GB)

[2026-03-19 12:21] Bench 09: Multi-GPU DataParallel

Inference Scaling (FORGE-Nano FP32)

GPUs	Batch=1	Batch=4	Batch=8	Batch=16
1 GPU	7.8 fps	9.3 fps	9.3 fps	9.3 fps
2 GPU	6.1 fps	6.5 fps	10.0 fps	13.5 fps
4 GPU	6.0 fps	4.4 fps	8.0 fps	13.6 fps

Optimal: 2-4 GPUs at batch≥16 for 1.46x throughput over single GPU
Key insight: DataParallel overhead dominates at small batches — single GPU is faster at batch=1-4

FP16 Multi-GPU Inference

GPUs	Batch=4	Batch=8	Batch=16	Batch=32
1 GPU	32.7 fps	34.2 fps	32.9 fps	33.6 fps
4 GPU	4.4 fps	8.8 fps	17.5 fps	31.6 fps

FP16 1-GPU: 33.6 fps at batch=32 (4.3x faster than FP32!)
FP16 4-GPU: Matches 1-GPU throughput at batch=32

Training Scaling

GPUs	Batch	Steps/s	Loss Reduction
1 GPU	2	2.31	56.3%
2 GPU (DP)	4	0.79	-12.9%
4 GPU (DP)	8	0.50	82.6%

Training: Single GPU is faster per-step; 4-GPU benefits at larger effective batch
VRAM: 1 GPU=9.0 GB, 4 GPU=14.6 GB primary + 4.1 GB per replica

[2026-03-19 12:10] Bench 10: Multi-Teacher Distillation

GPU Placement Planning

Teachers	Total VRAM	Placement
2 (smolvla + rdt2)	3.5 GB	All GPU:0
3 (+openvla)	18.7 GB	All GPU:0
4 (+bitvla)	19.5 GB	All GPU:0
5 (+pi0)	22.7 GB	GPU:0 + overflow to GPU:1

Multi-Teacher Training (mock teachers, 50 steps)

Teachers	Loss Start	Loss End	Reduction	Speed	Peak VRAM
1 teacher	0.181	0.124	31.6%	0.72 s/s	5.68 GB
2 teachers	0.444	0.227	48.9%	1.14 s/s	6.87 GB
3 teachers	0.259	0.277	-7.1%	1.23 s/s	6.86 GB

Router entropy: Converges from 0.69→0.0002 (2 teachers), 1.08→0.0001 (3 teachers)
Key insight: Router learns to prefer most accurate teacher within 50 steps

Universal Distillation (3 configs × 40 steps)

Config	α_task	α_div	α_con	Loss↓	Diversity↑	Router Weights
balanced	0.30	0.05	0.10	76.1%	0.01→0.42	[0.69, 0.28, 0.04]
kd_heavy	0.10	0.05	0.05	9.2%	0.00→0.48	[0.80, 0.12, 0.09]
diverse	0.20	0.15	0.10	-568%	0.00→0.12	[0.63, 0.27, 0.11]

Best config: balanced (α_task=0.3, α_div=0.05, α_con=0.1) — 76.1% loss reduction
Worst config: diverse — high diversity weight destabilizes training

[2026-03-19 12:35] Bench 11: Student Variants Comparison

Variant	Params	FP32 fps	FP16 fps	FP16 Speedup	Train Steps/s	Loss↓	Train VRAM
nano_baseline (LoRA=32, diffusion)	967.9M	7.9	11.0	1.39x	1.64	67.0%	9.0 GB
nano_lora64 (LoRA=64, diffusion)	972.3M	7.9	10.8	1.37x	1.62	76.9%	9.1 GB
nano_flow (LoRA=32, flow)	967.9M	8.2	12.6	1.54x	1.58	85.8%	9.0 GB
small_baseline (LoRA=32, diffusion)	2097.7M	6.2	9.9	—	OOM	—	>22 GB
small_flow (LoRA=32, flow)	2097.7M	6.1	11.3	—	OOM	—	>22 GB

Key findings:

Flow matching is 15% faster than diffusion at FP16 (12.6 vs 11.0 fps)
Flow has best FP16 speedup: 1.54x vs 1.39x for diffusion
LoRA=64 trains better (76.9% vs 67.0% loss reduction) with negligible speed cost
Small (2.1B) fits inference on single L4 but needs multi-GPU for training

[2026-03-19 12:50] Bench 12: Full Pipeline Combinations (build→train→prune→infer)

Pipeline	Head	LoRA	Prune	Layers	Params Post-Prune	FP32 fps	FP16 fps	Loss↓	Time
nano_diff_p75_q4	diffusion	32	75%	24→15	830.8M	10.0	12.0	41.4%	171s
nano_flow_p50_q4	flow	32	50%	24→9	739.3M	14.1	7.8	76.3%	166s
nano_lora64_p90_q4	diffusion	64	90%	24→18	880.8M	9.1	11.2	86.3%	176s
nano_diff_p75_q8	diffusion	32	75%	24→15	830.8M	10.0	11.3	92.3%	172s
nano_flow_lora64_p60	flow	64	60%	24→11	774.1M	12.7	14.1	75.7%	168s
nano_diff_noprune_q8	diffusion	32	~100%	24→21	922.2M	8.1	11.0	59.4%	167s

Optimal configurations:

Fastest inference: nano_flow_lora64_p60 — 14.1 fps FP16, 12.7 fps FP32
Best loss reduction: nano_diff_p75_q8 — 92.3% in 30 steps
Most compressed: nano_flow_p50_q4 — 967.9M → 739.3M (24% reduction)
Best balanced: nano_flow_lora64_p60 — fast inference + good compression + strong training

Pruning impact on speed (FP32):

24→21 layers: 8.1 fps (baseline)
24→18 layers: 9.1 fps (+12%)
24→15 layers: 10.0 fps (+23%)
24→11 layers: 12.7 fps (+57%)
24→9 layers: 14.1 fps (+74%)

Recommended Configurations

Production (Edge Deployment)

variant: nano
action_head: flow
lora_rank: 64
prune_ratio: 0.60
quant_bits: 4
→ 774.1M params, FP16: 14.1 fps, ~<600MB INT4

Quality (Best Training)

variant: nano
action_head: diffusion
lora_rank: 32
prune_ratio: 0.75
quant_bits: 8
→ 830.8M params, FP16: 11.3 fps, 92.3% loss reduction

Minimum Size (IoT/Embedded)

variant: nano
action_head: flow
lora_rank: 32
prune_ratio: 0.50
quant_bits: 4
→ 739.3M params, FP32: 14.1 fps, ~<500MB INT4