Add complete results with all measured numbers
Browse files- RESULTS.md +140 -0
RESULTS.md
ADDED
|
@@ -0,0 +1,140 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Collected Results
|
| 2 |
+
|
| 3 |
+
All measurements below are real numbers from actual runs. GPU, settings, and seed noted for each.
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## Table 1: Your Original Results (MPS, provided by author)
|
| 8 |
+
|
| 9 |
+
Config: 6 layers, chunk_size=64, B=8, T=256, 10% active, 2000 steps
|
| 10 |
+
|
| 11 |
+
| d_model | Run | Time (s) | ms/step | Val Loss |
|
| 12 |
+
|---------|-----|----------|---------|----------|
|
| 13 |
+
| 512 | dense_baseline | 74.77 | 99.70 | 5.3142 |
|
| 14 |
+
| 512 | sparse_full_dX | 91.04 | 121.38 | 5.4141 |
|
| 15 |
+
| 512 | sparse_sparse_dX | 93.33 | 124.44 | 5.5467 |
|
| 16 |
+
| 2048 | dense_baseline | 1035.84 | 591.91 | 6.0264 |
|
| 17 |
+
| 2048 | sparse_full_dX | 875.51 | 500.29 | 5.9807 |
|
| 18 |
+
| 2048 | sparse_sparse_dX | 847.22 | 484.13 | 6.0231 |
|
| 19 |
+
|
| 20 |
+
**Observation**: Sparse is slower at d=512 (1.22x overhead), faster at d=2048 (1.18x speedup for full_dX, 1.22x for sparse_dX). Quality comparable at d=2048, worse at d=512.
|
| 21 |
+
|
| 22 |
+
---
|
| 23 |
+
|
| 24 |
+
## Table 2: Isolated Matmul Microbenchmark (T4, per single FFN layer)
|
| 25 |
+
|
| 26 |
+
Config: B=8, T=256 (M=2048), chunk_size=64, 10% active, fp32, 100 iterations
|
| 27 |
+
|
| 28 |
+
| d_model | FFN dim | Params | Fwd (ms) | dX (ms) | dW_dense (ms) | dW_sparse (ms) | Total_dense (ms) | Total_sparse_full_dX (ms) | Speedup |
|
| 29 |
+
|---------|---------|--------|----------|---------|---------------|----------------|-------------------|---------------------------|---------|
|
| 30 |
+
| 256 | 1024 | 0.3M | 0.27 | 0.21 | 0.27 | 0.26 | 0.75 | 0.74 | 1.02x |
|
| 31 |
+
| 384 | 1536 | 0.6M | 0.52 | 0.69 | 0.61 | 0.18 | 1.82 | 1.39 | 1.31x |
|
| 32 |
+
| 512 | 2048 | 1.0M | 1.00 | 1.01 | 0.97 | 0.26 | 2.99 | 2.28 | 1.31x |
|
| 33 |
+
| 768 | 3072 | 2.4M | 2.16 | 2.25 | 2.05 | 0.40 | 6.46 | 4.81 | 1.34x |
|
| 34 |
+
| 1024 | 4096 | 4.2M | 3.69 | 3.90 | 3.35 | 0.59 | 10.95 | 8.18 | 1.34x |
|
| 35 |
+
| 1536 | 6144 | 9.4M | 10.33 | 9.03 | 8.14 | 1.30 | 27.50 | 20.66 | 1.33x |
|
| 36 |
+
| 2048 | 8192 | 16.8M | 14.76 | 15.57 | 13.19 | 1.93 | 43.51 | 32.26 | 1.35x |
|
| 37 |
+
|
| 38 |
+
Amdahl ceiling (if dW were free): ~1.42–1.48x. Crossover point: d_model ≈ 384.
|
| 39 |
+
|
| 40 |
+
---
|
| 41 |
+
|
| 42 |
+
## Table 3: Triton Kernel Correctness (T4)
|
| 43 |
+
|
| 44 |
+
| d_in | d_out | chunk_size | dW max_err | dBias max_err | dX max_err | Status |
|
| 45 |
+
|------|-------|------------|-----------|---------------|-----------|--------|
|
| 46 |
+
| 512 | 2048 | 64 | 0.000320 | 0.000023 | 0.000042 | ✓ |
|
| 47 |
+
| 1024 | 4096 | 64 | 0.000443 | 0.000021 | 0.000092 | ✓ |
|
| 48 |
+
| 256 | 1024 | 32 | 0.000275 | 0.000038 | 0.000019 | ✓ |
|
| 49 |
+
|
| 50 |
+
---
|
| 51 |
+
|
| 52 |
+
## Table 4: Triton vs PyLoop vs Dense — Isolated Backward (T4)
|
| 53 |
+
|
| 54 |
+
Config: M=2048, chunk_size=64, 10% active, full_dX mode (dW sparse, dX dense), 50 iterations after warmup
|
| 55 |
+
|
| 56 |
+
| d_model | FFN dim | Active chunks | Dense (ms) | PyLoop (ms) | Triton (ms) | Triton/Dense | Triton/PyLoop |
|
| 57 |
+
|---------|---------|---------------|-----------|-------------|-------------|--------------|---------------|
|
| 58 |
+
| 256 | 1024 | 1 | 0.39 | 0.40 | 0.46 | 0.85x | 0.88x |
|
| 59 |
+
| 512 | 2048 | 3 | 1.96 | 1.30 | 1.16 | 1.69x | 1.12x |
|
| 60 |
+
| 768 | 3072 | 4 | 4.29 | 2.52 | 2.51 | 1.70x | 1.00x |
|
| 61 |
+
| 1024 | 4096 | 6 | 7.29 | 4.37 | 4.30 | 1.70x | 1.02x |
|
| 62 |
+
| 1536 | 6144 | 9 | 17.32 | 10.04 | 9.78 | 1.77x | 1.03x |
|
| 63 |
+
| 2048 | 8192 | 12 | 29.14 | 17.20 | 16.89 | 1.73x | 1.02x |
|
| 64 |
+
|
| 65 |
+
Triton with both dW and dX sparse:
|
| 66 |
+
|
| 67 |
+
| d_model | Dense (ms) | Triton_all (ms) | Speedup |
|
| 68 |
+
|---------|-----------|-----------------|---------|
|
| 69 |
+
| 512 | 1.96 | 0.41 | 4.83x |
|
| 70 |
+
| 1024 | 7.06 | 1.07 | 6.58x |
|
| 71 |
+
| 2048 | 29.00 | 3.71 | 7.81x |
|
| 72 |
+
|
| 73 |
+
---
|
| 74 |
+
|
| 75 |
+
## Table 5: End-to-End Training (T4, 100 steps)
|
| 76 |
+
|
| 77 |
+
Config: 6 layers, 8 heads, B=8, T=256, chunk_size=64, 10% active, seed=42, AdamW lr=5e-4, full_dX mode
|
| 78 |
+
|
| 79 |
+
| d_model | Mode | ms/step | vs Dense | Val Loss |
|
| 80 |
+
|---------|------|---------|----------|----------|
|
| 81 |
+
| 512 | dense | 184.6 | 1.00x | 5.6954 |
|
| 82 |
+
| 512 | pyloop | 179.0 | 1.03x | 5.8683 |
|
| 83 |
+
| 512 | triton | 196.0 | 0.94x | 5.8683 |
|
| 84 |
+
| 1024 | dense | 451.5 | 1.00x | 5.5300 |
|
| 85 |
+
| 1024 | pyloop | 435.6 | 1.04x | 5.4803 |
|
| 86 |
+
| 1024 | triton | 441.0 | 1.02x | 5.4800 |
|
| 87 |
+
|
| 88 |
+
d=2048 does not fit on T4 (16GB). A10G results pending (job 69f3af45d2c8bd8662bd419d).
|
| 89 |
+
|
| 90 |
+
Note: Triton autotune overhead hurts at small scale. At d=512 with only 1 active chunk per layer, fused kernels lose to PyTorch's already-optimized single-kernel launches.
|
| 91 |
+
|
| 92 |
+
---
|
| 93 |
+
|
| 94 |
+
## Table 6: EMA Predictor Overlap (T4, 350 steps, seed=42)
|
| 95 |
+
|
| 96 |
+
Config: d=512, 6 layers, chunk_size=64, 10% active, measured every 25 steps after annealing (step ≥ 250)
|
| 97 |
+
|
| 98 |
+
| Step | Jaccard | Recall |
|
| 99 |
+
|------|---------|--------|
|
| 100 |
+
| 250 | 0.6000 | 0.7500 |
|
| 101 |
+
| 275 | 0.6552 | 0.7917 |
|
| 102 |
+
| 300 | 0.7778 | 0.8750 |
|
| 103 |
+
| 325 | 0.6000 | 0.7500 |
|
| 104 |
+
|
| 105 |
+
Single seed only. Full 3-seed results with 2000 steps pending from A10G job.
|
| 106 |
+
|
| 107 |
+
---
|
| 108 |
+
|
| 109 |
+
## Table 7: Chunk-Size vs Speed (T4, 50 steps, timing only)
|
| 110 |
+
|
| 111 |
+
Config: d=512, 6 layers, 10% active, seed=42. Loss identical across sizes (only 50 steps, all in warmup).
|
| 112 |
+
|
| 113 |
+
| Chunk Size | ms/step |
|
| 114 |
+
|------------|---------|
|
| 115 |
+
| 16 | 601.4 |
|
| 116 |
+
| 32 | 453.0 |
|
| 117 |
+
| 64 | 321.5 |
|
| 118 |
+
| 128 | 251.3 |
|
| 119 |
+
| 256 | 219.8 |
|
| 120 |
+
|
| 121 |
+
Larger chunks = fewer Python loop iterations = less overhead. This is the PyLoop backend; Triton would show a different curve.
|
| 122 |
+
|
| 123 |
+
---
|
| 124 |
+
|
| 125 |
+
## Pending Results (A10G jobs running)
|
| 126 |
+
|
| 127 |
+
| Job ID | Experiment | Status |
|
| 128 |
+
|--------|-----------|--------|
|
| 129 |
+
| 69f38371d70108f37ace1cae | Full 7-experiment suite (2000 steps, 3 seeds, all ablations) | Running |
|
| 130 |
+
| 69f395b3d70108f37ace1cee | Model-size scaling study (d=256→2048, 2000 steps, 2 seeds) | Running |
|
| 131 |
+
| 69f3af45d2c8bd8662bd419d | E2E training with Triton (d=512,1024,2048, 500 steps) | Running |
|
| 132 |
+
|
| 133 |
+
These will provide:
|
| 134 |
+
- Table 3 full: All 8 baselines with 3 seeds at 2000 steps (Dense, Random, EMA, EMA+sparse_dX, RigL, SET, TopK-SGD, Oracle)
|
| 135 |
+
- Compute-matched dense (same FLOPs) vs sparse
|
| 136 |
+
- Chunk-size ablation with loss numbers at 2000 steps
|
| 137 |
+
- Epsilon-greedy exploration sweep
|
| 138 |
+
- Attention sparsification results
|
| 139 |
+
- Sparsity level sweep (5%–100%)
|
| 140 |
+
- d=2048 end-to-end training with Triton
|