JayLuci4 commited on
Commit
adbcafd
·
verified ·
1 Parent(s): 28b2f40

Chronos PoC: PTX transform selection via RLVR (DA-GRPO)

Browse files
README.md ADDED
@@ -0,0 +1,156 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Chronos PoC: PTX Transform Selection via RLVR
2
+
3
+ Proof-of-concept RL agent that selects PTX assembly transforms to optimize GPU kernel performance on NVIDIA L4 (sm_89).
4
+
5
+ ## What This Is
6
+
7
+ An MLP policy trained with DA-GRPO (Demonstration-Anchored Group Relative Policy Optimization) to select sequences of PTX-level transforms that reduce GPU kernel execution cycles. Trained on 64 gemm_tile kernel variants, validated on 33 diverse Triton kernels.
8
+
9
+ ## Results
10
+
11
+ | Metric | Value |
12
+ |--------|-------|
13
+ | Mean cycle reduction (gemm_tile) | **-29.2%** |
14
+ | Best single kernel | **-53.8%** (gemm_tile 4,6,8: 1839 -> 849 cycles) |
15
+ | Generalization to Triton kernels | 17/33 kernels improve |
16
+ | Best Triton improvement | **-47.0%** (attention_d64_kv64 with maxnreg_255) |
17
+ | Training time | ~6 hours on single NVIDIA L4 |
18
+ | Model parameters | ~20K |
19
+
20
+ ## Model Architecture
21
+
22
+ ```
23
+ Input: 25 kernel features + 21 action mask + 21 action history = 67 dims
24
+ Hidden: Linear(67, 128) -> ReLU -> Dropout(0.1) -> Linear(128, 128) -> ReLU -> Dropout(0.1)
25
+ Output: Linear(128, 21) -> mask -> softmax
26
+ ```
27
+
28
+ **25 features**: instruction counts (loads, stores, FMA, branches), vectorization ratios, cache hint coverage, register state, instruction mix ratios.
29
+
30
+ **21 actions**: 20 PTX transforms + stop. Transforms include vectorize loads/stores, cache hints (cs/cg/ca/cv), store cache hints (cs/wt/wb), register budget limits (32/64/128/255), instruction reorder (critical_path/interleave/loads_first/stores_last), prefetch, and split vector loads.
31
+
32
+ ## Files
33
+
34
+ ```
35
+ checkpoint_best.pt # Best checkpoint (epoch 250, -29.2% mean)
36
+ checkpoint_latest.pt # Final checkpoint (epoch 500)
37
+ inference.py # Self-contained inference script
38
+ training_result.json # Per-kernel results (64 gemm_tile kernels)
39
+ bc_stats.json # Behavior cloning warm-start statistics
40
+ measure_triton_results.json # Triton kernel measurement results (33 kernels)
41
+ ```
42
+
43
+ ## Inference
44
+
45
+ ### Requirements
46
+
47
+ ```
48
+ pip install torch
49
+ ```
50
+
51
+ No GPU required for inference (model runs on CPU). GPU only needed for actually applying transforms and measuring cycles.
52
+
53
+ ### Quick Start
54
+
55
+ ```python
56
+ import torch
57
+ from inference import load_model, predict_transforms
58
+
59
+ # Load model
60
+ model = load_model("checkpoint_best.pt")
61
+
62
+ # Predict transforms for a PTX kernel
63
+ with open("your_kernel.ptx") as f:
64
+ ptx = f.read()
65
+ actions = predict_transforms(model, ptx)
66
+ # -> ['maxnreg_128', 'vec_ld', 'vec_st']
67
+ ```
68
+
69
+ ### Command Line
70
+
71
+ ```bash
72
+ # Demo with synthetic features
73
+ python inference.py --checkpoint checkpoint_best.pt
74
+
75
+ # Run on a PTX file
76
+ python inference.py --checkpoint checkpoint_best.pt --ptx path/to/kernel.ptx
77
+ ```
78
+
79
+ ### Programmatic Usage
80
+
81
+ ```python
82
+ import torch
83
+ from inference import TransformPolicy, extract_features_from_ptx, get_action_mask, get_action_history, ACTION_NAMES
84
+
85
+ # Load
86
+ model = TransformPolicy(hidden=128)
87
+ ckpt = torch.load("checkpoint_best.pt", map_location="cpu", weights_only=False)
88
+ model.load_state_dict(ckpt["policy"])
89
+ model.eval()
90
+
91
+ # Extract features from PTX
92
+ ptx_source = open("kernel.ptx").read()
93
+ features = extract_features_from_ptx(ptx_source)
94
+
95
+ # Predict step by step
96
+ applied = set()
97
+ for step in range(6):
98
+ feat_t = torch.tensor(features, dtype=torch.float32)
99
+ mask_t = torch.tensor(get_action_mask(applied), dtype=torch.float32)
100
+ hist_t = torch.tensor(get_action_history(applied), dtype=torch.float32)
101
+
102
+ action_id = model.get_greedy_action(feat_t, mask_t, hist_t)
103
+ action = ACTION_NAMES[action_id]
104
+ if action == "stop":
105
+ break
106
+ print(f"Step {step+1}: apply {action}")
107
+ applied.add(action)
108
+ ```
109
+
110
+ ## Training Details
111
+
112
+ ### Algorithm: DA-GRPO
113
+
114
+ 1. **BC warm-start** (50 epochs): Clone greedy search trajectories. Best accuracy: 64.5%.
115
+ 2. **GRPO training** (450 epochs): Hardware-in-the-loop RL with SM clock() cycle measurement.
116
+ - Group size: 8 rollouts per kernel (1 anchor from reference policy + 7 with forced diverse first actions)
117
+ - Advantage: MC-GRPO (median baseline per kernel, global z-normalization)
118
+ - Reward: log(cycles_before / cycles_after) — outcome-only, terminal
119
+ - KL penalty: beta=0.01 against BC reference policy
120
+ - Clipped surrogate: epsilon=0.2
121
+
122
+ ### Action Space
123
+
124
+ 20 PTX transforms organized into 5 conflict groups (only one per group):
125
+ - **Cache hints** (load): cs, cg, ca, cv
126
+ - **Store cache hints**: cs, wt, wb
127
+ - **Register budget**: maxnreg 32, 64, 128, 255
128
+ - **Instruction reorder**: critical_path, interleave, loads_first, stores_last
129
+ - **Prefetch**: L1, L2
130
+ - **Vectorize**: loads, stores (independent)
131
+ - **Split**: vector loads (independent)
132
+
133
+ ### Hardware
134
+
135
+ - NVIDIA L4 GPU (sm_89, Ada Lovelace)
136
+ - SM clock() cycle counter (1-cycle std dev, 200 samples per measurement)
137
+ - pip-installed CUDA 12.9 ptxas
138
+
139
+ ## Limitations
140
+
141
+ - Trained on gemm_tile kernels only (64 variants, m,n,k in {2,4,6,8})
142
+ - Mode collapse: 52/64 kernels get the same sequence (vec_st + vec_ld + maxnreg_128)
143
+ - MLP can't read PTX code — relies on 25 scalar features
144
+ - 3% measurement error rate from cudaErrorMisalignedAddress on gemm_tile(4,6,4)
145
+ - Reorder transforms deadlock on kernels with bar.sync barriers
146
+
147
+ ## References
148
+
149
+ - [CuAsmRL (CGO 2025)](https://arxiv.org/abs/2501.08071): PPO on SASS scheduling
150
+ - [Dr. Kernel (2026)](https://arxiv.org/abs/2602.05885): REINFORCE for Triton kernels
151
+ - [DeepSeek-R1 (2025)](https://arxiv.org/abs/2501.12948): GRPO algorithm
152
+ - [MC-GRPO (2025)](https://arxiv.org/abs/2601.22582): Median-centered baseline
153
+
154
+ ## License
155
+
156
+ Research prototype. Contact for usage terms.
bc_stats.json ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bc_loss": [
3
+ 2.0958820213222134,
4
+ 1.7892995409983925,
5
+ 1.7074809244700841,
6
+ 1.5575190510989156,
7
+ 1.441180571626052,
8
+ 1.413911653761698,
9
+ 1.4011170574136682,
10
+ 1.364859262028256,
11
+ 1.342358095765574,
12
+ 1.338929759949791,
13
+ 1.3145962235550162,
14
+ 1.2938860436203856,
15
+ 1.2686388336093268,
16
+ 1.2415368708864603,
17
+ 1.2314557232911982,
18
+ 1.1912865268217552,
19
+ 1.2434098955286976,
20
+ 1.223619041755853,
21
+ 1.192927201742371,
22
+ 1.2041826229758243,
23
+ 1.209502972230948,
24
+ 1.178917215137408,
25
+ 1.1955577655188365,
26
+ 1.178613586315317,
27
+ 1.1684499366863355,
28
+ 1.1616653519247488,
29
+ 1.1482060889940005,
30
+ 1.1699149070099053,
31
+ 1.1195348450576017,
32
+ 1.1293887990782159,
33
+ 1.1804475940792718,
34
+ 1.112368142052507,
35
+ 1.0938795485091486,
36
+ 1.1037644244529106,
37
+ 1.0885746230489959,
38
+ 1.0920597482832242,
39
+ 1.0800647707979651,
40
+ 1.049601698474074,
41
+ 1.0656030951319515,
42
+ 1.1000592331168275,
43
+ 1.0485037210825328,
44
+ 1.0458013574128906,
45
+ 1.0802548773960718,
46
+ 1.0371340227863504,
47
+ 1.0327755164455723,
48
+ 1.0279540154637057,
49
+ 1.0051751578636612,
50
+ 1.0235914099515635,
51
+ 1.0102946024143558,
52
+ 0.9888515271045066
53
+ ],
54
+ "bc_accuracy": [
55
+ 0.3938223938223938,
56
+ 0.47104247104247104,
57
+ 0.5752895752895753,
58
+ 0.5752895752895753,
59
+ 0.5444015444015444,
60
+ 0.5598455598455598,
61
+ 0.5752895752895753,
62
+ 0.5752895752895753,
63
+ 0.5598455598455598,
64
+ 0.5598455598455598,
65
+ 0.5714285714285714,
66
+ 0.5791505791505791,
67
+ 0.5714285714285714,
68
+ 0.5752895752895753,
69
+ 0.5598455598455598,
70
+ 0.5945945945945946,
71
+ 0.5637065637065637,
72
+ 0.5868725868725869,
73
+ 0.6023166023166023,
74
+ 0.583011583011583,
75
+ 0.6061776061776062,
76
+ 0.6061776061776062,
77
+ 0.5675675675675675,
78
+ 0.5868725868725869,
79
+ 0.5945945945945946,
80
+ 0.6138996138996139,
81
+ 0.5984555984555985,
82
+ 0.5984555984555985,
83
+ 0.6254826254826255,
84
+ 0.5907335907335908,
85
+ 0.5868725868725869,
86
+ 0.6023166023166023,
87
+ 0.6138996138996139,
88
+ 0.5945945945945946,
89
+ 0.6332046332046332,
90
+ 0.6216216216216216,
91
+ 0.6138996138996139,
92
+ 0.6177606177606177,
93
+ 0.6447876447876448,
94
+ 0.6254826254826255,
95
+ 0.6216216216216216,
96
+ 0.6061776061776062,
97
+ 0.5907335907335908,
98
+ 0.6061776061776062,
99
+ 0.6293436293436293,
100
+ 0.6216216216216216,
101
+ 0.6447876447876448,
102
+ 0.6254826254826255,
103
+ 0.6254826254826255,
104
+ 0.6447876447876448
105
+ ],
106
+ "best_accuracy": 0.6447876447876448
107
+ }
checkpoint_best.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4601dcda039965c4a57461e7fa32d71259302f334be34f3fe8ac99cc08f6f937
3
+ size 469964
checkpoint_latest.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bf261b22976e88acde2df07bc5f886bae7c218c03270e1ded086c31fad4bcfc8
3
+ size 476816
inference.py ADDED
@@ -0,0 +1,421 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Chronos PoC: PTX transform selection inference.
2
+
3
+ Loads a trained TransformPolicy checkpoint and predicts the optimal
4
+ sequence of PTX transforms for a given kernel.
5
+
6
+ Usage:
7
+ python inference.py --checkpoint checkpoint_best.pt --kernel gemm_tile --m 4 --n 6 --k 8
8
+ python inference.py --checkpoint checkpoint_best.pt --ptx path/to/kernel.ptx
9
+ """
10
+
11
+ import argparse
12
+ import sys
13
+ import os
14
+ import json
15
+
16
+ import torch
17
+ import torch.nn as nn
18
+ import torch.nn.functional as F
19
+ from torch.distributions import Categorical
20
+
21
+
22
+ # ---------------------------------------------------------------------------
23
+ # Model definition (self-contained, no external dependencies for inference)
24
+ # ---------------------------------------------------------------------------
25
+
26
+ N_FEATURES = 25 # Model was trained with 25 scalar features
27
+ N_ACTIONS = 21
28
+
29
+ ACTION_NAMES = [
30
+ "vec_ld", "vec_st",
31
+ "cache_cs", "cache_cg", "cache_ca", "cache_cv",
32
+ "st_cache_cs", "st_cache_wt", "st_cache_wb",
33
+ "maxnreg_32", "maxnreg_64", "maxnreg_128", "maxnreg_255",
34
+ "reorder_cp", "reorder_il", "reorder_lf", "reorder_sl",
35
+ "prefetch_L1", "prefetch_L2",
36
+ "split_ld",
37
+ "stop",
38
+ ]
39
+
40
+ FEATURE_NAMES = [
41
+ "n_instructions", "n_ld_global", "n_st_global", "n_fma",
42
+ "n_ld_param", "n_prefetch", "n_branch",
43
+ "n_ld_global_vec", "n_st_global_vec", "vec_ld_ratio", "vec_st_ratio",
44
+ "n_cache_hint_ld", "n_cache_hint_st", "hint_ld_ratio", "hint_st_ratio",
45
+ "load_ratio", "store_ratio", "fma_ratio", "compute_ratio",
46
+ "mem_ratio", "compute_to_mem",
47
+ "total_regs", "n_f32_regs", "n_b64_regs", "maxnreg",
48
+ ]
49
+
50
+ CONFLICT_GROUPS = {
51
+ "cache_hints": {"cache_cs", "cache_cg", "cache_ca", "cache_cv"},
52
+ "store_cache_hints": {"st_cache_cs", "st_cache_wt", "st_cache_wb"},
53
+ "register_budget": {"maxnreg_32", "maxnreg_64", "maxnreg_128", "maxnreg_255"},
54
+ "prefetch": {"prefetch_L1", "prefetch_L2"},
55
+ "reorder": {"reorder_cp", "reorder_il", "reorder_lf", "reorder_sl"},
56
+ }
57
+
58
+
59
+ class TransformPolicy(nn.Module):
60
+ """MLP policy for PTX transform selection.
61
+
62
+ Input: 25 features + 21 action mask + 21 action history = 67 dims
63
+ Output: 21 logits (masked before softmax)
64
+ """
65
+
66
+ def __init__(self, hidden=128):
67
+ super().__init__()
68
+ input_dim = N_FEATURES + N_ACTIONS + N_ACTIONS # 67
69
+ self.net = nn.Sequential(
70
+ nn.Linear(input_dim, hidden),
71
+ nn.ReLU(),
72
+ nn.Dropout(0.1),
73
+ nn.Linear(hidden, hidden),
74
+ nn.ReLU(),
75
+ nn.Dropout(0.1),
76
+ nn.Linear(hidden, N_ACTIONS),
77
+ )
78
+
79
+ def forward(self, features, action_mask, action_history):
80
+ x = torch.cat([features, action_mask, action_history], dim=-1)
81
+ logits = self.net(x)
82
+ logits = logits.masked_fill(action_mask == 0, float('-inf'))
83
+ return logits
84
+
85
+ @torch.no_grad()
86
+ def get_greedy_action(self, features, action_mask, action_history):
87
+ logits = self.forward(
88
+ features.unsqueeze(0), action_mask.unsqueeze(0),
89
+ action_history.unsqueeze(0),
90
+ )
91
+ return logits.argmax(dim=-1).item()
92
+
93
+ @torch.no_grad()
94
+ def get_action_probs(self, features, action_mask, action_history):
95
+ logits = self.forward(
96
+ features.unsqueeze(0), action_mask.unsqueeze(0),
97
+ action_history.unsqueeze(0),
98
+ )
99
+ probs = F.softmax(logits, dim=-1)
100
+ return probs.squeeze(0)
101
+
102
+
103
+ # ---------------------------------------------------------------------------
104
+ # Feature extraction (self-contained, regex-based)
105
+ # ---------------------------------------------------------------------------
106
+
107
+ import re
108
+
109
+ _LD_GLOBAL = re.compile(r'ld\.global')
110
+ _LD_GLOBAL_VEC = re.compile(r'ld\.global(?:\.\w+)*\.v[24]')
111
+ _ST_GLOBAL = re.compile(r'st\.global')
112
+ _ST_GLOBAL_VEC = re.compile(r'st\.global(?:\.\w+)*\.v[24]')
113
+ _FMA = re.compile(r'\bfma\.')
114
+ _MUL = re.compile(r'\bmul\.')
115
+ _ADD = re.compile(r'\badd\.')
116
+ _LD_PARAM = re.compile(r'ld\.param')
117
+ _PREFETCH = re.compile(r'prefetch\.global')
118
+ _CACHE_HINT_LD = re.compile(r'ld\.global\.(?:cs|cg|ca|cv)')
119
+ _CACHE_HINT_ST = re.compile(r'st\.global\.(?:wb|wt|cs)')
120
+ _MAXNREG = re.compile(r'\.maxnreg\s+(\d+)')
121
+
122
+
123
+ def extract_features_from_ptx(ptx_source):
124
+ """Extract 25 scalar features from PTX source text."""
125
+ n_instr = 0
126
+ n_ld_global = 0
127
+ n_ld_global_vec = 0
128
+ n_st_global = 0
129
+ n_st_global_vec = 0
130
+ n_fma = 0
131
+ n_mul = 0
132
+ n_add = 0
133
+ n_ld_param = 0
134
+ n_prefetch = 0
135
+ n_cache_hint_ld = 0
136
+ n_cache_hint_st = 0
137
+ n_branch = 0
138
+
139
+ # Parse register declarations
140
+ reg_decls = {}
141
+ for line in ptx_source.split('\n'):
142
+ m = re.search(r'\.reg\s+(\.\w+)\s+%\w+<(\d+)>\s*;', line)
143
+ if m:
144
+ reg_decls[m.group(1)] = int(m.group(2))
145
+
146
+ # Count instructions (lines between { and })
147
+ in_body = False
148
+ for line in ptx_source.split('\n'):
149
+ stripped = line.strip()
150
+
151
+ if stripped == '{':
152
+ in_body = True
153
+ continue
154
+ if stripped == '}':
155
+ in_body = False
156
+ continue
157
+ if not in_body:
158
+ continue
159
+
160
+ # Skip non-instructions
161
+ if not stripped or stripped.startswith('//') or stripped.startswith('.'):
162
+ continue
163
+ if stripped.endswith(':'): # label
164
+ continue
165
+ if stripped in ('ret;', 'exit;', ')', ','):
166
+ continue
167
+
168
+ # Check for branch
169
+ if 'bra ' in stripped or 'bra\t' in stripped:
170
+ n_branch += 1
171
+ continue
172
+
173
+ n_instr += 1
174
+
175
+ if _LD_GLOBAL.search(line):
176
+ n_ld_global += 1
177
+ if _LD_GLOBAL_VEC.search(line):
178
+ n_ld_global_vec += 1
179
+ if _CACHE_HINT_LD.search(line):
180
+ n_cache_hint_ld += 1
181
+ if _ST_GLOBAL.search(line):
182
+ n_st_global += 1
183
+ if _ST_GLOBAL_VEC.search(line):
184
+ n_st_global_vec += 1
185
+ if _CACHE_HINT_ST.search(line):
186
+ n_cache_hint_st += 1
187
+ if _FMA.search(line):
188
+ n_fma += 1
189
+ if _MUL.search(line):
190
+ n_mul += 1
191
+ if _ADD.search(line):
192
+ n_add += 1
193
+ if _LD_PARAM.search(line):
194
+ n_ld_param += 1
195
+ if _PREFETCH.search(line):
196
+ n_prefetch += 1
197
+
198
+ maxnreg = 0
199
+ m = _MAXNREG.search(ptx_source)
200
+ if m:
201
+ maxnreg = int(m.group(1))
202
+
203
+ total_regs = sum(reg_decls.values())
204
+ n_f32_regs = reg_decls.get('.f32', 0)
205
+ n_b64_regs = reg_decls.get('.b64', 0)
206
+ n_total = max(n_instr, 1)
207
+ n_compute = n_fma + n_mul + n_add
208
+ n_mem = n_ld_global + n_st_global
209
+
210
+ return [
211
+ n_instr,
212
+ n_ld_global,
213
+ n_st_global,
214
+ n_fma,
215
+ n_ld_param,
216
+ n_prefetch,
217
+ n_branch,
218
+ n_ld_global_vec,
219
+ n_st_global_vec,
220
+ round(n_ld_global_vec / max(n_ld_global, 1), 4), # vec_ld_ratio
221
+ round(n_st_global_vec / max(n_st_global, 1), 4), # vec_st_ratio
222
+ n_cache_hint_ld,
223
+ n_cache_hint_st,
224
+ round(n_cache_hint_ld / max(n_ld_global, 1), 4), # hint_ld_ratio
225
+ round(n_cache_hint_st / max(n_st_global, 1), 4), # hint_st_ratio
226
+ round(n_ld_global / n_total, 4), # load_ratio
227
+ round(n_st_global / n_total, 4), # store_ratio
228
+ round(n_fma / n_total, 4), # fma_ratio
229
+ round(n_compute / n_total, 4), # compute_ratio
230
+ round(n_mem / n_total, 4), # mem_ratio
231
+ round(n_compute / max(n_mem, 1), 4), # compute_to_mem
232
+ total_regs,
233
+ n_f32_regs,
234
+ n_b64_regs,
235
+ maxnreg,
236
+ ]
237
+
238
+
239
+ # ---------------------------------------------------------------------------
240
+ # Action mask and history
241
+ # ---------------------------------------------------------------------------
242
+
243
+ def get_action_mask(applied_set):
244
+ mask = []
245
+ for label in ACTION_NAMES:
246
+ if label == "stop":
247
+ mask.append(1)
248
+ continue
249
+ if label in applied_set:
250
+ mask.append(0)
251
+ continue
252
+ conflict = False
253
+ for group_labels in CONFLICT_GROUPS.values():
254
+ if label in group_labels and applied_set & group_labels:
255
+ conflict = True
256
+ break
257
+ mask.append(0 if conflict else 1)
258
+ return mask
259
+
260
+
261
+ def get_action_history(applied_set):
262
+ return [1 if name in applied_set else 0 for name in ACTION_NAMES]
263
+
264
+
265
+ # ---------------------------------------------------------------------------
266
+ # Inference
267
+ # ---------------------------------------------------------------------------
268
+
269
+ def load_model(checkpoint_path, device="cpu"):
270
+ """Load trained TransformPolicy from checkpoint."""
271
+ ckpt = torch.load(checkpoint_path, map_location=device, weights_only=False)
272
+ model = TransformPolicy(hidden=128)
273
+ model.load_state_dict(ckpt["policy"])
274
+ model.eval()
275
+ model.to(device)
276
+ epoch = ckpt.get("epoch", "unknown")
277
+ print(f"Loaded checkpoint from epoch {epoch}")
278
+ if "eval_result" in ckpt:
279
+ mean_imp = ckpt["eval_result"].get("mean_improvement", 0)
280
+ print(f" Eval mean improvement: {mean_imp*100:.1f}%")
281
+ return model
282
+
283
+
284
+ def predict_transforms(model, ptx_source, max_steps=6, verbose=True):
285
+ """Predict optimal transform sequence for a PTX kernel.
286
+
287
+ Returns list of transform labels (excluding 'stop').
288
+ """
289
+ features = extract_features_from_ptx(ptx_source)
290
+ applied = set()
291
+ actions = []
292
+
293
+ if verbose:
294
+ print(f"\nKernel: {features[0]} instructions, "
295
+ f"{features[1]} global loads, {features[2]} global stores, "
296
+ f"{features[3]} FMA, {features[21]} total regs")
297
+
298
+ for step in range(max_steps):
299
+ feat_t = torch.tensor(features, dtype=torch.float32)
300
+ mask = get_action_mask(applied)
301
+ mask_t = torch.tensor(mask, dtype=torch.float32)
302
+ hist = get_action_history(applied)
303
+ hist_t = torch.tensor(hist, dtype=torch.float32)
304
+
305
+ action_id = model.get_greedy_action(feat_t, mask_t, hist_t)
306
+ action_label = ACTION_NAMES[action_id]
307
+
308
+ if verbose:
309
+ probs = model.get_action_probs(feat_t, mask_t, hist_t)
310
+ top5 = torch.topk(probs, min(5, probs.size(0)))
311
+ top5_str = ", ".join(
312
+ f"{ACTION_NAMES[i]}={p:.2f}"
313
+ for p, i in zip(top5.values.tolist(), top5.indices.tolist())
314
+ )
315
+ print(f" Step {step+1}: {action_label} (top5: {top5_str})")
316
+
317
+ if action_label == "stop":
318
+ break
319
+
320
+ actions.append(action_label)
321
+ applied.add(action_label)
322
+
323
+ if verbose:
324
+ print(f"\nPredicted sequence: {' -> '.join(actions) if actions else '(no transforms)'}")
325
+
326
+ return actions
327
+
328
+
329
+ # ---------------------------------------------------------------------------
330
+ # Main
331
+ # ---------------------------------------------------------------------------
332
+
333
+ def main():
334
+ parser = argparse.ArgumentParser(description="Chronos PoC: PTX transform inference")
335
+ parser.add_argument("--checkpoint", required=True, help="Path to .pt checkpoint")
336
+ parser.add_argument("--ptx", help="Path to PTX file")
337
+ parser.add_argument("--kernel", default="gemm_tile",
338
+ help="Kernel type (for generating PTX if --ptx not provided)")
339
+ parser.add_argument("--m", type=int, default=4)
340
+ parser.add_argument("--n", type=int, default=6)
341
+ parser.add_argument("--k", type=int, default=8)
342
+ args = parser.parse_args()
343
+
344
+ model = load_model(args.checkpoint)
345
+
346
+ if args.ptx:
347
+ with open(args.ptx) as f:
348
+ ptx_source = f.read()
349
+ print(f"\nLoaded PTX from: {args.ptx}")
350
+ else:
351
+ print(f"\nTo run on a specific kernel, use: --ptx path/to/kernel.ptx")
352
+ print("Showing demo with a sample feature vector...")
353
+
354
+ # Demo: create a synthetic feature vector matching gemm_tile(4,6,8)
355
+ # (the best kernel from training: -53.8% improvement)
356
+ demo_features = [
357
+ 170, # n_instructions
358
+ 16, # n_ld_global
359
+ 8, # n_st_global
360
+ 48, # n_fma
361
+ 12, # n_ld_param
362
+ 0, # n_prefetch
363
+ 2, # n_branch
364
+ 0, # n_ld_global_vec
365
+ 0, # n_st_global_vec
366
+ 0.0, # vec_ld_ratio
367
+ 0.0, # vec_st_ratio
368
+ 0, # n_cache_hint_ld
369
+ 0, # n_cache_hint_st
370
+ 0.0, # hint_ld_ratio
371
+ 0.0, # hint_st_ratio
372
+ 0.094, # load_ratio
373
+ 0.047, # store_ratio
374
+ 0.282, # fma_ratio
375
+ 0.388, # compute_ratio
376
+ 0.141, # mem_ratio
377
+ 2.75, # compute_to_mem
378
+ 95, # total_regs
379
+ 48, # n_f32_regs
380
+ 16, # n_b64_regs
381
+ 0, # maxnreg
382
+ ]
383
+
384
+ applied = set()
385
+ actions = []
386
+ print(f"\nDemo: gemm_tile({args.m},{args.n},{args.k})-like features")
387
+ print(f"Features: {len(demo_features)} dims")
388
+
389
+ for step in range(6):
390
+ feat_t = torch.tensor(demo_features, dtype=torch.float32)
391
+ mask = get_action_mask(applied)
392
+ mask_t = torch.tensor(mask, dtype=torch.float32)
393
+ hist = get_action_history(applied)
394
+ hist_t = torch.tensor(hist, dtype=torch.float32)
395
+
396
+ action_id = model.get_greedy_action(feat_t, mask_t, hist_t)
397
+ action_label = ACTION_NAMES[action_id]
398
+
399
+ probs = model.get_action_probs(feat_t, mask_t, hist_t)
400
+ top3 = torch.topk(probs, min(3, probs.size(0)))
401
+ top3_str = ", ".join(
402
+ f"{ACTION_NAMES[i]}={p:.2f}"
403
+ for p, i in zip(top3.values.tolist(), top3.indices.tolist())
404
+ )
405
+ print(f" Step {step+1}: {action_label} (probs: {top3_str})")
406
+
407
+ if action_label == "stop":
408
+ break
409
+ actions.append(action_label)
410
+ applied.add(action_label)
411
+
412
+ print(f"\nPredicted: {' -> '.join(actions)}")
413
+ print(f"Expected for gemm_tile(4,6,8): maxnreg_128 -> vec_ld -> vec_st -> stop")
414
+ return
415
+
416
+ actions = predict_transforms(model, ptx_source)
417
+ print(f"\nTo apply these transforms, use the Chronos transform pipeline.")
418
+
419
+
420
+ if __name__ == "__main__":
421
+ main()
measure_triton_results.json ADDED
@@ -0,0 +1,2733 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "n_kernels": 33,
3
+ "n_improved": 17,
4
+ "elapsed_s": 83.0,
5
+ "transform_stats": {
6
+ "cache_cs": {
7
+ "n": 29,
8
+ "mean_delta": -1.16,
9
+ "median_delta": -0.12,
10
+ "best": -13.13,
11
+ "worst": 1.09,
12
+ "improved_count": 6,
13
+ "degraded_count": 1,
14
+ "errors": 0
15
+ },
16
+ "cache_cg": {
17
+ "n": 29,
18
+ "mean_delta": 9.63,
19
+ "median_delta": 0.26,
20
+ "best": -3.28,
21
+ "worst": 96.43,
22
+ "improved_count": 4,
23
+ "degraded_count": 11,
24
+ "errors": 0
25
+ },
26
+ "cache_ca": {
27
+ "n": 29,
28
+ "mean_delta": -0.92,
29
+ "median_delta": -0.12,
30
+ "best": -10.58,
31
+ "worst": 1.15,
32
+ "improved_count": 5,
33
+ "degraded_count": 1,
34
+ "errors": 0
35
+ },
36
+ "cache_cv": {
37
+ "n": 29,
38
+ "mean_delta": 9.67,
39
+ "median_delta": 0.26,
40
+ "best": -3.37,
41
+ "worst": 96.17,
42
+ "improved_count": 4,
43
+ "degraded_count": 10,
44
+ "errors": 0
45
+ },
46
+ "st_cache_cs": {
47
+ "n": 33,
48
+ "mean_delta": -0.75,
49
+ "median_delta": -0.12,
50
+ "best": -10.99,
51
+ "worst": 1.2,
52
+ "improved_count": 5,
53
+ "degraded_count": 1,
54
+ "errors": 0
55
+ },
56
+ "st_cache_wt": {
57
+ "n": 33,
58
+ "mean_delta": 0.08,
59
+ "median_delta": 0.0,
60
+ "best": -2.98,
61
+ "worst": 5.72,
62
+ "improved_count": 1,
63
+ "degraded_count": 1,
64
+ "errors": 0
65
+ },
66
+ "st_cache_wb": {
67
+ "n": 33,
68
+ "mean_delta": -0.81,
69
+ "median_delta": -0.03,
70
+ "best": -11.5,
71
+ "worst": 1.15,
72
+ "improved_count": 5,
73
+ "degraded_count": 1,
74
+ "errors": 0
75
+ },
76
+ "maxnreg_32": {
77
+ "n": 33,
78
+ "mean_delta": 36.02,
79
+ "median_delta": 0.38,
80
+ "best": -4.26,
81
+ "worst": 366.65,
82
+ "improved_count": 5,
83
+ "degraded_count": 12,
84
+ "errors": 0
85
+ },
86
+ "maxnreg_64": {
87
+ "n": 33,
88
+ "mean_delta": 12.76,
89
+ "median_delta": 0.0,
90
+ "best": -7.28,
91
+ "worst": 191.22,
92
+ "improved_count": 5,
93
+ "degraded_count": 8,
94
+ "errors": 0
95
+ },
96
+ "maxnreg_128": {
97
+ "n": 33,
98
+ "mean_delta": 3.72,
99
+ "median_delta": 0.04,
100
+ "best": -3.88,
101
+ "worst": 89.88,
102
+ "improved_count": 2,
103
+ "degraded_count": 5,
104
+ "errors": 0
105
+ },
106
+ "maxnreg_255": {
107
+ "n": 33,
108
+ "mean_delta": -2.26,
109
+ "median_delta": -0.21,
110
+ "best": -46.95,
111
+ "worst": 1.07,
112
+ "improved_count": 7,
113
+ "degraded_count": 1,
114
+ "errors": 0
115
+ },
116
+ "reorder_cp": {
117
+ "n": 16,
118
+ "mean_delta": -0.06,
119
+ "median_delta": -0.04,
120
+ "best": -7.65,
121
+ "worst": 5.3,
122
+ "improved_count": 2,
123
+ "degraded_count": 3,
124
+ "errors": 14
125
+ },
126
+ "reorder_il": {
127
+ "n": 16,
128
+ "mean_delta": -2.31,
129
+ "median_delta": -0.15,
130
+ "best": -19.02,
131
+ "worst": 2.27,
132
+ "improved_count": 5,
133
+ "degraded_count": 1,
134
+ "errors": 14
135
+ },
136
+ "reorder_lf": {
137
+ "n": 16,
138
+ "mean_delta": -0.37,
139
+ "median_delta": 0.0,
140
+ "best": -9.98,
141
+ "worst": 3.42,
142
+ "improved_count": 2,
143
+ "degraded_count": 3,
144
+ "errors": 14
145
+ },
146
+ "reorder_sl": {
147
+ "n": 16,
148
+ "mean_delta": -0.89,
149
+ "median_delta": -0.23,
150
+ "best": -9.72,
151
+ "worst": 9.94,
152
+ "improved_count": 4,
153
+ "degraded_count": 2,
154
+ "errors": 14
155
+ }
156
+ },
157
+ "kernel_results": {
158
+ "triton_vector_add_256": {
159
+ "source": "triton_kernels",
160
+ "baseline": 944,
161
+ "baseline_std": 29.49187981462016,
162
+ "baseline_error": null,
163
+ "transforms": {
164
+ "cache_cs": {
165
+ "cycles": 941,
166
+ "std": 27.570924902875493,
167
+ "delta_pct": -0.32
168
+ },
169
+ "cache_cg": {
170
+ "cycles": 934,
171
+ "std": 27.46245436955699,
172
+ "delta_pct": -1.06
173
+ },
174
+ "cache_ca": {
175
+ "cycles": 937,
176
+ "std": 26.23790197405273,
177
+ "delta_pct": -0.74
178
+ },
179
+ "cache_cv": {
180
+ "cycles": 933,
181
+ "std": 28.072007409517404,
182
+ "delta_pct": -1.17
183
+ },
184
+ "st_cache_cs": {
185
+ "cycles": 939,
186
+ "std": 25.358580401907357,
187
+ "delta_pct": -0.53
188
+ },
189
+ "st_cache_wt": {
190
+ "cycles": 941,
191
+ "std": 27.90693417414389,
192
+ "delta_pct": -0.32
193
+ },
194
+ "st_cache_wb": {
195
+ "cycles": 937,
196
+ "std": 26.86858946800148,
197
+ "delta_pct": -0.74
198
+ },
199
+ "maxnreg_32": {
200
+ "cycles": 939,
201
+ "std": 28.326063969425757,
202
+ "delta_pct": -0.53
203
+ },
204
+ "maxnreg_64": {
205
+ "cycles": 940,
206
+ "std": 27.253641499806957,
207
+ "delta_pct": -0.42
208
+ },
209
+ "maxnreg_128": {
210
+ "cycles": 941,
211
+ "std": 23.985612354075933,
212
+ "delta_pct": -0.32
213
+ },
214
+ "maxnreg_255": {
215
+ "cycles": 938,
216
+ "std": 28.52321160037908,
217
+ "delta_pct": -0.64
218
+ },
219
+ "reorder_cp": {
220
+ "cycles": 941,
221
+ "std": 28.676645549994163,
222
+ "delta_pct": -0.32
223
+ },
224
+ "reorder_il": {
225
+ "cycles": 938,
226
+ "std": 25.202558897857973,
227
+ "delta_pct": -0.64
228
+ },
229
+ "reorder_lf": {
230
+ "cycles": 935,
231
+ "std": 27.466561033372926,
232
+ "delta_pct": -0.95
233
+ },
234
+ "reorder_sl": {
235
+ "cycles": 937,
236
+ "std": 30.11105735440056,
237
+ "delta_pct": -0.74
238
+ }
239
+ }
240
+ },
241
+ "triton_vector_add_512": {
242
+ "source": "triton_kernels",
243
+ "baseline": 1068,
244
+ "baseline_std": 60.74178442390378,
245
+ "baseline_error": null,
246
+ "transforms": {
247
+ "cache_cs": {
248
+ "cycles": 1074,
249
+ "std": 63.366669274942964,
250
+ "delta_pct": 0.56
251
+ },
252
+ "cache_cg": {
253
+ "cycles": 1033,
254
+ "std": 43.91783208447339,
255
+ "delta_pct": -3.28
256
+ },
257
+ "cache_ca": {
258
+ "cycles": 1062,
259
+ "std": 58.1985736509066,
260
+ "delta_pct": -0.56
261
+ },
262
+ "cache_cv": {
263
+ "cycles": 1032,
264
+ "std": 46.53583135606368,
265
+ "delta_pct": -3.37
266
+ },
267
+ "st_cache_cs": {
268
+ "cycles": 1076,
269
+ "std": 58.655327976237594,
270
+ "delta_pct": 0.75
271
+ },
272
+ "st_cache_wt": {
273
+ "cycles": 1070,
274
+ "std": 55.24308983936362,
275
+ "delta_pct": 0.19
276
+ },
277
+ "st_cache_wb": {
278
+ "cycles": 1072,
279
+ "std": 61.67726465238224,
280
+ "delta_pct": 0.37
281
+ },
282
+ "maxnreg_32": {
283
+ "cycles": 1065,
284
+ "std": 58.49837412270532,
285
+ "delta_pct": -0.28
286
+ },
287
+ "maxnreg_64": {
288
+ "cycles": 1071,
289
+ "std": 58.60743894762848,
290
+ "delta_pct": 0.28
291
+ },
292
+ "maxnreg_128": {
293
+ "cycles": 1075,
294
+ "std": 59.72745578877439,
295
+ "delta_pct": 0.66
296
+ },
297
+ "maxnreg_255": {
298
+ "cycles": 1066,
299
+ "std": 56.25154908978063,
300
+ "delta_pct": -0.19
301
+ },
302
+ "reorder_cp": {
303
+ "cycles": 1069,
304
+ "std": 61.26006019422442,
305
+ "delta_pct": 0.09
306
+ },
307
+ "reorder_il": {
308
+ "cycles": 1069,
309
+ "std": 61.40023697022675,
310
+ "delta_pct": 0.09
311
+ },
312
+ "reorder_lf": {
313
+ "cycles": 1070,
314
+ "std": 61.17932473475006,
315
+ "delta_pct": 0.19
316
+ },
317
+ "reorder_sl": {
318
+ "cycles": 1063,
319
+ "std": 59.396992979442985,
320
+ "delta_pct": -0.47
321
+ }
322
+ }
323
+ },
324
+ "triton_vector_add_1024": {
325
+ "source": "triton_kernels",
326
+ "baseline": 1219,
327
+ "baseline_std": 42.60126171840454,
328
+ "baseline_error": null,
329
+ "transforms": {
330
+ "cache_cs": {
331
+ "cycles": 1217,
332
+ "std": 55.49619356316251,
333
+ "delta_pct": -0.16
334
+ },
335
+ "cache_cg": {
336
+ "cycles": 1233,
337
+ "std": 38.63293898993448,
338
+ "delta_pct": 1.15
339
+ },
340
+ "cache_ca": {
341
+ "cycles": 1217,
342
+ "std": 37.32690048745007,
343
+ "delta_pct": -0.16
344
+ },
345
+ "cache_cv": {
346
+ "cycles": 1231,
347
+ "std": 55.58792584725571,
348
+ "delta_pct": 0.98
349
+ },
350
+ "st_cache_cs": {
351
+ "cycles": 1220,
352
+ "std": 29.24901323121859,
353
+ "delta_pct": 0.08
354
+ },
355
+ "st_cache_wt": {
356
+ "cycles": 1219,
357
+ "std": 51.917569039776886,
358
+ "delta_pct": 0.0
359
+ },
360
+ "st_cache_wb": {
361
+ "cycles": 1218,
362
+ "std": 55.40694879705974,
363
+ "delta_pct": -0.08
364
+ },
365
+ "maxnreg_32": {
366
+ "cycles": 1219,
367
+ "std": 41.19195036654613,
368
+ "delta_pct": 0.0
369
+ },
370
+ "maxnreg_64": {
371
+ "cycles": 1219,
372
+ "std": 56.34494121036954,
373
+ "delta_pct": 0.0
374
+ },
375
+ "maxnreg_128": {
376
+ "cycles": 1215,
377
+ "std": 37.623588345611054,
378
+ "delta_pct": -0.33
379
+ },
380
+ "maxnreg_255": {
381
+ "cycles": 1218,
382
+ "std": 42.37277309782781,
383
+ "delta_pct": -0.08
384
+ },
385
+ "reorder_cp": {
386
+ "cycles": 1217,
387
+ "std": 47.77099302924318,
388
+ "delta_pct": -0.16
389
+ },
390
+ "reorder_il": {
391
+ "cycles": 1217,
392
+ "std": 35.28370693393766,
393
+ "delta_pct": -0.16
394
+ },
395
+ "reorder_lf": {
396
+ "cycles": 1220,
397
+ "std": 56.07920091263784,
398
+ "delta_pct": 0.08
399
+ },
400
+ "reorder_sl": {
401
+ "cycles": 1220,
402
+ "std": 46.88487922560961,
403
+ "delta_pct": 0.08
404
+ }
405
+ }
406
+ },
407
+ "triton_softmax_1024": {
408
+ "source": "triton_kernels",
409
+ "baseline": 3838,
410
+ "baseline_std": 84.44705974751281,
411
+ "baseline_error": null,
412
+ "transforms": {
413
+ "cache_cs": {
414
+ "cycles": 3880,
415
+ "std": 76.23343361019494,
416
+ "delta_pct": 1.09
417
+ },
418
+ "cache_cg": {
419
+ "cycles": 3843,
420
+ "std": 62.285206710743125,
421
+ "delta_pct": 0.13
422
+ },
423
+ "cache_ca": {
424
+ "cycles": 3882,
425
+ "std": 79.20490499331466,
426
+ "delta_pct": 1.15
427
+ },
428
+ "cache_cv": {
429
+ "cycles": 3850,
430
+ "std": 58.920025246091,
431
+ "delta_pct": 0.31
432
+ },
433
+ "st_cache_cs": {
434
+ "cycles": 3884,
435
+ "std": 72.98933466610036,
436
+ "delta_pct": 1.2
437
+ },
438
+ "st_cache_wt": {
439
+ "cycles": 3842,
440
+ "std": 85.59785496728291,
441
+ "delta_pct": 0.1
442
+ },
443
+ "st_cache_wb": {
444
+ "cycles": 3882,
445
+ "std": 74.53467045610385,
446
+ "delta_pct": 1.15
447
+ },
448
+ "maxnreg_32": {
449
+ "cycles": 3802,
450
+ "std": 83.56720035396663,
451
+ "delta_pct": -0.94
452
+ },
453
+ "maxnreg_64": {
454
+ "cycles": 3889,
455
+ "std": 72.41214262815319,
456
+ "delta_pct": 1.33
457
+ },
458
+ "maxnreg_128": {
459
+ "cycles": 3848,
460
+ "std": 86.11081000664201,
461
+ "delta_pct": 0.26
462
+ },
463
+ "maxnreg_255": {
464
+ "cycles": 3879,
465
+ "std": 72.32222600971295,
466
+ "delta_pct": 1.07
467
+ },
468
+ "reorder_cp": {
469
+ "error": "skipped_barrier"
470
+ },
471
+ "reorder_il": {
472
+ "error": "skipped_barrier"
473
+ },
474
+ "reorder_lf": {
475
+ "error": "skipped_barrier"
476
+ },
477
+ "reorder_sl": {
478
+ "error": "skipped_barrier"
479
+ }
480
+ }
481
+ },
482
+ "triton_layernorm_1024": {
483
+ "source": "triton_kernels",
484
+ "baseline": 4556,
485
+ "baseline_std": 112.65114469014507,
486
+ "baseline_error": null,
487
+ "transforms": {
488
+ "cache_cs": {
489
+ "cycles": 4523,
490
+ "std": 100.84496008725473,
491
+ "delta_pct": -0.72
492
+ },
493
+ "cache_cg": {
494
+ "cycles": 6911,
495
+ "std": 457.00405643713924,
496
+ "delta_pct": 51.69
497
+ },
498
+ "cache_ca": {
499
+ "cycles": 4529,
500
+ "std": 121.65385772757064,
501
+ "delta_pct": -0.59
502
+ },
503
+ "cache_cv": {
504
+ "cycles": 6876,
505
+ "std": 449.7087875503435,
506
+ "delta_pct": 50.92
507
+ },
508
+ "st_cache_cs": {
509
+ "cycles": 4507,
510
+ "std": 91.85497863480236,
511
+ "delta_pct": -1.08
512
+ },
513
+ "st_cache_wt": {
514
+ "cycles": 4562,
515
+ "std": 106.59615893642697,
516
+ "delta_pct": 0.13
517
+ },
518
+ "st_cache_wb": {
519
+ "cycles": 4534,
520
+ "std": 115.7763961263262,
521
+ "delta_pct": -0.48
522
+ },
523
+ "maxnreg_32": {
524
+ "cycles": 4567,
525
+ "std": 110.52709520746485,
526
+ "delta_pct": 0.24
527
+ },
528
+ "maxnreg_64": {
529
+ "cycles": 4523,
530
+ "std": 111.35946019535116,
531
+ "delta_pct": -0.72
532
+ },
533
+ "maxnreg_128": {
534
+ "cycles": 4565,
535
+ "std": 125.39940580002762,
536
+ "delta_pct": 0.2
537
+ },
538
+ "maxnreg_255": {
539
+ "cycles": 4527,
540
+ "std": 115.90962761996951,
541
+ "delta_pct": -0.64
542
+ },
543
+ "reorder_cp": {
544
+ "error": "skipped_barrier"
545
+ },
546
+ "reorder_il": {
547
+ "error": "skipped_barrier"
548
+ },
549
+ "reorder_lf": {
550
+ "error": "skipped_barrier"
551
+ },
552
+ "reorder_sl": {
553
+ "error": "skipped_barrier"
554
+ }
555
+ }
556
+ },
557
+ "triton_matmul_64x64x32_w4": {
558
+ "source": "triton_kernels",
559
+ "baseline": 171410,
560
+ "baseline_std": 501.0587783633772,
561
+ "baseline_error": null,
562
+ "transforms": {
563
+ "st_cache_cs": {
564
+ "cycles": 171382,
565
+ "std": 432.3299073335084,
566
+ "delta_pct": -0.02
567
+ },
568
+ "st_cache_wt": {
569
+ "cycles": 171423,
570
+ "std": 470.07234483109085,
571
+ "delta_pct": 0.01
572
+ },
573
+ "st_cache_wb": {
574
+ "cycles": 171367,
575
+ "std": 460.0975798404073,
576
+ "delta_pct": -0.03
577
+ },
578
+ "maxnreg_32": {
579
+ "cycles": 425115,
580
+ "std": 4083.747927991516,
581
+ "delta_pct": 148.01
582
+ },
583
+ "maxnreg_64": {
584
+ "cycles": 245521,
585
+ "std": 4866.736257041262,
586
+ "delta_pct": 43.24
587
+ },
588
+ "maxnreg_128": {
589
+ "cycles": 194120,
590
+ "std": 4007.7541684059165,
591
+ "delta_pct": 13.25
592
+ },
593
+ "maxnreg_255": {
594
+ "cycles": 171367,
595
+ "std": 462.19168531573564,
596
+ "delta_pct": -0.03
597
+ },
598
+ "reorder_cp": {
599
+ "error": "skipped_barrier"
600
+ },
601
+ "reorder_il": {
602
+ "error": "skipped_barrier"
603
+ },
604
+ "reorder_lf": {
605
+ "error": "skipped_barrier"
606
+ },
607
+ "reorder_sl": {
608
+ "error": "skipped_barrier"
609
+ }
610
+ }
611
+ },
612
+ "triton_matmul_64x64x32_w8": {
613
+ "source": "triton_kernels",
614
+ "baseline": 186277,
615
+ "baseline_std": 555.2674566368895,
616
+ "baseline_error": null,
617
+ "transforms": {
618
+ "st_cache_cs": {
619
+ "cycles": 186202,
620
+ "std": 563.2344040450654,
621
+ "delta_pct": -0.04
622
+ },
623
+ "st_cache_wt": {
624
+ "cycles": 186236,
625
+ "std": 542.4974524364146,
626
+ "delta_pct": -0.02
627
+ },
628
+ "st_cache_wb": {
629
+ "cycles": 186132,
630
+ "std": 537.6838005510301,
631
+ "delta_pct": -0.08
632
+ },
633
+ "maxnreg_32": {
634
+ "cycles": 319430,
635
+ "std": 4796.146295076392,
636
+ "delta_pct": 71.48
637
+ },
638
+ "maxnreg_64": {
639
+ "cycles": 208291,
640
+ "std": 4566.849550376605,
641
+ "delta_pct": 11.82
642
+ },
643
+ "maxnreg_128": {
644
+ "cycles": 186231,
645
+ "std": 542.7691460464569,
646
+ "delta_pct": -0.02
647
+ },
648
+ "maxnreg_255": {
649
+ "cycles": 185762,
650
+ "std": 473.6743501605296,
651
+ "delta_pct": -0.28
652
+ },
653
+ "reorder_cp": {
654
+ "error": "skipped_barrier"
655
+ },
656
+ "reorder_il": {
657
+ "error": "skipped_barrier"
658
+ },
659
+ "reorder_lf": {
660
+ "error": "skipped_barrier"
661
+ },
662
+ "reorder_sl": {
663
+ "error": "skipped_barrier"
664
+ }
665
+ }
666
+ },
667
+ "triton_matmul_128x128x32_w4": {
668
+ "source": "triton_kernels",
669
+ "baseline": 240682,
670
+ "baseline_std": 342.63614286878726,
671
+ "baseline_error": null,
672
+ "transforms": {
673
+ "st_cache_cs": {
674
+ "cycles": 240657,
675
+ "std": 354.66443206924487,
676
+ "delta_pct": -0.01
677
+ },
678
+ "st_cache_wt": {
679
+ "cycles": 240660,
680
+ "std": 354.1097540311479,
681
+ "delta_pct": -0.01
682
+ },
683
+ "st_cache_wb": {
684
+ "cycles": 240703,
685
+ "std": 353.70178395789867,
686
+ "delta_pct": 0.01
687
+ },
688
+ "maxnreg_32": {
689
+ "cycles": 1123144,
690
+ "std": 1845.049125064154,
691
+ "delta_pct": 366.65
692
+ },
693
+ "maxnreg_64": {
694
+ "cycles": 700917,
695
+ "std": 1328.5777667867244,
696
+ "delta_pct": 191.22
697
+ },
698
+ "maxnreg_128": {
699
+ "cycles": 457011,
700
+ "std": 1192.7186457731766,
701
+ "delta_pct": 89.88
702
+ },
703
+ "maxnreg_255": {
704
+ "cycles": 238426,
705
+ "std": 471.0170941430894,
706
+ "delta_pct": -0.94
707
+ }
708
+ }
709
+ },
710
+ "triton_matmul_128x128x32_w8": {
711
+ "source": "triton_kernels",
712
+ "baseline": 211438,
713
+ "baseline_std": 210.94903596603612,
714
+ "baseline_error": null,
715
+ "transforms": {
716
+ "st_cache_cs": {
717
+ "cycles": 211484,
718
+ "std": 207.6386823306293,
719
+ "delta_pct": 0.02
720
+ },
721
+ "st_cache_wt": {
722
+ "cycles": 211417,
723
+ "std": 192.11210763509936,
724
+ "delta_pct": -0.01
725
+ },
726
+ "st_cache_wb": {
727
+ "cycles": 211456,
728
+ "std": 205.09111999304113,
729
+ "delta_pct": 0.01
730
+ },
731
+ "maxnreg_32": {
732
+ "cycles": 710161,
733
+ "std": 1178.3205980440978,
734
+ "delta_pct": 235.87
735
+ },
736
+ "maxnreg_64": {
737
+ "cycles": 425841,
738
+ "std": 724.849326946642,
739
+ "delta_pct": 101.4
740
+ },
741
+ "maxnreg_128": {
742
+ "cycles": 244024,
743
+ "std": 483.0836308290729,
744
+ "delta_pct": 15.41
745
+ },
746
+ "maxnreg_255": {
747
+ "cycles": 211445,
748
+ "std": 217.40122119022237,
749
+ "delta_pct": 0.0
750
+ },
751
+ "reorder_cp": {
752
+ "error": "skipped_barrier"
753
+ },
754
+ "reorder_il": {
755
+ "error": "skipped_barrier"
756
+ },
757
+ "reorder_lf": {
758
+ "error": "skipped_barrier"
759
+ },
760
+ "reorder_sl": {
761
+ "error": "skipped_barrier"
762
+ }
763
+ }
764
+ },
765
+ "triton_fused_add_mul_256": {
766
+ "source": "triton_kernels",
767
+ "baseline": 1332,
768
+ "baseline_std": 21.35080267811962,
769
+ "baseline_error": null,
770
+ "transforms": {
771
+ "cache_cs": {
772
+ "cycles": 1331,
773
+ "std": 34.96618938346013,
774
+ "delta_pct": -0.08
775
+ },
776
+ "cache_cg": {
777
+ "cycles": 1332,
778
+ "std": 23.14809711401782,
779
+ "delta_pct": 0.0
780
+ },
781
+ "cache_ca": {
782
+ "cycles": 1328,
783
+ "std": 33.876983336773066,
784
+ "delta_pct": -0.3
785
+ },
786
+ "cache_cv": {
787
+ "cycles": 1329,
788
+ "std": 23.705819960507586,
789
+ "delta_pct": -0.23
790
+ },
791
+ "st_cache_cs": {
792
+ "cycles": 1326,
793
+ "std": 32.04271524075324,
794
+ "delta_pct": -0.45
795
+ },
796
+ "st_cache_wt": {
797
+ "cycles": 1332,
798
+ "std": 23.455223298873108,
799
+ "delta_pct": 0.0
800
+ },
801
+ "st_cache_wb": {
802
+ "cycles": 1332,
803
+ "std": 37.35344662812255,
804
+ "delta_pct": 0.0
805
+ },
806
+ "maxnreg_32": {
807
+ "cycles": 1337,
808
+ "std": 22.76026746327907,
809
+ "delta_pct": 0.38
810
+ },
811
+ "maxnreg_64": {
812
+ "cycles": 1326,
813
+ "std": 34.47232186842076,
814
+ "delta_pct": -0.45
815
+ },
816
+ "maxnreg_128": {
817
+ "cycles": 1334,
818
+ "std": 24.19508162829793,
819
+ "delta_pct": 0.15
820
+ },
821
+ "maxnreg_255": {
822
+ "cycles": 1330,
823
+ "std": 35.5125879090781,
824
+ "delta_pct": -0.15
825
+ },
826
+ "reorder_cp": {
827
+ "cycles": 1336,
828
+ "std": 20.016243403795826,
829
+ "delta_pct": 0.3
830
+ },
831
+ "reorder_il": {
832
+ "cycles": 1330,
833
+ "std": 35.21678435064735,
834
+ "delta_pct": -0.15
835
+ },
836
+ "reorder_lf": {
837
+ "cycles": 1335,
838
+ "std": 20.668061713668266,
839
+ "delta_pct": 0.23
840
+ },
841
+ "reorder_sl": {
842
+ "cycles": 1330,
843
+ "std": 33.829409912086845,
844
+ "delta_pct": -0.15
845
+ }
846
+ }
847
+ },
848
+ "triton_fused_add_mul_512": {
849
+ "source": "triton_kernels",
850
+ "baseline": 1379,
851
+ "baseline_std": 59.30477552440444,
852
+ "baseline_error": null,
853
+ "transforms": {
854
+ "cache_cs": {
855
+ "cycles": 1363,
856
+ "std": 50.025993243512914,
857
+ "delta_pct": -1.16
858
+ },
859
+ "cache_cg": {
860
+ "cycles": 1356,
861
+ "std": 58.66862449384679,
862
+ "delta_pct": -1.67
863
+ },
864
+ "cache_ca": {
865
+ "cycles": 1356,
866
+ "std": 51.74146886202595,
867
+ "delta_pct": -1.67
868
+ },
869
+ "cache_cv": {
870
+ "cycles": 1355,
871
+ "std": 57.26626493844347,
872
+ "delta_pct": -1.74
873
+ },
874
+ "st_cache_cs": {
875
+ "cycles": 1354,
876
+ "std": 50.8162766739162,
877
+ "delta_pct": -1.81
878
+ },
879
+ "st_cache_wt": {
880
+ "cycles": 1380,
881
+ "std": 60.92277386166851,
882
+ "delta_pct": 0.07
883
+ },
884
+ "st_cache_wb": {
885
+ "cycles": 1359,
886
+ "std": 48.36108326950504,
887
+ "delta_pct": -1.45
888
+ },
889
+ "maxnreg_32": {
890
+ "cycles": 1376,
891
+ "std": 59.94648425887877,
892
+ "delta_pct": -0.22
893
+ },
894
+ "maxnreg_64": {
895
+ "cycles": 1355,
896
+ "std": 48.759710571331325,
897
+ "delta_pct": -1.74
898
+ },
899
+ "maxnreg_128": {
900
+ "cycles": 1375,
901
+ "std": 62.22318780004766,
902
+ "delta_pct": -0.29
903
+ },
904
+ "maxnreg_255": {
905
+ "cycles": 1361,
906
+ "std": 51.13456365316908,
907
+ "delta_pct": -1.31
908
+ },
909
+ "reorder_cp": {
910
+ "cycles": 1373,
911
+ "std": 60.520011359879966,
912
+ "delta_pct": -0.44
913
+ },
914
+ "reorder_il": {
915
+ "cycles": 1356,
916
+ "std": 48.52316946573049,
917
+ "delta_pct": -1.67
918
+ },
919
+ "reorder_lf": {
920
+ "cycles": 1374,
921
+ "std": 61.745087051521764,
922
+ "delta_pct": -0.36
923
+ },
924
+ "reorder_sl": {
925
+ "cycles": 1357,
926
+ "std": 51.92411674742287,
927
+ "delta_pct": -1.6
928
+ }
929
+ }
930
+ },
931
+ "triton_fused_add_mul_1024": {
932
+ "source": "triton_kernels",
933
+ "baseline": 1647,
934
+ "baseline_std": 126.03052715513016,
935
+ "baseline_error": null,
936
+ "transforms": {
937
+ "cache_cs": {
938
+ "cycles": 1620,
939
+ "std": 163.05165676864493,
940
+ "delta_pct": -1.64
941
+ },
942
+ "cache_cg": {
943
+ "cycles": 1652,
944
+ "std": 136.04815728263281,
945
+ "delta_pct": 0.3
946
+ },
947
+ "cache_ca": {
948
+ "cycles": 1636,
949
+ "std": 157.5926254619803,
950
+ "delta_pct": -0.67
951
+ },
952
+ "cache_cv": {
953
+ "cycles": 1651,
954
+ "std": 137.32488157650093,
955
+ "delta_pct": 0.24
956
+ },
957
+ "st_cache_cs": {
958
+ "cycles": 1638,
959
+ "std": 164.70073398439973,
960
+ "delta_pct": -0.55
961
+ },
962
+ "st_cache_wt": {
963
+ "cycles": 1645,
964
+ "std": 132.36343112808765,
965
+ "delta_pct": -0.12
966
+ },
967
+ "st_cache_wb": {
968
+ "cycles": 1636,
969
+ "std": 158.1481551425751,
970
+ "delta_pct": -0.67
971
+ },
972
+ "maxnreg_32": {
973
+ "cycles": 1645,
974
+ "std": 131.74864363628188,
975
+ "delta_pct": -0.12
976
+ },
977
+ "maxnreg_64": {
978
+ "cycles": 1633,
979
+ "std": 164.37551247068404,
980
+ "delta_pct": -0.85
981
+ },
982
+ "maxnreg_128": {
983
+ "cycles": 1644,
984
+ "std": 136.379041553312,
985
+ "delta_pct": -0.18
986
+ },
987
+ "maxnreg_255": {
988
+ "cycles": 1637,
989
+ "std": 164.06369182424245,
990
+ "delta_pct": -0.61
991
+ },
992
+ "reorder_cp": {
993
+ "cycles": 1645,
994
+ "std": 142.6154956517699,
995
+ "delta_pct": -0.12
996
+ },
997
+ "reorder_il": {
998
+ "cycles": 1638,
999
+ "std": 166.29074658561132,
1000
+ "delta_pct": -0.55
1001
+ },
1002
+ "reorder_lf": {
1003
+ "cycles": 1644,
1004
+ "std": 132.60286233713055,
1005
+ "delta_pct": -0.18
1006
+ },
1007
+ "reorder_sl": {
1008
+ "cycles": 1634,
1009
+ "std": 162.15513983528243,
1010
+ "delta_pct": -0.79
1011
+ }
1012
+ }
1013
+ },
1014
+ "reduction_sum_1024": {
1015
+ "source": "diverse",
1016
+ "baseline": 2283,
1017
+ "baseline_std": 28.91735810892828,
1018
+ "baseline_error": null,
1019
+ "transforms": {
1020
+ "cache_cs": {
1021
+ "cycles": 2279,
1022
+ "std": 29.059206802664107,
1023
+ "delta_pct": -0.18
1024
+ },
1025
+ "cache_cg": {
1026
+ "cycles": 2261,
1027
+ "std": 34.178703530122384,
1028
+ "delta_pct": -0.96
1029
+ },
1030
+ "cache_ca": {
1031
+ "cycles": 2280,
1032
+ "std": 30.294712409923946,
1033
+ "delta_pct": -0.13
1034
+ },
1035
+ "cache_cv": {
1036
+ "cycles": 2257,
1037
+ "std": 36.78278265710739,
1038
+ "delta_pct": -1.14
1039
+ },
1040
+ "st_cache_cs": {
1041
+ "cycles": 2280,
1042
+ "std": 29.946478924908686,
1043
+ "delta_pct": -0.13
1044
+ },
1045
+ "st_cache_wt": {
1046
+ "cycles": 2287,
1047
+ "std": 28.96892775026373,
1048
+ "delta_pct": 0.18
1049
+ },
1050
+ "st_cache_wb": {
1051
+ "cycles": 2287,
1052
+ "std": 31.985745262538437,
1053
+ "delta_pct": 0.18
1054
+ },
1055
+ "maxnreg_32": {
1056
+ "cycles": 2518,
1057
+ "std": 45.40396898950575,
1058
+ "delta_pct": 10.29
1059
+ },
1060
+ "maxnreg_64": {
1061
+ "cycles": 2283,
1062
+ "std": 29.030967948037834,
1063
+ "delta_pct": 0.0
1064
+ },
1065
+ "maxnreg_128": {
1066
+ "cycles": 2284,
1067
+ "std": 27.902578733873323,
1068
+ "delta_pct": 0.04
1069
+ },
1070
+ "maxnreg_255": {
1071
+ "cycles": 2286,
1072
+ "std": 30.912740092072067,
1073
+ "delta_pct": 0.13
1074
+ },
1075
+ "reorder_cp": {
1076
+ "error": "skipped_barrier"
1077
+ },
1078
+ "reorder_il": {
1079
+ "error": "skipped_barrier"
1080
+ },
1081
+ "reorder_lf": {
1082
+ "error": "skipped_barrier"
1083
+ },
1084
+ "reorder_sl": {
1085
+ "error": "skipped_barrier"
1086
+ }
1087
+ }
1088
+ },
1089
+ "reduction_sum_512": {
1090
+ "source": "diverse",
1091
+ "baseline": 1442,
1092
+ "baseline_std": 19.446850644770223,
1093
+ "baseline_error": null,
1094
+ "transforms": {
1095
+ "cache_cs": {
1096
+ "cycles": 1436,
1097
+ "std": 21.160753767292885,
1098
+ "delta_pct": -0.42
1099
+ },
1100
+ "cache_cg": {
1101
+ "cycles": 1440,
1102
+ "std": 17.872425688753054,
1103
+ "delta_pct": -0.14
1104
+ },
1105
+ "cache_ca": {
1106
+ "cycles": 1437,
1107
+ "std": 18.555635262636525,
1108
+ "delta_pct": -0.35
1109
+ },
1110
+ "cache_cv": {
1111
+ "cycles": 1440,
1112
+ "std": 22.142944135773813,
1113
+ "delta_pct": -0.14
1114
+ },
1115
+ "st_cache_cs": {
1116
+ "cycles": 1438,
1117
+ "std": 20.739546161861885,
1118
+ "delta_pct": -0.28
1119
+ },
1120
+ "st_cache_wt": {
1121
+ "cycles": 1441,
1122
+ "std": 19.672681566070246,
1123
+ "delta_pct": -0.07
1124
+ },
1125
+ "st_cache_wb": {
1126
+ "cycles": 1440,
1127
+ "std": 20.03422072355199,
1128
+ "delta_pct": -0.14
1129
+ },
1130
+ "maxnreg_32": {
1131
+ "cycles": 1513,
1132
+ "std": 17.36521738994361,
1133
+ "delta_pct": 4.92
1134
+ },
1135
+ "maxnreg_64": {
1136
+ "cycles": 1438,
1137
+ "std": 18.31485667429587,
1138
+ "delta_pct": -0.28
1139
+ },
1140
+ "maxnreg_128": {
1141
+ "cycles": 1442,
1142
+ "std": 23.397187865211496,
1143
+ "delta_pct": 0.0
1144
+ },
1145
+ "maxnreg_255": {
1146
+ "cycles": 1439,
1147
+ "std": 19.986604889275217,
1148
+ "delta_pct": -0.21
1149
+ },
1150
+ "reorder_cp": {
1151
+ "error": "skipped_barrier"
1152
+ },
1153
+ "reorder_il": {
1154
+ "error": "skipped_barrier"
1155
+ },
1156
+ "reorder_lf": {
1157
+ "error": "skipped_barrier"
1158
+ },
1159
+ "reorder_sl": {
1160
+ "error": "skipped_barrier"
1161
+ }
1162
+ }
1163
+ },
1164
+ "reduction_max_1024": {
1165
+ "source": "diverse",
1166
+ "baseline": 2287,
1167
+ "baseline_std": 29.404587652269498,
1168
+ "baseline_error": null,
1169
+ "transforms": {
1170
+ "cache_cs": {
1171
+ "cycles": 2288,
1172
+ "std": 29.20560049031692,
1173
+ "delta_pct": 0.04
1174
+ },
1175
+ "cache_cg": {
1176
+ "cycles": 2261,
1177
+ "std": 38.18233099222728,
1178
+ "delta_pct": -1.14
1179
+ },
1180
+ "cache_ca": {
1181
+ "cycles": 2290,
1182
+ "std": 31.04470325192367,
1183
+ "delta_pct": 0.13
1184
+ },
1185
+ "cache_cv": {
1186
+ "cycles": 2265,
1187
+ "std": 36.98072335690582,
1188
+ "delta_pct": -0.96
1189
+ },
1190
+ "st_cache_cs": {
1191
+ "cycles": 2283,
1192
+ "std": 33.27393837525098,
1193
+ "delta_pct": -0.17
1194
+ },
1195
+ "st_cache_wt": {
1196
+ "cycles": 2285,
1197
+ "std": 30.970121084684184,
1198
+ "delta_pct": -0.09
1199
+ },
1200
+ "st_cache_wb": {
1201
+ "cycles": 2294,
1202
+ "std": 31.33937459490856,
1203
+ "delta_pct": 0.31
1204
+ },
1205
+ "maxnreg_32": {
1206
+ "cycles": 2525,
1207
+ "std": 47.04580321346421,
1208
+ "delta_pct": 10.41
1209
+ },
1210
+ "maxnreg_64": {
1211
+ "cycles": 2285,
1212
+ "std": 28.196409345872393,
1213
+ "delta_pct": -0.09
1214
+ },
1215
+ "maxnreg_128": {
1216
+ "cycles": 2289,
1217
+ "std": 29.040254819818646,
1218
+ "delta_pct": 0.09
1219
+ },
1220
+ "maxnreg_255": {
1221
+ "cycles": 2285,
1222
+ "std": 31.75385960792798,
1223
+ "delta_pct": -0.09
1224
+ },
1225
+ "reorder_cp": {
1226
+ "error": "skipped_barrier"
1227
+ },
1228
+ "reorder_il": {
1229
+ "error": "skipped_barrier"
1230
+ },
1231
+ "reorder_lf": {
1232
+ "error": "skipped_barrier"
1233
+ },
1234
+ "reorder_sl": {
1235
+ "error": "skipped_barrier"
1236
+ }
1237
+ }
1238
+ },
1239
+ "reduction_max_512": {
1240
+ "source": "diverse",
1241
+ "baseline": 1443,
1242
+ "baseline_std": 21.411211432331427,
1243
+ "baseline_error": null,
1244
+ "transforms": {
1245
+ "cache_cs": {
1246
+ "cycles": 1440,
1247
+ "std": 19.682082587978336,
1248
+ "delta_pct": -0.21
1249
+ },
1250
+ "cache_cg": {
1251
+ "cycles": 1444,
1252
+ "std": 20.720013875477978,
1253
+ "delta_pct": 0.07
1254
+ },
1255
+ "cache_ca": {
1256
+ "cycles": 1442,
1257
+ "std": 19.579233897167683,
1258
+ "delta_pct": -0.07
1259
+ },
1260
+ "cache_cv": {
1261
+ "cycles": 1441,
1262
+ "std": 19.73320488415402,
1263
+ "delta_pct": -0.14
1264
+ },
1265
+ "st_cache_cs": {
1266
+ "cycles": 1440,
1267
+ "std": 18.96362570818144,
1268
+ "delta_pct": -0.21
1269
+ },
1270
+ "st_cache_wt": {
1271
+ "cycles": 1444,
1272
+ "std": 23.070044971781048,
1273
+ "delta_pct": 0.07
1274
+ },
1275
+ "st_cache_wb": {
1276
+ "cycles": 1443,
1277
+ "std": 19.77969666097031,
1278
+ "delta_pct": 0.0
1279
+ },
1280
+ "maxnreg_32": {
1281
+ "cycles": 1514,
1282
+ "std": 18.103860776088617,
1283
+ "delta_pct": 4.92
1284
+ },
1285
+ "maxnreg_64": {
1286
+ "cycles": 1442,
1287
+ "std": 19.005459741874176,
1288
+ "delta_pct": -0.07
1289
+ },
1290
+ "maxnreg_128": {
1291
+ "cycles": 1442,
1292
+ "std": 21.203138918565806,
1293
+ "delta_pct": -0.07
1294
+ },
1295
+ "maxnreg_255": {
1296
+ "cycles": 1440,
1297
+ "std": 19.25943340288078,
1298
+ "delta_pct": -0.21
1299
+ },
1300
+ "reorder_cp": {
1301
+ "error": "skipped_barrier"
1302
+ },
1303
+ "reorder_il": {
1304
+ "error": "skipped_barrier"
1305
+ },
1306
+ "reorder_lf": {
1307
+ "error": "skipped_barrier"
1308
+ },
1309
+ "reorder_sl": {
1310
+ "error": "skipped_barrier"
1311
+ }
1312
+ }
1313
+ },
1314
+ "prefix_scan_1024": {
1315
+ "source": "diverse",
1316
+ "baseline": 1914,
1317
+ "baseline_std": 18.85006896539108,
1318
+ "baseline_error": null,
1319
+ "transforms": {
1320
+ "cache_cs": {
1321
+ "cycles": 1913,
1322
+ "std": 19.857842279563002,
1323
+ "delta_pct": -0.05
1324
+ },
1325
+ "cache_cg": {
1326
+ "cycles": 1951,
1327
+ "std": 26.21342404189121,
1328
+ "delta_pct": 1.93
1329
+ },
1330
+ "cache_ca": {
1331
+ "cycles": 1909,
1332
+ "std": 19.845399970774082,
1333
+ "delta_pct": -0.26
1334
+ },
1335
+ "cache_cv": {
1336
+ "cycles": 1950,
1337
+ "std": 25.577528711742268,
1338
+ "delta_pct": 1.88
1339
+ },
1340
+ "st_cache_cs": {
1341
+ "cycles": 1910,
1342
+ "std": 18.901454309126585,
1343
+ "delta_pct": -0.21
1344
+ },
1345
+ "st_cache_wt": {
1346
+ "cycles": 1913,
1347
+ "std": 18.244280610646175,
1348
+ "delta_pct": -0.05
1349
+ },
1350
+ "st_cache_wb": {
1351
+ "cycles": 1909,
1352
+ "std": 20.177750122350115,
1353
+ "delta_pct": -0.26
1354
+ },
1355
+ "maxnreg_32": {
1356
+ "cycles": 1919,
1357
+ "std": 21.419439768584052,
1358
+ "delta_pct": 0.26
1359
+ },
1360
+ "maxnreg_64": {
1361
+ "cycles": 1914,
1362
+ "std": 24.30736462473873,
1363
+ "delta_pct": 0.0
1364
+ },
1365
+ "maxnreg_128": {
1366
+ "cycles": 1918,
1367
+ "std": 22.96361469803916,
1368
+ "delta_pct": 0.21
1369
+ },
1370
+ "maxnreg_255": {
1371
+ "cycles": 1916,
1372
+ "std": 23.2384331442548,
1373
+ "delta_pct": 0.1
1374
+ },
1375
+ "reorder_cp": {
1376
+ "error": "skipped_barrier"
1377
+ },
1378
+ "reorder_il": {
1379
+ "error": "skipped_barrier"
1380
+ },
1381
+ "reorder_lf": {
1382
+ "error": "skipped_barrier"
1383
+ },
1384
+ "reorder_sl": {
1385
+ "error": "skipped_barrier"
1386
+ }
1387
+ }
1388
+ },
1389
+ "prefix_scan_512": {
1390
+ "source": "diverse",
1391
+ "baseline": 1470,
1392
+ "baseline_std": 24.63637097869733,
1393
+ "baseline_error": null,
1394
+ "transforms": {
1395
+ "cache_cs": {
1396
+ "cycles": 1467,
1397
+ "std": 29.877576876313114,
1398
+ "delta_pct": -0.2
1399
+ },
1400
+ "cache_cg": {
1401
+ "cycles": 1473,
1402
+ "std": 36.70606353179267,
1403
+ "delta_pct": 0.2
1404
+ },
1405
+ "cache_ca": {
1406
+ "cycles": 1474,
1407
+ "std": 33.85005022152848,
1408
+ "delta_pct": 0.27
1409
+ },
1410
+ "cache_cv": {
1411
+ "cycles": 1467,
1412
+ "std": 29.68908722072809,
1413
+ "delta_pct": -0.2
1414
+ },
1415
+ "st_cache_cs": {
1416
+ "cycles": 1468,
1417
+ "std": 29.363446238478208,
1418
+ "delta_pct": -0.14
1419
+ },
1420
+ "st_cache_wt": {
1421
+ "cycles": 1470,
1422
+ "std": 21.9029472902621,
1423
+ "delta_pct": 0.0
1424
+ },
1425
+ "st_cache_wb": {
1426
+ "cycles": 1474,
1427
+ "std": 24.60645240582234,
1428
+ "delta_pct": 0.27
1429
+ },
1430
+ "maxnreg_32": {
1431
+ "cycles": 1478,
1432
+ "std": 20.00399960007998,
1433
+ "delta_pct": 0.54
1434
+ },
1435
+ "maxnreg_64": {
1436
+ "cycles": 1470,
1437
+ "std": 26.488219551340173,
1438
+ "delta_pct": 0.0
1439
+ },
1440
+ "maxnreg_128": {
1441
+ "cycles": 1474,
1442
+ "std": 30.10612852892248,
1443
+ "delta_pct": 0.27
1444
+ },
1445
+ "maxnreg_255": {
1446
+ "cycles": 1474,
1447
+ "std": 30.13604320411026,
1448
+ "delta_pct": 0.27
1449
+ },
1450
+ "reorder_cp": {
1451
+ "error": "skipped_barrier"
1452
+ },
1453
+ "reorder_il": {
1454
+ "error": "skipped_barrier"
1455
+ },
1456
+ "reorder_lf": {
1457
+ "error": "skipped_barrier"
1458
+ },
1459
+ "reorder_sl": {
1460
+ "error": "skipped_barrier"
1461
+ }
1462
+ }
1463
+ },
1464
+ "attention_d64_kv64": {
1465
+ "source": "diverse",
1466
+ "baseline": 53610,
1467
+ "baseline_std": 93.32397320624536,
1468
+ "baseline_error": null,
1469
+ "transforms": {
1470
+ "cache_cs": {
1471
+ "cycles": 53613,
1472
+ "std": 102.94325123581439,
1473
+ "delta_pct": 0.01
1474
+ },
1475
+ "cache_cg": {
1476
+ "cycles": 58146,
1477
+ "std": 188.2566184759516,
1478
+ "delta_pct": 8.46
1479
+ },
1480
+ "cache_ca": {
1481
+ "cycles": 53613,
1482
+ "std": 101.8963491740504,
1483
+ "delta_pct": 0.01
1484
+ },
1485
+ "cache_cv": {
1486
+ "cycles": 58174,
1487
+ "std": 181.6218515900551,
1488
+ "delta_pct": 8.51
1489
+ },
1490
+ "st_cache_cs": {
1491
+ "cycles": 53615,
1492
+ "std": 101.64258150499721,
1493
+ "delta_pct": 0.01
1494
+ },
1495
+ "st_cache_wt": {
1496
+ "cycles": 53611,
1497
+ "std": 101.52303925710656,
1498
+ "delta_pct": 0.0
1499
+ },
1500
+ "st_cache_wb": {
1501
+ "cycles": 53589,
1502
+ "std": 97.34385239962512,
1503
+ "delta_pct": -0.04
1504
+ },
1505
+ "maxnreg_32": {
1506
+ "cycles": 116641,
1507
+ "std": 835.3467224901286,
1508
+ "delta_pct": 117.57
1509
+ },
1510
+ "maxnreg_64": {
1511
+ "cycles": 61921,
1512
+ "std": 213.7253255349024,
1513
+ "delta_pct": 15.5
1514
+ },
1515
+ "maxnreg_128": {
1516
+ "cycles": 53249,
1517
+ "std": 162.7886285954888,
1518
+ "delta_pct": -0.67
1519
+ },
1520
+ "maxnreg_255": {
1521
+ "cycles": 28440,
1522
+ "std": 22.207642265670618,
1523
+ "delta_pct": -46.95
1524
+ }
1525
+ }
1526
+ },
1527
+ "attention_d64_kv32": {
1528
+ "source": "diverse",
1529
+ "baseline": 56910,
1530
+ "baseline_std": 182.19544992946447,
1531
+ "baseline_error": null,
1532
+ "transforms": {
1533
+ "cache_cs": {
1534
+ "cycles": 56888,
1535
+ "std": 181.49333321089236,
1536
+ "delta_pct": -0.04
1537
+ },
1538
+ "cache_cg": {
1539
+ "cycles": 59204,
1540
+ "std": 185.004154277681,
1541
+ "delta_pct": 4.03
1542
+ },
1543
+ "cache_ca": {
1544
+ "cycles": 56881,
1545
+ "std": 194.19240452448184,
1546
+ "delta_pct": -0.05
1547
+ },
1548
+ "cache_cv": {
1549
+ "cycles": 59155,
1550
+ "std": 197.05218845524146,
1551
+ "delta_pct": 3.94
1552
+ },
1553
+ "st_cache_cs": {
1554
+ "cycles": 56877,
1555
+ "std": 198.4987042778869,
1556
+ "delta_pct": -0.06
1557
+ },
1558
+ "st_cache_wt": {
1559
+ "cycles": 56894,
1560
+ "std": 179.84295864725982,
1561
+ "delta_pct": -0.03
1562
+ },
1563
+ "st_cache_wb": {
1564
+ "cycles": 56919,
1565
+ "std": 187.93011227581385,
1566
+ "delta_pct": 0.02
1567
+ },
1568
+ "maxnreg_32": {
1569
+ "cycles": 72623,
1570
+ "std": 203.6142676729703,
1571
+ "delta_pct": 27.61
1572
+ },
1573
+ "maxnreg_64": {
1574
+ "cycles": 55899,
1575
+ "std": 111.39286512160463,
1576
+ "delta_pct": -1.78
1577
+ },
1578
+ "maxnreg_128": {
1579
+ "cycles": 56444,
1580
+ "std": 115.63091109214697,
1581
+ "delta_pct": -0.82
1582
+ },
1583
+ "maxnreg_255": {
1584
+ "cycles": 56451,
1585
+ "std": 98.92903504532934,
1586
+ "delta_pct": -0.81
1587
+ },
1588
+ "reorder_cp": {
1589
+ "error": "skipped_barrier"
1590
+ },
1591
+ "reorder_il": {
1592
+ "error": "skipped_barrier"
1593
+ },
1594
+ "reorder_lf": {
1595
+ "error": "skipped_barrier"
1596
+ },
1597
+ "reorder_sl": {
1598
+ "error": "skipped_barrier"
1599
+ }
1600
+ }
1601
+ },
1602
+ "attention_d128_kv64": {
1603
+ "source": "diverse",
1604
+ "baseline": 90703,
1605
+ "baseline_std": 331.9657663073107,
1606
+ "baseline_error": null,
1607
+ "transforms": {
1608
+ "cache_cs": {
1609
+ "cycles": 90984,
1610
+ "std": 289.66283153867016,
1611
+ "delta_pct": 0.31
1612
+ },
1613
+ "cache_cg": {
1614
+ "cycles": 96840,
1615
+ "std": 153.33229592946165,
1616
+ "delta_pct": 6.77
1617
+ },
1618
+ "cache_ca": {
1619
+ "cycles": 90723,
1620
+ "std": 357.56033952327545,
1621
+ "delta_pct": 0.02
1622
+ },
1623
+ "cache_cv": {
1624
+ "cycles": 96850,
1625
+ "std": 155.450086442562,
1626
+ "delta_pct": 6.78
1627
+ },
1628
+ "st_cache_cs": {
1629
+ "cycles": 90717,
1630
+ "std": 396.2587030905441,
1631
+ "delta_pct": 0.02
1632
+ },
1633
+ "st_cache_wt": {
1634
+ "cycles": 90690,
1635
+ "std": 369.9314664907542,
1636
+ "delta_pct": -0.01
1637
+ },
1638
+ "st_cache_wb": {
1639
+ "cycles": 90692,
1640
+ "std": 395.4016678960775,
1641
+ "delta_pct": -0.01
1642
+ },
1643
+ "maxnreg_32": {
1644
+ "cycles": 274793,
1645
+ "std": 5722.283842660638,
1646
+ "delta_pct": 202.96
1647
+ },
1648
+ "maxnreg_64": {
1649
+ "cycles": 146437,
1650
+ "std": 401.55479713234655,
1651
+ "delta_pct": 61.45
1652
+ },
1653
+ "maxnreg_128": {
1654
+ "cycles": 94028,
1655
+ "std": 815.9024439079711,
1656
+ "delta_pct": 3.67
1657
+ },
1658
+ "maxnreg_255": {
1659
+ "cycles": 90439,
1660
+ "std": 214.42241714662205,
1661
+ "delta_pct": -0.29
1662
+ }
1663
+ }
1664
+ },
1665
+ "relu_1024": {
1666
+ "source": "diverse",
1667
+ "baseline": 853,
1668
+ "baseline_std": 10.05211296195979,
1669
+ "baseline_error": null,
1670
+ "transforms": {
1671
+ "cache_cs": {
1672
+ "cycles": 852,
1673
+ "std": 8.13938572620809,
1674
+ "delta_pct": -0.12
1675
+ },
1676
+ "cache_cg": {
1677
+ "cycles": 859,
1678
+ "std": 37.65465409481276,
1679
+ "delta_pct": 0.7
1680
+ },
1681
+ "cache_ca": {
1682
+ "cycles": 852,
1683
+ "std": 8.371676952678,
1684
+ "delta_pct": -0.12
1685
+ },
1686
+ "cache_cv": {
1687
+ "cycles": 860,
1688
+ "std": 37.299769101162006,
1689
+ "delta_pct": 0.82
1690
+ },
1691
+ "st_cache_cs": {
1692
+ "cycles": 852,
1693
+ "std": 9.529657653871938,
1694
+ "delta_pct": -0.12
1695
+ },
1696
+ "st_cache_wt": {
1697
+ "cycles": 852,
1698
+ "std": 7.353903725233286,
1699
+ "delta_pct": -0.12
1700
+ },
1701
+ "st_cache_wb": {
1702
+ "cycles": 853,
1703
+ "std": 7.97044383958635,
1704
+ "delta_pct": 0.0
1705
+ },
1706
+ "maxnreg_32": {
1707
+ "cycles": 860,
1708
+ "std": 9.282078161704954,
1709
+ "delta_pct": 0.82
1710
+ },
1711
+ "maxnreg_64": {
1712
+ "cycles": 852,
1713
+ "std": 7.753444395879808,
1714
+ "delta_pct": -0.12
1715
+ },
1716
+ "maxnreg_128": {
1717
+ "cycles": 853,
1718
+ "std": 8.314323484204833,
1719
+ "delta_pct": 0.0
1720
+ },
1721
+ "maxnreg_255": {
1722
+ "cycles": 853,
1723
+ "std": 8.720344029910747,
1724
+ "delta_pct": 0.0
1725
+ },
1726
+ "reorder_cp": {
1727
+ "cycles": 852,
1728
+ "std": 8.963810573634406,
1729
+ "delta_pct": -0.12
1730
+ },
1731
+ "reorder_il": {
1732
+ "cycles": 853,
1733
+ "std": 8.406293773120233,
1734
+ "delta_pct": 0.0
1735
+ },
1736
+ "reorder_lf": {
1737
+ "cycles": 852,
1738
+ "std": 9.209710907514959,
1739
+ "delta_pct": -0.12
1740
+ },
1741
+ "reorder_sl": {
1742
+ "cycles": 853,
1743
+ "std": 6.763798858629668,
1744
+ "delta_pct": 0.0
1745
+ }
1746
+ }
1747
+ },
1748
+ "relu_512": {
1749
+ "source": "diverse",
1750
+ "baseline": 764,
1751
+ "baseline_std": 16.43397699888861,
1752
+ "baseline_error": null,
1753
+ "transforms": {
1754
+ "cache_cs": {
1755
+ "cycles": 764,
1756
+ "std": 17.24999710144903,
1757
+ "delta_pct": 0.0
1758
+ },
1759
+ "cache_cg": {
1760
+ "cycles": 766,
1761
+ "std": 16.810089232362806,
1762
+ "delta_pct": 0.26
1763
+ },
1764
+ "cache_ca": {
1765
+ "cycles": 764,
1766
+ "std": 16.147148355050188,
1767
+ "delta_pct": 0.0
1768
+ },
1769
+ "cache_cv": {
1770
+ "cycles": 766,
1771
+ "std": 16.470385544971315,
1772
+ "delta_pct": 0.26
1773
+ },
1774
+ "st_cache_cs": {
1775
+ "cycles": 764,
1776
+ "std": 17.5565877948991,
1777
+ "delta_pct": 0.0
1778
+ },
1779
+ "st_cache_wt": {
1780
+ "cycles": 764,
1781
+ "std": 17.564737401965335,
1782
+ "delta_pct": 0.0
1783
+ },
1784
+ "st_cache_wb": {
1785
+ "cycles": 766,
1786
+ "std": 15.71966841253339,
1787
+ "delta_pct": 0.26
1788
+ },
1789
+ "maxnreg_32": {
1790
+ "cycles": 764,
1791
+ "std": 17.01982300142983,
1792
+ "delta_pct": 0.0
1793
+ },
1794
+ "maxnreg_64": {
1795
+ "cycles": 766,
1796
+ "std": 15.90412210717712,
1797
+ "delta_pct": 0.26
1798
+ },
1799
+ "maxnreg_128": {
1800
+ "cycles": 766,
1801
+ "std": 15.540193048993954,
1802
+ "delta_pct": 0.26
1803
+ },
1804
+ "maxnreg_255": {
1805
+ "cycles": 765,
1806
+ "std": 15.783535725559087,
1807
+ "delta_pct": 0.13
1808
+ },
1809
+ "reorder_cp": {
1810
+ "cycles": 764,
1811
+ "std": 16.173768113831727,
1812
+ "delta_pct": 0.0
1813
+ },
1814
+ "reorder_il": {
1815
+ "cycles": 767,
1816
+ "std": 16.0684909061181,
1817
+ "delta_pct": 0.39
1818
+ },
1819
+ "reorder_lf": {
1820
+ "cycles": 764,
1821
+ "std": 16.771750057760816,
1822
+ "delta_pct": 0.0
1823
+ },
1824
+ "reorder_sl": {
1825
+ "cycles": 764,
1826
+ "std": 17.16017482428428,
1827
+ "delta_pct": 0.0
1828
+ }
1829
+ }
1830
+ },
1831
+ "gelu_1024": {
1832
+ "source": "diverse",
1833
+ "baseline": 1371,
1834
+ "baseline_std": 11.377819430804832,
1835
+ "baseline_error": null,
1836
+ "transforms": {
1837
+ "cache_cs": {
1838
+ "cycles": 1372,
1839
+ "std": 12.677648638450272,
1840
+ "delta_pct": 0.07
1841
+ },
1842
+ "cache_cg": {
1843
+ "cycles": 1363,
1844
+ "std": 7.351979325324575,
1845
+ "delta_pct": -0.58
1846
+ },
1847
+ "cache_ca": {
1848
+ "cycles": 1371,
1849
+ "std": 11.97738598359425,
1850
+ "delta_pct": 0.0
1851
+ },
1852
+ "cache_cv": {
1853
+ "cycles": 1364,
1854
+ "std": 11.346672419700852,
1855
+ "delta_pct": -0.51
1856
+ },
1857
+ "st_cache_cs": {
1858
+ "cycles": 1371,
1859
+ "std": 11.756273856966756,
1860
+ "delta_pct": 0.0
1861
+ },
1862
+ "st_cache_wt": {
1863
+ "cycles": 1371,
1864
+ "std": 11.591272578970782,
1865
+ "delta_pct": 0.0
1866
+ },
1867
+ "st_cache_wb": {
1868
+ "cycles": 1370,
1869
+ "std": 8.280940465913277,
1870
+ "delta_pct": -0.07
1871
+ },
1872
+ "maxnreg_32": {
1873
+ "cycles": 1326,
1874
+ "std": 9.086781608468424,
1875
+ "delta_pct": -3.28
1876
+ },
1877
+ "maxnreg_64": {
1878
+ "cycles": 1371,
1879
+ "std": 8.118768379501907,
1880
+ "delta_pct": 0.0
1881
+ },
1882
+ "maxnreg_128": {
1883
+ "cycles": 1371,
1884
+ "std": 12.319496742968035,
1885
+ "delta_pct": 0.0
1886
+ },
1887
+ "maxnreg_255": {
1888
+ "cycles": 1370,
1889
+ "std": 11.341242215912681,
1890
+ "delta_pct": -0.07
1891
+ },
1892
+ "reorder_cp": {
1893
+ "cycles": 1370,
1894
+ "std": 9.817228478547294,
1895
+ "delta_pct": -0.07
1896
+ },
1897
+ "reorder_il": {
1898
+ "cycles": 1370,
1899
+ "std": 12.119792696246913,
1900
+ "delta_pct": -0.07
1901
+ },
1902
+ "reorder_lf": {
1903
+ "cycles": 1371,
1904
+ "std": 12.60591527815414,
1905
+ "delta_pct": 0.0
1906
+ },
1907
+ "reorder_sl": {
1908
+ "cycles": 1371,
1909
+ "std": 11.214205277236545,
1910
+ "delta_pct": 0.0
1911
+ }
1912
+ }
1913
+ },
1914
+ "gelu_512": {
1915
+ "source": "diverse",
1916
+ "baseline": 958,
1917
+ "baseline_std": 17.80862431520189,
1918
+ "baseline_error": null,
1919
+ "transforms": {
1920
+ "cache_cs": {
1921
+ "cycles": 955,
1922
+ "std": 20.503441174593107,
1923
+ "delta_pct": -0.31
1924
+ },
1925
+ "cache_cg": {
1926
+ "cycles": 962,
1927
+ "std": 26.21575432826605,
1928
+ "delta_pct": 0.42
1929
+ },
1930
+ "cache_ca": {
1931
+ "cycles": 957,
1932
+ "std": 19.55654366190508,
1933
+ "delta_pct": -0.1
1934
+ },
1935
+ "cache_cv": {
1936
+ "cycles": 963,
1937
+ "std": 26.990672462908364,
1938
+ "delta_pct": 0.52
1939
+ },
1940
+ "st_cache_cs": {
1941
+ "cycles": 955,
1942
+ "std": 23.808225469362473,
1943
+ "delta_pct": -0.31
1944
+ },
1945
+ "st_cache_wt": {
1946
+ "cycles": 957,
1947
+ "std": 20.593517790800096,
1948
+ "delta_pct": -0.1
1949
+ },
1950
+ "st_cache_wb": {
1951
+ "cycles": 957,
1952
+ "std": 17.687107592820258,
1953
+ "delta_pct": -0.1
1954
+ },
1955
+ "maxnreg_32": {
1956
+ "cycles": 956,
1957
+ "std": 18.6074118296984,
1958
+ "delta_pct": -0.21
1959
+ },
1960
+ "maxnreg_64": {
1961
+ "cycles": 957,
1962
+ "std": 25.47200374921455,
1963
+ "delta_pct": -0.1
1964
+ },
1965
+ "maxnreg_128": {
1966
+ "cycles": 957,
1967
+ "std": 20.030021842224738,
1968
+ "delta_pct": -0.1
1969
+ },
1970
+ "maxnreg_255": {
1971
+ "cycles": 955,
1972
+ "std": 21.280082236683203,
1973
+ "delta_pct": -0.31
1974
+ },
1975
+ "reorder_cp": {
1976
+ "cycles": 959,
1977
+ "std": 19.8063903829042,
1978
+ "delta_pct": 0.1
1979
+ },
1980
+ "reorder_il": {
1981
+ "cycles": 957,
1982
+ "std": 22.347339886438384,
1983
+ "delta_pct": -0.1
1984
+ },
1985
+ "reorder_lf": {
1986
+ "cycles": 957,
1987
+ "std": 22.097415572867337,
1988
+ "delta_pct": -0.1
1989
+ },
1990
+ "reorder_sl": {
1991
+ "cycles": 955,
1992
+ "std": 21.986995588301735,
1993
+ "delta_pct": -0.31
1994
+ }
1995
+ }
1996
+ },
1997
+ "dropout_1024": {
1998
+ "source": "diverse",
1999
+ "baseline": 2137,
2000
+ "baseline_std": 0.0,
2001
+ "baseline_error": null,
2002
+ "transforms": {
2003
+ "cache_cs": {
2004
+ "cycles": 2137,
2005
+ "std": 0.09949874371066199,
2006
+ "delta_pct": 0.0
2007
+ },
2008
+ "cache_cg": {
2009
+ "cycles": 2137,
2010
+ "std": 0.0,
2011
+ "delta_pct": 0.0
2012
+ },
2013
+ "cache_ca": {
2014
+ "cycles": 2137,
2015
+ "std": 0.0,
2016
+ "delta_pct": 0.0
2017
+ },
2018
+ "cache_cv": {
2019
+ "cycles": 2137,
2020
+ "std": 0.0,
2021
+ "delta_pct": 0.0
2022
+ },
2023
+ "st_cache_cs": {
2024
+ "cycles": 2137,
2025
+ "std": 0.07053367989832941,
2026
+ "delta_pct": 0.0
2027
+ },
2028
+ "st_cache_wt": {
2029
+ "cycles": 2137,
2030
+ "std": 0.07053367989832941,
2031
+ "delta_pct": 0.0
2032
+ },
2033
+ "st_cache_wb": {
2034
+ "cycles": 2137,
2035
+ "std": 0.0,
2036
+ "delta_pct": 0.0
2037
+ },
2038
+ "maxnreg_32": {
2039
+ "cycles": 2046,
2040
+ "std": 0.0,
2041
+ "delta_pct": -4.26
2042
+ },
2043
+ "maxnreg_64": {
2044
+ "cycles": 2120,
2045
+ "std": 0.0,
2046
+ "delta_pct": -0.8
2047
+ },
2048
+ "maxnreg_128": {
2049
+ "cycles": 2149,
2050
+ "std": 0.0,
2051
+ "delta_pct": 0.56
2052
+ },
2053
+ "maxnreg_255": {
2054
+ "cycles": 2149,
2055
+ "std": 0.21160103969498834,
2056
+ "delta_pct": 0.56
2057
+ },
2058
+ "reorder_cp": {
2059
+ "cycles": 2210,
2060
+ "std": 5.169020700287435,
2061
+ "delta_pct": 3.42
2062
+ },
2063
+ "reorder_il": {
2064
+ "cycles": 2014,
2065
+ "std": 1.7173744495595593,
2066
+ "delta_pct": -5.76
2067
+ },
2068
+ "reorder_lf": {
2069
+ "cycles": 2210,
2070
+ "std": 5.540794166904236,
2071
+ "delta_pct": 3.42
2072
+ },
2073
+ "reorder_sl": {
2074
+ "cycles": 2210,
2075
+ "std": 8.832144416844644,
2076
+ "delta_pct": 3.42
2077
+ }
2078
+ }
2079
+ },
2080
+ "dropout_512": {
2081
+ "source": "diverse",
2082
+ "baseline": 1625,
2083
+ "baseline_std": 3.360621222333752,
2084
+ "baseline_error": null,
2085
+ "transforms": {
2086
+ "cache_cs": {
2087
+ "cycles": 1625,
2088
+ "std": 3.2819354046050324,
2089
+ "delta_pct": 0.0
2090
+ },
2091
+ "cache_cg": {
2092
+ "cycles": 1623,
2093
+ "std": 3.996845631244719,
2094
+ "delta_pct": -0.12
2095
+ },
2096
+ "cache_ca": {
2097
+ "cycles": 1625,
2098
+ "std": 3.205854020382089,
2099
+ "delta_pct": 0.0
2100
+ },
2101
+ "cache_cv": {
2102
+ "cycles": 1623,
2103
+ "std": 3.873744958047703,
2104
+ "delta_pct": -0.12
2105
+ },
2106
+ "st_cache_cs": {
2107
+ "cycles": 1625,
2108
+ "std": 3.0862234202986665,
2109
+ "delta_pct": 0.0
2110
+ },
2111
+ "st_cache_wt": {
2112
+ "cycles": 1625,
2113
+ "std": 3.241774205585578,
2114
+ "delta_pct": 0.0
2115
+ },
2116
+ "st_cache_wb": {
2117
+ "cycles": 1625,
2118
+ "std": 3.2674760901956117,
2119
+ "delta_pct": 0.0
2120
+ },
2121
+ "maxnreg_32": {
2122
+ "cycles": 1601,
2123
+ "std": 3.7460779490021294,
2124
+ "delta_pct": -1.48
2125
+ },
2126
+ "maxnreg_64": {
2127
+ "cycles": 1562,
2128
+ "std": 4.086682639990534,
2129
+ "delta_pct": -3.88
2130
+ },
2131
+ "maxnreg_128": {
2132
+ "cycles": 1562,
2133
+ "std": 3.956008088970497,
2134
+ "delta_pct": -3.88
2135
+ },
2136
+ "maxnreg_255": {
2137
+ "cycles": 1561,
2138
+ "std": 4.047144054762568,
2139
+ "delta_pct": -3.94
2140
+ },
2141
+ "reorder_cp": {
2142
+ "cycles": 1628,
2143
+ "std": 2.705087798944796,
2144
+ "delta_pct": 0.18
2145
+ },
2146
+ "reorder_il": {
2147
+ "cycles": 1627,
2148
+ "std": 2.4992798962901293,
2149
+ "delta_pct": 0.12
2150
+ },
2151
+ "reorder_lf": {
2152
+ "cycles": 1627,
2153
+ "std": 2.6220173531080984,
2154
+ "delta_pct": 0.12
2155
+ },
2156
+ "reorder_sl": {
2157
+ "cycles": 1627,
2158
+ "std": 2.5333574560255014,
2159
+ "delta_pct": 0.12
2160
+ }
2161
+ }
2162
+ },
2163
+ "cross_entropy_1024": {
2164
+ "source": "diverse",
2165
+ "baseline": 4663,
2166
+ "baseline_std": 74.4845136588808,
2167
+ "baseline_error": null,
2168
+ "transforms": {
2169
+ "cache_cs": {
2170
+ "cycles": 4683,
2171
+ "std": 71.8598217016992,
2172
+ "delta_pct": 0.43
2173
+ },
2174
+ "cache_cg": {
2175
+ "cycles": 4620,
2176
+ "std": 68.62513952626982,
2177
+ "delta_pct": -0.92
2178
+ },
2179
+ "cache_ca": {
2180
+ "cycles": 4683,
2181
+ "std": 71.5853749029786,
2182
+ "delta_pct": 0.43
2183
+ },
2184
+ "cache_cv": {
2185
+ "cycles": 4630,
2186
+ "std": 69.17326777737192,
2187
+ "delta_pct": -0.71
2188
+ },
2189
+ "st_cache_cs": {
2190
+ "cycles": 4672,
2191
+ "std": 64.96739162841618,
2192
+ "delta_pct": 0.19
2193
+ },
2194
+ "st_cache_wt": {
2195
+ "cycles": 4656,
2196
+ "std": 67.25985411075466,
2197
+ "delta_pct": -0.15
2198
+ },
2199
+ "st_cache_wb": {
2200
+ "cycles": 4675,
2201
+ "std": 77.72003843925967,
2202
+ "delta_pct": 0.26
2203
+ },
2204
+ "maxnreg_32": {
2205
+ "cycles": 4686,
2206
+ "std": 53.9437818474011,
2207
+ "delta_pct": 0.49
2208
+ },
2209
+ "maxnreg_64": {
2210
+ "cycles": 4666,
2211
+ "std": 63.665802241077586,
2212
+ "delta_pct": 0.06
2213
+ },
2214
+ "maxnreg_128": {
2215
+ "cycles": 4671,
2216
+ "std": 70.02021136786149,
2217
+ "delta_pct": 0.17
2218
+ },
2219
+ "maxnreg_255": {
2220
+ "cycles": 4679,
2221
+ "std": 68.15793405759891,
2222
+ "delta_pct": 0.34
2223
+ },
2224
+ "reorder_cp": {
2225
+ "error": "skipped_barrier"
2226
+ },
2227
+ "reorder_il": {
2228
+ "error": "skipped_barrier"
2229
+ },
2230
+ "reorder_lf": {
2231
+ "error": "skipped_barrier"
2232
+ },
2233
+ "reorder_sl": {
2234
+ "error": "skipped_barrier"
2235
+ }
2236
+ }
2237
+ },
2238
+ "cross_entropy_512": {
2239
+ "source": "diverse",
2240
+ "baseline": 2768,
2241
+ "baseline_std": 27.59124815951609,
2242
+ "baseline_error": null,
2243
+ "transforms": {
2244
+ "cache_cs": {
2245
+ "cycles": 2767,
2246
+ "std": 27.64353405409663,
2247
+ "delta_pct": -0.04
2248
+ },
2249
+ "cache_cg": {
2250
+ "cycles": 2850,
2251
+ "std": 24.742711553101852,
2252
+ "delta_pct": 2.96
2253
+ },
2254
+ "cache_ca": {
2255
+ "cycles": 2765,
2256
+ "std": 24.116241415278623,
2257
+ "delta_pct": -0.11
2258
+ },
2259
+ "cache_cv": {
2260
+ "cycles": 2849,
2261
+ "std": 24.128014837528596,
2262
+ "delta_pct": 2.93
2263
+ },
2264
+ "st_cache_cs": {
2265
+ "cycles": 2771,
2266
+ "std": 28.70365133567505,
2267
+ "delta_pct": 0.11
2268
+ },
2269
+ "st_cache_wt": {
2270
+ "cycles": 2762,
2271
+ "std": 27.35715217269517,
2272
+ "delta_pct": -0.22
2273
+ },
2274
+ "st_cache_wb": {
2275
+ "cycles": 2764,
2276
+ "std": 31.47713416116531,
2277
+ "delta_pct": -0.14
2278
+ },
2279
+ "maxnreg_32": {
2280
+ "cycles": 2694,
2281
+ "std": 29.4571616928719,
2282
+ "delta_pct": -2.67
2283
+ },
2284
+ "maxnreg_64": {
2285
+ "cycles": 2764,
2286
+ "std": 28.905667264396442,
2287
+ "delta_pct": -0.14
2288
+ },
2289
+ "maxnreg_128": {
2290
+ "cycles": 2763,
2291
+ "std": 28.315727078780796,
2292
+ "delta_pct": -0.18
2293
+ },
2294
+ "maxnreg_255": {
2295
+ "cycles": 2762,
2296
+ "std": 28.16327218204589,
2297
+ "delta_pct": -0.22
2298
+ },
2299
+ "reorder_cp": {
2300
+ "error": "skipped_barrier"
2301
+ },
2302
+ "reorder_il": {
2303
+ "error": "skipped_barrier"
2304
+ },
2305
+ "reorder_lf": {
2306
+ "error": "skipped_barrier"
2307
+ },
2308
+ "reorder_sl": {
2309
+ "error": "skipped_barrier"
2310
+ }
2311
+ }
2312
+ },
2313
+ "batch_norm_1024": {
2314
+ "source": "diverse",
2315
+ "baseline": 3818,
2316
+ "baseline_std": 305.4150782132408,
2317
+ "baseline_error": null,
2318
+ "transforms": {
2319
+ "cache_cs": {
2320
+ "cycles": 3493,
2321
+ "std": 265.58175346774107,
2322
+ "delta_pct": -8.51
2323
+ },
2324
+ "cache_cg": {
2325
+ "cycles": 6958,
2326
+ "std": 271.8136611725025,
2327
+ "delta_pct": 82.24
2328
+ },
2329
+ "cache_ca": {
2330
+ "cycles": 3567,
2331
+ "std": 268.9106727149371,
2332
+ "delta_pct": -6.57
2333
+ },
2334
+ "cache_cv": {
2335
+ "cycles": 6788,
2336
+ "std": 266.913621982843,
2337
+ "delta_pct": 77.79
2338
+ },
2339
+ "st_cache_cs": {
2340
+ "cycles": 3532,
2341
+ "std": 293.2227704323114,
2342
+ "delta_pct": -7.49
2343
+ },
2344
+ "st_cache_wt": {
2345
+ "cycles": 3843,
2346
+ "std": 244.83664023180845,
2347
+ "delta_pct": 0.65
2348
+ },
2349
+ "st_cache_wb": {
2350
+ "cycles": 3549,
2351
+ "std": 286.46786276997983,
2352
+ "delta_pct": -7.05
2353
+ },
2354
+ "maxnreg_32": {
2355
+ "cycles": 3678,
2356
+ "std": 269.71796264802236,
2357
+ "delta_pct": -3.67
2358
+ },
2359
+ "maxnreg_64": {
2360
+ "cycles": 3540,
2361
+ "std": 278.07100527563097,
2362
+ "delta_pct": -7.28
2363
+ },
2364
+ "maxnreg_128": {
2365
+ "cycles": 3818,
2366
+ "std": 268.53938775345415,
2367
+ "delta_pct": 0.0
2368
+ },
2369
+ "maxnreg_255": {
2370
+ "cycles": 3557,
2371
+ "std": 297.45760097028955,
2372
+ "delta_pct": -6.84
2373
+ },
2374
+ "reorder_cp": {
2375
+ "cycles": 3526,
2376
+ "std": 234.53006012023278,
2377
+ "delta_pct": -7.65
2378
+ },
2379
+ "reorder_il": {
2380
+ "cycles": 3092,
2381
+ "std": 262.68993314362086,
2382
+ "delta_pct": -19.02
2383
+ },
2384
+ "reorder_lf": {
2385
+ "cycles": 3437,
2386
+ "std": 303.6824212479214,
2387
+ "delta_pct": -9.98
2388
+ },
2389
+ "reorder_sl": {
2390
+ "cycles": 3447,
2391
+ "std": 274.7649700653269,
2392
+ "delta_pct": -9.72
2393
+ }
2394
+ }
2395
+ },
2396
+ "batch_norm_512": {
2397
+ "source": "diverse",
2398
+ "baseline": 1960,
2399
+ "baseline_std": 24.425166836687115,
2400
+ "baseline_error": null,
2401
+ "transforms": {
2402
+ "cache_cs": {
2403
+ "cycles": 1912,
2404
+ "std": 47.514229184529555,
2405
+ "delta_pct": -2.45
2406
+ },
2407
+ "cache_cg": {
2408
+ "cycles": 3850,
2409
+ "std": 65.87921599412064,
2410
+ "delta_pct": 96.43
2411
+ },
2412
+ "cache_ca": {
2413
+ "cycles": 1914,
2414
+ "std": 51.70893902411845,
2415
+ "delta_pct": -2.35
2416
+ },
2417
+ "cache_cv": {
2418
+ "cycles": 3845,
2419
+ "std": 78.12604799297095,
2420
+ "delta_pct": 96.17
2421
+ },
2422
+ "st_cache_cs": {
2423
+ "cycles": 1917,
2424
+ "std": 48.36344073781351,
2425
+ "delta_pct": -2.19
2426
+ },
2427
+ "st_cache_wt": {
2428
+ "cycles": 1959,
2429
+ "std": 27.981314747523925,
2430
+ "delta_pct": -0.05
2431
+ },
2432
+ "st_cache_wb": {
2433
+ "cycles": 1916,
2434
+ "std": 45.44963806236525,
2435
+ "delta_pct": -2.24
2436
+ },
2437
+ "maxnreg_32": {
2438
+ "cycles": 2007,
2439
+ "std": 37.455332063672856,
2440
+ "delta_pct": 2.4
2441
+ },
2442
+ "maxnreg_64": {
2443
+ "cycles": 1928,
2444
+ "std": 44.08847808668383,
2445
+ "delta_pct": -1.63
2446
+ },
2447
+ "maxnreg_128": {
2448
+ "cycles": 1963,
2449
+ "std": 24.53813155071103,
2450
+ "delta_pct": 0.15
2451
+ },
2452
+ "maxnreg_255": {
2453
+ "cycles": 1920,
2454
+ "std": 48.55923805003534,
2455
+ "delta_pct": -2.04
2456
+ },
2457
+ "reorder_cp": {
2458
+ "cycles": 1851,
2459
+ "std": 35.36270775831511,
2460
+ "delta_pct": -5.56
2461
+ },
2462
+ "reorder_il": {
2463
+ "cycles": 1892,
2464
+ "std": 55.189219961872986,
2465
+ "delta_pct": -3.47
2466
+ },
2467
+ "reorder_lf": {
2468
+ "cycles": 1892,
2469
+ "std": 45.385106587954596,
2470
+ "delta_pct": -3.47
2471
+ },
2472
+ "reorder_sl": {
2473
+ "cycles": 1847,
2474
+ "std": 44.671076772336704,
2475
+ "delta_pct": -5.77
2476
+ }
2477
+ }
2478
+ },
2479
+ "embedding_lookup_256": {
2480
+ "source": "diverse",
2481
+ "baseline": 2583,
2482
+ "baseline_std": 431.5732507639926,
2483
+ "baseline_error": null,
2484
+ "transforms": {
2485
+ "cache_cs": {
2486
+ "cycles": 2415,
2487
+ "std": 374.6212551030707,
2488
+ "delta_pct": -6.5
2489
+ },
2490
+ "cache_cg": {
2491
+ "cycles": 2663,
2492
+ "std": 439.39535952829544,
2493
+ "delta_pct": 3.1
2494
+ },
2495
+ "cache_ca": {
2496
+ "cycles": 2498,
2497
+ "std": 385.7563738941976,
2498
+ "delta_pct": -3.29
2499
+ },
2500
+ "cache_cv": {
2501
+ "cycles": 3023,
2502
+ "std": 315.9150834876359,
2503
+ "delta_pct": 17.03
2504
+ },
2505
+ "st_cache_cs": {
2506
+ "cycles": 2299,
2507
+ "std": 398.473873045649,
2508
+ "delta_pct": -10.99
2509
+ },
2510
+ "st_cache_wt": {
2511
+ "cycles": 2506,
2512
+ "std": 330.7936619329337,
2513
+ "delta_pct": -2.98
2514
+ },
2515
+ "st_cache_wb": {
2516
+ "cycles": 2286,
2517
+ "std": 390.9568873609979,
2518
+ "delta_pct": -11.5
2519
+ },
2520
+ "maxnreg_32": {
2521
+ "cycles": 2602,
2522
+ "std": 392.489609543998,
2523
+ "delta_pct": 0.74
2524
+ },
2525
+ "maxnreg_64": {
2526
+ "cycles": 2592,
2527
+ "std": 382.27569619320553,
2528
+ "delta_pct": 0.35
2529
+ },
2530
+ "maxnreg_128": {
2531
+ "cycles": 2556,
2532
+ "std": 583.0537709122547,
2533
+ "delta_pct": -1.05
2534
+ },
2535
+ "maxnreg_255": {
2536
+ "cycles": 2401,
2537
+ "std": 406.8704358883796,
2538
+ "delta_pct": -7.05
2539
+ },
2540
+ "reorder_cp": {
2541
+ "cycles": 2688,
2542
+ "std": 396.48086851700674,
2543
+ "delta_pct": 4.07
2544
+ },
2545
+ "reorder_il": {
2546
+ "cycles": 2372,
2547
+ "std": 383.4321869314051,
2548
+ "delta_pct": -8.17
2549
+ },
2550
+ "reorder_lf": {
2551
+ "cycles": 2630,
2552
+ "std": 488.5738096746488,
2553
+ "delta_pct": 1.82
2554
+ },
2555
+ "reorder_sl": {
2556
+ "cycles": 2368,
2557
+ "std": 343.44249791631785,
2558
+ "delta_pct": -8.32
2559
+ }
2560
+ }
2561
+ },
2562
+ "embedding_lookup_512": {
2563
+ "source": "diverse",
2564
+ "baseline": 3130,
2565
+ "baseline_std": 522.3538258450875,
2566
+ "baseline_error": null,
2567
+ "transforms": {
2568
+ "cache_cs": {
2569
+ "cycles": 2719,
2570
+ "std": 470.5250943095384,
2571
+ "delta_pct": -13.13
2572
+ },
2573
+ "cache_cg": {
2574
+ "cycles": 4012,
2575
+ "std": 490.27805896246264,
2576
+ "delta_pct": 28.18
2577
+ },
2578
+ "cache_ca": {
2579
+ "cycles": 2799,
2580
+ "std": 481.8130207611662,
2581
+ "delta_pct": -10.58
2582
+ },
2583
+ "cache_cv": {
2584
+ "cycles": 3815,
2585
+ "std": 484.65131342027746,
2586
+ "delta_pct": 21.88
2587
+ },
2588
+ "st_cache_cs": {
2589
+ "cycles": 3121,
2590
+ "std": 460.9627250657042,
2591
+ "delta_pct": -0.29
2592
+ },
2593
+ "st_cache_wt": {
2594
+ "cycles": 3309,
2595
+ "std": 564.0860440571101,
2596
+ "delta_pct": 5.72
2597
+ },
2598
+ "st_cache_wb": {
2599
+ "cycles": 2993,
2600
+ "std": 544.558885130892,
2601
+ "delta_pct": -4.38
2602
+ },
2603
+ "maxnreg_32": {
2604
+ "cycles": 3118,
2605
+ "std": 572.5250244312471,
2606
+ "delta_pct": -0.38
2607
+ },
2608
+ "maxnreg_64": {
2609
+ "cycles": 3586,
2610
+ "std": 367.3977695903991,
2611
+ "delta_pct": 14.57
2612
+ },
2613
+ "maxnreg_128": {
2614
+ "cycles": 3302,
2615
+ "std": 497.0922078447821,
2616
+ "delta_pct": 5.5
2617
+ },
2618
+ "maxnreg_255": {
2619
+ "cycles": 3032,
2620
+ "std": 532.9646478895199,
2621
+ "delta_pct": -3.13
2622
+ },
2623
+ "reorder_cp": {
2624
+ "cycles": 3296,
2625
+ "std": 559.147051297778,
2626
+ "delta_pct": 5.3
2627
+ },
2628
+ "reorder_il": {
2629
+ "cycles": 3201,
2630
+ "std": 469.6084406183517,
2631
+ "delta_pct": 2.27
2632
+ },
2633
+ "reorder_lf": {
2634
+ "cycles": 3234,
2635
+ "std": 527.5769300301142,
2636
+ "delta_pct": 3.32
2637
+ },
2638
+ "reorder_sl": {
2639
+ "cycles": 3441,
2640
+ "std": 550.5740747392671,
2641
+ "delta_pct": 9.94
2642
+ }
2643
+ }
2644
+ }
2645
+ },
2646
+ "best_improvements": [
2647
+ {
2648
+ "kernel": "attention_d64_kv32",
2649
+ "transform": "maxnreg_64",
2650
+ "delta_pct": -1.78
2651
+ },
2652
+ {
2653
+ "kernel": "attention_d64_kv64",
2654
+ "transform": "maxnreg_255",
2655
+ "delta_pct": -46.95
2656
+ },
2657
+ {
2658
+ "kernel": "batch_norm_1024",
2659
+ "transform": "reorder_il",
2660
+ "delta_pct": -19.02
2661
+ },
2662
+ {
2663
+ "kernel": "batch_norm_512",
2664
+ "transform": "reorder_sl",
2665
+ "delta_pct": -5.77
2666
+ },
2667
+ {
2668
+ "kernel": "cross_entropy_512",
2669
+ "transform": "maxnreg_32",
2670
+ "delta_pct": -2.67
2671
+ },
2672
+ {
2673
+ "kernel": "dropout_1024",
2674
+ "transform": "reorder_il",
2675
+ "delta_pct": -5.76
2676
+ },
2677
+ {
2678
+ "kernel": "dropout_512",
2679
+ "transform": "maxnreg_255",
2680
+ "delta_pct": -3.94
2681
+ },
2682
+ {
2683
+ "kernel": "embedding_lookup_256",
2684
+ "transform": "st_cache_wb",
2685
+ "delta_pct": -11.5
2686
+ },
2687
+ {
2688
+ "kernel": "embedding_lookup_512",
2689
+ "transform": "cache_cs",
2690
+ "delta_pct": -13.13
2691
+ },
2692
+ {
2693
+ "kernel": "gelu_1024",
2694
+ "transform": "maxnreg_32",
2695
+ "delta_pct": -3.28
2696
+ },
2697
+ {
2698
+ "kernel": "reduction_max_1024",
2699
+ "transform": "cache_cg",
2700
+ "delta_pct": -1.14
2701
+ },
2702
+ {
2703
+ "kernel": "reduction_sum_1024",
2704
+ "transform": "cache_cv",
2705
+ "delta_pct": -1.14
2706
+ },
2707
+ {
2708
+ "kernel": "triton_fused_add_mul_1024",
2709
+ "transform": "cache_cs",
2710
+ "delta_pct": -1.64
2711
+ },
2712
+ {
2713
+ "kernel": "triton_fused_add_mul_512",
2714
+ "transform": "st_cache_cs",
2715
+ "delta_pct": -1.81
2716
+ },
2717
+ {
2718
+ "kernel": "triton_layernorm_1024",
2719
+ "transform": "st_cache_cs",
2720
+ "delta_pct": -1.08
2721
+ },
2722
+ {
2723
+ "kernel": "triton_vector_add_256",
2724
+ "transform": "cache_cv",
2725
+ "delta_pct": -1.17
2726
+ },
2727
+ {
2728
+ "kernel": "triton_vector_add_512",
2729
+ "transform": "cache_cv",
2730
+ "delta_pct": -3.37
2731
+ }
2732
+ ]
2733
+ }
training_result.json ADDED
@@ -0,0 +1,725 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "mean_improvement": -0.29151748216617807,
3
+ "n_kernels": 64,
4
+ "per_kernel": {
5
+ "gemm_tile(2,2,2)": {
6
+ "improvement": -0.0191,
7
+ "actions": [
8
+ "vec_st",
9
+ "vec_ld",
10
+ "stop"
11
+ ],
12
+ "baseline_cycles": 419,
13
+ "final_cycles": 411
14
+ },
15
+ "gemm_tile(2,2,4)": {
16
+ "improvement": -0.0363,
17
+ "actions": [
18
+ "vec_ld",
19
+ "vec_st",
20
+ "stop"
21
+ ],
22
+ "baseline_cycles": 441,
23
+ "final_cycles": 425
24
+ },
25
+ "gemm_tile(2,2,6)": {
26
+ "improvement": -0.1475,
27
+ "actions": [
28
+ "vec_ld",
29
+ "vec_st",
30
+ "maxnreg_128",
31
+ "stop"
32
+ ],
33
+ "baseline_cycles": 522,
34
+ "final_cycles": 445
35
+ },
36
+ "gemm_tile(2,2,8)": {
37
+ "improvement": -0.1373,
38
+ "actions": [
39
+ "cache_cs",
40
+ "vec_ld",
41
+ "vec_st",
42
+ "maxnreg_128",
43
+ "stop"
44
+ ],
45
+ "baseline_cycles": 539,
46
+ "final_cycles": 465
47
+ },
48
+ "gemm_tile(2,4,2)": {
49
+ "improvement": -0.1237,
50
+ "actions": [
51
+ "vec_st",
52
+ "vec_ld",
53
+ "stop"
54
+ ],
55
+ "baseline_cycles": 477,
56
+ "final_cycles": 418
57
+ },
58
+ "gemm_tile(2,4,4)": {
59
+ "improvement": -0.1654,
60
+ "actions": [
61
+ "vec_ld",
62
+ "vec_st",
63
+ "maxnreg_128",
64
+ "stop"
65
+ ],
66
+ "baseline_cycles": 544,
67
+ "final_cycles": 454
68
+ },
69
+ "gemm_tile(2,4,6)": {
70
+ "improvement": -0.1932,
71
+ "actions": [
72
+ "cache_cs",
73
+ "vec_ld",
74
+ "maxnreg_128",
75
+ "vec_st",
76
+ "stop"
77
+ ],
78
+ "baseline_cycles": 647,
79
+ "final_cycles": 522
80
+ },
81
+ "gemm_tile(2,4,8)": {
82
+ "improvement": -0.361,
83
+ "actions": [
84
+ "maxnreg_128",
85
+ "vec_ld",
86
+ "vec_st",
87
+ "stop"
88
+ ],
89
+ "baseline_cycles": 856,
90
+ "final_cycles": 547
91
+ },
92
+ "gemm_tile(2,6,2)": {
93
+ "improvement": -0.1899,
94
+ "actions": [
95
+ "vec_st",
96
+ "vec_ld",
97
+ "stop"
98
+ ],
99
+ "baseline_cycles": 537,
100
+ "final_cycles": 435
101
+ },
102
+ "gemm_tile(2,6,4)": {
103
+ "improvement": -0.1651,
104
+ "actions": [
105
+ "cache_cs",
106
+ "vec_ld",
107
+ "maxnreg_128",
108
+ "vec_st",
109
+ "stop"
110
+ ],
111
+ "baseline_cycles": 630,
112
+ "final_cycles": 526
113
+ },
114
+ "gemm_tile(2,6,6)": {
115
+ "improvement": -0.3322,
116
+ "actions": [
117
+ "maxnreg_128",
118
+ "vec_ld",
119
+ "vec_st",
120
+ "stop"
121
+ ],
122
+ "baseline_cycles": 891,
123
+ "final_cycles": 595
124
+ },
125
+ "gemm_tile(2,6,8)": {
126
+ "improvement": -0.493,
127
+ "actions": [
128
+ "maxnreg_128",
129
+ "vec_ld",
130
+ "vec_st",
131
+ "stop"
132
+ ],
133
+ "baseline_cycles": 1278,
134
+ "final_cycles": 648
135
+ },
136
+ "gemm_tile(2,8,2)": {
137
+ "improvement": -0.1929,
138
+ "actions": [
139
+ "vec_st",
140
+ "vec_ld",
141
+ "maxnreg_128",
142
+ "stop"
143
+ ],
144
+ "baseline_cycles": 565,
145
+ "final_cycles": 456
146
+ },
147
+ "gemm_tile(2,8,4)": {
148
+ "improvement": -0.4123,
149
+ "actions": [
150
+ "prefetch_L1",
151
+ "st_cache_cs",
152
+ "maxnreg_128",
153
+ "vec_ld",
154
+ "vec_st",
155
+ "stop"
156
+ ],
157
+ "baseline_cycles": 912,
158
+ "final_cycles": 536
159
+ },
160
+ "gemm_tile(2,8,6)": {
161
+ "improvement": -0.2716,
162
+ "actions": [
163
+ "maxnreg_128",
164
+ "vec_ld",
165
+ "vec_st",
166
+ "stop"
167
+ ],
168
+ "baseline_cycles": 902,
169
+ "final_cycles": 657
170
+ },
171
+ "gemm_tile(2,8,8)": {
172
+ "improvement": -0.2637,
173
+ "actions": [
174
+ "maxnreg_128",
175
+ "vec_ld",
176
+ "vec_st",
177
+ "stop"
178
+ ],
179
+ "baseline_cycles": 1005,
180
+ "final_cycles": 740
181
+ },
182
+ "gemm_tile(4,2,2)": {
183
+ "improvement": -0.0695,
184
+ "actions": [
185
+ "vec_st",
186
+ "vec_ld",
187
+ "stop"
188
+ ],
189
+ "baseline_cycles": 475,
190
+ "final_cycles": 442
191
+ },
192
+ "gemm_tile(4,2,4)": {
193
+ "improvement": -0.1343,
194
+ "actions": [
195
+ "vec_ld",
196
+ "vec_st",
197
+ "maxnreg_128",
198
+ "stop"
199
+ ],
200
+ "baseline_cycles": 536,
201
+ "final_cycles": 464
202
+ },
203
+ "gemm_tile(4,2,6)": {
204
+ "improvement": -0.1936,
205
+ "actions": [
206
+ "cache_cs",
207
+ "vec_ld",
208
+ "maxnreg_128",
209
+ "vec_st",
210
+ "stop"
211
+ ],
212
+ "baseline_cycles": 625,
213
+ "final_cycles": 504
214
+ },
215
+ "gemm_tile(4,2,8)": {
216
+ "improvement": -0.2722,
217
+ "actions": [
218
+ "maxnreg_128",
219
+ "vec_ld",
220
+ "vec_st",
221
+ "stop"
222
+ ],
223
+ "baseline_cycles": 753,
224
+ "final_cycles": 548
225
+ },
226
+ "gemm_tile(4,4,2)": {
227
+ "improvement": -0.1865,
228
+ "actions": [
229
+ "vec_st",
230
+ "vec_ld",
231
+ "maxnreg_128",
232
+ "stop"
233
+ ],
234
+ "baseline_cycles": 547,
235
+ "final_cycles": 445
236
+ },
237
+ "gemm_tile(4,4,4)": {
238
+ "improvement": -0.1693,
239
+ "actions": [
240
+ "prefetch_L1",
241
+ "vec_st",
242
+ "maxnreg_128",
243
+ "vec_ld",
244
+ "stop"
245
+ ],
246
+ "baseline_cycles": 691,
247
+ "final_cycles": 574
248
+ },
249
+ "gemm_tile(4,4,6)": {
250
+ "improvement": -0.3213,
251
+ "actions": [
252
+ "maxnreg_128",
253
+ "vec_ld",
254
+ "vec_st",
255
+ "stop"
256
+ ],
257
+ "baseline_cycles": 940,
258
+ "final_cycles": 638
259
+ },
260
+ "gemm_tile(4,4,8)": {
261
+ "improvement": -0.4308,
262
+ "actions": [
263
+ "maxnreg_128",
264
+ "vec_ld",
265
+ "vec_st",
266
+ "stop"
267
+ ],
268
+ "baseline_cycles": 1258,
269
+ "final_cycles": 716
270
+ },
271
+ "gemm_tile(4,6,2)": {
272
+ "improvement": -0.2639,
273
+ "actions": [
274
+ "vec_st",
275
+ "vec_ld",
276
+ "maxnreg_128",
277
+ "stop"
278
+ ],
279
+ "baseline_cycles": 648,
280
+ "final_cycles": 477
281
+ },
282
+ "gemm_tile(4,6,4)": {
283
+ "improvement": -0.238,
284
+ "actions": [
285
+ "maxnreg_128",
286
+ "vec_ld",
287
+ "vec_st",
288
+ "stop"
289
+ ],
290
+ "baseline_cycles": 832,
291
+ "final_cycles": 634
292
+ },
293
+ "gemm_tile(4,6,6)": {
294
+ "improvement": -0.502,
295
+ "actions": [
296
+ "maxnreg_128",
297
+ "vec_ld",
298
+ "vec_st",
299
+ "stop"
300
+ ],
301
+ "baseline_cycles": 1502,
302
+ "final_cycles": 748
303
+ },
304
+ "gemm_tile(4,6,8)": {
305
+ "improvement": -0.5383,
306
+ "actions": [
307
+ "maxnreg_128",
308
+ "vec_ld",
309
+ "vec_st",
310
+ "stop"
311
+ ],
312
+ "baseline_cycles": 1839,
313
+ "final_cycles": 849
314
+ },
315
+ "gemm_tile(4,8,2)": {
316
+ "improvement": -0.2437,
317
+ "actions": [
318
+ "vec_st",
319
+ "maxnreg_128",
320
+ "vec_ld",
321
+ "stop"
322
+ ],
323
+ "baseline_cycles": 714,
324
+ "final_cycles": 540
325
+ },
326
+ "gemm_tile(4,8,4)": {
327
+ "improvement": -0.2336,
328
+ "actions": [
329
+ "maxnreg_128",
330
+ "vec_ld",
331
+ "vec_st",
332
+ "stop"
333
+ ],
334
+ "baseline_cycles": 959,
335
+ "final_cycles": 735
336
+ },
337
+ "gemm_tile(4,8,6)": {
338
+ "improvement": -0.2232,
339
+ "actions": [
340
+ "maxnreg_128",
341
+ "vec_ld",
342
+ "vec_st",
343
+ "stop"
344
+ ],
345
+ "baseline_cycles": 1102,
346
+ "final_cycles": 856
347
+ },
348
+ "gemm_tile(4,8,8)": {
349
+ "improvement": -0.3196,
350
+ "actions": [
351
+ "maxnreg_128",
352
+ "vec_ld",
353
+ "vec_st",
354
+ "stop"
355
+ ],
356
+ "baseline_cycles": 1430,
357
+ "final_cycles": 973
358
+ },
359
+ "gemm_tile(6,2,2)": {
360
+ "improvement": -0.1468,
361
+ "actions": [
362
+ "vec_st",
363
+ "vec_ld",
364
+ "stop"
365
+ ],
366
+ "baseline_cycles": 545,
367
+ "final_cycles": 465
368
+ },
369
+ "gemm_tile(6,2,4)": {
370
+ "improvement": -0.1749,
371
+ "actions": [
372
+ "cache_cs",
373
+ "vec_ld",
374
+ "maxnreg_128",
375
+ "vec_st",
376
+ "stop"
377
+ ],
378
+ "baseline_cycles": 606,
379
+ "final_cycles": 500
380
+ },
381
+ "gemm_tile(6,2,6)": {
382
+ "improvement": -0.336,
383
+ "actions": [
384
+ "maxnreg_128",
385
+ "vec_ld",
386
+ "vec_st",
387
+ "stop"
388
+ ],
389
+ "baseline_cycles": 866,
390
+ "final_cycles": 575
391
+ },
392
+ "gemm_tile(6,2,8)": {
393
+ "improvement": -0.36,
394
+ "actions": [
395
+ "maxnreg_128",
396
+ "vec_ld",
397
+ "vec_st",
398
+ "stop"
399
+ ],
400
+ "baseline_cycles": 989,
401
+ "final_cycles": 633
402
+ },
403
+ "gemm_tile(6,4,2)": {
404
+ "improvement": -0.2897,
405
+ "actions": [
406
+ "vec_st",
407
+ "vec_ld",
408
+ "maxnreg_128",
409
+ "stop"
410
+ ],
411
+ "baseline_cycles": 673,
412
+ "final_cycles": 478
413
+ },
414
+ "gemm_tile(6,4,4)": {
415
+ "improvement": -0.2296,
416
+ "actions": [
417
+ "maxnreg_128",
418
+ "vec_ld",
419
+ "vec_st",
420
+ "stop"
421
+ ],
422
+ "baseline_cycles": 832,
423
+ "final_cycles": 641
424
+ },
425
+ "gemm_tile(6,4,6)": {
426
+ "improvement": -0.4646,
427
+ "actions": [
428
+ "maxnreg_128",
429
+ "vec_ld",
430
+ "vec_st",
431
+ "stop"
432
+ ],
433
+ "baseline_cycles": 1386,
434
+ "final_cycles": 742
435
+ },
436
+ "gemm_tile(6,4,8)": {
437
+ "improvement": -0.4832,
438
+ "actions": [
439
+ "maxnreg_128",
440
+ "vec_ld",
441
+ "vec_st",
442
+ "stop"
443
+ ],
444
+ "baseline_cycles": 1546,
445
+ "final_cycles": 799
446
+ },
447
+ "gemm_tile(6,6,2)": {
448
+ "improvement": -0.2968,
449
+ "actions": [
450
+ "vec_st",
451
+ "maxnreg_128",
452
+ "vec_ld",
453
+ "stop"
454
+ ],
455
+ "baseline_cycles": 775,
456
+ "final_cycles": 545
457
+ },
458
+ "gemm_tile(6,6,4)": {
459
+ "improvement": -0.3052,
460
+ "actions": [
461
+ "maxnreg_128",
462
+ "vec_ld",
463
+ "vec_st",
464
+ "stop"
465
+ ],
466
+ "baseline_cycles": 983,
467
+ "final_cycles": 683
468
+ },
469
+ "gemm_tile(6,6,6)": {
470
+ "improvement": -0.4544,
471
+ "actions": [
472
+ "maxnreg_128",
473
+ "vec_ld",
474
+ "vec_st",
475
+ "stop"
476
+ ],
477
+ "baseline_cycles": 1635,
478
+ "final_cycles": 892
479
+ },
480
+ "gemm_tile(6,6,8)": {
481
+ "improvement": -0.4978,
482
+ "actions": [
483
+ "maxnreg_255",
484
+ "vec_ld",
485
+ "vec_st",
486
+ "stop"
487
+ ],
488
+ "baseline_cycles": 2059,
489
+ "final_cycles": 1034
490
+ },
491
+ "gemm_tile(6,8,2)": {
492
+ "improvement": -0.2758,
493
+ "actions": [
494
+ "vec_st",
495
+ "maxnreg_128",
496
+ "vec_ld",
497
+ "stop"
498
+ ],
499
+ "baseline_cycles": 881,
500
+ "final_cycles": 638
501
+ },
502
+ "gemm_tile(6,8,4)": {
503
+ "improvement": -0.264,
504
+ "actions": [
505
+ "vec_st",
506
+ "maxnreg_128",
507
+ "vec_ld",
508
+ "stop"
509
+ ],
510
+ "baseline_cycles": 1197,
511
+ "final_cycles": 881
512
+ },
513
+ "gemm_tile(6,8,6)": {
514
+ "improvement": -0.3264,
515
+ "actions": [
516
+ "maxnreg_128",
517
+ "vec_ld",
518
+ "vec_st",
519
+ "stop"
520
+ ],
521
+ "baseline_cycles": 1578,
522
+ "final_cycles": 1063
523
+ },
524
+ "gemm_tile(6,8,8)": {
525
+ "improvement": -0.3011,
526
+ "actions": [
527
+ "maxnreg_255",
528
+ "vec_ld",
529
+ "vec_st",
530
+ "stop"
531
+ ],
532
+ "baseline_cycles": 1737,
533
+ "final_cycles": 1214
534
+ },
535
+ "gemm_tile(8,2,2)": {
536
+ "improvement": -0.1706,
537
+ "actions": [
538
+ "vec_st",
539
+ "vec_ld",
540
+ "maxnreg_128",
541
+ "stop"
542
+ ],
543
+ "baseline_cycles": 592,
544
+ "final_cycles": 491
545
+ },
546
+ "gemm_tile(8,2,4)": {
547
+ "improvement": -0.2521,
548
+ "actions": [
549
+ "prefetch_L1",
550
+ "st_cache_cs",
551
+ "maxnreg_128",
552
+ "vec_ld",
553
+ "vec_st",
554
+ "stop"
555
+ ],
556
+ "baseline_cycles": 722,
557
+ "final_cycles": 540
558
+ },
559
+ "gemm_tile(8,2,6)": {
560
+ "improvement": -0.4012,
561
+ "actions": [
562
+ "maxnreg_128",
563
+ "vec_ld",
564
+ "vec_st",
565
+ "stop"
566
+ ],
567
+ "baseline_cycles": 1042,
568
+ "final_cycles": 624
569
+ },
570
+ "gemm_tile(8,2,8)": {
571
+ "improvement": -0.3817,
572
+ "actions": [
573
+ "maxnreg_128",
574
+ "vec_ld",
575
+ "vec_st",
576
+ "stop"
577
+ ],
578
+ "baseline_cycles": 1158,
579
+ "final_cycles": 716
580
+ },
581
+ "gemm_tile(8,4,2)": {
582
+ "improvement": -0.297,
583
+ "actions": [
584
+ "vec_st",
585
+ "maxnreg_128",
586
+ "vec_ld",
587
+ "stop"
588
+ ],
589
+ "baseline_cycles": 744,
590
+ "final_cycles": 523
591
+ },
592
+ "gemm_tile(8,4,4)": {
593
+ "improvement": -0.3152,
594
+ "actions": [
595
+ "maxnreg_128",
596
+ "vec_ld",
597
+ "vec_st",
598
+ "stop"
599
+ ],
600
+ "baseline_cycles": 1050,
601
+ "final_cycles": 719
602
+ },
603
+ "gemm_tile(8,4,6)": {
604
+ "improvement": -0.4074,
605
+ "actions": [
606
+ "maxnreg_128",
607
+ "vec_ld",
608
+ "vec_st",
609
+ "stop"
610
+ ],
611
+ "baseline_cycles": 1436,
612
+ "final_cycles": 851
613
+ },
614
+ "gemm_tile(8,4,8)": {
615
+ "improvement": -0.4238,
616
+ "actions": [
617
+ "maxnreg_128",
618
+ "vec_ld",
619
+ "vec_st",
620
+ "stop"
621
+ ],
622
+ "baseline_cycles": 1647,
623
+ "final_cycles": 949
624
+ },
625
+ "gemm_tile(8,6,2)": {
626
+ "improvement": -0.2839,
627
+ "actions": [
628
+ "vec_st",
629
+ "maxnreg_128",
630
+ "vec_ld",
631
+ "stop"
632
+ ],
633
+ "baseline_cycles": 863,
634
+ "final_cycles": 618
635
+ },
636
+ "gemm_tile(8,6,4)": {
637
+ "improvement": -0.4151,
638
+ "actions": [
639
+ "vec_st",
640
+ "maxnreg_128",
641
+ "vec_ld",
642
+ "stop"
643
+ ],
644
+ "baseline_cycles": 1349,
645
+ "final_cycles": 789
646
+ },
647
+ "gemm_tile(8,6,6)": {
648
+ "improvement": -0.4326,
649
+ "actions": [
650
+ "maxnreg_128",
651
+ "vec_ld",
652
+ "vec_st",
653
+ "stop"
654
+ ],
655
+ "baseline_cycles": 1789,
656
+ "final_cycles": 1015
657
+ },
658
+ "gemm_tile(8,6,8)": {
659
+ "improvement": -0.4535,
660
+ "actions": [
661
+ "maxnreg_255",
662
+ "vec_ld",
663
+ "vec_st",
664
+ "stop"
665
+ ],
666
+ "baseline_cycles": 2192,
667
+ "final_cycles": 1198
668
+ },
669
+ "gemm_tile(8,8,2)": {
670
+ "improvement": -0.2404,
671
+ "actions": [
672
+ "vec_st",
673
+ "maxnreg_128",
674
+ "vec_ld",
675
+ "stop"
676
+ ],
677
+ "baseline_cycles": 965,
678
+ "final_cycles": 733
679
+ },
680
+ "gemm_tile(8,8,4)": {
681
+ "improvement": -0.3331,
682
+ "actions": [
683
+ "vec_st",
684
+ "maxnreg_128",
685
+ "vec_ld",
686
+ "stop"
687
+ ],
688
+ "baseline_cycles": 1540,
689
+ "final_cycles": 1027
690
+ },
691
+ "gemm_tile(8,8,6)": {
692
+ "improvement": -0.3909,
693
+ "actions": [
694
+ "vec_st",
695
+ "maxnreg_255",
696
+ "vec_ld",
697
+ "stop"
698
+ ],
699
+ "baseline_cycles": 2062,
700
+ "final_cycles": 1256
701
+ },
702
+ "gemm_tile(8,8,8)": {
703
+ "improvement": -0.4082,
704
+ "actions": [
705
+ "maxnreg_255",
706
+ "vec_ld",
707
+ "vec_st",
708
+ "stop"
709
+ ],
710
+ "baseline_cycles": 2452,
711
+ "final_cycles": 1451
712
+ }
713
+ },
714
+ "action_distribution": {
715
+ "vec_st": 0.246,
716
+ "vec_ld": 0.246,
717
+ "stop": 0.246,
718
+ "maxnreg_128": 0.204,
719
+ "cache_cs": 0.019,
720
+ "maxnreg_255": 0.019,
721
+ "prefetch_L1": 0.012,
722
+ "st_cache_cs": 0.008
723
+ },
724
+ "unique_sequences": 12
725
+ }