Spaces:

openenv-community
/

RL_Surrogate_ENV

Configuration error

App Files Files Community

Update index.html

by SaiPranavSripathi - opened 2 days ago

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+392

-9733

This view is limited to 50 files because it contains too many changes. See the raw diff here.

Files changed (50) hide show

.gitattributes +35 -0
.gitignore +0 -21
BENCHMARK_REPORT.md +0 -439
HACKATHON_SUMMARY.md +0 -179
README.md +12 -129
backend/main.py +0 -177
backend/requirements.txt +0 -3
backend/terminal_manager.py +0 -342
client.py +0 -77
frontend/.gitignore +0 -24
frontend/README.md +0 -16
frontend/eslint.config.js +0 -29
frontend/index.html +0 -13
frontend/package-lock.json +0 -0
frontend/package.json +0 -27
frontend/public/vite.svg +0 -1
frontend/src/App.jsx +0 -195
frontend/src/api/terminal.js +0 -57
frontend/src/assets/react.svg +0 -1
frontend/src/components/TerminalPane.jsx +0 -147
frontend/src/hooks/useTerminalSession.js +0 -166
frontend/src/index.css +0 -486
frontend/src/main.jsx +0 -10
frontend/vite.config.js +0 -7
index.html +317 -0
models.py +0 -27
openenv.yaml +0 -15
requirements.txt +0 -8
scripts/__init__.py +0 -1
scripts/benchmark_qwen_05b_runtime.py +0 -158
scripts/benchmark_runtime_references.py +0 -222
scripts/benchmark_surrogate_vs_triton_autotune.py +0 -361
scripts/build_benchmark_splits.py +0 -81
scripts/build_qwen_05b_splits.py +0 -56
scripts/check_torch_cuda.py +0 -18
scripts/collect_measurements.py +0 -398
scripts/collect_multifamily_measurements.py +0 -659
scripts/collect_qwen_05b_measurements.py +0 -449
scripts/demo_compare.py +0 -200
scripts/eval_generalization.py +0 -147
scripts/qwen_05b_spec.py +0 -115
scripts/report_task_hardness.py +0 -42
scripts/run_full_pipeline.sh +0 -84
scripts/run_qwen_05b_pipeline.sh +0 -90
scripts/run_random_baseline.py +0 -187
scripts/run_surrogate_baseline.py +0 -217
scripts/smoke_test_client.py +0 -28
server/__init__.py +0 -1
server/app.py +0 -72
server/softmax_surrogate_environment.py +0 -570

.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

.gitignore DELETED Viewed

@@ -1,21 +0,0 @@
-__pycache__/
-*.pyc
-*.pyo
-*.log
-outputs/
-data/softmax_measurements.csv
-backend/build/
-backend/dist/
-backend/venv/
-backend/__pycache__/
-backend/*.pyc
-backend/*.pyo
-backend/*.log
-backend/outputs/
-backend/data/
-backend/data/softmax_measurements.csv
-backend/backend/build/
-backend/backend/dist/
-backend/backend/__pycache__/

BENCHMARK_REPORT.md DELETED Viewed

@@ -1,439 +0,0 @@
-# GPU Autotuning Benchmark Report
-## Project Summary
-This project implements a discrete-action, measured-oracle autotuning benchmark for GPU kernels on an H200-class machine. The optimizer is a surrogate-guided search policy that selects kernel configurations under a short evaluation budget. The benchmark is intentionally structured to answer two different questions:
-1. How much more search-efficient is a surrogate-guided policy than a random baseline?
-2. Do the kernel configurations found by the search process translate into materially better runtime than strong PyTorch baselines such as eager execution and `torch.compile`?
-The current benchmark answers both questions with real measurements.
-## What Was Built
-The current system includes:
-- A shared multi-family measurement cache:
-  - `softmax`
-  - `layernorm`
-  - `grouped_gemm`
-- A discrete-action environment with:
-  - hidden oracle table
-  - short-budget search episodes
-  - regret-based metrics
-  - train-task priors for cross-task generalization
-- Split-based evaluation:
-  - held-out shapes
-  - held-out family
-- Runtime reference benchmarking against:
-  - eager PyTorch
-  - `torch.compile`
-  - live reruns of the best Triton configs found by `random`
-  - live reruns of the best Triton configs found by `surrogate`
-The key design choice is that search evaluation and runtime evaluation are separated:
-- Search benchmark: measures how quickly each method finds good configs
-- Runtime benchmark: measures how fast the final chosen kernels actually run
-## Benchmark Structure
-### Families
-- Family A:
-  - `softmax`
-  - `layernorm`
-- Family B:
-  - `grouped_gemm`
-### Shapes
-Current task shapes use:
-- fixed `M = 4096`
-- `N ∈ {256, 512, 1024, 1536, 2048, 3072, 4096, 6144, 8192}`
-This yields:
-- `9` softmax tasks
-- `9` layernorm tasks
-- `9` grouped GEMM tasks
-- `27` tasks total
-### Search Space
-Current kernel config axes:
-- `block_size`
-- `num_warps`
-- `num_stages`
-Search is over discrete `config_id`s, not projected continuous actions. That makes the random baseline a true uniform baseline over legal configs.
-### Evaluation Splits
-The benchmark currently uses:
-- `shape_generalization`
-  - train on lower/intermediate shapes from each family
-  - test on held-out larger shapes within the same families
-- `family_holdout`
-  - train on `softmax` + `layernorm`
-  - test on all `grouped_gemm` tasks
-At the time of the main run:
-- shape-generalization test tasks: `6`
-- family-holdout test tasks: `9`
-- unique held-out runtime-reference tasks: `13`
-## Metrics
-### Search Metrics
-The main optimization metrics are:
-- `regret@k`
-  - best-so-far latency relative to oracle best after `k` search steps
-- best-so-far AUC
-  - average regret across the whole search trajectory
-  - lower is better
-- final oracle hit rate
-  - how often the optimizer reaches the oracle-best config by the end of the episode
-These metrics intentionally emphasize search efficiency, not just the final endpoint.
-### Runtime Metrics
-The runtime-reference benchmark records:
-- eager PyTorch steady-state latency
-- `torch.compile` compile-plus-first-call time
-- `torch.compile` steady-state latency
-- live rerun latency of best Triton config found by `random`
-- live rerun latency of best Triton config found by `surrogate`
-- Triton speedups vs eager and vs compiled
-## Main Search Results
-The main search results came from the held-out-shape / held-out-family evaluation in `run13.log`.
-### Held-out Shapes
-Random:
-- `mean_regret_at`:
-  - `@1 = 0.31341859698295593`
-  - `@3 = 0.13305269181728363`
-  - `@5 = 0.1070360466837883`
-  - `@6 = 0.06344400346279144`
-- `mean_best_so_far_auc = 0.1483089178800583`
-- `mean_oracle_hit_rate_final = 0.6749999523162842`
-Surrogate:
-- `mean_regret_at`:
-  - `@1 = 0.002321675419807434`
-  - `@3 = 0.002293013734742999`
-  - `@5 = 0.002293013734742999`
-  - `@6 = 0.002293013734742999`
-- `mean_best_so_far_auc = 0.0023013732861727476`
-- `mean_oracle_hit_rate_final = 0.8333333134651184`
-Interpretation:
-- Surrogate reduced best-so-far AUC by about `98.4%` versus random.
-- Surrogate reduced final `regret@6` by about `96.4%`.
-- Surrogate improved final oracle-hit rate by about `15.8` percentage points.
-This is a very strong within-family / held-out-shape result.
-### Held-out Family (`grouped_gemm`)
-Random:
-- `mean_regret_at`:
-  - `@1 = 2.341181755065918`
-  - `@3 = 0.8532703518867493`
-  - `@5 = 0.3116174638271332`
-  - `@6 = 0.21012252569198608`
-- `mean_best_so_far_auc = 0.9102223515510559`
-- `mean_oracle_hit_rate_final = 0.17777778208255768`
-Surrogate:
-- `mean_regret_at`:
-  - `@1 = 0.4722703695297241`
-  - `@3 = 0.29785311222076416`
-  - `@5 = 0.20862582325935364`
-  - `@6 = 0.17804712057113647`
-- `mean_best_so_far_auc = 0.3014116585254669`
-- `mean_oracle_hit_rate_final = 0.3444444239139557`
-Interpretation:
-- Surrogate reduced best-so-far AUC by about `66.9%` versus random.
-- Surrogate reduced final `regret@6` by about `15.3%`.
-- Surrogate improved final oracle-hit rate by about `16.7` percentage points.
-This is a good cross-family transfer result. The gap is smaller than in the held-out-shape setting, which is expected.
-## Main Runtime Results
-The runtime-reference benchmark compares eager PyTorch, `torch.compile`, and the live reruns of the best Triton configs found by `random` and `surrogate`.
-### Summary by Family
-#### Softmax
-Average eager latency:
-- `0.1103919968008995 ms`
-Average `torch.compile` steady-state latency:
-- `0.11152799427509308 ms`
-Average compile-plus-first-call time:
-- `529.9687260048813 ms`
-Average Triton speedup vs eager:
-- random-best: `3.362561387683493x`
-- surrogate-best: `3.286588301595338x`
-Average Triton speedup vs compiled:
-- random-best: `3.3985671575178635x`
-- surrogate-best: `3.321742054891467x`
-Interpretation:
-- `torch.compile` is effectively flat vs eager on this softmax set.
-- Tuned Triton is substantially faster than both.
-- Surrogate-best and random-best final kernels are very close in absolute runtime.
-#### LayerNorm
-Average eager latency:
-- `0.05939200147986412 ms`
-Average `torch.compile` steady-state latency:
-- `0.09503999352455139 ms`
-Average compile-plus-first-call time:
-- `440.1235789991915 ms`
-Average Triton speedup vs eager:
-- random-best: `1.8776593781360051x`
-- surrogate-best: `1.8364378273209185x`
-Average Triton speedup vs compiled:
-- random-best: `2.927484944635789x`
-- surrogate-best: `2.862647103483093x`
-Interpretation:
-- `torch.compile` is slower than eager on this LayerNorm set.
-- Tuned Triton is materially faster than both eager and compiled.
-- Again, surrogate-best and random-best final kernels are close in endpoint runtime.
-#### Grouped GEMM
-Average eager latency:
-- `0.05589688859052128 ms`
-Average `torch.compile` steady-state latency:
-- `0.101806221736802 ms`
-Average compile-plus-first-call time:
-- `102.45987688863858 ms`
-Average Triton speedup vs eager:
-- random-best: `1.2771213149737215x`
-- surrogate-best: `1.2644549628354071x`
-Average Triton speedup vs compiled:
-- random-best: `2.4414293463407355x`
-- surrogate-best: `2.4156697207038382x`
-Interpretation:
-- `torch.compile` is materially slower than eager on this grouped-GEMM set.
-- Tuned Triton is faster than both eager and compiled.
-- Endpoint difference between surrogate-best and random-best remains small.
-## Triton Autotune Comparison
-We also compared the surrogate search workflow directly against `triton.autotune` on three large softmax tasks:
-- `softmax_m4096_n4096`
-- `softmax_m4096_n6144`
-- `softmax_m4096_n8192`
-This comparison measures two things:
-- search plus compile cost to the first usable tuned kernel
-- steady-state runtime of the resulting tuned kernel
-### Per-task Results
-#### `softmax_m4096_n4096`
-- oracle best: `0.02127999998629093 ms`
-- surrogate:
-  - decision time: `33.06370500649791 ms`
-  - compile plus first call: `294.734695009538 ms`
-  - steady-state: `0.02127999998629093 ms`
-  - regret vs oracle: `0.0`
-- `triton.autotune`:
-  - first call: `8970.702438004082 ms`
-  - steady-state: `0.021856000646948814 ms`
-  - regret vs oracle: `0.0270677002363231`
-#### `softmax_m4096_n6144`
-- oracle best: `0.030719999223947525 ms`
-- surrogate:
-  - decision time: `15.47088599181734 ms`
-  - compile plus first call: `0.9627069957787171 ms`
-  - steady-state: `0.031007999554276466 ms`
-  - regret vs oracle: `0.009375010989727928`
-- `triton.autotune`:
-  - first call: `1990.3547260037158 ms`
-  - steady-state: `0.031039999797940254 ms`
-  - regret vs oracle: `0.010416685614473398`
-#### `softmax_m4096_n8192`
-- oracle best: `0.03747199848294258 ms`
-- surrogate:
-  - decision time: `15.47144899086561 ms`
-  - compile plus first call: `0.09679699724074453 ms`
-  - steady-state: `0.03798399865627289 ms`
-  - regret vs oracle: `0.013663540618560122`
-- `triton.autotune`:
-  - first call: `1988.913670007605 ms`
-  - steady-state: `0.03747199848294258 ms`
-  - regret vs oracle: `0.0`
-### Summary
-- mean surrogate compile plus first call: `98.59806633418582 ms`
-- mean surrogate steady-state latency: `0.03009066606561343 ms`
-- mean `triton.autotune` first call: `4316.656944671801 ms`
-- mean `triton.autotune` steady-state latency: `0.030122666309277218 ms`
-Interpretation:
-- The surrogate reaches a competitive tuned kernel far faster than `triton.autotune` on these tasks.
-- Steady-state performance is effectively the same on average:
-  - surrogate mean steady-state: `0.0300907 ms`
-  - `triton.autotune` mean steady-state: `0.0301227 ms`
-- On one task the surrogate exactly matched the oracle best.
-- On the other two tasks the surrogate was slightly off the oracle, but still close.
-- `triton.autotune` won one task in steady-state quality, but paid a much larger first-call search cost.
-This is an important result because it shows the surrogate is not only better than a random search baseline. It is also competitive with Triton's built-in autotuning in final kernel quality while being dramatically cheaper in tuning-time-to-first-good-kernel on these tested shapes.
-## What The Results Mean
-The results support the following conclusions:
-1. The surrogate optimizer is genuinely useful as a search policy.
-   - It is substantially more sample-efficient than random.
-   - It reaches good kernels much earlier in the budget.
-   - This effect is very strong on held-out shapes and still meaningful on held-out family transfer.
-2. The resulting Triton kernels are genuinely useful as runtime implementations.
-   - They are faster than eager PyTorch.
-   - They are faster than `torch.compile`.
-   - The advantage is strongest on `softmax`, then `layernorm`, then `grouped_gemm`.
-3. The surrogate is also competitive with `triton.autotune` on final steady-state runtime while being much cheaper in first-call tuning cost on the tested softmax shapes.
-4. The main value of the surrogate is search efficiency, not necessarily a dramatically better final endpoint than a lucky random search.
-   - By the end of the short search budget, random and surrogate can still land on very similar endpoint kernels.
-   - This is visible in the runtime benchmark, where random-best and surrogate-best final kernels are often close in ms.
-   - The surrogate still wins decisively on regret and AUC.
-This is a coherent and valuable result. The optimizer is improving how quickly good kernels are found, and the kernels it finds are fast in absolute runtime.
-## Caveats and Professional Notes
-This benchmark is strong for a hackathon project, but it should not be oversold.
-Important caveats:
-- The benchmark is not publication-grade yet.
-  - Only one `M` value is used.
-  - Family/task distributions are still narrow.
-  - Grouped GEMM was added recently and should be validated more deeply.
-- `torch.compile` compile time measurement should be interpreted carefully.
-  - The measured compile-plus-first-call times vary across tasks.
-  - A more publication-grade measurement would reset compiler state more aggressively and isolate cold-start behavior more carefully.
-- The runtime benchmark uses live reruns.
-  - That means small run-to-run variation is expected.
-- The endpoint runtime gap between surrogate-best and random-best is small.
-  - This is not a failure of the benchmark.
-  - It means the benchmark currently demonstrates search-efficiency gains more strongly than final-endpoint gains.
-These caveats do not invalidate the results. They define the proper scope of the claims.
-## Recommended Project Framing
-The most honest and compelling project framing is:
-> We built a measured-oracle GPU autotuning benchmark with held-out-shape and held-out-family evaluation. Our surrogate-guided optimizer substantially outperforms a random baseline on short-budget search efficiency, and the Triton kernels it finds are materially faster than both eager PyTorch and `torch.compile`.
-That statement matches the data.
-## Recommended Next Steps
-Highest-value next steps:
-1. Expand the runtime benchmark:
-   - more held-out tasks
-   - more end-to-end summaries
-2. Improve search without changing the measurement cache:
-   - stronger acquisition strategies
-   - family-aware priors
-   - feature engineering before simply increasing parameter count
-3. Make the collector more production-friendly:
-   - incremental writes
-   - progress logging
-   - resumable measurement collection
-4. If needed, increase benchmark rigor:
-   - broader shape sets
-   - more families
-   - more careful cold-start `torch.compile` accounting
-## Deliverable Status
-At the current stage, this project is:
-- hackathon-ready
-- technically credible
-- professionally explainable
-- useful as a platform for next-stage kernel autotuning work
-It is not yet:
-- a finished research benchmark
-- a final systems paper artifact
-That is the correct level of rigor for the current results.

HACKATHON_SUMMARY.md DELETED Viewed

@@ -1,179 +0,0 @@
-# GPU Kernel Autotuning: Hackathon Summary
-## What We Built
-We built a measured-oracle GPU autotuning benchmark for Triton kernels on an H200-class machine.
-The system has two layers:
-- Search benchmark
-  - compares `surrogate` vs `random`
-  - measures how quickly each method finds good kernel configs
-- Runtime benchmark
-  - compares the selected Triton kernels against eager PyTorch and `torch.compile`
-  - measures actual execution latency in milliseconds
-The benchmark is discrete-action, uses real measured kernel runtimes, and supports held-out-shape and held-out-family evaluation.
-## Kernel Families
-Current benchmark families:
-- Family A:
-  - `softmax`
-  - `layernorm`
-- Family B:
-  - `grouped_gemm`
-Current shape set:
-- `M = 4096`
-- `N ∈ {256, 512, 1024, 1536, 2048, 3072, 4096, 6144, 8192}`
-Total tasks:
-- `27` tasks
-## Why This Matters
-There are two distinct questions in GPU autotuning:
-1. Can an optimizer find good kernels quickly under a short tuning budget?
-2. Are the kernels it finds actually fast in real runtime?
-This project answers both.
-## Search Results
-### Held-out Shapes
-Against the `random` baseline, the `surrogate` optimizer achieved:
-- `98.4%` lower best-so-far regret AUC
-- `96.4%` lower `regret@6`
-- `+15.8` percentage points higher final oracle-hit rate
-Interpretation:
-- On held-out shapes, the surrogate is dramatically more sample-efficient than random.
-- It finds near-optimal kernels almost immediately.
-### Held-out Family (`grouped_gemm`)
-Against the `random` baseline, the `surrogate` optimizer achieved:
-- `66.9%` lower best-so-far regret AUC
-- `15.3%` lower `regret@6`
-- `+16.7` percentage points higher final oracle-hit rate
-Interpretation:
-- The surrogate also transfers across kernel families.
-- The cross-family problem is harder, but the optimizer still wins clearly.
-## Runtime Results
-We then reran the selected Triton kernels live and compared them to:
-- eager PyTorch
-- `torch.compile`
-- `triton.autotune` on selected large softmax tasks
-### Softmax
-Average surrogate-best Triton runtime:
-- about `0.0336 ms`
-Average `torch.compile` runtime:
-- about `0.1115 ms`
-Result:
-- surrogate-selected Triton is about `3.32x` faster than `torch.compile`
-### LayerNorm
-Average surrogate-best Triton runtime:
-- about `0.0323 ms`
-Average `torch.compile` runtime:
-- about `0.0950 ms`
-Result:
-- surrogate-selected Triton is about `2.86x` faster than `torch.compile`
-### Grouped GEMM
-Average surrogate-best Triton runtime:
-- about `0.0442 ms`
-Average `torch.compile` runtime:
-- about `0.1018 ms`
-Result:
-- surrogate-selected Triton is about `2.42x` faster than `torch.compile`
-### Triton Autotune on Large Softmax
-We also compared the surrogate directly against `triton.autotune` on:
-- `softmax_m4096_n4096`
-- `softmax_m4096_n6144`
-- `softmax_m4096_n8192`
-Result:
-- mean surrogate compile plus first call: `98.6 ms`
-- mean `triton.autotune` first call: `4316.7 ms`
-- mean steady-state latency was effectively the same:
-  - surrogate: `0.03009 ms`
-  - `triton.autotune`: `0.03012 ms`
-Interpretation:
-- The surrogate reaches a strong tuned kernel far faster than `triton.autotune` on these tested softmax shapes.
-- Final steady-state kernel quality is essentially matched on average.
-## What The Results Mean
-The current system shows two real advantages:
-- The surrogate is much better than random at short-budget tuning.
-- The resulting Triton kernels are materially faster than `torch.compile`.
-- On selected large softmax tasks, the surrogate also reaches competitive tuned kernels far faster than `triton.autotune`.
-The subtle point is that the surrogate's biggest gain is in search efficiency, not necessarily in a huge final-endpoint gap over a lucky random search. That is why regret/AUC is the right optimization metric here.
-## What Is Strong
-- Real GPU measurements, not synthetic rewards
-- Hidden-oracle benchmark protocol
-- Discrete action space
-- Held-out-shape evaluation
-- Held-out-family evaluation
-- Runtime comparison against eager PyTorch and `torch.compile`
-## What Is Not Final Yet
-- This is not publication-grade yet
-- Only one `M` dimension is used
-- Grouped GEMM was added recently and should be stress-validated further
-- `torch.compile` cold-start accounting could be measured even more rigorously
-## Bottom Line
-This project demonstrates a credible autotuning workflow:
-- surrogate-guided search beats random on search quality
-- tuned Triton kernels beat `torch.compile` on runtime
-- the benchmark already shows cross-family transfer
-That is a strong hackathon result with professional-grade measurement discipline.

README.md CHANGED Viewed

@@ -1,129 +1,12 @@
-# Multi-family GPU autotuning benchmark
-## TLDR
-```bash
-python3 scripts/check_torch_cuda.py
-./scripts/run_full_pipeline.sh
-```
-The pipeline now:
-- collects a shared measurement cache for `softmax`, `layernorm`, and `grouped_gemm`
-- builds split manifests for shape holdout and family holdout
-- smoke-tests the local OpenEnv-style environment
-- evaluates `random` vs `surrogate` using `regret@k` and best-so-far AUC
-- benchmarks eager PyTorch and `torch.compile` against best Triton configs
-## Measurement cache
-Main collector:
-```bash
-python3 scripts/collect_multifamily_measurements.py \
-  --output data/autotune_measurements.csv \
-  --families softmax layernorm grouped_gemm \
-  --n-cols 256 512 1024 1536 2048 3072 4096 6144 8192 \
-  --m 4096 \
-  --repeats 200 \
-  --warmup 25 \
-  --seed 0 \
-  --append
-```
-Current implemented families:
-- `softmax`
-- `layernorm`
-- `grouped_gemm`
-The shared CSV schema includes:
-- `family_group`
-- `family`
-- `task_id`
-- `m`, `n`
-- `config_id`
-- `block_size`, `num_warps`, `num_stages`
-- `shape_json`, `config_json`
-- `median_ms`, `effective_gbps`, `score`, `validation_error`
-## Splits and eval
-Build split manifests:
-```bash
-python3 scripts/build_benchmark_splits.py \
-  --measurement-path data/autotune_measurements.csv \
-  --output data/benchmark_splits.json \
-  --heldout-family grouped_gemm
-```
-Run split-based evaluation:
-```bash
-python3 scripts/eval_generalization.py \
-  --measurement-path data/autotune_measurements.csv \
-  --splits data/benchmark_splits.json \
-  --episodes 20 \
-  --budget 6 \
-  --seed 2 \
-  --acquisition ucb \
-  --beta 2.0
-```
-Benchmark absolute runtime references:
-```bash
-python3 scripts/benchmark_runtime_references.py \
-  --generalization-results outputs/generalization_eval.json \
-  --repeats 100 \
-  --warmup 10 \
-  --seed 123
-```
-Metrics:
-- `mean_regret_at`
-- `median_regret_at`
-- `mean_best_so_far_auc`
-- `mean_oracle_hit_rate_final`
-- `eager_latency_ms`
-- `compile_plus_first_call_ms`
-- `compiled_latency_ms`
-- Triton speedups vs eager / compiled
-## Environment
-OpenEnv metadata is in:
-- `openenv.yaml`
-The environment server still uses:
-- `server/app.py`
-- `server/softmax_surrogate_environment.py`
-Despite the filename, the env is now multi-family and supports a train-task prior for held-out-shape / held-out-family evaluation.
-## Qwen2.5-0.5B exact-kernel pipeline
-This repo now also includes a model-specific benchmark pipeline for the exact inference kernel roles needed by `Qwen/Qwen2.5-0.5B`.
-Kernel roles covered:
-- `rmsnorm`
-- attention `softmax`
-- `q_proj`, `k_proj`, `v_proj`, `o_proj`
-- `gate_proj`, `up_proj`, `down_proj`
-Run it with:
-```bash
-./scripts/run_qwen_05b_pipeline.sh
-```
-Key files:
-- `scripts/qwen_05b_spec.py`
-- `scripts/collect_qwen_05b_measurements.py`
-- `scripts/build_qwen_05b_splits.py`
-- `scripts/benchmark_qwen_05b_runtime.py`
-Outputs:
-- `data/qwen_05b_measurements.csv`
-- `data/qwen_05b_splits.json`
-- `outputs/qwen_05b_generalization_eval.json`
-- `outputs/qwen_05b_runtime_references.json`

+---
+title: RL Surrogate ENV
+emoji: 👀
+colorFrom: gray
+colorTo: blue
+sdk: static
+pinned: false
+license: apache-2.0
+short_description: This is a surrogate discovery environment to improve perform
+---
+Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

backend/main.py DELETED Viewed

@@ -1,177 +0,0 @@
-from __future__ import annotations
-import sys
-from pathlib import Path
-from typing import Any, Dict, List, Optional
-ROOT = Path(__file__).resolve().parents[1]
-if str(ROOT) not in sys.path:
-    sys.path.insert(0, str(ROOT))
-from fastapi import FastAPI, HTTPException, WebSocket, WebSocketDisconnect
-from fastapi.middleware.cors import CORSMiddleware
-from pydantic import BaseModel
-from backend.terminal_manager import ALLOWED_JOBS, TerminalManager
-app = FastAPI(
-    title="RL Autotuning Backend",
-    description="Backend API for the multi-family GPU autotuning benchmark",
-    version="0.1.0",
-)
-app.add_middleware(
-    CORSMiddleware,
-    allow_origins=[
-        "http://localhost:5173",
-        "http://127.0.0.1:5173",
-        "http://localhost:4173",
-        "http://127.0.0.1:4173",
-    ],
-    allow_credentials=False,
-    allow_methods=["*"],
-    allow_headers=["*"],
-)
-env = None
-terminal_manager = TerminalManager()
-def _get_env():
-    global env
-    if env is None:
-        try:
-            from server.softmax_surrogate_environment import SoftmaxSurrogateEnvironment
-            env = SoftmaxSurrogateEnvironment()
-        except ImportError as exc:
-            raise HTTPException(
-                status_code=503,
-                detail=f"Environment unavailable – missing dependency: {exc.name}",
-            )
-    return env
-class ResetRequest(BaseModel):
-    task: Optional[str] = None
-    seed: Optional[int] = None
-class StepRequest(BaseModel):
-    config_id: Optional[int] = None
-    x: Optional[List[float]] = None
-class SessionRequest(BaseModel):
-    job_id: str
-    restart: bool = False
-class SessionInputRequest(BaseModel):
-    data: str
-    append_newline: bool = True
-class SessionResizeRequest(BaseModel):
-    cols: int
-    rows: int
-@app.get("/health")
-def health() -> Dict[str, str]:
-    return {"status": "ok"}
-@app.post("/reset")
-def reset(payload: ResetRequest) -> Dict[str, Any]:
-    return _get_env().reset(task=payload.task, seed=payload.seed)
-@app.post("/step")
-def step(payload: StepRequest) -> Dict[str, Any]:
-    e = _get_env()
-    if payload.config_id is not None:
-        return e.step({"config_id": payload.config_id})
-    if payload.x is not None:
-        return e.step({"x": payload.x})
-    raise HTTPException(status_code=400, detail="Provide config_id or x.")
-@app.get("/state")
-def state() -> Dict[str, Any]:
-    return _get_env().state()
-@app.get("/terminal/jobs")
-def terminal_jobs() -> Dict[str, Any]:
-    return {"jobs": terminal_manager.list_jobs()}
-@app.post("/terminal/sessions")
-async def create_terminal_session(payload: SessionRequest) -> Dict[str, Any]:
-    if payload.job_id not in ALLOWED_JOBS:
-        raise HTTPException(status_code=404, detail=f"Unknown job_id: {payload.job_id}")
-    session = await terminal_manager.ensure_session(payload.job_id, restart=payload.restart)
-    return session.snapshot()
-@app.get("/terminal/sessions/{session_id}")
-def terminal_session_snapshot(session_id: str) -> Dict[str, Any]:
-    session = terminal_manager.get_session(session_id)
-    if session is None:
-        raise HTTPException(status_code=404, detail="Session not found")
-    return session.snapshot()
-@app.post("/terminal/sessions/{session_id}/input")
-def terminal_session_input(session_id: str, payload: SessionInputRequest) -> Dict[str, Any]:
-    session = terminal_manager.get_session(session_id)
-    if session is None:
-        raise HTTPException(status_code=404, detail="Session not found")
-    if not session.is_active:
-        raise HTTPException(status_code=409, detail="Session is not running")
-    session.write(payload.data, append_newline=payload.append_newline)
-    return {"ok": True}
-@app.post("/terminal/sessions/{session_id}/resize")
-def terminal_session_resize(session_id: str, payload: SessionResizeRequest) -> Dict[str, Any]:
-    session = terminal_manager.get_session(session_id)
-    if session is None:
-        raise HTTPException(status_code=404, detail="Session not found")
-    session.resize(payload.cols, payload.rows)
-    return {"ok": True}
-@app.post("/terminal/sessions/{session_id}/stop")
-def terminal_session_stop(session_id: str) -> Dict[str, Any]:
-    session = terminal_manager.get_session(session_id)
-    if session is None:
-        raise HTTPException(status_code=404, detail="Session not found")
-    session.interrupt()
-    return {"ok": True}
-@app.websocket("/terminal/sessions/{session_id}/stream")
-async def terminal_session_stream(websocket: WebSocket, session_id: str) -> None:
-    session = terminal_manager.get_session(session_id)
-    if session is None:
-        await websocket.close(code=4404)
-        return
-    await websocket.accept()
-    queue = await session.subscribe()
-    try:
-        await websocket.send_json(session.snapshot())
-        while True:
-            event = await queue.get()
-            await websocket.send_json(event)
-    except WebSocketDisconnect:
-        pass
-    finally:
-        session.unsubscribe(queue)
-if __name__ == "__main__":
-    import uvicorn
-    uvicorn.run("main:app", host="0.0.0.0", port=8000, reload=True)

backend/requirements.txt DELETED Viewed

@@ -1,3 +0,0 @@
-fastapi
-uvicorn[standard]
-pydantic

backend/terminal_manager.py DELETED Viewed

@@ -1,342 +0,0 @@
-from __future__ import annotations
-import asyncio
-import importlib.util
-import json
-import os
-import pty
-import signal
-import struct
-import subprocess
-import termios
-import threading
-import time
-import uuid
-from dataclasses import dataclass
-from pathlib import Path
-from typing import Any
-REPO_ROOT = Path(__file__).resolve().parents[1]
-BUFFER_LIMIT = 160_000
-DEFAULT_COLS = 120
-DEFAULT_ROWS = 36
-PYTHON_CANDIDATES = (
-    "/usr/local/bin/python3",
-    "/opt/homebrew/bin/python3",
-    "/Users/amannindra/miniconda3/bin/python3",
-)
-@dataclass(frozen=True)
-class AllowedJob:
-    job_id: str
-    label: str
-    description: str
-    command: tuple[str, ...]
-    cwd: Path
-    def as_dict(self) -> dict[str, Any]:
-        return {
-            "job_id": self.job_id,
-            "label": self.label,
-            "description": self.description,
-            "command": list(self.command),
-            "cwd": str(self.cwd),
-        }
-ALLOWED_JOBS: dict[str, AllowedJob] = {
-    "qwen": AllowedJob(
-        job_id="qwen",
-        label="Qwen Baseline",
-        description="Runs the exact-kernel Qwen2.5-0.5B benchmark pipeline.",
-        command=("bash", "scripts/run_qwen_05b_pipeline.sh"),
-        cwd=REPO_ROOT,
-    ),
-    "rl-agent": AllowedJob(
-        job_id="rl-agent",
-        label="RL Agent",
-        description="Runs the multi-family surrogate and runtime benchmark pipeline.",
-        command=("bash", "scripts/run_full_pipeline.sh"),
-        cwd=REPO_ROOT,
-    ),
-}
-def _probe_python(path: str) -> dict[str, Any] | None:
-    if not Path(path).exists():
-        return None
-    script = (
-        "import importlib.util, json, sys; "
-        "print(json.dumps({"
-        "'executable': sys.executable, "
-        "'torch': bool(importlib.util.find_spec('torch')), "
-        "'triton': bool(importlib.util.find_spec('triton'))"
-        "}))"
-    )
-    try:
-        result = subprocess.run(
-            [path, "-c", script],
-            check=True,
-            capture_output=True,
-            text=True,
-        )
-    except (OSError, subprocess.CalledProcessError):
-        return None
-    try:
-        payload = json.loads(result.stdout.strip())
-    except json.JSONDecodeError:
-        return None
-    payload["path"] = path
-    return payload
-def _best_python_runtime() -> dict[str, Any] | None:
-    explicit = os.environ.get("TERMINAL_PYTHON_BIN")
-    if explicit:
-        probe = _probe_python(explicit)
-        if probe is not None:
-            probe["score"] = int(probe["torch"]) + int(probe["triton"])
-            probe["explicit"] = True
-            return probe
-    best: dict[str, Any] | None = None
-    for candidate in PYTHON_CANDIDATES:
-        probe = _probe_python(candidate)
-        if probe is None:
-            continue
-        score = int(probe["torch"]) + int(probe["triton"])
-        probe["score"] = score
-        if best is None or score > best["score"]:
-            best = probe
-    return best
-class TerminalSession:
-    def __init__(self, job: AllowedJob, loop: asyncio.AbstractEventLoop) -> None:
-        self.id = uuid.uuid4().hex
-        self.job = job
-        self.loop = loop
-        self.created_at = time.time()
-        self.started_at: float | None = None
-        self.finished_at: float | None = None
-        self.exit_code: int | None = None
-        self.status = "starting"
-        self.cols = DEFAULT_COLS
-        self.rows = DEFAULT_ROWS
-        self.python_runtime = _best_python_runtime()
-        self._buffer = ""
-        self._buffer_lock = threading.Lock()
-        self._subscribers: set[asyncio.Queue[dict[str, Any]]] = set()
-        self._subscriber_lock = threading.Lock()
-        self._master_fd, slave_fd = pty.openpty()
-        self._resize_fd(self.cols, self.rows)
-        env = os.environ.copy()
-        env.setdefault("TERM", "xterm-256color")
-        env.setdefault("PYTHONUNBUFFERED", "1")
-        env.setdefault("FORCE_COLOR", "1")
-        if self.python_runtime is not None:
-            python_dir = str(Path(self.python_runtime["path"]).parent)
-            env["PATH"] = f"{python_dir}:{env.get('PATH', '')}"
-            env["PYTHON_BIN"] = self.python_runtime["path"]
-        self._append_buffer(self._launcher_banner())
-        self.process = subprocess.Popen(
-            self.job.command,
-            cwd=str(self.job.cwd),
-            stdin=slave_fd,
-            stdout=slave_fd,
-            stderr=slave_fd,
-            env=env,
-            preexec_fn=os.setsid,
-            close_fds=True,
-        )
-        os.close(slave_fd)
-        self.started_at = time.time()
-        self.status = "running"
-        self._reader_thread = threading.Thread(target=self._reader_loop, daemon=True)
-        self._waiter_thread = threading.Thread(target=self._wait_loop, daemon=True)
-        self._reader_thread.start()
-        self._waiter_thread.start()
-    @property
-    def command_display(self) -> str:
-        return " ".join(self.job.command)
-    @property
-    def is_active(self) -> bool:
-        return self.process.poll() is None
-    def snapshot(self) -> dict[str, Any]:
-        with self._buffer_lock:
-            buffer = self._buffer
-        return {
-            "type": "snapshot",
-            "session": {
-                "id": self.id,
-                "job_id": self.job.job_id,
-                "label": self.job.label,
-                "description": self.job.description,
-                "cwd": str(self.job.cwd),
-                "command": self.command_display,
-                "status": self.status,
-                "created_at": self.created_at,
-                "started_at": self.started_at,
-                "finished_at": self.finished_at,
-                "exit_code": self.exit_code,
-                "cols": self.cols,
-                "rows": self.rows,
-            },
-            "buffer": buffer,
-        }
-    async def subscribe(self) -> asyncio.Queue[dict[str, Any]]:
-        queue: asyncio.Queue[dict[str, Any]] = asyncio.Queue()
-        with self._subscriber_lock:
-            self._subscribers.add(queue)
-        return queue
-    def unsubscribe(self, queue: asyncio.Queue[dict[str, Any]]) -> None:
-        with self._subscriber_lock:
-            self._subscribers.discard(queue)
-    def write(self, data: str, append_newline: bool = True) -> None:
-        if not data:
-            return
-        payload = data + ("\n" if append_newline else "")
-        os.write(self._master_fd, payload.encode("utf-8", errors="replace"))
-    def resize(self, cols: int, rows: int) -> None:
-        self.cols = max(20, cols)
-        self.rows = max(8, rows)
-        self._resize_fd(self.cols, self.rows)
-    def interrupt(self) -> None:
-        if self.process.poll() is None:
-            os.killpg(os.getpgid(self.process.pid), signal.SIGINT)
-    def terminate(self) -> None:
-        if self.process.poll() is None:
-            os.killpg(os.getpgid(self.process.pid), signal.SIGTERM)
-    def _resize_fd(self, cols: int, rows: int) -> None:
-        winsize = struct.pack("HHHH", rows, cols, 0, 0)
-        try:
-            termios.tcsetwinsize(self._master_fd, (rows, cols))
-        except AttributeError:
-            pass
-        try:
-            import fcntl
-            fcntl.ioctl(self._master_fd, termios.TIOCSWINSZ, winsize)
-        except OSError:
-            pass
-    def _append_buffer(self, chunk: str) -> None:
-        with self._buffer_lock:
-            self._buffer = (self._buffer + chunk)[-BUFFER_LIMIT:]
-    def _launcher_banner(self) -> str:
-        lines = [
-            f"[launcher] job: {self.job.label}",
-            f"[launcher] cwd: {self.job.cwd}",
-            f"[launcher] command: {self.command_display}",
-        ]
-        if self.python_runtime is not None:
-            modules = []
-            modules.append(f"torch={'yes' if self.python_runtime['torch'] else 'no'}")
-            modules.append(f"triton={'yes' if self.python_runtime['triton'] else 'no'}")
-            lines.append(f"[launcher] python3: {self.python_runtime['path']} ({', '.join(modules)})")
-            if self.python_runtime.get("explicit"):
-                lines.append("[launcher] python3 source: TERMINAL_PYTHON_BIN")
-            if not self.python_runtime["triton"]:
-                lines.append("[launcher] warning: Triton is not installed in the selected Python runtime.")
-        else:
-            lines.append("[launcher] warning: no preferred Python runtime detected; falling back to PATH lookup.")
-        return "\n".join(lines) + "\n\n"
-    def _publish(self, event: dict[str, Any]) -> None:
-        with self._subscriber_lock:
-            subscribers = tuple(self._subscribers)
-        for queue in subscribers:
-            self.loop.call_soon_threadsafe(self._safe_put, queue, event)
-    @staticmethod
-    def _safe_put(queue: asyncio.Queue[dict[str, Any]], event: dict[str, Any]) -> None:
-        try:
-            queue.put_nowait(event)
-        except asyncio.QueueFull:
-            pass
-    def _reader_loop(self) -> None:
-        while True:
-            try:
-                data = os.read(self._master_fd, 4096)
-            except OSError:
-                break
-            if not data:
-                break
-            text = data.decode("utf-8", errors="replace")
-            self._append_buffer(text)
-            self._publish({"type": "output", "data": text})
-    def _wait_loop(self) -> None:
-        exit_code = self.process.wait()
-        self.exit_code = exit_code
-        self.finished_at = time.time()
-        self.status = "exited" if exit_code == 0 else "failed"
-        self._publish(
-            {
-                "type": "exit",
-                "exit_code": exit_code,
-                "status": self.status,
-                "finished_at": self.finished_at,
-            }
-        )
-        try:
-            os.close(self._master_fd)
-        except OSError:
-            pass
-class TerminalManager:
-    def __init__(self) -> None:
-        self._sessions: dict[str, TerminalSession] = {}
-        self._latest_by_job: dict[str, str] = {}
-        self._lock = threading.Lock()
-    def list_jobs(self) -> list[dict[str, Any]]:
-        return [job.as_dict() for job in ALLOWED_JOBS.values()]
-    def get_session(self, session_id: str) -> TerminalSession | None:
-        with self._lock:
-            return self._sessions.get(session_id)
-    async def ensure_session(self, job_id: str, restart: bool = False) -> TerminalSession:
-        if job_id not in ALLOWED_JOBS:
-            raise KeyError(job_id)
-        with self._lock:
-            existing_id = self._latest_by_job.get(job_id)
-            existing = self._sessions.get(existing_id) if existing_id else None
-        if existing and existing.is_active and not restart:
-            return existing
-        if existing and restart:
-            existing.interrupt()
-        session = TerminalSession(ALLOWED_JOBS[job_id], asyncio.get_running_loop())
-        with self._lock:
-            self._sessions[session.id] = session
-            self._latest_by_job[job_id] = session.id
-        return session

client.py DELETED Viewed

@@ -1,77 +0,0 @@
-from __future__ import annotations
-import argparse
-import sys
-from pathlib import Path
-from typing import Any, Optional
-import requests
-ROOT = Path(__file__).resolve().parents[1]
-if str(ROOT) not in sys.path:
-    sys.path.append(str(ROOT))
-from server.softmax_surrogate_environment import DEFAULT_BUDGET, SoftmaxSurrogateEnvironment
-class SoftmaxSurrogateEnvClient:
-    def __init__(
-        self,
-        base_url: Optional[str] = None,
-        measurement_path: str = "data/autotune_measurements.csv",
-        budget: int = DEFAULT_BUDGET,
-        seed: int = 0,
-    ) -> None:
-        self.base_url = base_url
-        self._local_env = None
-        if base_url is None:
-            self._local_env = SoftmaxSurrogateEnvironment(
-                measurement_path=measurement_path,
-                budget=budget,
-                seed=seed,
-            )
-    def reset(self, task: Optional[str] = None, seed: Optional[int] = None) -> dict:
-        if self._local_env is not None:
-            return self._local_env.reset(task=task, seed=seed)
-        payload = {}
-        if task is not None:
-            payload["task"] = task
-        if seed is not None:
-            payload["seed"] = seed
-        resp = requests.post(f"{self.base_url}/reset", json=payload, timeout=60)
-        resp.raise_for_status()
-        return resp.json()
-    def step(self, action: Any) -> dict:
-        if self._local_env is not None:
-            return self._local_env.step(action)
-        payload = action if isinstance(action, dict) else {"x": action}
-        resp = requests.post(f"{self.base_url}/step", json=payload, timeout=120)
-        resp.raise_for_status()
-        return resp.json()
-    def state(self) -> dict:
-        if self._local_env is not None:
-            return self._local_env.state()
-        resp = requests.get(f"{self.base_url}/state", timeout=60)
-        resp.raise_for_status()
-        return resp.json()
-def parse_args() -> argparse.Namespace:
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--remote", default=None, help="Optional base URL (e.g. http://127.0.0.1:8000)")
-    parser.add_argument("--task", default=None)
-    parser.add_argument("--seed", type=int, default=0)
-    return parser.parse_args()
-def main() -> None:
-    args = parse_args()
-    client = SoftmaxSurrogateEnvClient(base_url=args.remote, seed=args.seed)
-    print(client.reset(task=args.task))
-if __name__ == "__main__":
-    main()

frontend/.gitignore DELETED Viewed

@@ -1,24 +0,0 @@
-# Logs
-logs
-*.log
-npm-debug.log*
-yarn-debug.log*
-yarn-error.log*
-pnpm-debug.log*
-lerna-debug.log*
-node_modules
-dist
-dist-ssr
-*.local
-# Editor directories and files
-.vscode/*
-!.vscode/extensions.json
-.idea
-.DS_Store
-*.suo
-*.ntvs*
-*.njsproj
-*.sln
-*.sw?

frontend/README.md DELETED Viewed

@@ -1,16 +0,0 @@
-# React + Vite
-This template provides a minimal setup to get React working in Vite with HMR and some ESLint rules.
-Currently, two official plugins are available:
-- [@vitejs/plugin-react](https://github.com/vitejs/vite-plugin-react/blob/main/packages/plugin-react) uses [Babel](https://babeljs.io/) (or [oxc](https://oxc.rs) when used in [rolldown-vite](https://vite.dev/guide/rolldown)) for Fast Refresh
-- [@vitejs/plugin-react-swc](https://github.com/vitejs/vite-plugin-react/blob/main/packages/plugin-react-swc) uses [SWC](https://swc.rs/) for Fast Refresh
-## React Compiler
-The React Compiler is not enabled on this template because of its impact on dev & build performances. To add it, see [this documentation](https://react.dev/learn/react-compiler/installation).
-## Expanding the ESLint configuration
-If you are developing a production application, we recommend using TypeScript with type-aware lint rules enabled. Check out the [TS template](https://github.com/vitejs/vite/tree/main/packages/create-vite/template-react-ts) for information on how to integrate TypeScript and [`typescript-eslint`](https://typescript-eslint.io) in your project.

frontend/eslint.config.js DELETED Viewed

@@ -1,29 +0,0 @@
-import js from '@eslint/js'
-import globals from 'globals'
-import reactHooks from 'eslint-plugin-react-hooks'
-import reactRefresh from 'eslint-plugin-react-refresh'
-import { defineConfig, globalIgnores } from 'eslint/config'
-export default defineConfig([
-  globalIgnores(['dist']),
-  {
-    files: ['**/*.{js,jsx}'],
-    extends: [
-      js.configs.recommended,
-      reactHooks.configs.flat.recommended,
-      reactRefresh.configs.vite,
-    ],
-    languageOptions: {
-      ecmaVersion: 2020,
-      globals: globals.browser,
-      parserOptions: {
-        ecmaVersion: 'latest',
-        ecmaFeatures: { jsx: true },
-        sourceType: 'module',
-      },
-    },
-    rules: {
-      'no-unused-vars': ['error', { varsIgnorePattern: '^[A-Z_]' }],
-    },
-  },
-])

frontend/index.html DELETED Viewed

@@ -1,13 +0,0 @@
-<!doctype html>
-<html lang="en">
-  <head>
-    <meta charset="UTF-8" />
-    <link rel="icon" type="image/svg+xml" href="/vite.svg" />
-    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
-    <title>frontend</title>
-  </head>
-  <body>
-    <div id="root"></div>
-    <script type="module" src="/src/main.jsx"></script>
-  </body>
-</html>

frontend/package-lock.json DELETED Viewed

The diff for this file is too large to render. See raw diff

frontend/package.json DELETED Viewed

@@ -1,27 +0,0 @@
-{
-  "name": "frontend",
-  "private": true,
-  "version": "0.0.0",
-  "type": "module",
-  "scripts": {
-    "dev": "vite",
-    "build": "vite build",
-    "lint": "eslint .",
-    "preview": "vite preview"
-  },
-  "dependencies": {
-    "react": "^19.2.0",
-    "react-dom": "^19.2.0"
-  },
-  "devDependencies": {
-    "@eslint/js": "^9.39.1",
-    "@types/react": "^19.2.7",
-    "@types/react-dom": "^19.2.3",
-    "@vitejs/plugin-react": "^5.1.1",
-    "eslint": "^9.39.1",
-    "eslint-plugin-react-hooks": "^7.0.1",
-    "eslint-plugin-react-refresh": "^0.4.24",
-    "globals": "^16.5.0",
-    "vite": "^7.3.1"
-  }
-}

frontend/public/vite.svg DELETED Viewed

frontend/src/App.jsx DELETED Viewed

@@ -1,195 +0,0 @@
-import { useEffect, useRef, useState } from 'react'
-import TerminalPane from './components/TerminalPane'
-const panes = [
-  { jobId: 'qwen', title: 'Qwen Baseline', tone: 'cyan' },
-  { jobId: 'rl-agent', title: 'RL Agent', tone: 'green' },
-]
-function formatMs(value) {
-  if (value == null) {
-    return '--'
-  }
-  if (value < 1000) {
-    return `${Math.round(value)} ms`
-  }
-  return `${(value / 1000).toFixed(2)} s`
-}
-function buildRunStats(telemetry, issuedAt) {
-  if (!telemetry || !issuedAt) {
-    return {
-      responseMs: null,
-      completionMs: null,
-      waiting: true,
-    }
-  }
-  const responseMs = telemetry.lastOutputAt && telemetry.lastOutputAt >= issuedAt ? telemetry.lastOutputAt - issuedAt : null
-  const finishedAt = telemetry.session?.finished_at ? telemetry.session.finished_at * 1000 : null
-  const completionMs = finishedAt && finishedAt >= issuedAt ? finishedAt - issuedAt : null
-  return {
-    responseMs,
-    completionMs,
-    waiting: responseMs == null && completionMs == null,
-  }
-}
-function App() {
-  const [split, setSplit] = useState(50)
-  const [dragging, setDragging] = useState(false)
-  const [command, setCommand] = useState('')
-  const [comparisonRun, setComparisonRun] = useState(null)
-  const [telemetry, setTelemetry] = useState({
-    qwen: null,
-    'rl-agent': null,
-  })
-  const workspaceRef = useRef(null)
-  const leftPaneRef = useRef(null)
-  const rightPaneRef = useRef(null)
-  useEffect(() => {
-    if (!dragging) {
-      return undefined
-    }
-    const handlePointerMove = (event) => {
-      const bounds = workspaceRef.current?.getBoundingClientRect()
-      if (!bounds) {
-        return
-      }
-      const next = ((event.clientX - bounds.left) / bounds.width) * 100
-      const clamped = Math.min(75, Math.max(25, next))
-      setSplit(clamped)
-    }
-    const handlePointerUp = () => {
-      setDragging(false)
-    }
-    window.addEventListener('pointermove', handlePointerMove)
-    window.addEventListener('pointerup', handlePointerUp)
-    return () => {
-      window.removeEventListener('pointermove', handlePointerMove)
-      window.removeEventListener('pointerup', handlePointerUp)
-    }
-  }, [dragging])
-  const handleBroadcast = async (event) => {
-    event.preventDefault()
-    const value = command.trim()
-    if (!value) {
-      return
-    }
-    const issuedAt = Date.now()
-    setComparisonRun({
-      command: value,
-      issuedAt,
-    })
-    await Promise.allSettled([
-      leftPaneRef.current?.submit(value),
-      rightPaneRef.current?.submit(value),
-    ])
-    setCommand('')
-  }
-  const handleTelemetryChange = (payload) => {
-    setTelemetry((previous) => ({
-      ...previous,
-      [payload.jobId]: payload,
-    }))
-  }
-  const qwenStats = buildRunStats(telemetry.qwen, comparisonRun?.issuedAt)
-  const agentStats = buildRunStats(telemetry['rl-agent'], comparisonRun?.issuedAt)
-  let comparisonHeadline = 'Send a shared command to compare runtime.'
-  if (comparisonRun) {
-    if (qwenStats.completionMs != null && agentStats.completionMs != null) {
-      const fasterJob = qwenStats.completionMs <= agentStats.completionMs ? panes[0].title : panes[1].title
-      const delta = Math.abs(qwenStats.completionMs - agentStats.completionMs)
-      comparisonHeadline = `${fasterJob} finished ${formatMs(delta)} faster.`
-    } else if (qwenStats.responseMs != null && agentStats.responseMs != null) {
-      const fasterJob = qwenStats.responseMs <= agentStats.responseMs ? panes[0].title : panes[1].title
-      const delta = Math.abs(qwenStats.responseMs - agentStats.responseMs)
-      comparisonHeadline = `${fasterJob} responded ${formatMs(delta)} faster.`
-    } else {
-      comparisonHeadline = `Running shared command: ${comparisonRun.command}`
-    }
-  }
-  return (
-    <main className="desktop">
-      <div className="desktop__glow" />
-      <section className="comparison-bar">
-        <div className="comparison-bar__copy">
-          <span className="comparison-bar__eyebrow">Runtime compare</span>
-          <strong>{comparisonHeadline}</strong>
-          <small>{comparisonRun ? `Command: ${comparisonRun.command}` : 'Broadcast one command to both panes.'}</small>
-        </div>
-        <div className="comparison-bar__stats">
-          <article className="comparison-card comparison-card--cyan">
-            <span>{panes[0].title}</span>
-            <strong>{formatMs(qwenStats.completionMs ?? qwenStats.responseMs)}</strong>
-            <small>{qwenStats.completionMs != null ? 'completion time' : 'first output latency'}</small>
-          </article>
-          <article className="comparison-card comparison-card--green">
-            <span>{panes[1].title}</span>
-            <strong>{formatMs(agentStats.completionMs ?? agentStats.responseMs)}</strong>
-            <small>{agentStats.completionMs != null ? 'completion time' : 'first output latency'}</small>
-          </article>
-        </div>
-      </section>
-      <section ref={workspaceRef} className="workspace">
-        <div className="workspace__pane" style={{ width: `${split}%` }}>
-          <TerminalPane ref={leftPaneRef} {...panes[0]} onTelemetryChange={handleTelemetryChange} />
-        </div>
-        <button
-          type="button"
-          className={dragging ? 'workspace__divider is-dragging' : 'workspace__divider'}
-          onPointerDown={() => setDragging(true)}
-          aria-label="Resize terminal panes"
-          aria-valuemin={25}
-          aria-valuemax={75}
-          aria-valuenow={Math.round(split)}
-          aria-orientation="vertical"
-        >
-          <span />
-        </button>
-        <div className="workspace__pane" style={{ width: `${100 - split}%` }}>
-          <TerminalPane ref={rightPaneRef} {...panes[1]} onTelemetryChange={handleTelemetryChange} />
-        </div>
-      </section>
-      <form className="broadcast-bar" onSubmit={handleBroadcast}>
-        <label className="broadcast-bar__label" htmlFor="broadcast-input">
-          Shared input
-        </label>
-        <div className="broadcast-bar__field">
-          <span className="broadcast-bar__prompt">$</span>
-          <input
-            id="broadcast-input"
-            value={command}
-            onChange={(event) => setCommand(event.target.value)}
-            placeholder="Send the same command to both terminals"
-            spellCheck="false"
-          />
-          <button type="submit">Send to both</button>
-        </div>
-      </form>
-    </main>
-  )
-}
-export default App

frontend/src/api/terminal.js DELETED Viewed

@@ -1,57 +0,0 @@
-const RAW_API_BASE = import.meta.env.VITE_API_BASE_URL || 'http://127.0.0.1:8000'
-const API_BASE = RAW_API_BASE.replace(/\/+$/, '')
-const WS_BASE = API_BASE.replace(/^http/, 'ws')
-async function request(path, options = {}) {
-  const response = await fetch(`${API_BASE}${path}`, {
-    headers: {
-      'Content-Type': 'application/json',
-      ...(options.headers || {}),
-    },
-    ...options,
-  })
-  if (!response.ok) {
-    let message = `Request failed with status ${response.status}`
-    try {
-      const payload = await response.json()
-      message = payload.detail || message
-    } catch {
-      // Keep default error message when the payload is not JSON.
-    }
-    throw new Error(message)
-  }
-  return response.json()
-}
-export function createOrAttachSession(jobId, { restart = false } = {}) {
-  return request('/terminal/sessions', {
-    method: 'POST',
-    body: JSON.stringify({ job_id: jobId, restart }),
-  })
-}
-export function stopTerminalSession(sessionId) {
-  return request(`/terminal/sessions/${sessionId}/stop`, {
-    method: 'POST',
-  })
-}
-export function sendTerminalInput(sessionId, data, appendNewline = true) {
-  return request(`/terminal/sessions/${sessionId}/input`, {
-    method: 'POST',
-    body: JSON.stringify({ data, append_newline: appendNewline }),
-  })
-}
-export function resizeTerminalSession(sessionId, cols, rows) {
-  return request(`/terminal/sessions/${sessionId}/resize`, {
-    method: 'POST',
-    body: JSON.stringify({ cols, rows }),
-  })
-}
-export function openTerminalSocket(sessionId) {
-  return new WebSocket(`${WS_BASE}/terminal/sessions/${sessionId}/stream`)
-}

frontend/src/assets/react.svg DELETED Viewed

frontend/src/components/TerminalPane.jsx DELETED Viewed

@@ -1,147 +0,0 @@
-import { forwardRef, useEffect, useImperativeHandle, useMemo, useRef } from 'react'
-import { useTerminalSession } from '../hooks/useTerminalSession'
-function formatTime(timestamp) {
-  if (!timestamp) {
-    return 'Idle'
-  }
-  return new Date(timestamp * 1000).toLocaleTimeString([], {
-    hour: '2-digit',
-    minute: '2-digit',
-    second: '2-digit',
-  })
-}
-function statusLabel(status) {
-  if (status === 'running') {
-    return 'Running'
-  }
-  if (status === 'failed') {
-    return 'Failed'
-  }
-  if (status === 'exited') {
-    return 'Completed'
-  }
-  return 'Starting'
-}
-const TerminalPane = forwardRef(function TerminalPane({ jobId, title, tone, onTelemetryChange }, ref) {
-  const { session, buffer, connectionState, error, lastOutputAt, restart, resize, sendInput, start, stop } =
-    useTerminalSession(jobId)
-  const viewportRef = useRef(null)
-  const scrollRef = useRef(null)
-  useImperativeHandle(
-    ref,
-    () => ({
-      submit: async (value) => {
-        await sendInput(value, true)
-      },
-    }),
-    [sendInput],
-  )
-  useEffect(() => {
-    onTelemetryChange?.({
-      jobId,
-      session,
-      connectionState,
-      error,
-      lastOutputAt,
-    })
-  }, [connectionState, error, jobId, lastOutputAt, onTelemetryChange, session])
-  useEffect(() => {
-    const container = scrollRef.current
-    if (container) {
-      container.scrollTop = container.scrollHeight
-    }
-  }, [buffer])
-  useEffect(() => {
-    const element = viewportRef.current
-    if (!element) {
-      return undefined
-    }
-    let frameId = 0
-    const measure = () => {
-      cancelAnimationFrame(frameId)
-      frameId = requestAnimationFrame(() => {
-        const style = getComputedStyle(element)
-        const fontSize = parseFloat(style.fontSize) || 15
-        const lineHeight = parseFloat(style.lineHeight) || 24
-        const cols = Math.max(48, Math.floor(element.clientWidth / (fontSize * 0.61)))
-        const rows = Math.max(14, Math.floor(element.clientHeight / lineHeight))
-        resize(cols, rows)
-      })
-    }
-    measure()
-    const observer = new ResizeObserver(measure)
-    observer.observe(element)
-    return () => {
-      cancelAnimationFrame(frameId)
-      observer.disconnect()
-    }
-  }, [resize])
-  const footerMeta = useMemo(
-    () => [
-      session?.status ? statusLabel(session.status) : 'Connecting',
-      session?.started_at ? `Started ${formatTime(session.started_at)}` : null,
-      session?.exit_code != null ? `Exit ${session.exit_code}` : null,
-      connectionState === 'connected' ? 'WS live' : connectionState,
-    ].filter(Boolean),
-    [connectionState, session],
-  )
-  return (
-    <article className={`terminal-pane terminal-pane--${tone}`}>
-      <header className="terminal-pane__header">
-        <div className="terminal-pane__heading">
-          <div className="terminal-pane__title-row">
-            <span className="terminal-pane__dot" />
-            <h2>{title}</h2>
-            <span className={`status-chip status-chip--${session?.status || 'starting'}`}>
-              {statusLabel(session?.status)}
-            </span>
-          </div>
-          <p>{session?.command || 'Waiting for backend session...'}</p>
-          <small>{session?.cwd || 'No working directory available yet.'}</small>
-        </div>
-        <div className="terminal-pane__actions">
-          <button type="button" onClick={start}>
-            Attach
-          </button>
-          <button type="button" onClick={restart}>
-            Restart
-          </button>
-          <button type="button" onClick={stop}>
-            Stop
-          </button>
-        </div>
-      </header>
-      <div ref={viewportRef} className="terminal-pane__viewport">
-        <div ref={scrollRef} className="terminal-pane__scroll">
-          <pre className="terminal-pane__buffer">{buffer || 'Starting session...\n'}</pre>
-          {session?.status === 'running' ? <span className="terminal-pane__cursor" aria-hidden="true" /> : null}
-        </div>
-      </div>
-      <footer className="terminal-pane__footer">
-        <div className="terminal-pane__meta">
-          {footerMeta.map((item) => (
-            <span key={item}>{item}</span>
-          ))}
-          {error ? <span className="terminal-pane__error">{error}</span> : null}
-        </div>
-      </footer>
-    </article>
-  )
-})
-export default TerminalPane

frontend/src/hooks/useTerminalSession.js DELETED Viewed

@@ -1,166 +0,0 @@
-import { useCallback, useEffect, useRef, useState } from 'react'
-import {
-  createOrAttachSession,
-  openTerminalSocket,
-  resizeTerminalSession,
-  sendTerminalInput,
-  stopTerminalSession,
-} from '../api/terminal'
-const BUFFER_LIMIT = 160000
-function trimBuffer(text) {
-  return text.length > BUFFER_LIMIT ? text.slice(-BUFFER_LIMIT) : text
-}
-export function useTerminalSession(jobId) {
-  const [session, setSession] = useState(null)
-  const [buffer, setBuffer] = useState('')
-  const [connectionState, setConnectionState] = useState('connecting')
-  const [error, setError] = useState('')
-  const [lastOutputAt, setLastOutputAt] = useState(null)
-  const socketRef = useRef(null)
-  const resizeRef = useRef({ cols: null, rows: null })
-  const attachSocket = useCallback((sessionId) => {
-    if (socketRef.current) {
-      socketRef.current.close()
-    }
-    const socket = openTerminalSocket(sessionId)
-    socketRef.current = socket
-    setConnectionState('connecting')
-    socket.addEventListener('open', () => {
-      setConnectionState('connected')
-    })
-    socket.addEventListener('message', (event) => {
-      const payload = JSON.parse(event.data)
-      if (payload.type === 'snapshot') {
-        setSession(payload.session)
-        setBuffer(payload.buffer || '')
-        return
-      }
-      if (payload.type === 'output') {
-        setLastOutputAt(Date.now())
-        setBuffer((previous) => trimBuffer(previous + payload.data))
-        return
-      }
-      if (payload.type === 'exit') {
-        setSession((previous) =>
-          previous
-            ? {
-                ...previous,
-                status: payload.status,
-                exit_code: payload.exit_code,
-                finished_at: payload.finished_at,
-              }
-            : previous,
-        )
-      }
-    })
-    socket.addEventListener('close', () => {
-      setConnectionState('disconnected')
-    })
-    socket.addEventListener('error', () => {
-      setConnectionState('error')
-    })
-  }, [])
-  const bootSession = useCallback(
-    async (restart = false) => {
-      try {
-        setError('')
-        const payload = await createOrAttachSession(jobId, { restart })
-        setSession(payload.session)
-        setBuffer(payload.buffer || '')
-        attachSocket(payload.session.id)
-      } catch (caughtError) {
-        setError(caughtError.message)
-        setConnectionState('error')
-      }
-    },
-    [attachSocket, jobId],
-  )
-  useEffect(() => {
-    const timeoutId = window.setTimeout(() => {
-      void bootSession(false)
-    }, 0)
-    return () => {
-      window.clearTimeout(timeoutId)
-      if (socketRef.current) {
-        socketRef.current.close()
-      }
-    }
-  }, [bootSession])
-  const restart = useCallback(() => bootSession(true), [bootSession])
-  const stop = useCallback(async () => {
-    if (!session?.id) {
-      return
-    }
-    try {
-      await stopTerminalSession(session.id)
-    } catch (caughtError) {
-      setError(caughtError.message)
-    }
-  }, [session])
-  const sendInput = useCallback(
-    async (value, appendNewline = true) => {
-      if (!session?.id || !value.trim()) {
-        return
-      }
-      try {
-        await sendTerminalInput(session.id, value, appendNewline)
-      } catch (caughtError) {
-        setError(caughtError.message)
-      }
-    },
-    [session],
-  )
-  const resize = useCallback(
-    async (cols, rows) => {
-      if (!session?.id) {
-        return
-      }
-      const previous = resizeRef.current
-      if (previous.cols === cols && previous.rows === rows) {
-        return
-      }
-      resizeRef.current = { cols, rows }
-      try {
-        await resizeTerminalSession(session.id, cols, rows)
-      } catch {
-        // Ignore resize errors so rendering stays responsive.
-      }
-    },
-    [session],
-  )
-  return {
-    buffer,
-    connectionState,
-    error,
-    lastOutputAt,
-    restart,
-    resize,
-    sendInput,
-    session,
-    start: () => bootSession(false),
-    stop,
-  }
-}

frontend/src/index.css DELETED Viewed

@@ -1,486 +0,0 @@
-:root {
-  color: #ebeff5;
-  background:
-    radial-gradient(circle at top, rgba(88, 108, 146, 0.16), transparent 24%),
-    linear-gradient(180deg, #0d1015 0%, #090b0f 100%);
-  font-synthesis: none;
-  text-rendering: optimizeLegibility;
-  -webkit-font-smoothing: antialiased;
-  -moz-osx-font-smoothing: grayscale;
-  --page: #090b0f;
-  --pane: rgba(14, 17, 22, 0.98);
-  --pane-border: rgba(255, 255, 255, 0.07);
-  --pane-soft: rgba(255, 255, 255, 0.03);
-  --text: #ebeff5;
-  --muted: #8f97a6;
-  --cyan: #8fd1ff;
-  --green: #93e09f;
-  --red: #ff7f8c;
-  --amber: #e6c171;
-}
-* {
-  box-sizing: border-box;
-}
-html,
-body,
-#root {
-  min-height: 100%;
-}
-body {
-  margin: 0;
-  min-width: 320px;
-  background: var(--page);
-  color: var(--text);
-  font-family: "IBM Plex Sans", "SF Pro Display", "Segoe UI", sans-serif;
-}
-button,
-input {
-  font: inherit;
-}
-button {
-  cursor: pointer;
-}
-.desktop {
-  position: relative;
-  min-height: 100vh;
-  overflow: hidden;
-}
-.desktop__glow {
-  position: absolute;
-  inset: 0;
-  background:
-    radial-gradient(circle at 30% 0%, rgba(109, 132, 180, 0.14), transparent 20%),
-    radial-gradient(circle at 100% 0%, rgba(70, 110, 168, 0.12), transparent 16%);
-  pointer-events: none;
-}
-.workspace {
-  position: relative;
-  z-index: 1;
-  display: flex;
-  width: 100vw;
-  height: calc(100vh - 194px);
-  background: rgba(7, 10, 13, 0.98);
-}
-.comparison-bar {
-  position: relative;
-  z-index: 1;
-  display: flex;
-  align-items: stretch;
-  justify-content: space-between;
-  gap: 18px;
-  padding: 14px 18px;
-  border-bottom: 1px solid var(--pane-border);
-  background:
-    linear-gradient(180deg, rgba(16, 19, 25, 0.98), rgba(11, 13, 18, 0.98));
-}
-.comparison-bar__copy {
-  display: grid;
-  gap: 5px;
-  min-width: 0;
-}
-.comparison-bar__copy strong,
-.comparison-bar__copy small {
-  overflow: hidden;
-  text-overflow: ellipsis;
-  white-space: nowrap;
-}
-.comparison-bar__eyebrow {
-  color: var(--muted);
-  font:
-    500 0.74rem/1 "JetBrains Mono", "SFMono-Regular", Consolas, monospace;
-  text-transform: uppercase;
-  letter-spacing: 0.14em;
-}
-.comparison-bar__copy strong {
-  font-size: 1rem;
-  font-weight: 600;
-}
-.comparison-bar__copy small {
-  color: var(--muted);
-}
-.comparison-bar__stats {
-  display: flex;
-  gap: 12px;
-}
-.comparison-card {
-  display: grid;
-  gap: 4px;
-  min-width: 170px;
-  padding: 12px 14px;
-  border: 1px solid var(--pane-border);
-  border-radius: 14px;
-  background: rgba(255, 255, 255, 0.03);
-}
-.comparison-card span,
-.comparison-card small {
-  color: var(--muted);
-}
-.comparison-card strong {
-  font:
-    600 1.2rem/1.1 "JetBrains Mono", "SFMono-Regular", Consolas, monospace;
-}
-.comparison-card--cyan strong {
-  color: var(--cyan);
-}
-.comparison-card--green strong {
-  color: var(--green);
-}
-.workspace__pane {
-  min-width: 0;
-  height: 100%;
-}
-.workspace__divider {
-  position: relative;
-  flex: 0 0 12px;
-  border: 0;
-  padding: 0;
-  background: linear-gradient(180deg, rgba(255, 255, 255, 0.04), rgba(255, 255, 255, 0.015));
-  cursor: col-resize;
-}
-.workspace__divider span {
-  position: absolute;
-  top: 50%;
-  left: 50%;
-  width: 3px;
-  height: 72px;
-  border-radius: 999px;
-  background: rgba(255, 255, 255, 0.22);
-  transform: translate(-50%, -50%);
-}
-.workspace__divider:hover span,
-.workspace__divider.is-dragging span {
-  background: rgba(143, 209, 255, 0.8);
-}
-.workspace__divider:focus-visible {
-  outline: none;
-  box-shadow: inset 0 0 0 1px rgba(143, 209, 255, 0.5);
-}
-.terminal-pane {
-  display: grid;
-  grid-template-rows: auto minmax(0, 1fr) auto;
-  width: 100%;
-  height: 100%;
-  background:
-    linear-gradient(180deg, rgba(255, 255, 255, 0.015), transparent 10%),
-    var(--pane);
-}
-.terminal-pane + .terminal-pane {
-  border-left: 1px solid var(--pane-border);
-}
-.terminal-pane__header {
-  display: flex;
-  align-items: flex-start;
-  justify-content: space-between;
-  gap: 18px;
-  padding: 18px 18px 14px;
-  border-bottom: 1px solid var(--pane-border);
-  background: rgba(255, 255, 255, 0.02);
-}
-.terminal-pane__heading {
-  min-width: 0;
-}
-.terminal-pane__title-row {
-  display: flex;
-  align-items: center;
-  gap: 10px;
-  margin-bottom: 6px;
-}
-.terminal-pane__title-row h2 {
-  margin: 0;
-  font-size: 1rem;
-  font-weight: 600;
-  letter-spacing: 0.01em;
-}
-.terminal-pane__heading p,
-.terminal-pane__heading small {
-  display: block;
-  margin: 0;
-  overflow: hidden;
-  text-overflow: ellipsis;
-  white-space: nowrap;
-}
-.terminal-pane__heading p {
-  color: var(--text);
-  font:
-    500 0.84rem/1.4 "JetBrains Mono", "SFMono-Regular", Consolas, monospace;
-}
-.terminal-pane__heading small {
-  margin-top: 4px;
-  color: var(--muted);
-}
-.terminal-pane__dot {
-  width: 9px;
-  height: 9px;
-  border-radius: 999px;
-  background: var(--cyan);
-  box-shadow: 0 0 24px rgba(143, 209, 255, 0.35);
-}
-.terminal-pane--green .terminal-pane__dot {
-  background: var(--green);
-  box-shadow: 0 0 24px rgba(147, 224, 159, 0.35);
-}
-.terminal-pane__actions {
-  display: flex;
-  gap: 8px;
-}
-.terminal-pane__actions button {
-  padding: 8px 10px;
-  border: 1px solid var(--pane-border);
-  border-radius: 10px;
-  background: rgba(255, 255, 255, 0.03);
-  color: var(--text);
-}
-.terminal-pane__actions button:hover {
-  background: rgba(255, 255, 255, 0.06);
-}
-.status-chip {
-  padding: 4px 8px;
-  border-radius: 999px;
-  font-size: 0.7rem;
-  text-transform: uppercase;
-  letter-spacing: 0.12em;
-  border: 1px solid transparent;
-}
-.status-chip--running {
-  color: var(--green);
-  border-color: rgba(147, 224, 159, 0.24);
-  background: rgba(147, 224, 159, 0.08);
-}
-.status-chip--failed {
-  color: var(--red);
-  border-color: rgba(255, 127, 140, 0.25);
-  background: rgba(255, 127, 140, 0.08);
-}
-.status-chip--exited {
-  color: var(--amber);
-  border-color: rgba(230, 193, 113, 0.24);
-  background: rgba(230, 193, 113, 0.08);
-}
-.status-chip--starting {
-  color: var(--cyan);
-  border-color: rgba(143, 209, 255, 0.24);
-  background: rgba(143, 209, 255, 0.08);
-}
-.terminal-pane__viewport {
-  min-height: 0;
-  overflow: hidden;
-  font:
-    500 0.95rem/1.72 "JetBrains Mono", "SFMono-Regular", Consolas, monospace;
-}
-.terminal-pane__scroll {
-  height: 100%;
-  overflow: auto;
-  padding: 18px 18px 20px;
-}
-.terminal-pane__buffer {
-  margin: 0;
-  white-space: pre-wrap;
-  word-break: break-word;
-  color: #edf2fa;
-}
-.terminal-pane__cursor {
-  display: inline-block;
-  width: 0.62em;
-  height: 1.05em;
-  margin-top: 4px;
-  background: rgba(237, 242, 250, 0.9);
-  animation: blink 1s steps(1) infinite;
-}
-.terminal-pane__footer {
-  display: grid;
-  padding: 14px 18px 16px;
-  border-top: 1px solid var(--pane-border);
-  background: rgba(255, 255, 255, 0.02);
-}
-.terminal-pane__meta {
-  display: flex;
-  flex-wrap: wrap;
-  gap: 10px 14px;
-  color: var(--muted);
-  font:
-    500 0.76rem/1.3 "JetBrains Mono", "SFMono-Regular", Consolas, monospace;
-}
-.terminal-pane__error {
-  color: var(--red);
-}
-.broadcast-bar {
-  position: relative;
-  z-index: 1;
-  display: grid;
-  gap: 8px;
-  padding: 14px 18px 18px;
-  border-top: 1px solid var(--pane-border);
-  background:
-    linear-gradient(180deg, rgba(18, 21, 27, 0.98), rgba(10, 12, 16, 0.98));
-}
-.broadcast-bar__label {
-  color: var(--muted);
-  font:
-    500 0.78rem/1 "JetBrains Mono", "SFMono-Regular", Consolas, monospace;
-  text-transform: uppercase;
-  letter-spacing: 0.12em;
-}
-.broadcast-bar__field {
-  display: flex;
-  align-items: center;
-  gap: 12px;
-  padding: 14px 16px;
-  border: 1px solid var(--pane-border);
-  border-radius: 14px;
-  background: rgba(255, 255, 255, 0.03);
-  box-shadow: inset 0 1px 0 rgba(255, 255, 255, 0.02);
-}
-.broadcast-bar__prompt {
-  color: var(--cyan);
-  font:
-    600 0.92rem/1 "JetBrains Mono", "SFMono-Regular", Consolas, monospace;
-}
-.broadcast-bar__field input {
-  flex: 1;
-  min-width: 0;
-  border: 0;
-  background: transparent;
-  color: var(--text);
-  outline: none;
-  font:
-    500 0.95rem/1.2 "JetBrains Mono", "SFMono-Regular", Consolas, monospace;
-}
-.broadcast-bar__field input::placeholder {
-  color: var(--muted);
-}
-.broadcast-bar__field button {
-  padding: 10px 14px;
-  border: 1px solid rgba(143, 209, 255, 0.18);
-  border-radius: 10px;
-  background: rgba(143, 209, 255, 0.08);
-  color: var(--text);
-}
-.broadcast-bar__field button:hover {
-  background: rgba(143, 209, 255, 0.14);
-}
-@keyframes blink {
-  50% {
-    opacity: 0;
-  }
-}
-@media (max-width: 980px) {
-  .comparison-bar {
-    flex-direction: column;
-  }
-  .comparison-bar__stats {
-    width: 100%;
-  }
-  .comparison-card {
-    flex: 1;
-    min-width: 0;
-  }
-  .terminal-pane__header {
-    flex-direction: column;
-  }
-  .terminal-pane__actions {
-    width: 100%;
-  }
-  .terminal-pane__actions button {
-    flex: 1;
-  }
-}
-@media (max-width: 780px) {
-  .workspace {
-    height: calc(100vh - 244px);
-  }
-  .workspace__divider {
-    flex-basis: 10px;
-  }
-  .terminal-pane__scroll {
-    padding: 16px;
-  }
-  .terminal-pane__viewport {
-    font-size: 0.88rem;
-  }
-  .terminal-pane__footer {
-    padding: 12px 14px 14px;
-  }
-  .broadcast-bar {
-    padding: 12px 14px 14px;
-  }
-  .broadcast-bar__field {
-    padding: 12px 14px;
-  }
-  .comparison-bar {
-    padding: 12px 14px;
-  }
-}

frontend/src/main.jsx DELETED Viewed

@@ -1,10 +0,0 @@
-import { StrictMode } from 'react'
-import { createRoot } from 'react-dom/client'
-import './index.css'
-import App from './App.jsx'
-createRoot(document.getElementById('root')).render(
-  <StrictMode>
-    <App />
-  </StrictMode>,
-)

frontend/vite.config.js DELETED Viewed

@@ -1,7 +0,0 @@
-import { defineConfig } from 'vite'
-import react from '@vitejs/plugin-react'
-// https://vite.dev/config/
-export default defineConfig({
-  plugins: [react()],
-})

index.html ADDED Viewed

	@@ -0,0 +1,317 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>RL Surrogate Kernel Autotuning</title>
+    <link rel="stylesheet" href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700&family=JetBrains+Mono:wght@400;500&display=swap">
+    <style>
+        :root {
+            --primary: #2563eb;
+            --primary-dark: #1d4ed8;
+            --secondary: #10b981;
+            --accent: #f59e0b;
+            --bg: #0f172a;
+            --bg-card: #1e293b;
+            --text: #f1f5f9;
+            --text-muted: #94a3b8;
+            --border: #334155;
+        }
+        * {
+            margin: 0;
+            padding: 0;
+            box-sizing: border-box;
+        }
+        body {
+            font-family: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif;
+            background: var(--bg);
+            color: var(--text);
+            line-height: 1.6;
+            min-height: 100vh;
+        }
+        .container {
+            max-width: 1200px;
+            margin: 0 auto;
+            padding: 2rem;
+        }
+        header {
+            text-align: center;
+            padding: 4rem 0;
+            position: relative;
+        }
+        header::before {
+            content: '';
+            position: absolute;
+            top: 0;
+            left: 50%;
+            transform: translateX(-50%);
+            width: 600px;
+            height: 600px;
+            background: radial-gradient(circle, rgba(37, 99, 235, 0.15) 0%, transparent 70%);
+            pointer-events: none;
+        }
+        h1 {
+            font-size: 3rem;
+            font-weight: 700;
+            margin-bottom: 1rem;
+            background: linear-gradient(135deg, #60a5fa 0%, #a78bfa 100%);
+            -webkit-background-clip: text;
+            -webkit-text-fill-color: transparent;
+            background-clip: text;
+        }
+        .subtitle {
+            font-size: 1.25rem;
+            color: var(--text-muted);
+            max-width: 600px;
+            margin: 0 auto;
+        }
+        .badge {
+            display: inline-block;
+            padding: 0.5rem 1rem;
+            background: var(--primary);
+            border-radius: 9999px;
+            font-size: 0.875rem;
+            font-weight: 500;
+            margin-top: 1.5rem;
+        }
+        .grid {
+            display: grid;
+            grid-template-columns: repeat(auto-fit, minmax(300px, 1fr));
+            gap: 1.5rem;
+            margin: 3rem 0;
+        }
+        .card {
+            background: var(--bg-card);
+            border: 1px solid var(--border);
+            border-radius: 1rem;
+            padding: 1.5rem;
+            transition: transform 0.2s, box-shadow 0.2s;
+        }
+        .card:hover {
+            transform: translateY(-2px);
+            box-shadow: 0 10px 40px rgba(0, 0, 0, 0.3);
+        }
+        .card-icon {
+            width: 48px;
+            height: 48px;
+            border-radius: 0.75rem;
+            display: flex;
+            align-items: center;
+            justify-content: center;
+            font-size: 1.5rem;
+            margin-bottom: 1rem;
+        }
+        .card-icon.blue { background: rgba(37, 99, 235, 0.2); }
+        .card-icon.green { background: rgba(16, 185, 129, 0.2); }
+        .card-icon.amber { background: rgba(245, 158, 11, 0.2); }
+        .card-icon.purple { background: rgba(139, 92, 246, 0.2); }
+        .card h3 {
+            font-size: 1.125rem;
+            font-weight: 600;
+            margin-bottom: 0.5rem;
+        }
+        .card p {
+            color: var(--text-muted);
+            font-size: 0.875rem;
+        }
+        .features {
+            margin: 3rem 0;
+        }
+        .features h2 {
+            font-size: 1.75rem;
+            margin-bottom: 1.5rem;
+            text-align: center;
+        }
+        .feature-list {
+            display: grid;
+            grid-template-columns: repeat(auto-fit, minmax(280px, 1fr));
+            gap: 1rem;
+        }
+        .feature-item {
+            display: flex;
+            align-items: center;
+            gap: 0.75rem;
+            padding: 1rem;
+            background: var(--bg-card);
+            border-radius: 0.75rem;
+            border: 1px solid var(--border);
+        }
+        .feature-item span {
+            color: var(--secondary);
+            font-size: 1.25rem;
+        }
+        .code-block {
+            background: #0d1117;
+            border: 1px solid var(--border);
+            border-radius: 0.75rem;
+            padding: 1.5rem;
+            margin: 2rem 0;
+            overflow-x: auto;
+        }
+        .code-block pre {
+            font-family: 'JetBrains Mono', monospace;
+            font-size: 0.875rem;
+            color: #e2e8f0;
+            white-space: pre;
+        }
+        .code-block .comment { color: #6b7280; }
+        .code-block .keyword { color: #c084fc; }
+        .code-block .string { color: #34d399; }
+        .code-block .number { color: #f472b6; }
+        .cta {
+            text-align: center;
+            padding: 3rem;
+            background: var(--bg-card);
+            border-radius: 1rem;
+            border: 1px solid var(--border);
+            margin-top: 3rem;
+        }
+        .cta h2 {
+            font-size: 1.5rem;
+            margin-bottom: 1rem;
+        }
+        .cta p {
+            color: var(--text-muted);
+            margin-bottom: 1.5rem;
+        }
+        .btn {
+            display: inline-block;
+            padding: 0.75rem 1.5rem;
+            background: var(--primary);
+            color: white;
+            text-decoration: none;
+            border-radius: 0.5rem;
+            font-weight: 500;
+            transition: background 0.2s;
+        }
+        .btn:hover {
+            background: var(--primary-dark);
+        }
+        footer {
+            text-align: center;
+            padding: 2rem;
+            color: var(--text-muted);
+            font-size: 0.875rem;
+            border-top: 1px solid var(--border);
+            margin-top: 4rem;
+        }
+        .kernel-families {
+            display: flex;
+            justify-content: center;
+            gap: 1rem;
+            margin-top: 2rem;
+            flex-wrap: wrap;
+        }
+        .kernel-tag {
+            padding: 0.5rem 1rem;
+            background: rgba(139, 92, 246, 0.2);
+            border: 1px solid rgba(139, 92, 246, 0.3);
+            border-radius: 0.5rem;
+            font-family: 'JetBrains Mono', monospace;
+            font-size: 0.875rem;
+        }
+    </style>
+</head>
+<body>
+    <div class="container">
+        <header>
+            <h1>RL Surrogate Kernel Autotuning</h1>
+            <p class="subtitle">
+                A reinforcement learning environment for optimizing GPU kernels using surrogate models and self-improving AI agents
+            </p>
+            <span class="badge">🚀 HuggingFace Space</span>
+        </header>
+        <div class="grid">
+            <div class="card">
+                <div class="card-icon blue">🎯</div>
+                <h3>Multi-Family Support</h3>
+                <p>Optimizes three kernel families: Softmax, LayerNorm, and Grouped GEMM for NVIDIA GPUs</p>
+            </div>
+            <div class="card">
+                <div class="card-icon green">📊</div>
+                <h3>Surrogate Modeling</h3>
+                <p>Uses Gaussian Process surrogates to predict kernel performance and guide exploration</p>
+            </div>
+            <div class="card">
+                <div class="card-icon amber">🤖</div>
+                <h3>Self-Improving AI</h3>
+                <p>Employs student-teacher LLM architecture for autonomous kernel optimization</p>
+            </div>
+            <div class="card">
+                <div class="card-icon purple">⚡</div>
+                <h3>Triton Backend</h3>
+                <p>Generates optimized Triton kernels with configurable block sizes, warps, and stages</p>
+            </div>
+        </div>
+        <div class="features">
+            <h2>Supported Kernel Families</h2>
+            <div class="kernel-families">
+                <span class="kernel-tag">softmax</span>
+                <span class="kernel-tag">layernorm</span>
+                <span class="kernel-tag">grouped_gemm</span>
+            </div>
+        </div>
+        <div class="features">
+            <h2>How It Works</h2>
+            <div class="feature-list">
+                <div class="feature-item">
+                    <span>1.</span>
+                    <div><strong>Reset</strong> - Initialize environment with a kernel task (e.g., softmax_m4096_n256)</div>
+                </div>
+                <div class="feature-item">
+                    <span>2.</span>
+                    <div><strong>Configure</strong> - Select Triton config (block_size, num_warps, num_stages)</div>
+                </div>
+                <div class="feature-item">
+                    <span>3.</span>
+                    <div><strong>Step</strong> - Benchmark configuration and receive reward based on latency</div>
+                </div>
+                <div class="feature-item">
+                    <span>4.</span>
+                    <div><strong>Optimize</strong> - Use LLM to generate better kernel variants automatically</div>
+                </div>
+                <div class="feature-item">
+                    <span>5.</span>
+                    <div><strong>Learn</strong> - Surrogate model improves with more observations</div>
+                </div>
+                <div class="feature-item">
+                    <span>6.</span>
+                    <div><strong>Compare</strong> - Student vs Teacher LLM competition for best optimization</div>
+                </div>
+            </div>
+        </div>
+        <div class="code-block">
+            <pre><span class="comment"># Example: Using the RL Environment</span>
+<span class="comment"># 1. Reset with a task</span>
+POST /reset
+{
+  <span class="string">"task"</span>: <span class="string">"softmax_m4096_n256"</span>,
+  <span class="string">"seed"</span>: <span class="number">0</span>
+}
+<span class="comment"># 2. Take a step with a config</span>
+POST /step
+{
+  <span class="string">"config_id"</span>: <span class="number">0</span>,
+  <span class="string">"source"</span>: <span class="string">"def benchmark_generated_kernel..."</span>  <span class="comment"># optional kernel code</span>
+}
+<span class="comment"># Response includes reward based on latency improvement</span>
+{
+  <span class="string">"reward"</span>: <span class="number">0.093</span>,
+  <span class="string">"observation"</span>: {
+    <span class="string">"best_so_far_ms"</span>: <span class="number">0.41</span>,
+    <span class="string">"oracle_best_ms"</span>: <span class="number">0.38</span>
+  }
+}</pre>
+        </div>
+        <div class="cta">
+            <h2>Try It Now</h2>
+            <p>Use the interactive Gradio interface above to benchmark kernel configurations and explore the RL environment.</p>
+            <a href="/" class="btn">Launch Interface →</a>
+        </div>
+        <footer>
+            <p>RL Surrogate Kernel Autotuning Environment</p>
+            <p>Built with FastAPI, Gradio, Triton, and Transformers</p>
+        </footer>
+    </div>
+</body>
+</html>

models.py DELETED Viewed

@@ -1,27 +0,0 @@
-from __future__ import annotations
-from dataclasses import dataclass
-from typing import Any, Dict, List, Sequence
-@dataclass
-class Action:
-    config_id: int
-@dataclass
-class StepResult:
-    observation: Dict[str, Any]
-    reward: float
-    done: bool
-    state: Dict[str, Any]
-    info: Dict[str, Any]
-@dataclass
-class ResetResult:
-    observation: Dict[str, Any]
-    reward: float
-    done: bool
-    state: Dict[str, Any]
-    info: Dict[str, Any]

openenv.yaml DELETED Viewed

@@ -1,15 +0,0 @@
-name: softmax_surrogate_env
-version: 0.1.0
-description: >
-  OpenEnv scaffold for discrete-action multi-family GPU autotuning with real
-  H100 measurements and held-out-shape / held-out-family evaluation.
-python:
-  package: .
-server:
-  entrypoint: server.app:app
-environment:
-  class: server.softmax_surrogate_environment.SoftmaxSurrogateEnvironment
-  init_kwargs:
-    measurement_path: data/autotune_measurements.csv
-    budget: 6
-    seed: 0

requirements.txt DELETED Viewed

@@ -1,8 +0,0 @@
-torch>=2.4
-triton>=2.1
-numpy>=1.23
-fastapi>=0.111
-uvicorn>=0.30
-pydantic>=2.8
-requests>=2.32

scripts/__init__.py DELETED Viewed

	@@ -1 +0,0 @@
1	- # Package marker for direct imports from the scripts folder.

scripts/benchmark_qwen_05b_runtime.py DELETED Viewed

@@ -1,158 +0,0 @@
-#!/usr/bin/env python3
-from __future__ import annotations
-import argparse
-import json
-import sys
-import time
-from pathlib import Path
-from typing import Any, Dict, Tuple
-import numpy as np
-import torch
-ROOT = Path(__file__).resolve().parents[1]
-if str(ROOT) not in sys.path:
-    sys.path.append(str(ROOT))
-from scripts.collect_qwen_05b_measurements import EPS, benchmark_qwen_task
-from scripts.qwen_05b_spec import QwenKernelTask, qwen_05b_tasks
-TASK_BY_ID = {task.task_id: task for task in qwen_05b_tasks()}
-def _bench_callable(fn, args: Tuple[Any, ...], repeats: int, warmup: int) -> float:
-    for _ in range(max(1, warmup)):
-        fn(*args)
-    torch.cuda.synchronize()
-    start = torch.cuda.Event(enable_timing=True)
-    end = torch.cuda.Event(enable_timing=True)
-    durations = []
-    for _ in range(max(1, repeats)):
-        torch.cuda.synchronize()
-        start.record()
-        fn(*args)
-        end.record()
-        end.synchronize()
-        durations.append(start.elapsed_time(end))
-    return float(np.median(np.asarray(durations, dtype=np.float32)))
-def _build_qwen_callable(task: QwenKernelTask, seed: int):
-    torch.manual_seed(seed)
-    if task.family == "softmax":
-        x = torch.randn((task.m, task.n), device="cuda", dtype=torch.float16)
-        def fn(inp: torch.Tensor):
-            return torch.softmax(inp, dim=-1)
-        return fn, (x,)
-    if task.family == "rmsnorm":
-        x = torch.randn((task.m, task.n), device="cuda", dtype=torch.float16)
-        def fn(inp: torch.Tensor):
-            return inp.float() * torch.rsqrt(inp.float().pow(2).mean(dim=-1, keepdim=True) + EPS)
-        return fn, (x,)
-    if task.family == "gemm":
-        a = torch.randn((task.m, task.k), device="cuda", dtype=torch.float16)
-        b = torch.randn((task.k, task.n), device="cuda", dtype=torch.float16)
-        def fn(lhs: torch.Tensor, rhs: torch.Tensor):
-            return torch.matmul(lhs, rhs)
-        return fn, (a, b)
-    raise ValueError(f"Unsupported family: {task.family}")
-def _benchmark_torch(task: QwenKernelTask, seed: int, repeats: int, warmup: int) -> Dict[str, float]:
-    eager_fn, args = _build_qwen_callable(task, seed)
-    eager_latency_ms = _bench_callable(eager_fn, args, repeats=repeats, warmup=warmup)
-    compiled_fn = torch.compile(eager_fn)
-    torch.cuda.synchronize()
-    start = time.perf_counter()
-    compiled_fn(*args)
-    torch.cuda.synchronize()
-    compile_plus_first_call_ms = float((time.perf_counter() - start) * 1000.0)
-    compiled_latency_ms = _bench_callable(compiled_fn, args, repeats=repeats, warmup=warmup)
-    return {
-        "eager_latency_ms": eager_latency_ms,
-        "compile_plus_first_call_ms": compile_plus_first_call_ms,
-        "compiled_latency_ms": compiled_latency_ms,
-    }
-def _task_best_configs(eval_results: Dict[str, Any]) -> Dict[str, Dict[str, Dict[str, Any]]]:
-    task_map: Dict[str, Dict[str, Dict[str, Any]]] = {}
-    for section in eval_results["results"].values():
-        for method in ("random", "surrogate"):
-            for run in section["task_runs"][method]:
-                task_map.setdefault(run["task"], {})[method] = run["best_overall"]["config"]
-    return task_map
-def main() -> None:
-    parser = argparse.ArgumentParser(description="Benchmark eager/torch.compile and best Triton configs for Qwen2.5-0.5B exact kernels.")
-    parser.add_argument("--generalization-results", type=Path, default=Path("outputs/qwen_05b_generalization_eval.json"))
-    parser.add_argument("--repeats", type=int, default=100)
-    parser.add_argument("--warmup", type=int, default=10)
-    parser.add_argument("--seed", type=int, default=123)
-    parser.add_argument("--output", type=Path, default=Path("outputs/qwen_05b_runtime_references.json"))
-    args = parser.parse_args()
-    generalization_results = json.loads(args.generalization_results.read_text(encoding="utf-8"))
-    task_configs = _task_best_configs(generalization_results)
-    results = {}
-    for idx, task_id in enumerate(sorted(task_configs.keys())):
-        task = TASK_BY_ID[task_id]
-        seed = args.seed + idx
-        torch_metrics = _benchmark_torch(task, seed=seed, repeats=args.repeats, warmup=args.warmup)
-        triton_results = {
-            method: benchmark_qwen_task(
-                task=task,
-                block_size=int(config["block_size"]),
-                num_warps=int(config["num_warps"]),
-                num_stages=int(config["num_stages"]),
-                repeats=args.repeats,
-                warmup=args.warmup,
-                seed=seed,
-            ).__dict__
-            for method, config in task_configs[task_id].items()
-        }
-        results[task_id] = {
-            "family": task.family,
-            "role": task.role,
-            "mode": task.mode,
-            "torch": torch_metrics,
-            "triton": triton_results,
-            "speedups": {
-                method: {
-                    "vs_eager": float(torch_metrics["eager_latency_ms"] / row["median_ms"]),
-                    "vs_compiled": float(torch_metrics["compiled_latency_ms"] / row["median_ms"]),
-                }
-                for method, row in triton_results.items()
-            },
-        }
-    summary = {
-        "generalization_results": str(args.generalization_results),
-        "repeats": args.repeats,
-        "warmup": args.warmup,
-        "seed": args.seed,
-        "task_count": len(results),
-        "results": results,
-    }
-    args.output.parent.mkdir(parents=True, exist_ok=True)
-    with args.output.open("w", encoding="utf-8") as handle:
-        json.dump(summary, handle, indent=2)
-    print(json.dumps(summary, indent=2))
-if __name__ == "__main__":
-    main()

scripts/benchmark_runtime_references.py DELETED Viewed

@@ -1,222 +0,0 @@
-#!/usr/bin/env python3
-from __future__ import annotations
-import argparse
-import json
-import re
-import sys
-import time
-from pathlib import Path
-from typing import Any, Dict, Iterable, List, Tuple
-import numpy as np
-import torch
-import torch.nn.functional as F
-ROOT = Path(__file__).resolve().parents[1]
-if str(ROOT) not in sys.path:
-    sys.path.append(str(ROOT))
-from scripts.collect_measurements import benchmark_single_config as benchmark_softmax_config
-from scripts.collect_multifamily_measurements import (
-    EPS,
-    GROUPED_GEMM_GROUP_COUNT,
-    GROUPED_GEMM_K,
-    benchmark_grouped_gemm_config,
-    benchmark_layernorm_config,
-)
-TASK_RE = re.compile(
-    r"^(?P<family>[a-z_]+?)(?:_g(?P<g>\d+)_k(?P<k>\d+))?_m(?P<m>\d+)_n(?P<n>\d+)$"
-)
-def _parse_task(task_id: str) -> Dict[str, int | str | None]:
-    match = TASK_RE.match(task_id)
-    if not match:
-        raise ValueError(f"Cannot parse task id: {task_id}")
-    data = match.groupdict()
-    return {
-        "family": data["family"],
-        "m": int(data["m"]),
-        "n": int(data["n"]),
-        "g": int(data["g"]) if data["g"] is not None else None,
-        "k": int(data["k"]) if data["k"] is not None else None,
-    }
-def _bench_callable(fn, args: Tuple[Any, ...], repeats: int, warmup: int) -> float:
-    for _ in range(max(1, warmup)):
-        fn(*args)
-    torch.cuda.synchronize()
-    start = torch.cuda.Event(enable_timing=True)
-    end = torch.cuda.Event(enable_timing=True)
-    latencies_ms: List[float] = []
-    for _ in range(max(1, repeats)):
-        torch.cuda.synchronize()
-        start.record()
-        fn(*args)
-        end.record()
-        end.synchronize()
-        latencies_ms.append(start.elapsed_time(end))
-    return float(np.median(np.asarray(latencies_ms, dtype=np.float32)))
-def _build_family_callable(task_meta: Dict[str, Any], seed: int) -> Tuple[Any, Tuple[Any, ...]]:
-    family = str(task_meta["family"])
-    m = int(task_meta["m"])
-    n = int(task_meta["n"])
-    torch.manual_seed(seed)
-    if family == "softmax":
-        x = torch.randn((m, n), device="cuda", dtype=torch.float16)
-        def fn(inp: torch.Tensor) -> torch.Tensor:
-            return torch.softmax(inp, dim=-1)
-        return fn, (x,)
-    if family == "layernorm":
-        x = torch.randn((m, n), device="cuda", dtype=torch.float16)
-        def fn(inp: torch.Tensor) -> torch.Tensor:
-            return F.layer_norm(inp, (inp.shape[-1],), eps=EPS)
-        return fn, (x,)
-    if family == "grouped_gemm":
-        group_count = int(task_meta.get("g") or GROUPED_GEMM_GROUP_COUNT)
-        k_dim = int(task_meta.get("k") or GROUPED_GEMM_K)
-        group_m = max(64, m // group_count)
-        a_groups = [torch.randn((group_m, k_dim), device="cuda", dtype=torch.float16) for _ in range(group_count)]
-        b_groups = [torch.randn((k_dim, n), device="cuda", dtype=torch.float16) for _ in range(group_count)]
-        def fn(*inputs: torch.Tensor) -> Tuple[torch.Tensor, ...]:
-            mid = len(inputs) // 2
-            a_list = inputs[:mid]
-            b_list = inputs[mid:]
-            return tuple(torch.matmul(a, b) for a, b in zip(a_list, b_list))
-        return fn, tuple(a_groups + b_groups)
-    raise ValueError(f"Unsupported family: {family}")
-def _benchmark_torch_compile(task_meta: Dict[str, Any], seed: int, repeats: int, warmup: int) -> Dict[str, float]:
-    eager_fn, args = _build_family_callable(task_meta, seed=seed)
-    eager_latency_ms = _bench_callable(eager_fn, args, repeats=repeats, warmup=warmup)
-    compiled_fn = torch.compile(eager_fn)
-    torch.cuda.synchronize()
-    start = time.perf_counter()
-    compiled_fn(*args)
-    torch.cuda.synchronize()
-    compile_plus_first_call_ms = float((time.perf_counter() - start) * 1000.0)
-    compiled_latency_ms = _bench_callable(compiled_fn, args, repeats=repeats, warmup=warmup)
-    return {
-        "eager_latency_ms": eager_latency_ms,
-        "compile_plus_first_call_ms": compile_plus_first_call_ms,
-        "compiled_latency_ms": compiled_latency_ms,
-    }
-def _benchmark_triton_best(task_id: str, config: Dict[str, Any], repeats: int, warmup: int, seed: int) -> Dict[str, Any]:
-    task = _parse_task(task_id)
-    family = str(task["family"])
-    common = {
-        "n": int(task["n"]),
-        "block_size": int(config["block_size"]),
-        "num_warps": int(config["num_warps"]),
-        "num_stages": int(config["num_stages"]),
-        "m": int(task["m"]),
-        "repeats": repeats,
-        "warmup": warmup,
-        "seed": seed,
-    }
-    if family == "softmax":
-        row = benchmark_softmax_config(**common)
-        return row.__dict__
-    if family == "layernorm":
-        row = benchmark_layernorm_config(**common)
-        return row.__dict__
-    if family == "grouped_gemm":
-        row = benchmark_grouped_gemm_config(**common)
-        return row.__dict__
-    raise ValueError(f"Unsupported family: {family}")
-def _collect_task_best_configs(generalization_results: Dict[str, Any]) -> Dict[str, Dict[str, Dict[str, Any]]]:
-    task_map: Dict[str, Dict[str, Dict[str, Any]]] = {}
-    for section in generalization_results["results"].values():
-        for method in ("random", "surrogate"):
-            for run in section["task_runs"][method]:
-                task_map.setdefault(run["task"], {})[method] = run["best_overall"]["config"]
-    return task_map
-def main() -> None:
-    parser = argparse.ArgumentParser(description="Benchmark eager/torch.compile and live Triton configs on held-out tasks.")
-    parser.add_argument(
-        "--generalization-results",
-        type=Path,
-        default=Path("outputs/generalization_eval.json"),
-    )
-    parser.add_argument("--repeats", type=int, default=100)
-    parser.add_argument("--warmup", type=int, default=10)
-    parser.add_argument("--seed", type=int, default=123)
-    parser.add_argument(
-        "--output",
-        type=Path,
-        default=Path("outputs/runtime_references.json"),
-    )
-    args = parser.parse_args()
-    generalization_results = json.loads(args.generalization_results.read_text(encoding="utf-8"))
-    task_best_configs = _collect_task_best_configs(generalization_results)
-    results: Dict[str, Any] = {}
-    for idx, task_id in enumerate(sorted(task_best_configs.keys())):
-        task_seed = args.seed + idx
-        task_meta = _parse_task(task_id)
-        torch_metrics = _benchmark_torch_compile(task_meta, seed=task_seed, repeats=args.repeats, warmup=args.warmup)
-        method_results = {
-            method: _benchmark_triton_best(
-                task_id=task_id,
-                config=config,
-                repeats=args.repeats,
-                warmup=args.warmup,
-                seed=task_seed,
-            )
-            for method, config in task_best_configs[task_id].items()
-        }
-        results[task_id] = {
-            "family": task_meta["family"],
-            "torch": torch_metrics,
-            "triton": method_results,
-            "speedups": {
-                method: {
-                    "vs_eager": float(torch_metrics["eager_latency_ms"] / row["median_ms"]),
-                    "vs_compiled": float(torch_metrics["compiled_latency_ms"] / row["median_ms"]),
-                }
-                for method, row in method_results.items()
-            },
-        }
-    summary = {
-        "generalization_results": str(args.generalization_results),
-        "repeats": args.repeats,
-        "warmup": args.warmup,
-        "seed": args.seed,
-        "task_count": len(results),
-        "results": results,
-    }
-    args.output.parent.mkdir(parents=True, exist_ok=True)
-    with args.output.open("w", encoding="utf-8") as handle:
-        json.dump(summary, handle, indent=2)
-    print(json.dumps(summary, indent=2))
-if __name__ == "__main__":
-    main()

scripts/benchmark_surrogate_vs_triton_autotune.py DELETED Viewed

@@ -1,361 +0,0 @@
-#!/usr/bin/env python3
-from __future__ import annotations
-import argparse
-import json
-import sys
-import time
-from pathlib import Path
-from typing import Any, Dict, List
-import numpy as np
-import torch
-ROOT = Path(__file__).resolve().parents[1]
-if str(ROOT) not in sys.path:
-    sys.path.append(str(ROOT))
-import triton
-import triton.language as tl
-import triton.testing as ttesting
-from scripts.collect_measurements import BLOCK_SIZES, NUM_STAGES, NUM_WARPS, fused_rowwise_softmax_kernel
-from server.softmax_surrogate_environment import SoftmaxSurrogateEnvironment
-DEFAULT_TASKS = [
-    "softmax_m4096_n4096",
-    "softmax_m4096_n6144",
-    "softmax_m4096_n8192",
-]
-def _choose_surrogate_config_id(env: SoftmaxSurrogateEnvironment, acquisition: str, beta: float, xi: float) -> int:
-    seen = set(env.seen_config_ids())
-    best_config_id = -1
-    best_score = float("-inf")
-    for config_id in env.available_config_ids():
-        if config_id in seen and len(seen) < len(env.available_config_ids()):
-            continue
-        score = env.acquisition_score(config_id, strategy=acquisition, beta=beta, xi=xi)
-        if score > best_score:
-            best_score = score
-            best_config_id = config_id
-    if best_config_id < 0:
-        raise RuntimeError("Failed to select surrogate config.")
-    return best_config_id
-def _valid_autotune_configs(n_cols: int) -> List[triton.Config]:
-    configs = []
-    for block_size in BLOCK_SIZES:
-        if block_size < n_cols:
-            continue
-        for num_warps in NUM_WARPS:
-            for num_stages in NUM_STAGES:
-                configs.append(
-                    triton.Config(
-                        {"BLOCK_SIZE": block_size},
-                        num_warps=num_warps,
-                        num_stages=num_stages,
-                    )
-                )
-    return configs
-def _compile_plus_first_call_fixed(sample: torch.Tensor, config: Dict[str, int]) -> float:
-    output = torch.empty_like(sample)
-    grid = (sample.shape[0],)
-    torch.cuda.synchronize()
-    start = time.perf_counter()
-    fused_rowwise_softmax_kernel[grid](
-        sample,
-        output,
-        sample.stride(0),
-        sample.stride(1),
-        output.stride(0),
-        output.stride(1),
-        sample.shape[1],
-        BLOCK_SIZE=int(config["block_size"]),
-        num_warps=int(config["num_warps"]),
-        num_stages=int(config["num_stages"]),
-    )
-    torch.cuda.synchronize()
-    return float((time.perf_counter() - start) * 1000.0)
-def _steady_state_fixed(sample: torch.Tensor, config: Dict[str, int], repeats: int, warmup: int) -> float:
-    output = torch.empty_like(sample)
-    grid = (sample.shape[0],)
-    def launch() -> None:
-        fused_rowwise_softmax_kernel[grid](
-            sample,
-            output,
-            sample.stride(0),
-            sample.stride(1),
-            output.stride(0),
-            output.stride(1),
-            sample.shape[1],
-            BLOCK_SIZE=int(config["block_size"]),
-            num_warps=int(config["num_warps"]),
-            num_stages=int(config["num_stages"]),
-        )
-    return float(
-        ttesting.do_bench(
-            launch,
-            warmup=warmup,
-            rep=repeats,
-            quantiles=[0.5],
-            return_mode="median",
-        )
-    )
-def _benchmark_triton_autotune(sample: torch.Tensor, repeats: int, warmup: int) -> Dict[str, float]:
-    output = torch.empty_like(sample)
-    n_cols = sample.shape[1]
-    configs = _valid_autotune_configs(n_cols)
-    @triton.autotune(configs=configs, key=["n_cols"])
-    @triton.jit
-    def autotuned_softmax_kernel(
-        X_ptr,
-        Y_ptr,
-        stride_xm,
-        stride_xn,
-        stride_ym,
-        stride_yn,
-        n_cols,
-        BLOCK_SIZE: tl.constexpr,
-    ):
-        row_idx = tl.program_id(0)
-        col_offsets = tl.arange(0, BLOCK_SIZE)
-        x_ptr = X_ptr + row_idx * stride_xm + col_offsets
-        y_ptr = Y_ptr + row_idx * stride_ym + col_offsets
-        mask = col_offsets < n_cols
-        x = tl.load(x_ptr, mask=mask, other=-float("inf"))
-        x = x - tl.max(x, axis=0)
-        numerator = tl.exp(x)
-        denominator = tl.sum(numerator, axis=0)
-        y = numerator / denominator
-        tl.store(y_ptr, y, mask=mask)
-    grid = (sample.shape[0],)
-    torch.cuda.synchronize()
-    start = time.perf_counter()
-    autotuned_softmax_kernel[grid](
-        sample,
-        output,
-        sample.stride(0),
-        sample.stride(1),
-        output.stride(0),
-        output.stride(1),
-        n_cols,
-    )
-    torch.cuda.synchronize()
-    first_call_ms = float((time.perf_counter() - start) * 1000.0)
-    def launch() -> None:
-        autotuned_softmax_kernel[grid](
-            sample,
-            output,
-            sample.stride(0),
-            sample.stride(1),
-            output.stride(0),
-            output.stride(1),
-            n_cols,
-        )
-    steady_ms = float(
-        ttesting.do_bench(
-            launch,
-            warmup=warmup,
-            rep=repeats,
-            quantiles=[0.5],
-            return_mode="median",
-        )
-    )
-    return {
-        "autotune_first_call_ms": first_call_ms,
-        "autotune_steady_ms": steady_ms,
-    }
-def _build_summary(
-    measurement_path: str,
-    tasks: List[str],
-    acquisition: str,
-    beta: float,
-    xi: float,
-    results: Dict[str, Any],
-) -> Dict[str, Any]:
-    if results:
-        surrogate_first = [task["surrogate"]["compile_plus_first_call_ms"] for task in results.values()]
-        surrogate_steady = [task["surrogate"]["steady_ms"] for task in results.values()]
-        autotune_first = [task["triton_autotune"]["autotune_first_call_ms"] for task in results.values()]
-        autotune_steady = [task["triton_autotune"]["autotune_steady_ms"] for task in results.values()]
-        surrogate_speedup = [
-            task["triton_autotune"]["autotune_first_call_ms"] / max(task["surrogate"]["compile_plus_first_call_ms"], 1e-9)
-            for task in results.values()
-        ]
-        summary = {
-            "mean_surrogate_compile_plus_first_call_ms": float(np.mean(surrogate_first)),
-            "mean_surrogate_steady_ms": float(np.mean(surrogate_steady)),
-            "mean_autotune_first_call_ms": float(np.mean(autotune_first)),
-            "mean_autotune_steady_ms": float(np.mean(autotune_steady)),
-            "mean_search_time_speedup_surrogate_vs_autotune": float(np.mean(surrogate_speedup)),
-            "completed_task_count": len(results),
-        }
-    else:
-        summary = {
-            "mean_surrogate_compile_plus_first_call_ms": None,
-            "mean_surrogate_steady_ms": None,
-            "mean_autotune_first_call_ms": None,
-            "mean_autotune_steady_ms": None,
-            "mean_search_time_speedup_surrogate_vs_autotune": None,
-            "completed_task_count": 0,
-        }
-    return {
-        "measurement_path": measurement_path,
-        "tasks": tasks,
-        "acquisition": acquisition,
-        "beta": beta,
-        "xi": xi,
-        "results": results,
-        "summary": summary,
-    }
-def run_benchmark(
-    measurement_path: str,
-    tasks: List[str],
-    repeats: int,
-    warmup: int,
-    seed: int,
-    acquisition: str,
-    beta: float,
-    xi: float,
-    output_path: Path,
-    resume: bool,
-) -> Dict[str, Any]:
-    env_probe = SoftmaxSurrogateEnvironment(measurement_path=measurement_path, budget=1, seed=seed)
-    available_tasks = [task for task in env_probe.available_tasks() if task.startswith("softmax_m4096_n")]
-    train_task_ids = [task for task in available_tasks if task not in tasks]
-    results: Dict[str, Any] = {}
-    if resume and output_path.exists():
-        try:
-            existing = json.loads(output_path.read_text(encoding="utf-8"))
-            results = dict(existing.get("results", {}))
-        except Exception:
-            results = {}
-    for index, task in enumerate(tasks):
-        if task in results:
-            print(f"[progress] skipping completed task {index + 1}/{len(tasks)}: {task}")
-            continue
-        task_seed = seed + index
-        print(f"[progress] starting task {index + 1}/{len(tasks)}: {task}")
-        env = SoftmaxSurrogateEnvironment(
-            measurement_path=measurement_path,
-            budget=6,
-            seed=task_seed,
-            train_task_ids=train_task_ids,
-        )
-        reset_out = env.reset(task=task, seed=task_seed)
-        decision_start = time.perf_counter()
-        surrogate_config_id = _choose_surrogate_config_id(env, acquisition=acquisition, beta=beta, xi=xi)
-        decision_ms = float((time.perf_counter() - decision_start) * 1000.0)
-        surrogate_config = env.config_info(surrogate_config_id)
-        sample = torch.randn((env._task_rows[0].m, env._task_rows[0].n), device="cuda", dtype=torch.float16)
-        surrogate_first_call_ms = _compile_plus_first_call_fixed(sample, surrogate_config)
-        surrogate_steady_ms = _steady_state_fixed(sample, surrogate_config, repeats=repeats, warmup=warmup)
-        autotune_metrics = _benchmark_triton_autotune(sample, repeats=repeats, warmup=warmup)
-        oracle_best = env.oracle_best()
-        results[task] = {
-            "seeded_config_ids": reset_out["observation"]["tried_config_ids"],
-            "train_task_count": len(train_task_ids),
-            "oracle_best_ms": oracle_best["median_ms"],
-            "surrogate": {
-                "config": surrogate_config,
-                "decision_ms": decision_ms,
-                "compile_plus_first_call_ms": surrogate_first_call_ms,
-                "steady_ms": surrogate_steady_ms,
-                "regret_vs_oracle": float(surrogate_steady_ms / oracle_best["median_ms"] - 1.0),
-            },
-            "triton_autotune": {
-                **autotune_metrics,
-                "regret_vs_oracle": float(autotune_metrics["autotune_steady_ms"] / oracle_best["median_ms"] - 1.0),
-            },
-        }
-        print(
-            "[progress] finished"
-            f" task={task}"
-            f" surrogate_first_ms={results[task]['surrogate']['compile_plus_first_call_ms']:.3f}"
-            f" surrogate_steady_ms={results[task]['surrogate']['steady_ms']:.3f}"
-            f" autotune_first_ms={results[task]['triton_autotune']['autotune_first_call_ms']:.3f}"
-            f" autotune_steady_ms={results[task]['triton_autotune']['autotune_steady_ms']:.3f}"
-        )
-        snapshot = _build_summary(
-            measurement_path=measurement_path,
-            tasks=tasks,
-            acquisition=acquisition,
-            beta=beta,
-            xi=xi,
-            results=results,
-        )
-        output_path.parent.mkdir(parents=True, exist_ok=True)
-        output_path.write_text(json.dumps(snapshot, indent=2), encoding="utf-8")
-    return _build_summary(
-        measurement_path=measurement_path,
-        tasks=tasks,
-        acquisition=acquisition,
-        beta=beta,
-        xi=xi,
-        results=results,
-    )
-def parse_args() -> argparse.Namespace:
-    parser = argparse.ArgumentParser(description="Quick benchmark: surrogate-guided softmax config choice vs Triton autotune.")
-    parser.add_argument("--measurement-path", default="data/autotune_measurements.csv")
-    parser.add_argument("--tasks", nargs="+", default=DEFAULT_TASKS)
-    parser.add_argument("--repeats", type=int, default=50)
-    parser.add_argument("--warmup", type=int, default=10)
-    parser.add_argument("--seed", type=int, default=7)
-    parser.add_argument("--acquisition", choices=("mean", "ucb", "ei"), default="ucb")
-    parser.add_argument("--beta", type=float, default=2.0)
-    parser.add_argument("--xi", type=float, default=0.0)
-    parser.add_argument("--output", type=Path, default=Path("outputs/surrogate_vs_triton_autotune.json"))
-    parser.add_argument("--no-resume", action="store_true")
-    return parser.parse_args()
-def main() -> None:
-    args = parse_args()
-    summary = run_benchmark(
-        measurement_path=args.measurement_path,
-        tasks=args.tasks,
-        repeats=args.repeats,
-        warmup=args.warmup,
-        seed=args.seed,
-        acquisition=args.acquisition,
-        beta=args.beta,
-        xi=args.xi,
-        output_path=args.output,
-        resume=not args.no_resume,
-    )
-    args.output.parent.mkdir(parents=True, exist_ok=True)
-    with args.output.open("w", encoding="utf-8") as handle:
-        json.dump(summary, handle, indent=2)
-    print(json.dumps(summary, indent=2))
-if __name__ == "__main__":
-    main()

scripts/build_benchmark_splits.py DELETED Viewed

@@ -1,81 +0,0 @@
-#!/usr/bin/env python3
-from __future__ import annotations
-import argparse
-import csv
-import json
-from collections import defaultdict
-from pathlib import Path
-from typing import Dict, List
-def _load_tasks(path: Path) -> Dict[str, List[dict]]:
-    grouped: Dict[str, List[dict]] = defaultdict(list)
-    with path.open("r", newline="", encoding="utf-8") as handle:
-        for row in csv.DictReader(handle):
-            grouped[row["task_id"]].append(row)
-    return grouped
-def _task_n(task_id: str) -> int:
-    return int(task_id.split("_n")[-1])
-def build_splits(grouped: Dict[str, List[dict]], heldout_family: str | None) -> Dict[str, object]:
-    tasks_by_family: Dict[str, List[str]] = defaultdict(list)
-    for task_id, rows in grouped.items():
-        tasks_by_family[rows[0]["family"]].append(task_id)
-    for family in tasks_by_family:
-        tasks_by_family[family].sort(key=_task_n)
-    families = sorted(tasks_by_family.keys())
-    if not families:
-        raise RuntimeError("No tasks found in measurement file.")
-    shape_train: List[str] = []
-    shape_test: List[str] = []
-    for family, tasks in tasks_by_family.items():
-        holdout_count = 2 if len(tasks) >= 4 else 1
-        split_idx = max(1, len(tasks) - holdout_count)
-        shape_train.extend(tasks[:split_idx])
-        shape_test.extend(tasks[split_idx:])
-    if heldout_family is None:
-        heldout_family = families[-1]
-    if heldout_family not in tasks_by_family:
-        raise ValueError(f"Held-out family {heldout_family} is not present.")
-    family_train = [task_id for family, tasks in tasks_by_family.items() if family != heldout_family for task_id in tasks]
-    family_test = list(tasks_by_family[heldout_family])
-    return {
-        "families_present": families,
-        "shape_generalization": {
-            "train_tasks": sorted(shape_train),
-            "test_tasks": sorted(shape_test),
-        },
-        "family_holdout": {
-            "heldout_family": heldout_family,
-            "train_tasks": sorted(family_train),
-            "test_tasks": sorted(family_test),
-        },
-    }
-def main() -> None:
-    parser = argparse.ArgumentParser(description="Build train/test split manifests for the multi-family benchmark.")
-    parser.add_argument("--measurement-path", type=Path, default=Path("data/autotune_measurements.csv"))
-    parser.add_argument("--output", type=Path, default=Path("data/benchmark_splits.json"))
-    parser.add_argument("--heldout-family", type=str, default=None)
-    args = parser.parse_args()
-    splits = build_splits(_load_tasks(args.measurement_path), args.heldout_family)
-    args.output.parent.mkdir(parents=True, exist_ok=True)
-    with args.output.open("w", encoding="utf-8") as handle:
-        json.dump(splits, handle, indent=2)
-    print(json.dumps(splits, indent=2))
-if __name__ == "__main__":
-    main()

scripts/build_qwen_05b_splits.py DELETED Viewed

@@ -1,56 +0,0 @@
-#!/usr/bin/env python3
-from __future__ import annotations
-import argparse
-import json
-from pathlib import Path
-from scripts.qwen_05b_spec import DECODE_CTX_LENS, PREFILL_SEQ_LENS, qwen_05b_tasks
-def build_splits() -> dict:
-    tasks = qwen_05b_tasks()
-    long_prefill = max(PREFILL_SEQ_LENS)
-    long_decode = max(DECODE_CTX_LENS)
-    shape_train = []
-    shape_test = []
-    for task in tasks:
-        if task.mode == "prefill" and task.seq_len == long_prefill:
-            shape_test.append(task.task_id)
-        elif task.mode == "decode" and task.ctx_len == long_decode:
-            shape_test.append(task.task_id)
-        else:
-            shape_train.append(task.task_id)
-    family_holdout_train = [task.task_id for task in tasks if task.family != "gemm"]
-    family_holdout_test = [task.task_id for task in tasks if task.family == "gemm"]
-    return {
-        "model_id": "Qwen/Qwen2.5-0.5B",
-        "shape_generalization": {
-            "train_tasks": sorted(shape_train),
-            "test_tasks": sorted(shape_test),
-        },
-        "family_holdout": {
-            "heldout_family": "gemm",
-            "train_tasks": sorted(family_holdout_train),
-            "test_tasks": sorted(family_holdout_test),
-        },
-    }
-def main() -> None:
-    parser = argparse.ArgumentParser(description="Build train/test split manifests for Qwen2.5-0.5B kernel tuning.")
-    parser.add_argument("--output", type=Path, default=Path("data/qwen_05b_splits.json"))
-    args = parser.parse_args()
-    splits = build_splits()
-    args.output.parent.mkdir(parents=True, exist_ok=True)
-    with args.output.open("w", encoding="utf-8") as handle:
-        json.dump(splits, handle, indent=2)
-    print(json.dumps(splits, indent=2))
-if __name__ == "__main__":
-    main()

scripts/check_torch_cuda.py DELETED Viewed

@@ -1,18 +0,0 @@
-#!/usr/bin/env python3
-from __future__ import annotations
-import torch
-def main() -> None:
-    print(f"python: {__import__('sys').executable}")
-    print(f"torch: {torch.__version__}")
-    print(f"cuda_available: {torch.cuda.is_available()}")
-    if torch.cuda.is_available():
-        print(f"cuda_device_name: {torch.cuda.get_device_name(0)}")
-        print(f"cuda_capability: {torch.cuda.get_device_capability(0)}")
-if __name__ == "__main__":
-    main()

scripts/collect_measurements.py DELETED Viewed

@@ -1,398 +0,0 @@
-#!/usr/bin/env python3
-from __future__ import annotations
-import argparse
-import csv
-import math
-import statistics
-from dataclasses import dataclass
-from pathlib import Path
-from typing import Iterable, List, Dict, Optional
-import numpy as np
-import torch
-try:
-    import triton
-    import triton.language as tl
-    import triton.testing as ttesting
-except Exception as err:  # pragma: no cover
-    raise RuntimeError(
-        "Triton is required for this measurement collector. Install Triton and rerun."
-    ) from err
-BLOCK_SIZES = (256, 512, 1024, 2048, 4096, 8192)
-NUM_WARPS = (1, 2, 4, 8)
-NUM_STAGES = (1, 2, 3, 4)
-N_VALUES = (256, 512, 1024, 1536, 2048, 3072, 4096, 6144, 8192)
-@dataclass(frozen=True)
-class MeasurementRow:
-    task_id: str
-    m: int
-    n: int
-    block_size: int
-    num_warps: int
-    num_stages: int
-    median_ms: float
-    effective_gbps: float
-    score: float
-    validation_error: float
-@triton.jit
-def fused_rowwise_softmax_kernel(
-    X_ptr,
-    Y_ptr,
-    stride_xm,
-    stride_xn,
-    stride_ym,
-    stride_yn,
-    n_cols,
-    BLOCK_SIZE: tl.constexpr,
-):
-    row_idx = tl.program_id(0)
-    col_offsets = tl.arange(0, BLOCK_SIZE)
-    x_ptr = X_ptr + row_idx * stride_xm + col_offsets
-    y_ptr = Y_ptr + row_idx * stride_ym + col_offsets
-    mask = col_offsets < n_cols
-    x = tl.load(x_ptr, mask=mask, other=-float("inf"))
-    x = x - tl.max(x, axis=0)
-    numerator = tl.exp(x)
-    denominator = tl.sum(numerator, axis=0)
-    y = numerator / denominator
-    tl.store(y_ptr, y, mask=mask)
-def _task_id(m: int, n: int) -> str:
-    return f"softmax_m{m}_n{n}"
-def _effective_gbps(sample: torch.Tensor, median_ms: float) -> float:
-    bytes_processed = float(sample.numel() * sample.element_size() * 2)
-    if median_ms <= 0:
-        return 0.0
-    return bytes_processed / (median_ms / 1000.0) / 1e9
-def _benchmark_config(
-    sample: torch.Tensor,
-    block_size: int,
-    num_warps: int,
-    num_stages: int,
-    repeats: int,
-    warmup: int,
-) -> float:
-    output = torch.empty_like(sample)
-    m, n = sample.shape
-    grid = (m,)
-    def launch() -> None:
-        fused_rowwise_softmax_kernel[grid](
-            sample,
-            output,
-            sample.stride(0),
-            sample.stride(1),
-            output.stride(0),
-            output.stride(1),
-            n,
-            BLOCK_SIZE=block_size,
-            num_warps=num_warps,
-            num_stages=num_stages,
-        )
-    # warmup to compile and stabilize caches / clocks.
-    for _ in range(max(1, warmup)):
-        launch()
-    torch.cuda.synchronize()
-    if ttesting is not None:
-        result = ttesting.do_bench(
-            launch,
-            warmup=0,
-            rep=repeats,
-            quantiles=[0.5],
-            return_mode="median",
-        )
-        if isinstance(result, dict):
-            return float(result.get("median", 0.0))
-        return float(result)
-    start = torch.cuda.Event(enable_timing=True)
-    end = torch.cuda.Event(enable_timing=True)
-    durations_ms: List[float] = []
-    for _ in range(max(1, repeats)):
-        torch.cuda.synchronize()
-        start.record()
-        launch()
-        end.record()
-        end.synchronize()
-        durations_ms.append(start.elapsed_time(end))
-    return float(statistics.median(durations_ms))
-def _validate_correctness(sample: torch.Tensor, block_size: int, num_warps: int, num_stages: int) -> float:
-    ref = torch.softmax(sample.float(), dim=-1).to(sample.dtype)
-    out = torch.empty_like(sample)
-    fused_rowwise_softmax_kernel[(sample.shape[0],)](
-        sample,
-        out,
-        sample.stride(0),
-        sample.stride(1),
-        out.stride(0),
-        out.stride(1),
-        sample.shape[1],
-        BLOCK_SIZE=block_size,
-        num_warps=num_warps,
-        num_stages=num_stages,
-    )
-    torch.cuda.synchronize()
-    err = (out.to(torch.float32) - ref).abs().max().item()
-    return float(err)
-def benchmark_single_config(
-    n: int,
-    block_size: int,
-    num_warps: int,
-    num_stages: int,
-    m: int = 4096,
-    repeats: int = 200,
-    warmup: int = 25,
-    seed: int = 0,
-) -> MeasurementRow:
-    if block_size < n:
-        raise ValueError(f"Invalid config: BLOCK_SIZE {block_size} < N {n}")
-    torch.manual_seed(seed)
-    sample = torch.randn((m, n), device="cuda", dtype=torch.float16)
-    if num_warps not in NUM_WARPS:
-        raise ValueError(f"Unsupported num_warps={num_warps}")
-    if num_stages not in NUM_STAGES:
-        raise ValueError(f"Unsupported num_stages={num_stages}")
-    if block_size not in BLOCK_SIZES:
-        raise ValueError(f"Unsupported BLOCK_SIZE={block_size}")
-    median_ms = _benchmark_config(
-        sample=sample,
-        block_size=block_size,
-        num_warps=num_warps,
-        num_stages=num_stages,
-        repeats=repeats,
-        warmup=warmup,
-    )
-    val_err = _validate_correctness(
-        sample=sample,
-        block_size=block_size,
-        num_warps=num_warps,
-        num_stages=num_stages,
-    )
-    gbps = _effective_gbps(sample, median_ms)
-    score = -math.log(max(median_ms, np.finfo(float).tiny))
-    return MeasurementRow(
-        task_id=_task_id(m, n),
-        m=m,
-        n=n,
-        block_size=block_size,
-        num_warps=num_warps,
-        num_stages=num_stages,
-        median_ms=float(median_ms),
-        effective_gbps=float(gbps),
-        score=float(score),
-        validation_error=float(val_err),
-    )
-def collect_measurements(
-    output_path: Path,
-    n_values: Iterable[int],
-    repeats: int,
-    warmup: int,
-    seed: int,
-    m: int = 4096,
-    append: bool = True,
-) -> List[MeasurementRow]:
-    if not torch.cuda.is_available():
-        raise RuntimeError("CUDA is required to run measurements on H100.")
-    if not torch.cuda.get_device_name(0):
-        raise RuntimeError("No CUDA device found.")
-    output_path.parent.mkdir(parents=True, exist_ok=True)
-    existing: Dict[str, MeasurementRow] = {}
-    if output_path.exists():
-        with output_path.open("r", newline="", encoding="utf-8") as f:
-            for row in csv.DictReader(f):
-                key = _row_key(
-                    int(row["m"]),
-                    int(row["n"]),
-                    int(row["block_size"]),
-                    int(row["num_warps"]),
-                    int(row["num_stages"]),
-                )
-                existing[key] = MeasurementRow(
-                    task_id=row["task_id"],
-                    m=int(row["m"]),
-                    n=int(row["n"]),
-                    block_size=int(row["block_size"]),
-                    num_warps=int(row["num_warps"]),
-                    num_stages=int(row["num_stages"]),
-                    median_ms=float(row["median_ms"]),
-                    effective_gbps=float(row["effective_gbps"]),
-                    score=float(row["score"]),
-                    validation_error=float(row["validation_error"]),
-                )
-    results: List[MeasurementRow] = []
-    for n in n_values:
-        if n < 0:
-            raise ValueError(f"Invalid n value: {n}")
-        for block in BLOCK_SIZES:
-            if block < n:
-                continue
-            for num_warps in NUM_WARPS:
-                for num_stages in NUM_STAGES:
-                    key = _row_key(m, n, block, num_warps, num_stages)
-                    if append and key in existing:
-                        continue
-                    torch.cuda.synchronize()
-                    row = benchmark_single_config(
-                        n=n,
-                        block_size=block,
-                        num_warps=num_warps,
-                        num_stages=num_stages,
-                        m=m,
-                        repeats=repeats,
-                        warmup=warmup,
-                        seed=seed,
-                    )
-                    results.append(row)
-    if results:
-        write_mode = "a" if output_path.exists() and append else "w"
-        with output_path.open(write_mode, newline="", encoding="utf-8") as f:
-            writer = csv.writer(f)
-            if write_mode == "w" or (output_path.stat().st_size == 0):
-                writer.writerow(
-                    [
-                        "task_id",
-                        "m",
-                        "n",
-                        "block_size",
-                        "num_warps",
-                        "num_stages",
-                        "median_ms",
-                        "effective_gbps",
-                        "score",
-                        "validation_error",
-                    ]
-                )
-            for r in results:
-                writer.writerow(
-                    [
-                        r.task_id,
-                        r.m,
-                        r.n,
-                        r.block_size,
-                        r.num_warps,
-                        r.num_stages,
-                        r.median_ms,
-                        r.effective_gbps,
-                        r.score,
-                        r.validation_error,
-                    ]
-                )
-    return results
-def _row_key(m: int, n: int, block_size: int, num_warps: int, num_stages: int) -> str:
-    return f"{m}|{n}|{block_size}|{num_warps}|{num_stages}"
-def parse_args() -> argparse.Namespace:
-    parser = argparse.ArgumentParser(
-        description="Collect real H100 measurements for Triton row-wise fused softmax."
-    )
-    parser.add_argument(
-        "--output",
-        type=Path,
-        default=Path("data/softmax_measurements.csv"),
-        help="CSV output path.",
-    )
-    parser.add_argument(
-        "--n-cols",
-        type=int,
-        nargs="+",
-        default=N_VALUES,
-        help="Softmax inner dimension N values to benchmark.",
-    )
-    parser.add_argument("--m", type=int, default=4096, help="Outer dimension M.")
-    parser.add_argument("--repeats", type=int, default=200, help="Benchmark repeats.")
-    parser.add_argument("--warmup", type=int, default=25, help="Benchmark warmup runs.")
-    parser.add_argument("--seed", type=int, default=0, help="Torch/random seed.")
-    parser.add_argument(
-        "--single-run",
-        action="store_true",
-        help="Run one specific config and print JSON-like output.",
-    )
-    parser.add_argument("--block-size", type=int, default=1024)
-    parser.add_argument("--num-warps", type=int, default=4)
-    parser.add_argument("--num-stages", type=int, default=2)
-    parser.add_argument(
-        "--append",
-        action="store_true",
-        help="Append to existing CSV file (default). If false, overwrite.",
-    )
-    parser.add_argument(
-        "--overwrite",
-        action="store_true",
-        help="Overwrite existing CSV data instead of appending.",
-    )
-    return parser.parse_args()
-def main() -> None:
-    args = parse_args()
-    if not torch.cuda.is_available():
-        raise RuntimeError("CUDA is not available.")
-    if not torch.cuda.get_device_name(0):
-        raise RuntimeError("No CUDA device was found.")
-    append = args.append and not args.overwrite
-    if args.single_run:
-        row = benchmark_single_config(
-            n=args.n_cols[0],
-            block_size=args.block_size,
-            num_warps=args.num_warps,
-            num_stages=args.num_stages,
-            m=args.m,
-            repeats=args.repeats,
-            warmup=args.warmup,
-            seed=args.seed,
-        )
-        print(row.__dict__)
-        return
-    collect_measurements(
-        output_path=args.output,
-        n_values=args.n_cols,
-        repeats=args.repeats,
-        warmup=args.warmup,
-        seed=args.seed,
-        m=args.m,
-        append=append,
-    )
-if __name__ == "__main__":
-    main()

scripts/collect_multifamily_measurements.py DELETED Viewed

@@ -1,659 +0,0 @@
-#!/usr/bin/env python3
-from __future__ import annotations
-import argparse
-import csv
-import json
-import math
-import sys
-from dataclasses import dataclass
-from pathlib import Path
-from typing import Dict, Iterable, List, Sequence, Tuple
-import numpy as np
-import torch
-ROOT = Path(__file__).resolve().parents[1]
-if str(ROOT) not in sys.path:
-    sys.path.append(str(ROOT))
-try:
-    import triton
-    import triton.language as tl
-    import triton.testing as ttesting
-except Exception as err:  # pragma: no cover
-    raise RuntimeError(
-        "Triton is required for the multi-family measurement collector."
-    ) from err
-from scripts.collect_measurements import (
-    BLOCK_SIZES,
-    NUM_STAGES,
-    NUM_WARPS,
-    N_VALUES,
-    benchmark_single_config as benchmark_softmax_config,
-)
-EPS = 1e-5
-GROUPED_GEMM_GROUP_COUNT = 4
-GROUPED_GEMM_K = 512
-@dataclass(frozen=True)
-class MultiFamilyMeasurementRow:
-    family_group: str
-    family: str
-    task_id: str
-    m: int
-    n: int
-    config_id: int
-    block_size: int
-    num_warps: int
-    num_stages: int
-    shape_json: str
-    config_json: str
-    median_ms: float
-    effective_gbps: float
-    score: float
-    validation_error: float
-@triton.jit
-def fused_rowwise_layernorm_kernel(
-    X_ptr,
-    Y_ptr,
-    stride_xm,
-    stride_xn,
-    stride_ym,
-    stride_yn,
-    n_cols,
-    eps,
-    BLOCK_SIZE: tl.constexpr,
-):
-    row_idx = tl.program_id(0)
-    col_offsets = tl.arange(0, BLOCK_SIZE)
-    x_ptr = X_ptr + row_idx * stride_xm + col_offsets
-    y_ptr = Y_ptr + row_idx * stride_ym + col_offsets
-    mask = col_offsets < n_cols
-    x = tl.load(x_ptr, mask=mask, other=0.0).to(tl.float32)
-    mean = tl.sum(x, axis=0) / n_cols
-    x_centered = x - mean
-    var = tl.sum(x_centered * x_centered, axis=0) / n_cols
-    inv_std = tl.rsqrt(var + eps)
-    y = x_centered * inv_std
-    tl.store(y_ptr, y.to(tl.float16), mask=mask)
-@triton.jit
-def matmul_kernel(
-    A_ptr,
-    B_ptr,
-    C_ptr,
-    M,
-    N,
-    K,
-    stride_am,
-    stride_ak,
-    stride_bk,
-    stride_bn,
-    stride_cm,
-    stride_cn,
-    BLOCK_M: tl.constexpr,
-    BLOCK_N: tl.constexpr,
-    BLOCK_K: tl.constexpr,
-):
-    pid = tl.program_id(axis=0)
-    num_pid_m = tl.cdiv(M, BLOCK_M)
-    num_pid_n = tl.cdiv(N, BLOCK_N)
-    pid_m = pid // num_pid_n
-    pid_n = pid % num_pid_n
-    offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
-    offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
-    offs_k = tl.arange(0, BLOCK_K)
-    a_ptrs = A_ptr + offs_m[:, None] * stride_am + offs_k[None, :] * stride_ak
-    b_ptrs = B_ptr + offs_k[:, None] * stride_bk + offs_n[None, :] * stride_bn
-    acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
-    for _ in range(0, tl.cdiv(K, BLOCK_K)):
-        a = tl.load(a_ptrs, mask=(offs_m[:, None] < M) & (offs_k[None, :] < K), other=0.0)
-        b = tl.load(b_ptrs, mask=(offs_k[:, None] < K) & (offs_n[None, :] < N), other=0.0)
-        acc += tl.dot(a, b)
-        a_ptrs += BLOCK_K * stride_ak
-        b_ptrs += BLOCK_K * stride_bk
-        offs_k += BLOCK_K
-    c = acc.to(tl.float16)
-    c_ptrs = C_ptr + offs_m[:, None] * stride_cm + offs_n[None, :] * stride_cn
-    c_mask = (offs_m[:, None] < M) & (offs_n[None, :] < N)
-    tl.store(c_ptrs, c, mask=c_mask)
-def _task_id(family: str, m: int, n: int) -> str:
-    if family == "grouped_gemm":
-        return f"{family}_g{GROUPED_GEMM_GROUP_COUNT}_k{GROUPED_GEMM_K}_m{m}_n{n}"
-    return f"{family}_m{m}_n{n}"
-def _valid_configs(family: str, n: int) -> List[Tuple[int, int, int]]:
-    configs: List[Tuple[int, int, int]] = []
-    if family == "grouped_gemm":
-        candidate_blocks = tuple(block for block in BLOCK_SIZES if block <= 1024)
-    else:
-        candidate_blocks = BLOCK_SIZES
-    for block_size in candidate_blocks:
-        if family != "grouped_gemm" and block_size < n:
-            continue
-        for num_warps in NUM_WARPS:
-            for num_stages in NUM_STAGES:
-                configs.append((block_size, num_warps, num_stages))
-    return configs
-def _effective_gbps(sample: torch.Tensor, median_ms: float) -> float:
-    bytes_processed = float(sample.numel() * sample.element_size() * 2)
-    if median_ms <= 0:
-        return 0.0
-    return bytes_processed / (median_ms / 1000.0) / 1e9
-def _benchmark_layernorm_config(
-    sample: torch.Tensor,
-    block_size: int,
-    num_warps: int,
-    num_stages: int,
-    repeats: int,
-    warmup: int,
-) -> float:
-    output = torch.empty_like(sample)
-    grid = (sample.shape[0],)
-    def launch() -> None:
-        fused_rowwise_layernorm_kernel[grid](
-            sample,
-            output,
-            sample.stride(0),
-            sample.stride(1),
-            output.stride(0),
-            output.stride(1),
-            sample.shape[1],
-            EPS,
-            BLOCK_SIZE=block_size,
-            num_warps=num_warps,
-            num_stages=num_stages,
-        )
-    for _ in range(max(1, warmup)):
-        launch()
-    torch.cuda.synchronize()
-    if ttesting is not None:
-        result = ttesting.do_bench(
-            launch,
-            warmup=0,
-            rep=repeats,
-            quantiles=[0.5],
-            return_mode="median",
-        )
-        if isinstance(result, dict):
-            return float(result.get("median", 0.0))
-        return float(result)
-    events: List[float] = []
-    start = torch.cuda.Event(enable_timing=True)
-    end = torch.cuda.Event(enable_timing=True)
-    for _ in range(max(1, repeats)):
-        torch.cuda.synchronize()
-        start.record()
-        launch()
-        end.record()
-        end.synchronize()
-        events.append(start.elapsed_time(end))
-    return float(np.median(np.asarray(events, dtype=np.float32)))
-def _validate_layernorm(sample: torch.Tensor, block_size: int, num_warps: int, num_stages: int) -> float:
-    ref = torch.nn.functional.layer_norm(sample.float(), (sample.shape[1],), eps=EPS).to(sample.dtype)
-    out = torch.empty_like(sample)
-    fused_rowwise_layernorm_kernel[(sample.shape[0],)](
-        sample,
-        out,
-        sample.stride(0),
-        sample.stride(1),
-        out.stride(0),
-        out.stride(1),
-        sample.shape[1],
-        EPS,
-        BLOCK_SIZE=block_size,
-        num_warps=num_warps,
-        num_stages=num_stages,
-    )
-    torch.cuda.synchronize()
-    return float((out.to(torch.float32) - ref).abs().max().item())
-def benchmark_layernorm_config(
-    n: int,
-    block_size: int,
-    num_warps: int,
-    num_stages: int,
-    m: int = 4096,
-    repeats: int = 200,
-    warmup: int = 25,
-    seed: int = 0,
-) -> MultiFamilyMeasurementRow:
-    if block_size < n:
-        raise ValueError(f"Invalid config: BLOCK_SIZE {block_size} < N {n}")
-    torch.manual_seed(seed)
-    sample = torch.randn((m, n), device="cuda", dtype=torch.float16)
-    median_ms = _benchmark_layernorm_config(
-        sample=sample,
-        block_size=block_size,
-        num_warps=num_warps,
-        num_stages=num_stages,
-        repeats=repeats,
-        warmup=warmup,
-    )
-    val_err = _validate_layernorm(
-        sample=sample,
-        block_size=block_size,
-        num_warps=num_warps,
-        num_stages=num_stages,
-    )
-    gbps = _effective_gbps(sample, median_ms)
-    return _make_row(
-        family="layernorm",
-        m=m,
-        n=n,
-        block_size=block_size,
-        num_warps=num_warps,
-        num_stages=num_stages,
-        median_ms=median_ms,
-        effective_gbps=gbps,
-        validation_error=val_err,
-    )
-def _grouped_gemm_shapes(m: int, n: int) -> Tuple[int, int, int]:
-    group_m = max(64, m // GROUPED_GEMM_GROUP_COUNT)
-    return GROUPED_GEMM_GROUP_COUNT, group_m, GROUPED_GEMM_K
-def _matmul_meta_from_block(block_size: int) -> Tuple[int, int, int]:
-    block_m = max(32, min(block_size // 4, 256))
-    block_n = max(32, min(block_size // 4, 256))
-    block_k = 32
-    return block_m, block_n, block_k
-def _benchmark_grouped_gemm_config(
-    a_groups: Sequence[torch.Tensor],
-    b_groups: Sequence[torch.Tensor],
-    block_size: int,
-    num_warps: int,
-    num_stages: int,
-    repeats: int,
-    warmup: int,
-) -> float:
-    c_groups = [torch.empty((a.shape[0], b.shape[1]), device=a.device, dtype=a.dtype) for a, b in zip(a_groups, b_groups)]
-    block_m, block_n, block_k = _matmul_meta_from_block(block_size)
-    def launch() -> None:
-        for a, b, c in zip(a_groups, b_groups, c_groups):
-            grid = (triton.cdiv(a.shape[0], block_m) * triton.cdiv(b.shape[1], block_n),)
-            matmul_kernel[grid](
-                a,
-                b,
-                c,
-                a.shape[0],
-                b.shape[1],
-                a.shape[1],
-                a.stride(0),
-                a.stride(1),
-                b.stride(0),
-                b.stride(1),
-                c.stride(0),
-                c.stride(1),
-                BLOCK_M=block_m,
-                BLOCK_N=block_n,
-                BLOCK_K=block_k,
-                num_warps=num_warps,
-                num_stages=num_stages,
-            )
-    for _ in range(max(1, warmup)):
-        launch()
-    torch.cuda.synchronize()
-    if ttesting is not None:
-        result = ttesting.do_bench(
-            launch,
-            warmup=0,
-            rep=repeats,
-            quantiles=[0.5],
-            return_mode="median",
-        )
-        if isinstance(result, dict):
-            return float(result.get("median", 0.0))
-        return float(result)
-    durations_ms: List[float] = []
-    start = torch.cuda.Event(enable_timing=True)
-    end = torch.cuda.Event(enable_timing=True)
-    for _ in range(max(1, repeats)):
-        torch.cuda.synchronize()
-        start.record()
-        launch()
-        end.record()
-        end.synchronize()
-        durations_ms.append(start.elapsed_time(end))
-    return float(np.median(np.asarray(durations_ms, dtype=np.float32)))
-def _validate_grouped_gemm(
-    a_groups: Sequence[torch.Tensor],
-    b_groups: Sequence[torch.Tensor],
-    block_size: int,
-    num_warps: int,
-    num_stages: int,
-) -> float:
-    c_groups = [torch.empty((a.shape[0], b.shape[1]), device=a.device, dtype=a.dtype) for a, b in zip(a_groups, b_groups)]
-    block_m, block_n, block_k = _matmul_meta_from_block(block_size)
-    for a, b, c in zip(a_groups, b_groups, c_groups):
-        grid = (triton.cdiv(a.shape[0], block_m) * triton.cdiv(b.shape[1], block_n),)
-        matmul_kernel[grid](
-            a,
-            b,
-            c,
-            a.shape[0],
-            b.shape[1],
-            a.shape[1],
-            a.stride(0),
-            a.stride(1),
-            b.stride(0),
-            b.stride(1),
-            c.stride(0),
-            c.stride(1),
-            BLOCK_M=block_m,
-            BLOCK_N=block_n,
-            BLOCK_K=block_k,
-            num_warps=num_warps,
-            num_stages=num_stages,
-        )
-    torch.cuda.synchronize()
-    max_err = 0.0
-    for a, b, c in zip(a_groups, b_groups, c_groups):
-        ref = torch.matmul(a.float(), b.float()).to(torch.float16)
-        max_err = max(max_err, float((c.float() - ref.float()).abs().max().item()))
-    return max_err
-def benchmark_grouped_gemm_config(
-    n: int,
-    block_size: int,
-    num_warps: int,
-    num_stages: int,
-    m: int = 4096,
-    repeats: int = 200,
-    warmup: int = 25,
-    seed: int = 0,
-) -> MultiFamilyMeasurementRow:
-    group_count, group_m, k_dim = _grouped_gemm_shapes(m, n)
-    torch.manual_seed(seed)
-    a_groups = [torch.randn((group_m, k_dim), device="cuda", dtype=torch.float16) for _ in range(group_count)]
-    b_groups = [torch.randn((k_dim, n), device="cuda", dtype=torch.float16) for _ in range(group_count)]
-    median_ms = _benchmark_grouped_gemm_config(
-        a_groups=a_groups,
-        b_groups=b_groups,
-        block_size=block_size,
-        num_warps=num_warps,
-        num_stages=num_stages,
-        repeats=repeats,
-        warmup=warmup,
-    )
-    val_err = _validate_grouped_gemm(
-        a_groups=a_groups,
-        b_groups=b_groups,
-        block_size=block_size,
-        num_warps=num_warps,
-        num_stages=num_stages,
-    )
-    bytes_processed = group_count * (
-        a_groups[0].numel() * a_groups[0].element_size()
-        + b_groups[0].numel() * b_groups[0].element_size()
-        + group_m * n * a_groups[0].element_size()
-    )
-    effective_gbps = bytes_processed / (median_ms / 1000.0) / 1e9 if median_ms > 0 else 0.0
-    return _make_row(
-        family="grouped_gemm",
-        m=m,
-        n=n,
-        block_size=block_size,
-        num_warps=num_warps,
-        num_stages=num_stages,
-        median_ms=median_ms,
-        effective_gbps=effective_gbps,
-        validation_error=val_err,
-    )
-def _make_row(
-    family: str,
-    m: int,
-    n: int,
-    block_size: int,
-    num_warps: int,
-    num_stages: int,
-    median_ms: float,
-    effective_gbps: float,
-    validation_error: float,
-) -> MultiFamilyMeasurementRow:
-    configs = _valid_configs(family, n)
-    config_id = configs.index((block_size, num_warps, num_stages))
-    return MultiFamilyMeasurementRow(
-        family_group="A" if family in {"softmax", "layernorm"} else "B",
-        family=family,
-        task_id=_task_id(family, m, n),
-        m=m,
-        n=n,
-        config_id=config_id,
-        block_size=block_size,
-        num_warps=num_warps,
-        num_stages=num_stages,
-        shape_json=json.dumps(
-            {
-                "family": family,
-                "m": m,
-                "n": n,
-                "group_count": GROUPED_GEMM_GROUP_COUNT if family == "grouped_gemm" else None,
-                "k": GROUPED_GEMM_K if family == "grouped_gemm" else None,
-            },
-            sort_keys=True,
-        ),
-        config_json=json.dumps(
-            {
-                "block_size": block_size,
-                "num_warps": num_warps,
-                "num_stages": num_stages,
-            },
-            sort_keys=True,
-        ),
-        median_ms=float(median_ms),
-        effective_gbps=float(effective_gbps),
-        score=float(-math.log(max(median_ms, np.finfo(float).tiny))),
-        validation_error=float(validation_error),
-    )
-def _softmax_row_to_multi(row: object) -> MultiFamilyMeasurementRow:
-    return _make_row(
-        family="softmax",
-        m=int(row.m),
-        n=int(row.n),
-        block_size=int(row.block_size),
-        num_warps=int(row.num_warps),
-        num_stages=int(row.num_stages),
-        median_ms=float(row.median_ms),
-        effective_gbps=float(row.effective_gbps),
-        validation_error=float(row.validation_error),
-    )
-def collect_multifamily_measurements(
-    output_path: Path,
-    families: Sequence[str],
-    n_values: Iterable[int],
-    repeats: int,
-    warmup: int,
-    seed: int,
-    m: int = 4096,
-    append: bool = True,
-) -> List[MultiFamilyMeasurementRow]:
-    if not torch.cuda.is_available():
-        raise RuntimeError("CUDA is required to run measurements on GPU.")
-    output_path.parent.mkdir(parents=True, exist_ok=True)
-    existing = set()
-    if output_path.exists():
-        with output_path.open("r", newline="", encoding="utf-8") as handle:
-            for row in csv.DictReader(handle):
-                existing.add(_row_key(row["family"], int(row["m"]), int(row["n"]), int(row["config_id"])))
-    results: List[MultiFamilyMeasurementRow] = []
-    for family in families:
-        if family not in {"softmax", "layernorm", "grouped_gemm"}:
-            raise ValueError(f"Unsupported family: {family}")
-        for n in n_values:
-            for config_id, (block_size, num_warps, num_stages) in enumerate(_valid_configs(family, n)):
-                key = _row_key(family, m, n, config_id)
-                if append and key in existing:
-                    continue
-                if family == "softmax":
-                    row = _softmax_row_to_multi(
-                        benchmark_softmax_config(
-                            n=n,
-                            block_size=block_size,
-                            num_warps=num_warps,
-                            num_stages=num_stages,
-                            m=m,
-                            repeats=repeats,
-                            warmup=warmup,
-                            seed=seed,
-                        )
-                    )
-                elif family == "layernorm":
-                    row = benchmark_layernorm_config(
-                        n=n,
-                        block_size=block_size,
-                        num_warps=num_warps,
-                        num_stages=num_stages,
-                        m=m,
-                        repeats=repeats,
-                        warmup=warmup,
-                        seed=seed,
-                    )
-                else:
-                    row = benchmark_grouped_gemm_config(
-                        n=n,
-                        block_size=block_size,
-                        num_warps=num_warps,
-                        num_stages=num_stages,
-                        m=m,
-                        repeats=repeats,
-                        warmup=warmup,
-                        seed=seed,
-                    )
-                results.append(row)
-    if results:
-        write_mode = "a" if output_path.exists() and append else "w"
-        with output_path.open(write_mode, newline="", encoding="utf-8") as handle:
-            writer = csv.writer(handle)
-            if write_mode == "w" or output_path.stat().st_size == 0:
-                writer.writerow(
-                    [
-                        "family_group",
-                        "family",
-                        "task_id",
-                        "m",
-                        "n",
-                        "config_id",
-                        "block_size",
-                        "num_warps",
-                        "num_stages",
-                        "shape_json",
-                        "config_json",
-                        "median_ms",
-                        "effective_gbps",
-                        "score",
-                        "validation_error",
-                    ]
-                )
-            for row in results:
-                writer.writerow(
-                    [
-                        row.family_group,
-                        row.family,
-                        row.task_id,
-                        row.m,
-                        row.n,
-                        row.config_id,
-                        row.block_size,
-                        row.num_warps,
-                        row.num_stages,
-                        row.shape_json,
-                        row.config_json,
-                        row.median_ms,
-                        row.effective_gbps,
-                        row.score,
-                        row.validation_error,
-                    ]
-                )
-    return results
-def _row_key(family: str, m: int, n: int, config_id: int) -> str:
-    return f"{family}|{m}|{n}|{config_id}"
-def parse_args() -> argparse.Namespace:
-    parser = argparse.ArgumentParser(description="Collect multi-family GPU measurements.")
-    parser.add_argument("--output", type=Path, default=Path("data/autotune_measurements.csv"))
-    parser.add_argument(
-        "--families",
-        nargs="+",
-        default=("softmax", "layernorm", "grouped_gemm"),
-        choices=("softmax", "layernorm", "grouped_gemm"),
-    )
-    parser.add_argument("--n-cols", type=int, nargs="+", default=N_VALUES)
-    parser.add_argument("--m", type=int, default=4096)
-    parser.add_argument("--repeats", type=int, default=200)
-    parser.add_argument("--warmup", type=int, default=25)
-    parser.add_argument("--seed", type=int, default=0)
-    parser.add_argument("--append", action="store_true")
-    parser.add_argument("--overwrite", action="store_true")
-    return parser.parse_args()
-def main() -> None:
-    args = parse_args()
-    append = args.append and not args.overwrite
-    collect_multifamily_measurements(
-        output_path=args.output,
-        families=args.families,
-        n_values=args.n_cols,
-        repeats=args.repeats,
-        warmup=args.warmup,
-        seed=args.seed,
-        m=args.m,
-        append=append,
-    )
-if __name__ == "__main__":
-    main()

scripts/collect_qwen_05b_measurements.py DELETED Viewed

@@ -1,449 +0,0 @@
-#!/usr/bin/env python3
-from __future__ import annotations
-import argparse
-import csv
-import json
-import math
-import sys
-from dataclasses import dataclass
-from pathlib import Path
-from typing import Dict, List, Sequence
-import numpy as np
-import torch
-ROOT = Path(__file__).resolve().parents[1]
-if str(ROOT) not in sys.path:
-    sys.path.append(str(ROOT))
-try:
-    import triton
-    import triton.language as tl
-    import triton.testing as ttesting
-except Exception as err:  # pragma: no cover
-    raise RuntimeError("Triton is required for Qwen kernel measurement.") from err
-from scripts.collect_measurements import BLOCK_SIZES, NUM_STAGES, NUM_WARPS, benchmark_single_config
-from scripts.qwen_05b_spec import QwenKernelTask, qwen_05b_tasks
-EPS = 1e-5
-@dataclass(frozen=True)
-class QwenMeasurementRow:
-    family_group: str
-    family: str
-    task_id: str
-    m: int
-    n: int
-    k: int
-    config_id: int
-    block_size: int
-    num_warps: int
-    num_stages: int
-    shape_json: str
-    config_json: str
-    median_ms: float
-    effective_gbps: float
-    score: float
-    validation_error: float
-@triton.jit
-def fused_rowwise_rmsnorm_kernel(
-    X_ptr,
-    Y_ptr,
-    stride_xm,
-    stride_xn,
-    stride_ym,
-    stride_yn,
-    n_cols,
-    eps,
-    BLOCK_SIZE: tl.constexpr,
-):
-    row_idx = tl.program_id(0)
-    col_offsets = tl.arange(0, BLOCK_SIZE)
-    mask = col_offsets < n_cols
-    x_ptr = X_ptr + row_idx * stride_xm + col_offsets
-    y_ptr = Y_ptr + row_idx * stride_ym + col_offsets
-    x = tl.load(x_ptr, mask=mask, other=0.0).to(tl.float32)
-    mean_sq = tl.sum(x * x, axis=0) / n_cols
-    inv_rms = tl.rsqrt(mean_sq + eps)
-    y = x * inv_rms
-    tl.store(y_ptr, y.to(tl.float16), mask=mask)
-@triton.jit
-def matmul_kernel(
-    A_ptr,
-    B_ptr,
-    C_ptr,
-    M,
-    N,
-    K,
-    stride_am,
-    stride_ak,
-    stride_bk,
-    stride_bn,
-    stride_cm,
-    stride_cn,
-    BLOCK_M: tl.constexpr,
-    BLOCK_N: tl.constexpr,
-    BLOCK_K: tl.constexpr,
-):
-    pid = tl.program_id(axis=0)
-    num_pid_m = tl.cdiv(M, BLOCK_M)
-    num_pid_n = tl.cdiv(N, BLOCK_N)
-    pid_m = pid // num_pid_n
-    pid_n = pid % num_pid_n
-    offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
-    offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
-    offs_k = tl.arange(0, BLOCK_K)
-    a_ptrs = A_ptr + offs_m[:, None] * stride_am + offs_k[None, :] * stride_ak
-    b_ptrs = B_ptr + offs_k[:, None] * stride_bk + offs_n[None, :] * stride_bn
-    acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
-    for _ in range(0, tl.cdiv(K, BLOCK_K)):
-        a = tl.load(a_ptrs, mask=(offs_m[:, None] < M) & (offs_k[None, :] < K), other=0.0)
-        b = tl.load(b_ptrs, mask=(offs_k[:, None] < K) & (offs_n[None, :] < N), other=0.0)
-        acc += tl.dot(a, b)
-        a_ptrs += BLOCK_K * stride_ak
-        b_ptrs += BLOCK_K * stride_bk
-        offs_k += BLOCK_K
-    c = acc.to(tl.float16)
-    c_ptrs = C_ptr + offs_m[:, None] * stride_cm + offs_n[None, :] * stride_cn
-    c_mask = (offs_m[:, None] < M) & (offs_n[None, :] < N)
-    tl.store(c_ptrs, c, mask=c_mask)
-def _gemm_blocks(block_size: int) -> tuple[int, int, int]:
-    block_m = max(16, min(block_size // 4, 128))
-    block_n = max(16, min(block_size // 4, 128))
-    block_k = 32
-    return block_m, block_n, block_k
-def _effective_gbps(bytes_processed: int, median_ms: float) -> float:
-    if median_ms <= 0:
-        return 0.0
-    return float(bytes_processed) / (median_ms / 1000.0) / 1e9
-def _score(ms: float) -> float:
-    return float(-math.log(max(ms, np.finfo(float).tiny)))
-def _config_json(block_size: int, num_warps: int, num_stages: int) -> str:
-    return json.dumps(
-        {"block_size": block_size, "num_warps": num_warps, "num_stages": num_stages},
-        sort_keys=True,
-    )
-def _valid_configs(task: QwenKernelTask) -> List[tuple[int, int, int]]:
-    configs: List[tuple[int, int, int]] = []
-    for block_size in BLOCK_SIZES:
-        if task.family in {"softmax", "rmsnorm"} and block_size < task.n:
-            continue
-        if task.family == "gemm" and block_size > 1024:
-            continue
-        for num_warps in NUM_WARPS:
-            for num_stages in NUM_STAGES:
-                configs.append((block_size, num_warps, num_stages))
-    return configs
-def _benchmark_rmsnorm(task: QwenKernelTask, block_size: int, num_warps: int, num_stages: int, repeats: int, warmup: int, seed: int) -> QwenMeasurementRow:
-    torch.manual_seed(seed)
-    sample = torch.randn((task.m, task.n), device="cuda", dtype=torch.float16)
-    output = torch.empty_like(sample)
-    grid = (sample.shape[0],)
-    def launch() -> None:
-        fused_rowwise_rmsnorm_kernel[grid](
-            sample,
-            output,
-            sample.stride(0),
-            sample.stride(1),
-            output.stride(0),
-            output.stride(1),
-            sample.shape[1],
-            EPS,
-            BLOCK_SIZE=block_size,
-            num_warps=num_warps,
-            num_stages=num_stages,
-        )
-    for _ in range(max(1, warmup)):
-        launch()
-    torch.cuda.synchronize()
-    if ttesting is not None:
-        result = ttesting.do_bench(launch, warmup=0, rep=repeats, quantiles=[0.5], return_mode="median")
-        median_ms = float(result.get("median", 0.0) if isinstance(result, dict) else result)
-    else:
-        start = torch.cuda.Event(enable_timing=True)
-        end = torch.cuda.Event(enable_timing=True)
-        durations: List[float] = []
-        for _ in range(max(1, repeats)):
-            torch.cuda.synchronize()
-            start.record()
-            launch()
-            end.record()
-            end.synchronize()
-            durations.append(start.elapsed_time(end))
-        median_ms = float(np.median(np.asarray(durations, dtype=np.float32)))
-    ref = sample.float() * torch.rsqrt(sample.float().pow(2).mean(dim=-1, keepdim=True) + EPS)
-    fused_rowwise_rmsnorm_kernel[grid](
-        sample,
-        output,
-        sample.stride(0),
-        sample.stride(1),
-        output.stride(0),
-        output.stride(1),
-        sample.shape[1],
-        EPS,
-        BLOCK_SIZE=block_size,
-        num_warps=num_warps,
-        num_stages=num_stages,
-    )
-    torch.cuda.synchronize()
-    max_err = float((output.float() - ref.float()).abs().max().item())
-    config_id = _valid_configs(task).index((block_size, num_warps, num_stages))
-    return QwenMeasurementRow(
-        family_group=task.family_group,
-        family=task.family,
-        task_id=task.task_id,
-        m=task.m,
-        n=task.n,
-        k=task.k,
-        config_id=config_id,
-        block_size=block_size,
-        num_warps=num_warps,
-        num_stages=num_stages,
-        shape_json=json.dumps(task.shape_fields(), sort_keys=True),
-        config_json=_config_json(block_size, num_warps, num_stages),
-        median_ms=median_ms,
-        effective_gbps=_effective_gbps(sample.numel() * sample.element_size() * 2, median_ms),
-        score=_score(median_ms),
-        validation_error=max_err,
-    )
-def _benchmark_gemm(task: QwenKernelTask, block_size: int, num_warps: int, num_stages: int, repeats: int, warmup: int, seed: int) -> QwenMeasurementRow:
-    torch.manual_seed(seed)
-    a = torch.randn((task.m, task.k), device="cuda", dtype=torch.float16)
-    b = torch.randn((task.k, task.n), device="cuda", dtype=torch.float16)
-    c = torch.empty((task.m, task.n), device="cuda", dtype=torch.float16)
-    block_m, block_n, block_k = _gemm_blocks(block_size)
-    def launch() -> None:
-        grid = (triton.cdiv(task.m, block_m) * triton.cdiv(task.n, block_n),)
-        matmul_kernel[grid](
-            a,
-            b,
-            c,
-            task.m,
-            task.n,
-            task.k,
-            a.stride(0),
-            a.stride(1),
-            b.stride(0),
-            b.stride(1),
-            c.stride(0),
-            c.stride(1),
-            BLOCK_M=block_m,
-            BLOCK_N=block_n,
-            BLOCK_K=block_k,
-            num_warps=num_warps,
-            num_stages=num_stages,
-        )
-    for _ in range(max(1, warmup)):
-        launch()
-    torch.cuda.synchronize()
-    if ttesting is not None:
-        result = ttesting.do_bench(launch, warmup=0, rep=repeats, quantiles=[0.5], return_mode="median")
-        median_ms = float(result.get("median", 0.0) if isinstance(result, dict) else result)
-    else:
-        start = torch.cuda.Event(enable_timing=True)
-        end = torch.cuda.Event(enable_timing=True)
-        durations: List[float] = []
-        for _ in range(max(1, repeats)):
-            torch.cuda.synchronize()
-            start.record()
-            launch()
-            end.record()
-            end.synchronize()
-            durations.append(start.elapsed_time(end))
-        median_ms = float(np.median(np.asarray(durations, dtype=np.float32)))
-    ref = torch.matmul(a.float(), b.float()).to(torch.float16)
-    launch()
-    torch.cuda.synchronize()
-    max_err = float((c.float() - ref.float()).abs().max().item())
-    config_id = _valid_configs(task).index((block_size, num_warps, num_stages))
-    bytes_processed = a.numel() * a.element_size() + b.numel() * b.element_size() + c.numel() * c.element_size()
-    return QwenMeasurementRow(
-        family_group=task.family_group,
-        family=task.family,
-        task_id=task.task_id,
-        m=task.m,
-        n=task.n,
-        k=task.k,
-        config_id=config_id,
-        block_size=block_size,
-        num_warps=num_warps,
-        num_stages=num_stages,
-        shape_json=json.dumps(task.shape_fields(), sort_keys=True),
-        config_json=_config_json(block_size, num_warps, num_stages),
-        median_ms=median_ms,
-        effective_gbps=_effective_gbps(bytes_processed, median_ms),
-        score=_score(median_ms),
-        validation_error=max_err,
-    )
-def benchmark_qwen_task(task: QwenKernelTask, block_size: int, num_warps: int, num_stages: int, repeats: int, warmup: int, seed: int) -> QwenMeasurementRow:
-    if task.family == "softmax":
-        row = benchmark_single_config(
-            n=task.n,
-            block_size=block_size,
-            num_warps=num_warps,
-            num_stages=num_stages,
-            m=task.m,
-            repeats=repeats,
-            warmup=warmup,
-            seed=seed,
-        )
-        config_id = _valid_configs(task).index((block_size, num_warps, num_stages))
-        return QwenMeasurementRow(
-            family_group=task.family_group,
-            family=task.family,
-            task_id=task.task_id,
-            m=task.m,
-            n=task.n,
-            k=0,
-            config_id=config_id,
-            block_size=block_size,
-            num_warps=num_warps,
-            num_stages=num_stages,
-            shape_json=json.dumps(task.shape_fields(), sort_keys=True),
-            config_json=_config_json(block_size, num_warps, num_stages),
-            median_ms=float(row.median_ms),
-            effective_gbps=float(row.effective_gbps),
-            score=float(row.score),
-            validation_error=float(row.validation_error),
-        )
-    if task.family == "rmsnorm":
-        return _benchmark_rmsnorm(task, block_size, num_warps, num_stages, repeats, warmup, seed)
-    if task.family == "gemm":
-        return _benchmark_gemm(task, block_size, num_warps, num_stages, repeats, warmup, seed)
-    raise ValueError(f"Unsupported family: {task.family}")
-def collect_qwen_measurements(output_path: Path, repeats: int, warmup: int, seed: int, append: bool = True) -> List[QwenMeasurementRow]:
-    output_path.parent.mkdir(parents=True, exist_ok=True)
-    existing = set()
-    if output_path.exists():
-        with output_path.open("r", newline="", encoding="utf-8") as handle:
-            for row in csv.DictReader(handle):
-                existing.add((row["task_id"], int(row["config_id"])))
-    results: List[QwenMeasurementRow] = []
-    for task in qwen_05b_tasks():
-        for config_id, (block_size, num_warps, num_stages) in enumerate(_valid_configs(task)):
-            key = (task.task_id, config_id)
-            if append and key in existing:
-                continue
-            results.append(
-                benchmark_qwen_task(
-                    task=task,
-                    block_size=block_size,
-                    num_warps=num_warps,
-                    num_stages=num_stages,
-                    repeats=repeats,
-                    warmup=warmup,
-                    seed=seed,
-                )
-            )
-    if results:
-        write_mode = "a" if output_path.exists() and append else "w"
-        with output_path.open(write_mode, newline="", encoding="utf-8") as handle:
-            writer = csv.writer(handle)
-            if write_mode == "w" or output_path.stat().st_size == 0:
-                writer.writerow(
-                    [
-                        "family_group",
-                        "family",
-                        "task_id",
-                        "m",
-                        "n",
-                        "k",
-                        "config_id",
-                        "block_size",
-                        "num_warps",
-                        "num_stages",
-                        "shape_json",
-                        "config_json",
-                        "median_ms",
-                        "effective_gbps",
-                        "score",
-                        "validation_error",
-                    ]
-                )
-            for row in results:
-                writer.writerow(
-                    [
-                        row.family_group,
-                        row.family,
-                        row.task_id,
-                        row.m,
-                        row.n,
-                        row.k,
-                        row.config_id,
-                        row.block_size,
-                        row.num_warps,
-                        row.num_stages,
-                        row.shape_json,
-                        row.config_json,
-                        row.median_ms,
-                        row.effective_gbps,
-                        row.score,
-                        row.validation_error,
-                    ]
-                )
-    return results
-def parse_args() -> argparse.Namespace:
-    parser = argparse.ArgumentParser(description="Collect exact-kernel measurements for Qwen2.5-0.5B.")
-    parser.add_argument("--output", type=Path, default=Path("data/qwen_05b_measurements.csv"))
-    parser.add_argument("--repeats", type=int, default=200)
-    parser.add_argument("--warmup", type=int, default=25)
-    parser.add_argument("--seed", type=int, default=0)
-    parser.add_argument("--append", action="store_true")
-    parser.add_argument("--overwrite", action="store_true")
-    return parser.parse_args()
-def main() -> None:
-    args = parse_args()
-    append = args.append and not args.overwrite
-    collect_qwen_measurements(
-        output_path=args.output,
-        repeats=args.repeats,
-        warmup=args.warmup,
-        seed=args.seed,
-        append=append,
-    )
-if __name__ == "__main__":
-    main()

scripts/demo_compare.py DELETED Viewed

@@ -1,200 +0,0 @@
-#!/usr/bin/env python3
-from __future__ import annotations
-import argparse
-import json
-from dataclasses import dataclass
-import sys
-from pathlib import Path
-from typing import Any, Dict, List
-ROOT = Path(__file__).resolve().parents[1]
-if str(ROOT) not in sys.path:
-    sys.path.append(str(ROOT))
-from server.softmax_surrogate_environment import SoftmaxSurrogateEnvironment
-from scripts.collect_measurements import benchmark_single_config
-from scripts.run_random_baseline import run_random_baseline
-from scripts.run_surrogate_baseline import run_surrogate_baseline
-@dataclass
-class BaselineResult:
-    method: str
-    latency_ms: float
-    config: Dict[str, int]
-    regret: float
-def _search_metric_key(summary: Dict[str, Any], oracle_best_ms: float) -> tuple[float, float, float]:
-    metrics = summary.get("aggregate_metrics", {})
-    mean_auc_regret = float(metrics.get("mean_auc_regret", float("inf")))
-    oracle_hit_rate_final = float(metrics.get("oracle_hit_rate_final", 0.0))
-    best_latency_ms = float(summary["best_overall"]["latency_ms"])
-    latency_regret = best_latency_ms / oracle_best_ms - 1.0
-    return (mean_auc_regret, -oracle_hit_rate_final, latency_regret)
-def _heuristic_for_task(task_id: str, task_rows: List[Dict[str, Any]], env: SoftmaxSurrogateEnvironment) -> BaselineResult:
-    n = int(task_id.split("_n")[-1])
-    block = min(row["block_size"] for row in task_rows if row["block_size"] >= n)
-    warp = 4 if 4 in {row["num_warps"] for row in task_rows} else 2
-    stage = 2 if 2 in {row["num_stages"] for row in task_rows} else 1
-    candidate = None
-    for row in task_rows:
-        if row["block_size"] == block and row["num_warps"] == warp and row["num_stages"] == stage:
-            candidate = row
-            break
-    if candidate is None:
-        candidate = min(
-            task_rows,
-            key=lambda row: abs(row["block_size"] - block) + 10 * abs(row["num_warps"] - warp),
-        )
-    latency_ms = env.measured_latency_ms(candidate["config_id"])
-    oracle_best_ms = env.oracle_best()["median_ms"]
-    return BaselineResult(
-        method="heuristic",
-        latency_ms=float(latency_ms),
-        config=candidate,
-        regret=float(latency_ms / oracle_best_ms - 1.0),
-    )
-def _pick_task(task_arg: str | None, measurement_path: str, budget: int) -> str:
-    env = SoftmaxSurrogateEnvironment(measurement_path=measurement_path, budget=budget, seed=0)
-    if task_arg:
-        env.reset(task=task_arg)
-    else:
-        env.reset()
-    return env.state()["task_id"]
-def _run_all(
-    task: str,
-    budget: int,
-    episodes: int,
-    seed: int,
-    measurement_path: str,
-    acquisition: str,
-    beta: float,
-    xi: float,
-) -> Dict[str, Any]:
-    env = SoftmaxSurrogateEnvironment(measurement_path=measurement_path, budget=budget, seed=seed)
-    env.reset(task=task)
-    task_rows = env.available_configs()
-    oracle_best = env.oracle_best()
-    heuristic = _heuristic_for_task(task, task_rows, env)
-    random_summary = run_random_baseline(task, episodes=episodes, budget=budget, seed=seed, measurement_path=measurement_path)
-    surrogate_summary = run_surrogate_baseline(
-        task,
-        episodes=episodes,
-        budget=budget,
-        seed=seed,
-        measurement_path=measurement_path,
-        acquisition=acquisition,
-        beta=beta,
-        xi=xi,
-    )
-    search_summaries = {
-        "random": random_summary,
-        "surrogate": surrogate_summary,
-    }
-    winner_method, winner_summary = min(
-        search_summaries.items(),
-        key=lambda item: _search_metric_key(item[1], oracle_best["median_ms"]),
-    )
-    winner_cfg = winner_summary["best_overall"]["config"]
-    winner_regret = float(winner_summary["best_overall"]["latency_ms"] / oracle_best["median_ms"] - 1.0)
-    n = int(task.split("_n")[-1])
-    live = benchmark_single_config(
-        n=n,
-        block_size=winner_cfg["block_size"],
-        num_warps=winner_cfg["num_warps"],
-        num_stages=winner_cfg["num_stages"],
-        repeats=max(200, budget * 20),
-        warmup=25,
-        seed=seed + 999,
-    )
-    return {
-        "task": task,
-        "seed": seed,
-        "budget": budget,
-        "episodes": episodes,
-        "acquisition": acquisition,
-        "beta": beta,
-        "xi": xi,
-        "oracle_best": oracle_best,
-        "heuristic": heuristic.__dict__,
-        "random": random_summary["best_overall"],
-        "random_aggregate_metrics": random_summary.get("aggregate_metrics", {}),
-        "surrogate": surrogate_summary["best_overall"],
-        "surrogate_aggregate_metrics": surrogate_summary.get("aggregate_metrics", {}),
-        "winner": {
-            "method": winner_method,
-            "selection_metric": "min(mean_auc_regret), tie-break max(oracle_hit_rate_final), then best latency",
-            "latency_ms": winner_summary["best_overall"]["latency_ms"],
-            "config": winner_cfg,
-            "regret": winner_regret,
-            "live_rerun": live.__dict__,
-        },
-    }
-def parse_args() -> argparse.Namespace:
-    parser = argparse.ArgumentParser(description="Compare heuristic/random/surrogate baselines.")
-    parser.add_argument(
-        "--task",
-        default="softmax_m4096_n2048",
-        help="Task ID (e.g., softmax_m4096_n2048)",
-    )
-    parser.add_argument("--budget", type=int, default=6)
-    parser.add_argument("--episodes", type=int, default=20)
-    parser.add_argument("--seed", type=int, default=2)
-    parser.add_argument(
-        "--acquisition",
-        type=str,
-        choices=("mean", "ucb", "ei"),
-        default="ucb",
-    )
-    parser.add_argument("--beta", type=float, default=2.0)
-    parser.add_argument("--xi", type=float, default=0.0)
-    parser.add_argument(
-        "--measurement-path",
-        type=str,
-        default="data/autotune_measurements.csv",
-    )
-    parser.add_argument(
-        "--output",
-        type=Path,
-        default=Path("outputs/demo_compare.json"),
-    )
-    return parser.parse_args()
-def main() -> None:
-    args = parse_args()
-    task = _pick_task(args.task, args.measurement_path, args.budget)
-    summary = _run_all(
-        task=task,
-        budget=args.budget,
-        episodes=args.episodes,
-        seed=args.seed,
-        measurement_path=args.measurement_path,
-        acquisition=args.acquisition,
-        beta=args.beta,
-        xi=args.xi,
-    )
-    args.output.parent.mkdir(parents=True, exist_ok=True)
-    with args.output.open("w", encoding="utf-8") as f:
-        json.dump(summary, f, indent=2)
-    print(json.dumps(summary, indent=2))
-if __name__ == "__main__":
-    main()

scripts/eval_generalization.py DELETED Viewed

@@ -1,147 +0,0 @@
-#!/usr/bin/env python3
-from __future__ import annotations
-import argparse
-import json
-import sys
-from pathlib import Path
-from typing import Any, Dict, List
-import numpy as np
-ROOT = Path(__file__).resolve().parents[1]
-if str(ROOT) not in sys.path:
-    sys.path.append(str(ROOT))
-from scripts.run_random_baseline import run_random_baseline
-from scripts.run_surrogate_baseline import run_surrogate_baseline
-def _average_metric_dict(records: List[Dict[str, float]]) -> Dict[str, float]:
-    if not records:
-        return {}
-    keys = sorted({key for record in records for key in record.keys()}, key=lambda value: int(value))
-    return {
-        key: float(np.mean(np.asarray([record[key] for record in records if key in record], dtype=np.float32)))
-        for key in keys
-    }
-def _summarize_runs(runs: List[Dict[str, Any]]) -> Dict[str, Any]:
-    mean_regret_records = [run["aggregate_metrics"].get("mean_regret_at", {}) for run in runs]
-    median_regret_records = [run["aggregate_metrics"].get("median_regret_at", {}) for run in runs]
-    auc_values = [run["aggregate_metrics"].get("mean_auc_regret") for run in runs]
-    oracle_hit_values = [run["aggregate_metrics"].get("oracle_hit_rate_final") for run in runs]
-    return {
-        "mean_regret_at": _average_metric_dict(mean_regret_records),
-        "median_regret_at": _average_metric_dict(median_regret_records),
-        "mean_best_so_far_auc": float(np.mean(np.asarray(auc_values, dtype=np.float32))) if auc_values else None,
-        "mean_oracle_hit_rate_final": float(np.mean(np.asarray(oracle_hit_values, dtype=np.float32))) if oracle_hit_values else None,
-    }
-def _evaluate_section(
-    section_name: str,
-    split: Dict[str, Any],
-    measurement_path: str,
-    episodes: int,
-    budget: int,
-    seed: int,
-    acquisition: str,
-    beta: float,
-    xi: float,
-) -> Dict[str, Any]:
-    train_tasks = split["train_tasks"]
-    test_tasks = split["test_tasks"]
-    random_runs: List[Dict[str, Any]] = []
-    surrogate_runs: List[Dict[str, Any]] = []
-    for idx, task in enumerate(test_tasks):
-        task_seed = seed + idx * 1000
-        random_runs.append(
-            run_random_baseline(
-                task=task,
-                episodes=episodes,
-                budget=budget,
-                seed=task_seed,
-                measurement_path=measurement_path,
-            )
-        )
-        surrogate_runs.append(
-            run_surrogate_baseline(
-                task=task,
-                episodes=episodes,
-                budget=budget,
-                seed=task_seed,
-                measurement_path=measurement_path,
-                train_task_ids=train_tasks,
-                acquisition=acquisition,
-                beta=beta,
-                xi=xi,
-            )
-        )
-    return {
-        "section": section_name,
-        "train_tasks": train_tasks,
-        "test_tasks": test_tasks,
-        "random_summary": _summarize_runs(random_runs),
-        "surrogate_summary": _summarize_runs(surrogate_runs),
-        "task_runs": {
-            "random": random_runs,
-            "surrogate": surrogate_runs,
-        },
-    }
-def main() -> None:
-    parser = argparse.ArgumentParser(description="Evaluate random vs surrogate on shape and family holdout splits.")
-    parser.add_argument("--measurement-path", type=str, default="data/autotune_measurements.csv")
-    parser.add_argument("--splits", type=Path, default=Path("data/benchmark_splits.json"))
-    parser.add_argument("--episodes", type=int, default=20)
-    parser.add_argument("--budget", type=int, default=6)
-    parser.add_argument("--seed", type=int, default=2)
-    parser.add_argument("--acquisition", choices=("mean", "ucb", "ei"), default="ucb")
-    parser.add_argument("--beta", type=float, default=2.0)
-    parser.add_argument("--xi", type=float, default=0.0)
-    parser.add_argument("--output", type=Path, default=Path("outputs/generalization_eval.json"))
-    args = parser.parse_args()
-    splits = json.loads(args.splits.read_text(encoding="utf-8"))
-    sections = {
-        "shape_generalization": splits["shape_generalization"],
-        "family_holdout": splits["family_holdout"],
-    }
-    results = {
-        name: _evaluate_section(
-            section_name=name,
-            split=section,
-            measurement_path=args.measurement_path,
-            episodes=args.episodes,
-            budget=args.budget,
-            seed=args.seed,
-            acquisition=args.acquisition,
-            beta=args.beta,
-            xi=args.xi,
-        )
-        for name, section in sections.items()
-    }
-    summary = {
-        "measurement_path": args.measurement_path,
-        "splits_path": str(args.splits),
-        "episodes": args.episodes,
-        "budget": args.budget,
-        "acquisition": args.acquisition,
-        "beta": args.beta,
-        "xi": args.xi,
-        "results": results,
-    }
-    args.output.parent.mkdir(parents=True, exist_ok=True)
-    with args.output.open("w", encoding="utf-8") as handle:
-        json.dump(summary, handle, indent=2)
-    print(json.dumps(summary, indent=2))
-if __name__ == "__main__":
-    main()

scripts/qwen_05b_spec.py DELETED Viewed

@@ -1,115 +0,0 @@
-#!/usr/bin/env python3
-from __future__ import annotations
-from dataclasses import dataclass
-from typing import Dict, List
-MODEL_ID = "Qwen/Qwen2.5-0.5B"
-HIDDEN_SIZE = 896
-INTERMEDIATE_SIZE = 4864
-NUM_HIDDEN_LAYERS = 24
-NUM_ATTENTION_HEADS = 14
-NUM_KEY_VALUE_HEADS = 2
-HEAD_DIM = HIDDEN_SIZE // NUM_ATTENTION_HEADS
-KV_PROJ_SIZE = NUM_KEY_VALUE_HEADS * HEAD_DIM
-DTYPE = "bfloat16"
-BATCH_SIZE = 1
-PREFILL_SEQ_LENS = (128, 512, 2048)
-DECODE_CTX_LENS = (128, 512, 2048, 8192)
-MODE_IDS = {
-    "prefill": 1,
-    "decode": 2,
-}
-ROLE_IDS = {
-    "rmsnorm": 1,
-    "attn_softmax": 2,
-    "q_proj": 3,
-    "k_proj": 4,
-    "v_proj": 5,
-    "o_proj": 6,
-    "gate_proj": 7,
-    "up_proj": 8,
-    "down_proj": 9,
-}
-@dataclass(frozen=True)
-class QwenKernelTask:
-    family_group: str
-    family: str
-    task_id: str
-    role: str
-    mode: str
-    m: int
-    n: int
-    k: int
-    seq_len: int
-    ctx_len: int
-    def shape_fields(self) -> Dict[str, int | str]:
-        return {
-            "family_group": self.family_group,
-            "family": self.family,
-            "role": self.role,
-            "mode": self.mode,
-            "role_id": ROLE_IDS[self.role],
-            "mode_id": MODE_IDS[self.mode],
-            "m": self.m,
-            "n": self.n,
-            "k": self.k,
-            "seq_len": self.seq_len,
-            "ctx_len": self.ctx_len,
-            "hidden_size": HIDDEN_SIZE,
-            "intermediate_size": INTERMEDIATE_SIZE,
-            "num_attention_heads": NUM_ATTENTION_HEADS,
-            "num_key_value_heads": NUM_KEY_VALUE_HEADS,
-            "head_dim": HEAD_DIM,
-            "dtype": DTYPE,
-            "model_id": MODEL_ID,
-        }
-def _task_id(role: str, mode: str, m: int, n: int, k: int, seq_len: int, ctx_len: int) -> str:
-    extra = f"_k{k}" if k > 0 else ""
-    ctx = f"_ctx{ctx_len}" if ctx_len > 0 else ""
-    return f"qwen05b_{role}_{mode}_m{m}_n{n}{extra}_seq{seq_len}{ctx}"
-def qwen_05b_tasks() -> List[QwenKernelTask]:
-    tasks: List[QwenKernelTask] = []
-    for seq_len in PREFILL_SEQ_LENS:
-        tasks.extend(
-            [
-                QwenKernelTask("QWEN", "rmsnorm", _task_id("rmsnorm", "prefill", seq_len, HIDDEN_SIZE, 0, seq_len, seq_len), "rmsnorm", "prefill", seq_len, HIDDEN_SIZE, 0, seq_len, seq_len),
-                QwenKernelTask("QWEN", "softmax", _task_id("attn_softmax", "prefill", NUM_ATTENTION_HEADS * seq_len, seq_len, 0, seq_len, seq_len), "attn_softmax", "prefill", NUM_ATTENTION_HEADS * seq_len, seq_len, 0, seq_len, seq_len),
-                QwenKernelTask("QWEN", "gemm", _task_id("q_proj", "prefill", seq_len, HIDDEN_SIZE, HIDDEN_SIZE, seq_len, seq_len), "q_proj", "prefill", seq_len, HIDDEN_SIZE, HIDDEN_SIZE, seq_len, seq_len),
-                QwenKernelTask("QWEN", "gemm", _task_id("k_proj", "prefill", seq_len, KV_PROJ_SIZE, HIDDEN_SIZE, seq_len, seq_len), "k_proj", "prefill", seq_len, KV_PROJ_SIZE, HIDDEN_SIZE, seq_len, seq_len),
-                QwenKernelTask("QWEN", "gemm", _task_id("v_proj", "prefill", seq_len, KV_PROJ_SIZE, HIDDEN_SIZE, seq_len, seq_len), "v_proj", "prefill", seq_len, KV_PROJ_SIZE, HIDDEN_SIZE, seq_len, seq_len),
-                QwenKernelTask("QWEN", "gemm", _task_id("o_proj", "prefill", seq_len, HIDDEN_SIZE, HIDDEN_SIZE, seq_len, seq_len), "o_proj", "prefill", seq_len, HIDDEN_SIZE, HIDDEN_SIZE, seq_len, seq_len),
-                QwenKernelTask("QWEN", "gemm", _task_id("gate_proj", "prefill", seq_len, INTERMEDIATE_SIZE, HIDDEN_SIZE, seq_len, seq_len), "gate_proj", "prefill", seq_len, INTERMEDIATE_SIZE, HIDDEN_SIZE, seq_len, seq_len),
-                QwenKernelTask("QWEN", "gemm", _task_id("up_proj", "prefill", seq_len, INTERMEDIATE_SIZE, HIDDEN_SIZE, seq_len, seq_len), "up_proj", "prefill", seq_len, INTERMEDIATE_SIZE, HIDDEN_SIZE, seq_len, seq_len),
-                QwenKernelTask("QWEN", "gemm", _task_id("down_proj", "prefill", seq_len, HIDDEN_SIZE, INTERMEDIATE_SIZE, seq_len, seq_len), "down_proj", "prefill", seq_len, HIDDEN_SIZE, INTERMEDIATE_SIZE, seq_len, seq_len),
-            ]
-        )
-    for ctx_len in DECODE_CTX_LENS:
-        tasks.extend(
-            [
-                QwenKernelTask("QWEN", "rmsnorm", _task_id("rmsnorm", "decode", 1, HIDDEN_SIZE, 0, 1, ctx_len), "rmsnorm", "decode", 1, HIDDEN_SIZE, 0, 1, ctx_len),
-                QwenKernelTask("QWEN", "softmax", _task_id("attn_softmax", "decode", NUM_ATTENTION_HEADS, ctx_len, 0, 1, ctx_len), "attn_softmax", "decode", NUM_ATTENTION_HEADS, ctx_len, 0, 1, ctx_len),
-                QwenKernelTask("QWEN", "gemm", _task_id("q_proj", "decode", 1, HIDDEN_SIZE, HIDDEN_SIZE, 1, ctx_len), "q_proj", "decode", 1, HIDDEN_SIZE, HIDDEN_SIZE, 1, ctx_len),
-                QwenKernelTask("QWEN", "gemm", _task_id("k_proj", "decode", 1, KV_PROJ_SIZE, HIDDEN_SIZE, 1, ctx_len), "k_proj", "decode", 1, KV_PROJ_SIZE, HIDDEN_SIZE, 1, ctx_len),
-                QwenKernelTask("QWEN", "gemm", _task_id("v_proj", "decode", 1, KV_PROJ_SIZE, HIDDEN_SIZE, 1, ctx_len), "v_proj", "decode", 1, KV_PROJ_SIZE, HIDDEN_SIZE, 1, ctx_len),
-                QwenKernelTask("QWEN", "gemm", _task_id("o_proj", "decode", 1, HIDDEN_SIZE, HIDDEN_SIZE, 1, ctx_len), "o_proj", "decode", 1, HIDDEN_SIZE, HIDDEN_SIZE, 1, ctx_len),
-                QwenKernelTask("QWEN", "gemm", _task_id("gate_proj", "decode", 1, INTERMEDIATE_SIZE, HIDDEN_SIZE, 1, ctx_len), "gate_proj", "decode", 1, INTERMEDIATE_SIZE, HIDDEN_SIZE, 1, ctx_len),
-                QwenKernelTask("QWEN", "gemm", _task_id("up_proj", "decode", 1, INTERMEDIATE_SIZE, HIDDEN_SIZE, 1, ctx_len), "up_proj", "decode", 1, INTERMEDIATE_SIZE, HIDDEN_SIZE, 1, ctx_len),
-                QwenKernelTask("QWEN", "gemm", _task_id("down_proj", "decode", 1, HIDDEN_SIZE, INTERMEDIATE_SIZE, 1, ctx_len), "down_proj", "decode", 1, HIDDEN_SIZE, INTERMEDIATE_SIZE, 1, ctx_len),
-            ]
-        )
-    return tasks

scripts/report_task_hardness.py DELETED Viewed

@@ -1,42 +0,0 @@
-#!/usr/bin/env python3
-from __future__ import annotations
-import argparse
-import csv
-from collections import defaultdict
-from pathlib import Path
-from typing import Dict, List
-def _load_rows(path: Path) -> Dict[str, List[float]]:
-    grouped: Dict[str, List[float]] = defaultdict(list)
-    with path.open("r", newline="", encoding="utf-8") as f:
-        for row in csv.DictReader(f):
-            grouped[row["task_id"]].append(float(row["median_ms"]))
-    return grouped
-def main() -> None:
-    parser = argparse.ArgumentParser(description="Report task hardness from measured latency table.")
-    parser.add_argument("--measurement-path", type=Path, default=Path("data/autotune_measurements.csv"))
-    parser.add_argument("--budget", type=int, default=6)
-    args = parser.parse_args()
-    grouped = _load_rows(args.measurement_path)
-    for task_id, vals in sorted(grouped.items()):
-        vals = sorted(vals)
-        best = vals[0]
-        ncfg = len(vals)
-        within1 = sum(v <= best * 1.01 for v in vals)
-        within2 = sum(v <= best * 1.02 for v in vals)
-        within5 = sum(v <= best * 1.05 for v in vals)
-        hit_best = 1.0 - (1.0 - 1.0 / ncfg) ** args.budget
-        print(
-            f"{task_id} ncfg={ncfg} best_ms={best:.9f} "
-            f"within1={within1} within2={within2} within5={within5} "
-            f"random_hit_best@{args.budget}={hit_best:.4f}"
-        )
-if __name__ == "__main__":
-    main()

scripts/run_full_pipeline.sh DELETED Viewed

@@ -1,84 +0,0 @@
-#!/usr/bin/env bash
-set -euo pipefail
-SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
-REPO_ROOT="$(cd "${SCRIPT_DIR}/.." && pwd)"
-cd "${REPO_ROOT}"
-MEASUREMENTS="${REPO_ROOT}/data/autotune_measurements.csv"
-SPLITS="${REPO_ROOT}/data/benchmark_splits.json"
-OUTPUT_DIR="${REPO_ROOT}/outputs"
-mkdir -p "${OUTPUT_DIR}"
-EPISODES=20
-BUDGET=6
-BETA=2.0
-echo "[step] 0) checking workspace"
-python3 - <<'PY'
-import torch
-print(f"python={__import__('sys').executable}")
-print(f"torch={getattr(__import__('torch'), '__version__', 'missing')}")
-print(f"torch.cuda.is_available={torch.cuda.is_available()}")
-if not torch.cuda.is_available():
-    raise SystemExit("CUDA is not available. Run this pipeline on a GPU machine.")
-print("ready")
-PY
-echo "[step] 1) collect multi-family measurements"
-python3 scripts/collect_multifamily_measurements.py \
-  --output "${MEASUREMENTS}" \
-  --families softmax layernorm grouped_gemm \
-  --n-cols 256 512 1024 1536 2048 3072 4096 6144 8192 \
-  --m 4096 \
-  --repeats 200 \
-  --warmup 25 \
-  --seed 0 \
-  --append
-echo "[step] 2) build train/test splits"
-python3 scripts/build_benchmark_splits.py \
-  --measurement-path "${MEASUREMENTS}" \
-  --output "${SPLITS}" \
-  --heldout-family grouped_gemm
-echo "[step] 3) local smoke test"
-python3 scripts/smoke_test_client.py | tee "${OUTPUT_DIR}/smoke_test_client.json"
-echo "[step] 4) evaluate held-out shapes and held-out family"
-python3 scripts/eval_generalization.py \
-  --measurement-path "${MEASUREMENTS}" \
-  --splits "${SPLITS}" \
-  --episodes "${EPISODES}" \
-  --budget "${BUDGET}" \
-  --seed 2 \
-  --acquisition ucb \
-  --beta "${BETA}" \
-  --output "${OUTPUT_DIR}/generalization_eval.json"
-echo "[step] 5) benchmark eager vs torch.compile vs best Triton configs"
-python3 scripts/benchmark_runtime_references.py \
-  --generalization-results "${OUTPUT_DIR}/generalization_eval.json" \
-  --repeats 100 \
-  --warmup 10 \
-  --seed 123 \
-  --output "${OUTPUT_DIR}/runtime_references.json"
-python3 - <<'PY'
-import json
-from pathlib import Path
-summary = json.loads(Path("outputs/generalization_eval.json").read_text(encoding="utf-8"))
-for section_name, section in summary["results"].items():
-    print(section_name)
-    print("  random:", section["random_summary"])
-    print("  surrogate:", section["surrogate_summary"])
-runtime = json.loads(Path("outputs/runtime_references.json").read_text(encoding="utf-8"))
-for task_id, task in runtime["results"].items():
-    print(task_id)
-    print("  torch:", task["torch"])
-    print("  speedups:", task["speedups"])
-PY
-echo "[done] outputs in outputs/"

scripts/run_qwen_05b_pipeline.sh DELETED Viewed

@@ -1,90 +0,0 @@
-#!/usr/bin/env bash
-set -euo pipefail
-SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
-REPO_ROOT="$(cd "${SCRIPT_DIR}/.." && pwd)"
-cd "${REPO_ROOT}"
-MEASUREMENTS="${REPO_ROOT}/data/qwen_05b_measurements.csv"
-SPLITS="${REPO_ROOT}/data/qwen_05b_splits.json"
-OUTPUT_DIR="${REPO_ROOT}/outputs"
-mkdir -p "${OUTPUT_DIR}"
-EPISODES=20
-BUDGET=6
-BETA=2.0
-echo "[step] 0) checking workspace"
-python3 - <<'PY'
-import torch
-print(f"python={__import__('sys').executable}")
-print(f"torch={getattr(__import__('torch'), '__version__', 'missing')}")
-print(f"torch.cuda.is_available={torch.cuda.is_available()}")
-if not torch.cuda.is_available():
-    raise SystemExit("CUDA is not available. Run this pipeline on a GPU machine.")
-print("ready")
-PY
-echo "[step] 1) collect exact Qwen2.5-0.5B kernel measurements"
-python3 scripts/collect_qwen_05b_measurements.py \
-  --output "${MEASUREMENTS}" \
-  --repeats 200 \
-  --warmup 25 \
-  --seed 0 \
-  --append
-echo "[step] 2) build Qwen splits"
-python3 scripts/build_qwen_05b_splits.py --output "${SPLITS}"
-echo "[step] 3) local smoke test against Qwen measurement cache"
-python3 - <<'PY'
-import json
-from pathlib import Path
-from client import SoftmaxSurrogateEnvClient
-client = SoftmaxSurrogateEnvClient(measurement_path="data/qwen_05b_measurements.csv")
-reset_out = client.reset()
-first_config = reset_out["observation"]["tried_config_ids"][0]
-step_out = client.step({"config_id": first_config})
-summary = {"reset": reset_out, "step": step_out}
-Path("outputs/qwen_05b_smoke_test.json").write_text(json.dumps(summary, indent=2), encoding="utf-8")
-print(json.dumps(summary, indent=2))
-PY
-echo "[step] 4) evaluate held-out shapes and held-out family"
-python3 scripts/eval_generalization.py \
-  --measurement-path "${MEASUREMENTS}" \
-  --splits "${SPLITS}" \
-  --episodes "${EPISODES}" \
-  --budget "${BUDGET}" \
-  --seed 2 \
-  --acquisition ucb \
-  --beta "${BETA}" \
-  --output "${OUTPUT_DIR}/qwen_05b_generalization_eval.json"
-echo "[step] 5) benchmark eager vs torch.compile vs best Triton configs"
-python3 scripts/benchmark_qwen_05b_runtime.py \
-  --generalization-results "${OUTPUT_DIR}/qwen_05b_generalization_eval.json" \
-  --repeats 100 \
-  --warmup 10 \
-  --seed 123 \
-  --output "${OUTPUT_DIR}/qwen_05b_runtime_references.json"
-python3 - <<'PY'
-import json
-from pathlib import Path
-eval_summary = json.loads(Path("outputs/qwen_05b_generalization_eval.json").read_text(encoding="utf-8"))
-for section_name, section in eval_summary["results"].items():
-    print(section_name)
-    print("  random:", section["random_summary"])
-    print("  surrogate:", section["surrogate_summary"])
-runtime = json.loads(Path("outputs/qwen_05b_runtime_references.json").read_text(encoding="utf-8"))
-for task_id, task in runtime["results"].items():
-    print(task_id)
-    print("  torch:", task["torch"])
-    print("  speedups:", task["speedups"])
-PY
-echo "[done] Qwen outputs in outputs/"

scripts/run_random_baseline.py DELETED Viewed

@@ -1,187 +0,0 @@
-#!/usr/bin/env python3
-from __future__ import annotations
-import argparse
-import json
-from dataclasses import dataclass
-import sys
-from pathlib import Path
-from typing import Any, Dict, List
-import numpy as np
-ROOT = Path(__file__).resolve().parents[1]
-if str(ROOT) not in sys.path:
-    sys.path.append(str(ROOT))
-from server.softmax_surrogate_environment import SoftmaxSurrogateEnvironment
-@dataclass
-class RunRecord:
-    task_id: str
-    episode: int
-    best_latency_ms: float
-    best_config: Dict[str, int]
-    final_validation_mse: float
-    final_state: Dict[str, Any]
-    final_regret: float
-    history: List[Dict[str, Any]]
-def _aggregate_metrics(episode_records: List[Dict[str, Any]], budget: int) -> Dict[str, Any]:
-    ks = sorted(set(k for k in (1, 3, 5, budget) if k <= budget))
-    regrets_by_k: Dict[int, List[float]] = {k: [] for k in ks}
-    auc_regrets: List[float] = []
-    for episode in episode_records:
-        regrets = [float(step["regret"]) for step in episode["history"]]
-        if regrets:
-            auc_regrets.append(float(sum(regrets) / len(regrets)))
-        for k in ks:
-            if len(regrets) >= k:
-                regrets_by_k[k].append(regrets[k - 1])
-    return {
-        "mean_regret_at": {
-            str(k): float(sum(vals) / len(vals)) for k, vals in regrets_by_k.items() if vals
-        },
-        "median_regret_at": {
-            str(k): float(np.median(np.asarray(vals, dtype=np.float32))) for k, vals in regrets_by_k.items() if vals
-        },
-        "mean_auc_regret": float(sum(auc_regrets) / len(auc_regrets)) if auc_regrets else None,
-        "oracle_hit_rate_final": float(
-            sum(1 for episode in episode_records if float(episode["final_regret"]) == 0.0) / len(episode_records)
-        ) if episode_records else None,
-    }
-def _pick_task_from_input(args: argparse.Namespace) -> str:
-    if args.task:
-        return args.task
-    env = SoftmaxSurrogateEnvironment(
-        measurement_path=args.measurement_path,
-        budget=args.budget,
-        seed=args.seed,
-    )
-    return env.reset()["observation"]["task_id"]
-def run_random_baseline(
-    task: str,
-    episodes: int,
-    budget: int,
-    seed: int,
-    measurement_path: str,
-) -> Dict[str, Any]:
-    rng = np.random.default_rng(seed)
-    best_overall: Dict[str, Any] = {"latency_ms": float("inf"), "config": None, "task_id": task}
-    episode_records: List[Dict[str, Any]] = []
-    env = SoftmaxSurrogateEnvironment(
-        measurement_path=measurement_path,
-        budget=budget,
-        seed=seed,
-    )
-    for episode in range(episodes):
-        env.reset(task=task, seed=seed + episode)
-        done = False
-        episode_best = float("inf")
-        episode_best_cfg: Dict[str, int] | None = None
-        history: List[Dict[str, Any]] = []
-        while not done:
-            unseen = [config_id for config_id in env.available_config_ids() if config_id not in env.seen_config_ids()]
-            choice_pool = unseen if unseen else env.available_config_ids()
-            config_id = int(rng.choice(choice_pool))
-            step_out = env.step({"config_id": config_id})
-            obs = step_out["observation"]
-            trial = obs["last_trial"]
-            history.append(
-                {
-                    "config_id": config_id,
-                    "latency_ms": trial["latency_ms"],
-                    "config": trial["config"],
-                    "reward": step_out["reward"],
-                    "regret": step_out["info"]["current_regret"],
-                    "validation_mse": step_out["info"]["validation_mse"],
-                }
-            )
-            if obs["best_so_far_ms"] < episode_best:
-                episode_best = obs["best_so_far_ms"]
-                best_id = env.seen_config_ids()[int(np.argmin([env.measured_latency_ms(cid) for cid in env.seen_config_ids()]))]
-                episode_best_cfg = env.config_info(best_id)
-            done = bool(step_out["done"])
-        if episode_best < best_overall["latency_ms"]:
-            best_overall = {
-                "latency_ms": float(episode_best),
-                "config": episode_best_cfg,
-                "task_id": task,
-            }
-        diagnostics = env.diagnostics()
-        episode_records.append(
-            RunRecord(
-                task_id=task,
-                episode=episode,
-                best_latency_ms=float(episode_best),
-                best_config=episode_best_cfg or {},
-                final_validation_mse=float(diagnostics["validation_mse"]),
-                final_state=env.state(),
-                final_regret=float(diagnostics["current_regret"]),
-                history=history,
-            ).__dict__
-        )
-    return {
-        "task": task,
-        "method": "random",
-        "episodes": episodes,
-        "budget": budget,
-        "seed": seed,
-        "oracle_best_ms": env.oracle_best()["median_ms"],
-        "best_overall": best_overall,
-        "aggregate_metrics": _aggregate_metrics(episode_records, budget),
-        "episodes_summary": episode_records,
-    }
-def parse_args() -> argparse.Namespace:
-    parser = argparse.ArgumentParser(description="Random baseline for surrogate environment.")
-    parser.add_argument("--task", default=None, help="Task ID (e.g., softmax_m4096_n2048)")
-    parser.add_argument("--episodes", type=int, default=20)
-    parser.add_argument("--budget", type=int, default=6)
-    parser.add_argument("--seed", type=int, default=0)
-    parser.add_argument(
-        "--measurement-path",
-        type=str,
-        default="data/autotune_measurements.csv",
-    )
-    parser.add_argument(
-        "--output",
-        type=Path,
-        default=Path("outputs/random_baseline.json"),
-    )
-    return parser.parse_args()
-def main() -> None:
-    args = parse_args()
-    task = _pick_task_from_input(args)
-    summary = run_random_baseline(
-        task=task,
-        episodes=args.episodes,
-        budget=args.budget,
-        seed=args.seed,
-        measurement_path=args.measurement_path,
-    )
-    args.output.parent.mkdir(parents=True, exist_ok=True)
-    with args.output.open("w", encoding="utf-8") as f:
-        json.dump(summary, f, indent=2)
-    print(json.dumps(summary, indent=2))
-if __name__ == "__main__":
-    main()

scripts/run_surrogate_baseline.py DELETED Viewed

@@ -1,217 +0,0 @@
-#!/usr/bin/env python3
-from __future__ import annotations
-import argparse
-import json
-import sys
-from pathlib import Path
-from typing import Any, Dict, List
-import numpy as np
-ROOT = Path(__file__).resolve().parents[1]
-if str(ROOT) not in sys.path:
-    sys.path.append(str(ROOT))
-from server.softmax_surrogate_environment import SoftmaxSurrogateEnvironment
-def _choose_surrogate_action(
-    env: SoftmaxSurrogateEnvironment,
-    acquisition: str,
-    beta: float,
-    xi: float,
-) -> int:
-    seen = set(env.seen_config_ids())
-    best_config_id = -1
-    best_score = float("-inf")
-    for config_id in env.available_config_ids():
-        if config_id in seen and len(seen) < len(env.available_config_ids()):
-            continue
-        score = env.acquisition_score(config_id, strategy=acquisition, beta=beta, xi=xi)
-        if score > best_score:
-            best_score = score
-            best_config_id = config_id
-    if best_config_id < 0:
-        raise RuntimeError("Failed to choose a surrogate action.")
-    return best_config_id
-def _aggregate_metrics(episode_records: List[Dict[str, Any]], budget: int) -> Dict[str, Any]:
-    ks = sorted(set(k for k in (1, 3, 5, budget) if k <= budget))
-    regrets_by_k: Dict[int, List[float]] = {k: [] for k in ks}
-    auc_regrets: List[float] = []
-    for episode in episode_records:
-        regrets = [float(step["regret"]) for step in episode["history"]]
-        if regrets:
-            auc_regrets.append(float(sum(regrets) / len(regrets)))
-        for k in ks:
-            if len(regrets) >= k:
-                regrets_by_k[k].append(regrets[k - 1])
-    return {
-        "mean_regret_at": {
-            str(k): float(sum(vals) / len(vals)) for k, vals in regrets_by_k.items() if vals
-        },
-        "median_regret_at": {
-            str(k): float(np.median(np.asarray(vals, dtype=np.float32))) for k, vals in regrets_by_k.items() if vals
-        },
-        "mean_auc_regret": float(sum(auc_regrets) / len(auc_regrets)) if auc_regrets else None,
-        "oracle_hit_rate_final": float(
-            sum(1 for episode in episode_records if float(episode["final_regret"]) == 0.0) / len(episode_records)
-        ) if episode_records else None,
-    }
-def run_surrogate_baseline(
-    task: str,
-    episodes: int,
-    budget: int,
-    seed: int,
-    measurement_path: str,
-    train_task_ids: List[str] | None = None,
-    acquisition: str = "ucb",
-    beta: float = 1.5,
-    xi: float = 0.0,
-) -> Dict[str, Any]:
-    env = SoftmaxSurrogateEnvironment(
-        measurement_path=measurement_path,
-        budget=budget,
-        seed=seed,
-        train_task_ids=train_task_ids,
-    )
-    best_overall = {"latency_ms": float("inf"), "config": None, "task_id": task}
-    episode_records: List[Dict[str, Any]] = []
-    for episode in range(episodes):
-        env.reset(task=task, seed=seed + episode)
-        done = False
-        episode_best = float("inf")
-        episode_best_cfg: Dict[str, int] | None = None
-        history: List[Dict[str, Any]] = []
-        while not done:
-            config_id = _choose_surrogate_action(env, acquisition=acquisition, beta=beta, xi=xi)
-            out = env.step({"config_id": config_id})
-            obs = out["observation"]
-            trial = obs["last_trial"]
-            history.append(
-                {
-                    "config_id": config_id,
-                    "latency_ms": trial["latency_ms"],
-                    "config": trial["config"],
-                    "reward": out["reward"],
-                    "regret": out["info"]["current_regret"],
-                    "validation_mse": out["info"]["validation_mse"],
-                }
-            )
-            if obs["best_so_far_ms"] < episode_best:
-                episode_best = obs["best_so_far_ms"]
-                best_seen = min(env.seen_config_ids(), key=env.measured_latency_ms)
-                episode_best_cfg = env.config_info(best_seen)
-            done = bool(out["done"])
-        if episode_best < best_overall["latency_ms"]:
-            best_overall = {
-                "latency_ms": float(episode_best),
-                "config": episode_best_cfg,
-                "task_id": task,
-            }
-        diagnostics = env.diagnostics()
-        episode_records.append(
-            {
-                "task_id": task,
-                "episode": episode,
-                "best_latency_ms": episode_best,
-                "best_config": episode_best_cfg or {},
-                "final_validation_mse": diagnostics["validation_mse"],
-                "final_regret": diagnostics["current_regret"],
-                "history": history,
-            }
-        )
-    return {
-        "task": task,
-        "method": "surrogate",
-        "episodes": episodes,
-        "budget": budget,
-        "seed": seed,
-        "train_task_ids": list(train_task_ids or []),
-        "acquisition": acquisition,
-        "beta": beta,
-        "xi": xi,
-        "oracle_best_ms": env.oracle_best()["median_ms"],
-        "best_overall": best_overall,
-        "aggregate_metrics": _aggregate_metrics(episode_records, budget),
-        "episodes_summary": episode_records,
-    }
-def parse_args() -> argparse.Namespace:
-    parser = argparse.ArgumentParser(description="Surrogate-guided baseline.")
-    parser.add_argument("--task", default=None, help="Task ID (e.g., softmax_m4096_n2048)")
-    parser.add_argument("--episodes", type=int, default=20)
-    parser.add_argument("--budget", type=int, default=6)
-    parser.add_argument("--seed", type=int, default=0)
-    parser.add_argument(
-        "--acquisition",
-        type=str,
-        choices=("mean", "ucb", "ei"),
-        default="ucb",
-        help="Candidate selection mode: mean, ucb, or ei.",
-    )
-    parser.add_argument("--beta", type=float, default=1.5, help="UCB exploration strength.")
-    parser.add_argument("--xi", type=float, default=0.0, help="Expected-improvement margin.")
-    parser.add_argument(
-        "--train-tasks-file",
-        type=Path,
-        default=None,
-        help="Optional JSON file containing a list of train task ids.",
-    )
-    parser.add_argument(
-        "--measurement-path",
-        type=str,
-        default="data/autotune_measurements.csv",
-    )
-    parser.add_argument(
-        "--output",
-        type=Path,
-        default=Path("outputs/surrogate_baseline.json"),
-    )
-    return parser.parse_args()
-def main() -> None:
-    args = parse_args()
-    if not args.task:
-        env = SoftmaxSurrogateEnvironment(measurement_path=args.measurement_path, budget=args.budget, seed=args.seed)
-        args.task = env.reset()["observation"]["task_id"]
-    train_task_ids = None
-    if args.train_tasks_file is not None:
-        train_task_ids = json.loads(args.train_tasks_file.read_text(encoding="utf-8"))
-    summary = run_surrogate_baseline(
-        task=args.task,
-        episodes=args.episodes,
-        budget=args.budget,
-        seed=args.seed,
-        measurement_path=args.measurement_path,
-        train_task_ids=train_task_ids,
-        acquisition=args.acquisition,
-        beta=args.beta,
-        xi=args.xi,
-    )
-    args.output.parent.mkdir(parents=True, exist_ok=True)
-    with args.output.open("w", encoding="utf-8") as f:
-        json.dump(summary, f, indent=2)
-    print(json.dumps(summary, indent=2))
-if __name__ == "__main__":
-    main()

scripts/smoke_test_client.py DELETED Viewed

@@ -1,28 +0,0 @@
-#!/usr/bin/env python3
-from __future__ import annotations
-import json
-import sys
-from pathlib import Path
-ROOT = Path(__file__).resolve().parents[1]
-if str(ROOT) not in sys.path:
-    sys.path.append(str(ROOT))
-from client import SoftmaxSurrogateEnvClient
-def main() -> None:
-    client = SoftmaxSurrogateEnvClient()
-    reset_out = client.reset()
-    step_out = client.step({"config_id": 0})
-    summary = {"reset": reset_out, "step": step_out}
-    out = Path("outputs/smoke_test_client.json")
-    out.parent.mkdir(parents=True, exist_ok=True)
-    with out.open("w", encoding="utf-8") as f:
-        json.dump(summary, f, indent=2)
-    print(json.dumps(summary, indent=2))
-if __name__ == "__main__":
-    main()

server/__init__.py DELETED Viewed

	@@ -1 +0,0 @@
1	- # Package marker for importing server modules directly from workspace scripts.

server/app.py DELETED Viewed

@@ -1,72 +0,0 @@
-from __future__ import annotations
-import argparse
-import sys
-from pathlib import Path
-from typing import Any, Dict, Optional
-ROOT = Path(__file__).resolve().parents[1]
-if str(ROOT) not in sys.path:
-    sys.path.append(str(ROOT))
-from fastapi import FastAPI
-from fastapi import HTTPException
-from pydantic import BaseModel
-from models import ResetResult, StepResult
-from server.softmax_surrogate_environment import SoftmaxSurrogateEnvironment
-app = FastAPI(title="Autotune Benchmark OpenEnv Server")
-env = SoftmaxSurrogateEnvironment()
-class ResetRequest(BaseModel):
-    task: Optional[str] = None
-    seed: Optional[int] = None
-class StepRequest(BaseModel):
-    config_id: Optional[int] = None
-    x: Optional[list[float]] = None
-@app.get("/health")
-def health() -> Dict[str, str]:
-    return {"ok": "true"}
-@app.post("/reset")
-def reset(payload: ResetRequest) -> Dict[str, Any]:
-    result = env.reset(task=payload.task, seed=payload.seed)
-    return result
-@app.post("/step")
-def step(payload: StepRequest) -> Dict[str, Any]:
-    if payload.config_id is not None:
-        result = env.step({"config_id": payload.config_id})
-        return result
-    if payload.x is not None:
-        result = env.step({"x": payload.x})
-        return result
-    raise HTTPException(status_code=400, detail="Missing config_id.")
-    return result
-@app.get("/state")
-def state() -> Dict[str, Any]:
-    return env.state()
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser(description="Run softmax surrogate environment server.")
-    parser.add_argument("--host", default="0.0.0.0")
-    parser.add_argument("--port", type=int, default=8000)
-    args = parser.parse_args()
-    try:
-        import uvicorn
-        uvicorn.run("app:app", host=args.host, port=args.port, reload=False)
-    except Exception as err:  # pragma: no cover
-        raise RuntimeError("uvicorn not available") from err

server/softmax_surrogate_environment.py DELETED Viewed

@@ -1,570 +0,0 @@
-from __future__ import annotations
-import csv
-import json
-import math
-import random
-import sys
-from dataclasses import dataclass
-from pathlib import Path
-from typing import Any, Dict, List, Optional, Sequence, Tuple
-import numpy as np
-ROOT = Path(__file__).resolve().parents[1]
-if str(ROOT) not in sys.path:
-    sys.path.append(str(ROOT))
-from scripts.collect_measurements import BLOCK_SIZES, NUM_STAGES, NUM_WARPS
-DEFAULT_MEASUREMENT_PATH = "data/autotune_measurements.csv"
-DEFAULT_BUDGET = 6
-INITIAL_DATASET_SIZE = 2
-DUPLICATE_PENALTY = -1e-4
-FAMILY_INDEX = {
-    "softmax": 0,
-    "layernorm": 1,
-    "grouped_gemm": 2,
-    "rmsnorm": 3,
-    "gemm": 4,
-}
-@dataclass(frozen=True)
-class Measurement:
-    family_group: str
-    family: str
-    task_id: str
-    m: int
-    n: int
-    k: int
-    config_id: int
-    block_size: int
-    num_warps: int
-    num_stages: int
-    shape_json: str
-    config_json: str
-    median_ms: float
-    effective_gbps: float
-    score: float
-    validation_error: float
-def _normalize_discrete(values: Sequence[int], value: int) -> float:
-    idx = list(values).index(int(value))
-    if len(values) == 1:
-        return 0.0
-    return 2.0 * (idx / (len(values) - 1)) - 1.0
-class SoftmaxSurrogateEnvironment:
-    """
-    Generic discrete-action autotuning environment backed by measured GPU data.
-    The class name is kept for compatibility with the existing local server and
-    baseline scripts, but the task space is now multi-family.
-    """
-    def __init__(
-        self,
-        measurement_path: str = DEFAULT_MEASUREMENT_PATH,
-        budget: int = DEFAULT_BUDGET,
-        seed: int = 0,
-        initial_samples: int = INITIAL_DATASET_SIZE,
-        train_task_ids: Optional[Sequence[str]] = None,
-    ) -> None:
-        self.measurement_path = Path(measurement_path)
-        self.budget = int(budget)
-        self.seed = int(seed)
-        self.initial_samples = max(1, int(initial_samples))
-        self.train_task_ids = set(train_task_ids or [])
-        self._measurements = self._load_measurements()
-        self._task_ids = sorted(self._measurements.keys())
-        if not self._task_ids:
-            raise RuntimeError(
-                "No measurement data found. Run the measurement collectors first."
-            )
-        self._rng = random.Random(self.seed)
-        self._episode_counter = 0
-        self._task_id: Optional[str] = None
-        self._family: Optional[str] = None
-        self._episode_id: Optional[str] = None
-        self._task_rows: List[Measurement] = []
-        self._prior_rows: List[Measurement] = []
-        self._config_by_id: Dict[int, Measurement] = {}
-        self._observed_ids: List[int] = []
-        self._observed_id_set = set()
-        self._observed_rows: List[Measurement] = []
-        self._observed_latencies: List[float] = []
-        self._steps_taken = 0
-        self._steps_remaining = 0
-        self._best_latency_ms = float("inf")
-        self._best_config_id: Optional[int] = None
-        self._validation_mse = float("inf")
-        self._surrogate_version = 0
-        self._surrogate_fitted_version = -1
-        self._surrogate_x: Optional[np.ndarray] = None
-        self._surrogate_y: Optional[np.ndarray] = None
-        self._surrogate_alpha: Optional[np.ndarray] = None
-        self._surrogate_k: Optional[np.ndarray] = None
-        self._surrogate_length_scale: float = 0.5
-    def reset(self, task: Optional[str] = None, seed: Optional[int] = None) -> Dict[str, Any]:
-        if seed is not None:
-            self._rng = random.Random(int(seed))
-        if task is None:
-            task = self._rng.choice(self._task_ids)
-        if task not in self._measurements:
-            raise ValueError(f"Unknown task: {task}")
-        rows = self._measurements[task]
-        self._task_id = task
-        self._family = rows[0].family
-        self._task_rows = rows
-        self._config_by_id = {row.config_id: row for row in rows}
-        self._prior_rows = self._build_prior_rows(task)
-        self._observed_ids = []
-        self._observed_id_set = set()
-        self._observed_rows = []
-        self._observed_latencies = []
-        self._steps_taken = 0
-        self._steps_remaining = self.budget
-        self._best_latency_ms = float("inf")
-        self._best_config_id = None
-        self._episode_counter += 1
-        self._episode_id = f"{task}:{self.seed}:{self._episode_counter}"
-        sample_count = min(self.initial_samples, len(rows))
-        for config_id in self._rng.sample(list(self._config_by_id.keys()), k=sample_count):
-            self._observe_config(config_id)
-        self._invalidate_surrogate()
-        self._validation_mse = self._compute_validation_mse()
-        return self._format_step_output(
-            observation=self._observation_payload(kind="reset"),
-            reward=0.0,
-            done=False,
-            info=self.diagnostics(),
-        )
-    def step(self, action: Any) -> Dict[str, Any]:
-        if self._task_id is None:
-            raise RuntimeError("Call reset() before step().")
-        if self._steps_remaining <= 0:
-            return self._format_step_output(
-                observation=self._observation_payload(kind="done"),
-                reward=0.0,
-                done=True,
-                info=self.diagnostics(),
-            )
-        config_id = self._extract_config_id(action)
-        row = self._row_for_id(config_id)
-        prev_best = self._best_latency_ms
-        duplicate = config_id in self._observed_id_set
-        if not duplicate:
-            self._observe_config(config_id)
-            self._surrogate_version += 1
-        self._steps_taken += 1
-        self._steps_remaining -= 1
-        self._validation_mse = self._compute_validation_mse()
-        reward = DUPLICATE_PENALTY if duplicate else max(0.0, math.log(prev_best) - math.log(self._best_latency_ms))
-        observation = self._observation_payload(
-            kind="step",
-            last_trial={
-                "config_id": config_id,
-                "config": self.config_info(config_id),
-                "latency_ms": row.median_ms,
-                "score": row.score,
-                "duplicate": duplicate,
-            },
-        )
-        return self._format_step_output(
-            observation=observation,
-            reward=reward,
-            done=self._steps_remaining <= 0,
-            info=self.diagnostics(),
-        )
-    def state(self) -> Dict[str, Any]:
-        if self._task_id is None:
-            return {"status": "uninitialized"}
-        return {
-            "episode_id": self._episode_id,
-            "step_count": self._steps_taken,
-            "task_id": self._task_id,
-            "family": self._family,
-            "tried_config_ids": list(self._observed_ids),
-        }
-    def diagnostics(self) -> Dict[str, Any]:
-        if self._task_id is None:
-            return {"status": "uninitialized"}
-        oracle_best_ms = self.oracle_best()["median_ms"]
-        regret = self._best_latency_ms / oracle_best_ms - 1.0
-        return {
-            "validation_mse": self._validation_mse,
-            "best_so_far_ms": self._best_latency_ms,
-            "oracle_best_ms": oracle_best_ms,
-            "current_regret": regret,
-            "observed_count": len(self._observed_ids),
-            "prior_count": len(self._prior_rows),
-        }
-    def available_tasks(self) -> List[str]:
-        return list(self._task_ids)
-    def available_config_ids(self) -> List[int]:
-        if self._task_id is None:
-            raise RuntimeError("Call reset() before accessing config ids.")
-        return sorted(self._config_by_id.keys())
-    def available_configs(self) -> List[Dict[str, Any]]:
-        return [self.config_info(config_id) for config_id in self.available_config_ids()]
-    def config_info(self, config_id: int) -> Dict[str, Any]:
-        row = self._row_for_id(config_id)
-        return {
-            "config_id": int(config_id),
-            "family": row.family,
-            "task_id": row.task_id,
-            "block_size": row.block_size,
-            "num_warps": row.num_warps,
-            "num_stages": row.num_stages,
-        }
-    def measured_latency_ms(self, config_id: int) -> float:
-        return self._row_for_id(config_id).median_ms
-    def oracle_best(self) -> Dict[str, Any]:
-        if self._task_id is None:
-            raise RuntimeError("Call reset() before querying oracle_best().")
-        best = min(self._task_rows, key=lambda row: row.median_ms)
-        return {
-            "config_id": best.config_id,
-            "family": best.family,
-            "task_id": best.task_id,
-            "block_size": best.block_size,
-            "num_warps": best.num_warps,
-            "num_stages": best.num_stages,
-            "median_ms": best.median_ms,
-            "score": best.score,
-        }
-    def predict_score(self, config_id: int) -> float:
-        return float(self._predict_with_uncertainty(config_id)[0])
-    def acquisition_score(
-        self,
-        config_id: int,
-        strategy: str = "ucb",
-        beta: float = 1.0,
-        xi: float = 0.0,
-    ) -> float:
-        mean, sigma = self._predict_with_uncertainty(config_id)
-        if strategy == "mean":
-            return float(mean)
-        if strategy == "ucb":
-            return float(mean + float(beta) * sigma)
-        if strategy == "ei":
-            best_observed = max(row.score for row in self._observed_rows) if self._observed_rows else mean
-            delta = mean - best_observed - float(xi)
-            if sigma <= 0.0:
-                return float(max(delta, 0.0))
-            z = delta / sigma
-            return float(max(delta * _normal_cdf(z) + sigma * _normal_pdf(z), 0.0))
-        raise ValueError(f"Unknown acquisition strategy: {strategy}")
-    def seen_config_ids(self) -> List[int]:
-        return list(self._observed_ids)
-    def _build_prior_rows(self, current_task: str) -> List[Measurement]:
-        if not self.train_task_ids:
-            return []
-        prior_rows: List[Measurement] = []
-        for task_id in sorted(self.train_task_ids):
-            if task_id == current_task or task_id not in self._measurements:
-                continue
-            prior_rows.extend(self._measurements[task_id])
-        return prior_rows
-    def _predict_with_uncertainty(self, config_id: int) -> Tuple[float, float]:
-        if not self._observed_rows and not self._prior_rows:
-            raise RuntimeError("No surrogate data available.")
-        self._fit_surrogate()
-        if self._surrogate_x is None or self._surrogate_y is None:
-            raise RuntimeError("Surrogate model unavailable.")
-        if self._surrogate_x.shape[0] == 1:
-            return float(self._surrogate_y[0]), 0.0
-        cfg = _config_to_vector(self._row_for_id(config_id)).reshape(1, -1)
-        if self._surrogate_k is None or self._surrogate_alpha is None:
-            raise RuntimeError("Surrogate model unavailable.")
-        k = _rbf_kernel(self._surrogate_x, cfg, self._surrogate_length_scale).reshape(-1)
-        pred = float(k @ self._surrogate_alpha)
-        solve = np.linalg.solve(self._surrogate_k, k)
-        var = max(0.0, float(1.0 - k @ solve))
-        return pred, float(math.sqrt(max(var, 1e-12)))
-    def _fit_surrogate(self) -> None:
-        if self._surrogate_fitted_version == self._surrogate_version:
-            return
-        rows = self._prior_rows + self._observed_rows
-        if not rows:
-            self._surrogate_x = None
-            self._surrogate_y = None
-            self._surrogate_alpha = None
-            self._surrogate_k = None
-            self._surrogate_fitted_version = self._surrogate_version
-            return
-        self._surrogate_x = np.array([_config_to_vector(row) for row in rows], dtype=np.float32)
-        self._surrogate_y = np.array([row.score for row in rows], dtype=np.float32)
-        if self._surrogate_x.shape[0] == 1:
-            self._surrogate_alpha = self._surrogate_y.copy()
-            self._surrogate_k = None
-            self._surrogate_fitted_version = self._surrogate_version
-            return
-        pairwise = _pairwise_sq_dists(self._surrogate_x)
-        triu = pairwise[np.triu_indices(self._surrogate_x.shape[0], k=1)]
-        med_dist = float(np.median(np.sqrt(triu))) if triu.size else 0.5
-        self._surrogate_length_scale = max(0.15, med_dist)
-        k = _rbf_kernel(self._surrogate_x, self._surrogate_x, self._surrogate_length_scale)
-        k[np.diag_indices_from(k)] += 1e-3
-        self._surrogate_k = k
-        self._surrogate_alpha = np.linalg.solve(k, self._surrogate_y)
-        self._surrogate_fitted_version = self._surrogate_version
-    def _compute_validation_mse(self) -> float:
-        if not self._task_rows:
-            return float("inf")
-        preds = np.array(
-            [self._predict_with_uncertainty(config_id)[0] for config_id in self.available_config_ids()],
-            dtype=np.float32,
-        )
-        target = np.array([self._row_for_id(config_id).score for config_id in self.available_config_ids()], dtype=np.float32)
-        return float(np.mean((preds - target) ** 2))
-    def _observe_config(self, config_id: int) -> None:
-        row = self._row_for_id(config_id)
-        self._observed_ids.append(config_id)
-        self._observed_id_set.add(config_id)
-        self._observed_rows.append(row)
-        self._observed_latencies.append(row.median_ms)
-        if row.median_ms < self._best_latency_ms:
-            self._best_latency_ms = row.median_ms
-            self._best_config_id = config_id
-    def _observation_payload(
-        self,
-        kind: str,
-        last_trial: Optional[Dict[str, Any]] = None,
-    ) -> Dict[str, Any]:
-        payload = {
-            "type": kind,
-            "task_id": self._task_id,
-            "family": self._family,
-            "M": self._task_rows[0].m if self._task_rows else None,
-            "N": self._task_rows[0].n if self._task_rows else None,
-            "dtype": "fp16",
-            "tried_config_ids": list(self._observed_ids),
-            "tried_latencies_ms": list(self._observed_latencies),
-            "best_so_far_ms": self._best_latency_ms,
-            "steps_remaining": self._steps_remaining,
-        }
-        if last_trial is not None:
-            payload["last_trial"] = last_trial
-        return payload
-    def _extract_config_id(self, action: Any) -> int:
-        if isinstance(action, (str, bytes)):
-            action = json.loads(action)
-        if isinstance(action, dict):
-            if "config_id" in action:
-                return int(action["config_id"])
-            if "x" in action:
-                normalized = self._extract_legacy_action(action["x"])
-                config = self._map_legacy_action_to_config(normalized)
-                return config
-        if isinstance(action, (int, np.integer)):
-            return int(action)
-        raise TypeError("Action must be an int config_id or dict with config_id.")
-    def _extract_legacy_action(self, action: Any) -> List[float]:
-        arr = np.clip(np.asarray(action, dtype=float), -1.0, 1.0)
-        if arr.shape != (3,):
-            raise ValueError("Legacy action vector must have 3 values.")
-        return arr.tolist()
-    def _map_legacy_action_to_config(self, action: Sequence[float]) -> int:
-        base = (
-            _de_norm(float(action[0]), BLOCK_SIZES),
-            _de_norm(float(action[1]), NUM_WARPS),
-            _de_norm(float(action[2]), NUM_STAGES),
-        )
-        best_id = min(
-            self.available_config_ids(),
-            key=lambda config_id: (
-                self._row_for_id(config_id).block_size - base[0]
-            ) ** 2
-            + (self._row_for_id(config_id).num_warps - base[1]) ** 2
-            + (self._row_for_id(config_id).num_stages - base[2]) ** 2,
-        )
-        return int(best_id)
-    def _row_for_id(self, config_id: int) -> Measurement:
-        if config_id not in self._config_by_id:
-            raise ValueError(f"Unknown config_id={config_id}")
-        return self._config_by_id[int(config_id)]
-    def _invalidate_surrogate(self) -> None:
-        self._surrogate_version += 1
-        self._surrogate_fitted_version = -1
-        self._surrogate_x = None
-        self._surrogate_y = None
-        self._surrogate_alpha = None
-        self._surrogate_k = None
-    def _format_step_output(
-        self,
-        observation: Dict[str, Any],
-        reward: float,
-        done: bool,
-        info: Optional[Dict[str, Any]] = None,
-    ) -> Dict[str, Any]:
-        return {
-            "observation": observation,
-            "reward": float(reward),
-            "done": bool(done),
-            "state": self.state(),
-            "info": info or {},
-        }
-    def _load_measurements(self) -> Dict[str, List[Measurement]]:
-        if not self.measurement_path.exists():
-            raise FileNotFoundError(
-                f"Missing measurement file at {self.measurement_path}. "
-                "Run the measurement collectors first."
-            )
-        grouped: Dict[str, List[Measurement]] = {}
-        with self.measurement_path.open("r", newline="", encoding="utf-8") as handle:
-            reader = csv.DictReader(handle)
-            fallback_config_ids: Dict[str, int] = {}
-            for row in reader:
-                family = row.get("family", "softmax")
-                family_group = row.get("family_group", "A" if family in {"softmax", "layernorm"} else "B")
-                task_id = row["task_id"]
-                block_size = int(row["block_size"])
-                num_warps = int(row["num_warps"])
-                num_stages = int(row["num_stages"])
-                config_id_raw = row.get("config_id")
-                if config_id_raw in (None, ""):
-                    key = f"{task_id}|{block_size}|{num_warps}|{num_stages}"
-                    if key not in fallback_config_ids:
-                        fallback_config_ids[key] = len([k for k in fallback_config_ids if k.startswith(f"{task_id}|")])
-                    config_id = fallback_config_ids[key]
-                else:
-                    config_id = int(config_id_raw)
-                measurement = Measurement(
-                    family_group=family_group,
-                    family=family,
-                    task_id=task_id,
-                    m=int(row["m"]),
-                    n=int(row["n"]),
-                    k=int(row.get("k", 0) or 0),
-                    config_id=config_id,
-                    block_size=block_size,
-                    num_warps=num_warps,
-                    num_stages=num_stages,
-                    shape_json=row.get("shape_json", "{}"),
-                    config_json=row.get("config_json", "{}"),
-                    median_ms=float(row["median_ms"]),
-                    effective_gbps=float(row["effective_gbps"]),
-                    score=float(row["score"]),
-                    validation_error=float(row["validation_error"]),
-                )
-                grouped.setdefault(task_id, []).append(measurement)
-        for task_id in grouped:
-            grouped[task_id].sort(key=lambda row: row.config_id)
-        return grouped
-def _config_to_vector(row: Measurement) -> np.ndarray:
-    family_vec = np.zeros(len(FAMILY_INDEX), dtype=np.float32)
-    if row.family in FAMILY_INDEX:
-        family_vec[FAMILY_INDEX[row.family]] = 1.0
-    shape_fields = {}
-    try:
-        shape_fields = json.loads(row.shape_json) if row.shape_json else {}
-    except Exception:
-        shape_fields = {}
-    mode_val = _shape_scalar(shape_fields.get("mode_id"))
-    role_val = _shape_scalar(shape_fields.get("role_id"))
-    seq_val = _shape_scalar(shape_fields.get("seq_len"))
-    ctx_val = _shape_scalar(shape_fields.get("ctx_len"))
-    group_val = _shape_scalar(shape_fields.get("group_count"))
-    numeric = np.array(
-        [
-            math.log2(max(row.m, 1)) / 16.0,
-            math.log2(max(row.n, 1)) / 16.0,
-            math.log2(max(row.k, 1)) / 16.0 if row.k > 0 else 0.0,
-            math.log2(max(seq_val, 1.0)) / 16.0 if seq_val > 0 else 0.0,
-            math.log2(max(ctx_val, 1.0)) / 16.0 if ctx_val > 0 else 0.0,
-            math.log2(max(group_val, 1.0)) / 8.0 if group_val > 0 else 0.0,
-            mode_val / 8.0,
-            role_val / 16.0,
-            _normalize_discrete(BLOCK_SIZES, row.block_size),
-            _normalize_discrete(NUM_WARPS, row.num_warps),
-            _normalize_discrete(NUM_STAGES, row.num_stages),
-        ],
-        dtype=np.float32,
-    )
-    return np.concatenate([family_vec, numeric], axis=0)
-def _pairwise_sq_dists(X: np.ndarray) -> np.ndarray:
-    diff = X[:, None, :] - X[None, :, :]
-    return np.sum(diff * diff, axis=2)
-def _rbf_kernel(X: np.ndarray, Y: np.ndarray, length_scale: float) -> np.ndarray:
-    sigma2 = float(length_scale * length_scale)
-    if sigma2 <= 0:
-        sigma2 = 1e-6
-    xy = X @ Y.T
-    x2 = np.sum(X * X, axis=1)[:, None]
-    y2 = np.sum(Y * Y, axis=1)[None, :]
-    d2 = np.maximum(x2 - 2.0 * xy + y2, 0.0)
-    return np.exp(-0.5 * d2 / sigma2).astype(np.float32)
-def _normal_pdf(z: float) -> float:
-    inv_sqrt_2pi = 1.0 / math.sqrt(2.0 * math.pi)
-    return float(inv_sqrt_2pi * math.exp(-0.5 * z * z))
-def _normal_cdf(z: float) -> float:
-    return 0.5 * (1.0 + math.erf(z / math.sqrt(2.0)))
-def _shape_scalar(value: Any) -> float:
-    if value is None:
-        return 0.0
-    try:
-        return float(value)
-    except (TypeError, ValueError):
-        return 0.0