Final Kernel versions

by Pramodith - opened May 8

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

-90968

This view is limited to 50 files because it contains too many changes. See the raw diff here.

Files changed (50) hide show

README.md +3 -326
benchmark_results/bnpo_loss_compiled/bnpo_loss_compiled_dark_animation.svg +0 -123
benchmark_results/bnpo_loss_compiled/bnpo_loss_compiled_dark_latency.svg +0 -0
benchmark_results/bnpo_loss_compiled/bnpo_loss_compiled_dark_throughput.svg +0 -0
benchmark_results/bnpo_loss_compiled/bnpo_loss_compiled_light_animation.svg +0 -123
benchmark_results/bnpo_loss_compiled/bnpo_loss_compiled_light_latency.svg +0 -0
benchmark_results/bnpo_loss_compiled/bnpo_loss_compiled_light_throughput.svg +0 -0
benchmark_results/bnpo_loss_compiled/results.json +0 -206
benchmark_results/bnpo_loss_eager/bnpo_loss_eager_dark_animation.svg +0 -123
benchmark_results/bnpo_loss_eager/bnpo_loss_eager_dark_latency.svg +0 -0
benchmark_results/bnpo_loss_eager/bnpo_loss_eager_dark_throughput.svg +0 -0
benchmark_results/bnpo_loss_eager/bnpo_loss_eager_light_animation.svg +0 -123
benchmark_results/bnpo_loss_eager/bnpo_loss_eager_light_latency.svg +0 -0
benchmark_results/bnpo_loss_eager/bnpo_loss_eager_light_throughput.svg +0 -0
benchmark_results/bnpo_loss_eager/results.json +0 -206
benchmark_results/grpo_loss_compiled/grpo_loss_compiled_dark_animation.svg +0 -105
benchmark_results/grpo_loss_compiled/grpo_loss_compiled_dark_latency.svg +0 -0
benchmark_results/grpo_loss_compiled/grpo_loss_compiled_dark_throughput.svg +0 -0
benchmark_results/grpo_loss_compiled/grpo_loss_compiled_light_animation.svg +0 -105
benchmark_results/grpo_loss_compiled/grpo_loss_compiled_light_latency.svg +0 -0
benchmark_results/grpo_loss_compiled/grpo_loss_compiled_light_throughput.svg +0 -0
benchmark_results/grpo_loss_compiled/results.json +0 -174
benchmark_results/grpo_loss_eager/grpo_loss_eager_dark_animation.svg +0 -105
benchmark_results/grpo_loss_eager/grpo_loss_eager_dark_latency.svg +0 -0
benchmark_results/grpo_loss_eager/grpo_loss_eager_dark_throughput.svg +0 -0
benchmark_results/grpo_loss_eager/grpo_loss_eager_light_animation.svg +0 -105
benchmark_results/grpo_loss_eager/grpo_loss_eager_light_latency.svg +0 -0
benchmark_results/grpo_loss_eager/grpo_loss_eager_light_throughput.svg +0 -0
benchmark_results/grpo_loss_eager/results.json +0 -174
benchmark_results/reverse_kl_compiled/results.json +0 -206
benchmark_results/reverse_kl_compiled/reverse_kl_compiled_dark_animation.svg +0 -123
benchmark_results/reverse_kl_compiled/reverse_kl_compiled_dark_latency.svg +0 -0
benchmark_results/reverse_kl_compiled/reverse_kl_compiled_dark_throughput.svg +0 -0
benchmark_results/reverse_kl_compiled/reverse_kl_compiled_light_animation.svg +0 -123
benchmark_results/reverse_kl_compiled/reverse_kl_compiled_light_latency.svg +0 -0
benchmark_results/reverse_kl_compiled/reverse_kl_compiled_light_throughput.svg +0 -0
benchmark_results/reverse_kl_eager/results.json +0 -206
benchmark_results/reverse_kl_eager/reverse_kl_eager_dark_animation.svg +0 -123
benchmark_results/reverse_kl_eager/reverse_kl_eager_dark_latency.svg +0 -0
benchmark_results/reverse_kl_eager/reverse_kl_eager_dark_throughput.svg +0 -0
benchmark_results/reverse_kl_eager/reverse_kl_eager_light_animation.svg +0 -123
benchmark_results/reverse_kl_eager/reverse_kl_eager_light_latency.svg +0 -0
benchmark_results/reverse_kl_eager/reverse_kl_eager_light_throughput.svg +0 -0
build/torch-cuda/__init__.py +0 -69
build/torch-cuda/_ops.py +0 -38
build/torch-cuda/bnpo_loss/__init__.py +0 -196
build/torch-cuda/bnpo_loss/_torch_ref.py +0 -56
build/torch-cuda/bnpo_loss/autograd.py +0 -149
build/torch-cuda/bnpo_loss/cute_bnpo_loss.py +0 -1081
build/torch-cuda/geometric_ai_kernels/__init__.py +0 -26

README.md CHANGED Viewed

@@ -1,326 +1,3 @@
----
-library_name: kernels
-license: apache-2.0
-tags:
-- cuda
-- cutlass
-- cute-dsl
-- rl
-- distillation
-- trl
-- grpo
-- bnpo
-- kl-divergence
----
-# Geometric-AI Kernels
-Fused **CuteDSL** kernels for the loss functions that dominate post-training
-workloads: PPO-family policy losses (BNPO, GRPO) and reverse-KL
-self-distillation.
-Each kernel ships a **single-launch fused forward +
-backward** path that returns `(loss, grad_logprobs)` directly. No `torch.autograd.Function` wrapper, no extra `grad_output * dpolicy` backward
-kernel, and no host-side syncs in the hot path.
-Background and benchmarks: see the
-[release post](https://geometric.so/blog/2026/05/08/hf-kernel-hub).
-- **Backend**: CUDA (NVIDIA CUTLASS DSL).
-- **Min GPU**: SM80 (Ampere) - required by `nvidia-cutlass-dsl`. Tested on H100 (SM90). Should work on SM80 (Ampere), SM86 (RTX 3090, A40), SM89 (RTX 4090, L40S), SM90a (H100 SXM), and SM100 (Blackwell B200/GB200).
-- **Min CUDA**: 12.8.
-- **Dtypes**: `float32`, `float16`, `bfloat16`.
-- **Dynamic shapes**: a single compile handles arbitrary batch size and
-  sequence length, no recompiles when shapes change between calls (common
-  in post-training rollouts).
-## Kernels
-| Kernel family | Direct (no autograd) | Autograd-aware | Forward-only |
-| --- | --- | --- | --- |
-| BNPO loss | `bnpo_loss` | `bnpo_loss_autograd` | `bnpo_loss_fwd` |
-| GRPO loss | `grpo_loss` | `grpo_loss_autograd` | `grpo_loss_fwd` |
-| Reverse KL | `reverse_kl` | `reverse_kl_autograd` | `reverse_kl_fwd` |
-### Entry points
-Each kernel family exposes three entry points with the same underlying CuteDSL kernel:
-- **`<name>(...)`** - fused fwd+bwd, returns `(loss, grad)` from one `@cute.jit`
-  dispatch. Lowest-overhead path; the caller chains the gradient into the upstream
-  model with `policy_logprobs.backward(grad)`. Use this in custom training loops
-  where you control gradient flow.
-- **`<name>_autograd(...)`** - same kernel, registered via
-  `torch.library.custom_op` + `register_autograd`. `loss.backward()` works
-  and composes with `torch.compile(fullgraph=True)`. There is a noticeable
-  per-call dispatcher overhead vs. the direct path.
-- **`<name>_fwd(...)`** - forward-only, returns scalar `loss` and skips
-  the gradient buffer entirely. Use for inference / validation /
-  reward-model scoring.
-## Loading the kernels
-```
-pip install apache-tvm-ffi nvidia-cutlass-dsl
-```
-```python
-from kernels import get_kernel
-km = get_kernel("Geometric-AI/geometric-ai-kernels", version=0)
-```
----
-## BNPO Loss
-**Batch-Normalized Policy Optimization** sums per-token policy and KL terms
-across the **entire batch** and divides by the global valid-token count:
-```
-loss = ((per_token_loss + β·kl) · mask).sum() / max(mask.sum(), 1)
-```
-where `per_token_loss` is the PPO-clipped ratio loss:
-```
-ratio      = exp(policy_logprobs - old_policy_logprobs)
-clipped    = clip(ratio, 1−ε, 1+ε_high)
-per_token  = −advantages · min(ratio, clipped)
-kl         = exp(ref_logprobs − policy_logprobs) − (ref_logprobs − policy_logprobs) − 1
-```
-The global denominator is computed entirely on-GPU via cross-CTA atomics -
-no host-side `mask.sum()` sync. When `beta=0` the KL branch is dead-coded
-at compile time.
-**Inputs**:
-- `policy_logprobs`, `old_policy_logprobs`, `ref_logprobs`: `(bs, seq_len)`, fp32/fp16/bf16
-- `advantages`: `(bs,)`
-- `completions_mask`: `(bs, seq_len)`, bool or int8
-**Returns**: `(loss, grad_policy_logprobs)` from `bnpo_loss`; scalar `loss` from `bnpo_loss_fwd`.
-```python
-import torch
-from kernels import get_kernel
-km = get_kernel("Geometric-AI/geometric-ai-kernels", version=0)
-device = torch.device("cuda")
-bs, seq_len = 16, 1024
-policy_logprobs     = torch.randn(bs, seq_len, dtype=torch.bfloat16, device=device, requires_grad=True)
-old_policy_logprobs = torch.randn(bs, seq_len, dtype=torch.bfloat16, device=device)
-ref_logprobs        = torch.randn(bs, seq_len, dtype=torch.bfloat16, device=device)
-advantages          = torch.randn(bs, dtype=torch.bfloat16, device=device)
-completions_mask    = (torch.rand(bs, seq_len, device=device) > 0.2).to(torch.int8)
-# 1) Direct (loss, grad) - lowest overhead training path
-loss, grad = km.bnpo_loss(
-    policy_logprobs, old_policy_logprobs, ref_logprobs,
-    advantages, completions_mask,
-    epsilon=0.2, epsilon_high=0.2, beta=0.1,
-)
-policy_logprobs.backward(grad)
-# 2) Autograd-aware - works with loss.backward() and torch.compile
-loss = km.bnpo_loss_autograd(
-    policy_logprobs.requires_grad_(),
-    old_policy_logprobs, ref_logprobs,
-    advantages, completions_mask,
-    epsilon=0.2, epsilon_high=0.2, beta=0.1,
-)
-loss.backward()
-# 3) Forward-only - inference / reward scoring, no gradient buffer
-loss = km.bnpo_loss_fwd(
-    policy_logprobs, old_policy_logprobs, ref_logprobs,
-    advantages, completions_mask,
-    epsilon=0.2, epsilon_high=0.2, beta=0.1,
-)
-```
----
-## GRPO Loss
-**Group Relative Policy Optimization** implements TRL's default
-**per-response normalization** variant - each response is normalized by its
-own valid-token count before averaging across the batch:
-```
-loss = mean_r( ((per_token_loss + β·kl) · mask).sum(-1) / max(mask.sum(-1), 1) )
-```
-`per_token_loss` and `kl` are the same clipped-ratio and KL expressions as BNPO.
-`completions_mask` is **required** because the per-response denominator is
-mask-derived. The kernel uses one CTA per row so the per-row mask sum is
-reduced inside the block - no cross-CTA atomics on the scaling pass.
-**Inputs**:
-- `policy_logprobs`, `old_policy_logprobs`, `ref_logprobs`: `(bs, seq_len)`, fp32/fp16/bf16
-- `advantages`: `(bs,)`
-- `completions_mask`: `(bs, seq_len)`, bool or int8 - **required**
-**Returns**: `(loss, grad_policy_logprobs)` from `grpo_loss`; scalar `loss` from `grpo_loss_fwd`.
-```python
-import torch
-from kernels import get_kernel
-km = get_kernel("Geometric-AI/geometric-ai-kernels", version=0)
-device = torch.device("cuda")
-bs, seq_len = 16, 1024
-policy_logprobs     = torch.randn(bs, seq_len, dtype=torch.bfloat16, device=device, requires_grad=True)
-old_policy_logprobs = torch.randn(bs, seq_len, dtype=torch.bfloat16, device=device)
-ref_logprobs        = torch.randn(bs, seq_len, dtype=torch.bfloat16, device=device)
-advantages          = torch.randn(bs, dtype=torch.bfloat16, device=device)
-completions_mask    = (torch.rand(bs, seq_len, device=device) > 0.2).to(torch.int8)
-# 1) Direct (loss, grad) - lowest overhead training path
-loss, grad = km.grpo_loss(
-    policy_logprobs, old_policy_logprobs, ref_logprobs,
-    advantages, completions_mask,
-    epsilon=0.2, epsilon_high=0.2, beta=0.1,
-)
-policy_logprobs.backward(grad)
-# 2) Autograd-aware - works with loss.backward() and torch.compile
-loss = km.grpo_loss_autograd(
-    policy_logprobs.requires_grad_(),
-    old_policy_logprobs, ref_logprobs,
-    advantages, completions_mask,
-    epsilon=0.2, epsilon_high=0.2, beta=0.1,
-)
-loss.backward()
-# 3) Forward-only - inference / reward scoring, no gradient buffer
-loss = km.grpo_loss_fwd(
-    policy_logprobs, old_policy_logprobs, ref_logprobs,
-    advantages, completions_mask,
-    epsilon=0.2, epsilon_high=0.2, beta=0.1,
-)
-```
----
-## Reverse KL
-**Reverse-KL self-distillation** computes `KL(student ‖ teacher)` over a
-`(num_tokens, vocab)` slab using an online normalization algorithm that reads
-each logit row exactly once on the forward-only path:
-```
-p = softmax(student_logits)
-q = softmax(teacher_logits)
-kl_per_row = Σ_v  p_v · (log p_v − log q_v)
-loss = (mask · kl_per_row).sum() / mask.sum()
-```
-The gradient through the softmax Jacobian is analytical:
-```
-grad_student_v = scale · p_v · (log p_v − log q_v − kl_per_row)
-```
-where `scale = mask[r] · inv_n_valid`.
-**Inputs**:
-- `student_logits`, `teacher_logits`: `(*, V)` - arbitrary leading dims (typically `(bs, seq_len, vocab)`); both must share shape and dtype
-- `completions_mask`: shape matching `student_logits.shape[:-1]`
-> ⚠️ **Fully-masked batches**: `inv_n_valid = 1 / mask.sum()` is not clamped, so a batch where every token is masked produces inf/NaN. Guard upstream if that case is reachable.
-**Returns**: `(loss, grad_student_logits)` from `reverse_kl`; scalar `loss` from `reverse_kl_fwd`.
-```python
-import torch
-from kernels import get_kernel
-km = get_kernel("Geometric-AI/geometric-ai-kernels", version=0)
-device = torch.device("cuda")
-# Qwen3.5-style vocab; arbitrary leading dims supported
-bs, seq_len, vocab = 4, 256, 248320
-student_logits  = torch.randn(bs, seq_len, vocab, dtype=torch.bfloat16, device=device, requires_grad=True)
-teacher_logits  = torch.randn(bs, seq_len, vocab, dtype=torch.bfloat16, device=device)
-completions_mask = (torch.rand(bs, seq_len, device=device) > 0.2)
-# 1) Direct (loss, grad) - lowest overhead training path
-loss, grad = km.reverse_kl(student_logits, teacher_logits, completions_mask)
-student_logits.backward(grad)
-# 2) Autograd-aware - works with loss.backward() and torch.compile
-loss = km.reverse_kl_autograd(
-    student_logits.requires_grad_(), teacher_logits, completions_mask
-)
-loss.backward()
-# 3) Forward-only - inference / KL monitoring, no gradient buffer
-loss = km.reverse_kl_fwd(student_logits, teacher_logits, completions_mask)
-```
----
-## Performance
-All numbers are geometric-mean speedups over H100 SXM (SM90a). Full methodology
-and per-shape plots in the [release post](https://geometric.so/blog/2026/05/08/hf-kernel-hub).
-### `kernels` CLI benchmark
-Timed with `time.perf_counter` + `cuda.synchronize()`, mean over 100 iterations.
-| Kernel | vs eager | vs `torch.compile` |
-| --- | --- | --- |
-| `grpo_loss_fwd` | 5.68×  | 2.45× |
-| `grpo_loss`     | 20.79× | 1.98x |
-| `bnpo_loss_fwd` | 5.29×  | 2.52× |
-| `bnpo_loss`     | 16.81× | 2.27× |
-| `reverse_kl_fwd`| 6.88×  | 2.45× |
-| `reverse_kl`    | 7.03×  | 2.61× |
----
-## Benchmark animations
-### BNPO Loss vs eager PyTorch
-<picture>
-  <source media="(prefers-color-scheme: dark)" srcset="benchmark_results/bnpo_loss_eager/bnpo_loss_eager_dark_animation.svg">
-  <img width="90%" src="benchmark_results/bnpo_loss_eager/bnpo_loss_eager_light_animation.svg" alt="BNPO loss latency vs eager PyTorch">
-</picture>
-### BNPO Loss vs torch.compile
-<picture>
-  <source media="(prefers-color-scheme: dark)" srcset="benchmark_results/bnpo_loss_compiled/bnpo_loss_compiled_dark_animation.svg">
-  <img width="90%" src="benchmark_results/bnpo_loss_compiled/bnpo_loss_compiled_light_animation.svg" alt="BNPO loss latency vs torch.compile">
-</picture>
-### GRPO Loss vs eager PyTorch
-<picture>
-  <source media="(prefers-color-scheme: dark)" srcset="benchmark_results/grpo_loss_eager/grpo_loss_eager_dark_animation.svg">
-  <img width="90%" src="benchmark_results/grpo_loss_eager/grpo_loss_eager_light_animation.svg" alt="GRPO loss latency vs eager PyTorch">
-</picture>
-### GRPO Loss vs torch.compile
-<picture>
-  <source media="(prefers-color-scheme: dark)" srcset="benchmark_results/grpo_loss_compiled/grpo_loss_compiled_dark_animation.svg">
-  <img width="90%" src="benchmark_results/grpo_loss_compiled/grpo_loss_compiled_light_animation.svg" alt="GRPO loss latency vs torch.compile">
-</picture>
-### Reverse KL vs eager PyTorch
-<picture>
-  <source media="(prefers-color-scheme: dark)" srcset="benchmark_results/reverse_kl_eager/reverse_kl_eager_dark_animation.svg">
-  <img width="90%" src="benchmark_results/reverse_kl_eager/reverse_kl_eager_light_animation.svg" alt="Reverse KL latency vs eager PyTorch">
-</picture>
-### Reverse KL vs torch.compile
-<picture>
-  <source media="(prefers-color-scheme: dark)" srcset="benchmark_results/reverse_kl_compiled/reverse_kl_compiled_dark_animation.svg">
-  <img width="90%" src="benchmark_results/reverse_kl_compiled/reverse_kl_compiled_light_animation.svg" alt="Reverse KL latency vs torch.compile">
-</picture>

+---
+license: apache-2.0
+---

benchmark_results/bnpo_loss_compiled/bnpo_loss_compiled_dark_animation.svg DELETED Viewed

benchmark_results/bnpo_loss_compiled/bnpo_loss_compiled_dark_latency.svg DELETED Viewed

benchmark_results/bnpo_loss_compiled/bnpo_loss_compiled_dark_throughput.svg DELETED Viewed

benchmark_results/bnpo_loss_compiled/bnpo_loss_compiled_light_animation.svg DELETED Viewed

benchmark_results/bnpo_loss_compiled/bnpo_loss_compiled_light_latency.svg DELETED Viewed

benchmark_results/bnpo_loss_compiled/bnpo_loss_compiled_light_throughput.svg DELETED Viewed

benchmark_results/bnpo_loss_compiled/results.json DELETED Viewed

@@ -1,206 +0,0 @@
-{
-  "results": [
-    {
-      "workload": "bnpoLossBenchmark.bnpo_loss_batch128_seqlen02781_compiled",
-      "timingResults": {
-        "mean_ms": 0.0359,
-        "std_ms": 0.0038,
-        "min_ms": 0.0332,
-        "max_ms": 0.0701,
-        "q1_ms": 0.0344,
-        "q3_ms": 0.0357,
-        "iqr_ms": 0.0013,
-        "outliers": 20,
-        "iterations": 200,
-        "refMeanMs": 0.0771
-      },
-      "verified": true
-    },
-    {
-      "workload": "bnpoLossBenchmark.bnpo_loss_batch128_seqlen08192_compiled",
-      "timingResults": {
-        "mean_ms": 0.0351,
-        "std_ms": 0.0033,
-        "min_ms": 0.0327,
-        "max_ms": 0.0557,
-        "q1_ms": 0.0336,
-        "q3_ms": 0.035,
-        "iqr_ms": 0.0014,
-        "outliers": 14,
-        "iterations": 200,
-        "refMeanMs": 0.0771
-      },
-      "verified": true
-    },
-    {
-      "workload": "bnpoLossBenchmark.bnpo_loss_batch16_seqlen01024_compiled",
-      "timingResults": {
-        "mean_ms": 0.0355,
-        "std_ms": 0.0042,
-        "min_ms": 0.0331,
-        "max_ms": 0.0706,
-        "q1_ms": 0.034,
-        "q3_ms": 0.0351,
-        "iqr_ms": 0.0011,
-        "outliers": 21,
-        "iterations": 200,
-        "refMeanMs": 0.0811
-      },
-      "verified": true
-    },
-    {
-      "workload": "bnpoLossBenchmark.bnpo_loss_batch16_seqlen02781_compiled",
-      "timingResults": {
-        "mean_ms": 0.0355,
-        "std_ms": 0.004,
-        "min_ms": 0.0319,
-        "max_ms": 0.0591,
-        "q1_ms": 0.0338,
-        "q3_ms": 0.0352,
-        "iqr_ms": 0.0014,
-        "outliers": 24,
-        "iterations": 200,
-        "refMeanMs": 0.0709
-      },
-      "verified": true
-    },
-    {
-      "workload": "bnpoLossBenchmark.bnpo_loss_batch32_seqlen02048_compiled",
-      "timingResults": {
-        "mean_ms": 0.0358,
-        "std_ms": 0.0042,
-        "min_ms": 0.032,
-        "max_ms": 0.0569,
-        "q1_ms": 0.0338,
-        "q3_ms": 0.0355,
-        "iqr_ms": 0.0017,
-        "outliers": 27,
-        "iterations": 200,
-        "refMeanMs": 0.0763
-      },
-      "verified": true
-    },
-    {
-      "workload": "bnpoLossBenchmark.bnpo_loss_batch64_seqlen04096_compiled",
-      "timingResults": {
-        "mean_ms": 0.0344,
-        "std_ms": 0.0031,
-        "min_ms": 0.032,
-        "max_ms": 0.0557,
-        "q1_ms": 0.0331,
-        "q3_ms": 0.0341,
-        "iqr_ms": 0.001,
-        "outliers": 32,
-        "iterations": 200,
-        "refMeanMs": 0.0739
-      },
-      "verified": true
-    },
-    {
-      "workload": "bnpoLossBenchmark.bnpo_loss_fwd_batch128_seqlen02781_compiled",
-      "timingResults": {
-        "mean_ms": 0.0323,
-        "std_ms": 0.0034,
-        "min_ms": 0.03,
-        "max_ms": 0.053,
-        "q1_ms": 0.0311,
-        "q3_ms": 0.0318,
-        "iqr_ms": 0.0007,
-        "outliers": 25,
-        "iterations": 200,
-        "refMeanMs": 0.0808
-      },
-      "verified": true
-    },
-    {
-      "workload": "bnpoLossBenchmark.bnpo_loss_fwd_batch128_seqlen08192_compiled",
-      "timingResults": {
-        "mean_ms": 0.0318,
-        "std_ms": 0.0032,
-        "min_ms": 0.0293,
-        "max_ms": 0.0502,
-        "q1_ms": 0.0304,
-        "q3_ms": 0.0317,
-        "iqr_ms": 0.0013,
-        "outliers": 17,
-        "iterations": 200,
-        "refMeanMs": 0.0845
-      },
-      "verified": true
-    },
-    {
-      "workload": "bnpoLossBenchmark.bnpo_loss_fwd_batch16_seqlen01024_compiled",
-      "timingResults": {
-        "mean_ms": 0.0317,
-        "std_ms": 0.0031,
-        "min_ms": 0.0293,
-        "max_ms": 0.0593,
-        "q1_ms": 0.0304,
-        "q3_ms": 0.0317,
-        "iqr_ms": 0.0013,
-        "outliers": 17,
-        "iterations": 200,
-        "refMeanMs": 0.079
-      },
-      "verified": true
-    },
-    {
-      "workload": "bnpoLossBenchmark.bnpo_loss_fwd_batch16_seqlen02781_compiled",
-      "timingResults": {
-        "mean_ms": 0.0306,
-        "std_ms": 0.0035,
-        "min_ms": 0.0279,
-        "max_ms": 0.0534,
-        "q1_ms": 0.0289,
-        "q3_ms": 0.0306,
-        "iqr_ms": 0.0017,
-        "outliers": 20,
-        "iterations": 200,
-        "refMeanMs": 0.084
-      },
-      "verified": true
-    },
-    {
-      "workload": "bnpoLossBenchmark.bnpo_loss_fwd_batch32_seqlen02048_compiled",
-      "timingResults": {
-        "mean_ms": 0.0305,
-        "std_ms": 0.0035,
-        "min_ms": 0.0279,
-        "max_ms": 0.051,
-        "q1_ms": 0.0288,
-        "q3_ms": 0.0308,
-        "iqr_ms": 0.002,
-        "outliers": 15,
-        "iterations": 200,
-        "refMeanMs": 0.0764
-      },
-      "verified": true
-    },
-    {
-      "workload": "bnpoLossBenchmark.bnpo_loss_fwd_batch64_seqlen04096_compiled",
-      "timingResults": {
-        "mean_ms": 0.0315,
-        "std_ms": 0.0033,
-        "min_ms": 0.0293,
-        "max_ms": 0.0543,
-        "q1_ms": 0.0302,
-        "q3_ms": 0.0311,
-        "iqr_ms": 0.0009,
-        "outliers": 21,
-        "iterations": 200,
-        "refMeanMs": 0.0739
-      },
-      "verified": true
-    }
-  ],
-  "machineInfo": {
-    "gpu": "NVIDIA H100 80GB HBM3",
-    "backend": "CUDA 13.0",
-    "pytorchVersion": "2.11.0+cu130",
-    "os": "Linux 6.11.0-1016-nvidia",
-    "cpu": "x86_64"
-  },
-  "kernelCommitSha": "7972ab0e834be24d",
-  "benchmarkScriptPath": "benchmarks",
-  "benchmarkScriptSha": "68426064f76adff2066ad365f6c97be3fe279bd6b20d025b3dc5614f9b2da449"
-}

benchmark_results/bnpo_loss_eager/bnpo_loss_eager_dark_animation.svg DELETED Viewed

benchmark_results/bnpo_loss_eager/bnpo_loss_eager_dark_latency.svg DELETED Viewed

benchmark_results/bnpo_loss_eager/bnpo_loss_eager_dark_throughput.svg DELETED Viewed

benchmark_results/bnpo_loss_eager/bnpo_loss_eager_light_animation.svg DELETED Viewed

benchmark_results/bnpo_loss_eager/bnpo_loss_eager_light_latency.svg DELETED Viewed

benchmark_results/bnpo_loss_eager/bnpo_loss_eager_light_throughput.svg DELETED Viewed

benchmark_results/bnpo_loss_eager/results.json DELETED Viewed

@@ -1,206 +0,0 @@
-{
-  "results": [
-    {
-      "workload": "bnpoLossBenchmark.bnpo_loss_batch128_seqlen02781_eager",
-      "timingResults": {
-        "mean_ms": 0.0358,
-        "std_ms": 0.0035,
-        "min_ms": 0.0323,
-        "max_ms": 0.0536,
-        "q1_ms": 0.0342,
-        "q3_ms": 0.0358,
-        "iqr_ms": 0.0017,
-        "outliers": 17,
-        "iterations": 200,
-        "refMeanMs": 0.5552
-      },
-      "verified": true
-    },
-    {
-      "workload": "bnpoLossBenchmark.bnpo_loss_batch128_seqlen08192_eager",
-      "timingResults": {
-        "mean_ms": 0.0344,
-        "std_ms": 0.0031,
-        "min_ms": 0.0314,
-        "max_ms": 0.0537,
-        "q1_ms": 0.0329,
-        "q3_ms": 0.0345,
-        "iqr_ms": 0.0015,
-        "outliers": 20,
-        "iterations": 200,
-        "refMeanMs": 0.6466
-      },
-      "verified": true
-    },
-    {
-      "workload": "bnpoLossBenchmark.bnpo_loss_batch16_seqlen01024_eager",
-      "timingResults": {
-        "mean_ms": 0.0345,
-        "std_ms": 0.0171,
-        "min_ms": 0.0305,
-        "max_ms": 0.2718,
-        "q1_ms": 0.0319,
-        "q3_ms": 0.033,
-        "iqr_ms": 0.0011,
-        "outliers": 23,
-        "iterations": 200,
-        "refMeanMs": 0.5868
-      },
-      "verified": true
-    },
-    {
-      "workload": "bnpoLossBenchmark.bnpo_loss_batch16_seqlen02781_eager",
-      "timingResults": {
-        "mean_ms": 0.0324,
-        "std_ms": 0.0027,
-        "min_ms": 0.0301,
-        "max_ms": 0.0508,
-        "q1_ms": 0.0312,
-        "q3_ms": 0.0324,
-        "iqr_ms": 0.0012,
-        "outliers": 17,
-        "iterations": 200,
-        "refMeanMs": 0.5832
-      },
-      "verified": true
-    },
-    {
-      "workload": "bnpoLossBenchmark.bnpo_loss_batch32_seqlen02048_eager",
-      "timingResults": {
-        "mean_ms": 0.0343,
-        "std_ms": 0.0033,
-        "min_ms": 0.031,
-        "max_ms": 0.0513,
-        "q1_ms": 0.0325,
-        "q3_ms": 0.0346,
-        "iqr_ms": 0.0021,
-        "outliers": 19,
-        "iterations": 200,
-        "refMeanMs": 0.6265
-      },
-      "verified": true
-    },
-    {
-      "workload": "bnpoLossBenchmark.bnpo_loss_batch64_seqlen04096_eager",
-      "timingResults": {
-        "mean_ms": 0.0328,
-        "std_ms": 0.0029,
-        "min_ms": 0.0306,
-        "max_ms": 0.0499,
-        "q1_ms": 0.0317,
-        "q3_ms": 0.0326,
-        "iqr_ms": 0.0009,
-        "outliers": 20,
-        "iterations": 200,
-        "refMeanMs": 0.5698
-      },
-      "verified": true
-    },
-    {
-      "workload": "bnpoLossBenchmark.bnpo_loss_fwd_batch128_seqlen02781_eager",
-      "timingResults": {
-        "mean_ms": 0.0317,
-        "std_ms": 0.0034,
-        "min_ms": 0.0285,
-        "max_ms": 0.052,
-        "q1_ms": 0.0305,
-        "q3_ms": 0.0314,
-        "iqr_ms": 0.0009,
-        "outliers": 22,
-        "iterations": 200,
-        "refMeanMs": 0.1858
-      },
-      "verified": true
-    },
-    {
-      "workload": "bnpoLossBenchmark.bnpo_loss_fwd_batch128_seqlen08192_eager",
-      "timingResults": {
-        "mean_ms": 0.0292,
-        "std_ms": 0.0028,
-        "min_ms": 0.0273,
-        "max_ms": 0.0455,
-        "q1_ms": 0.0281,
-        "q3_ms": 0.0289,
-        "iqr_ms": 0.0008,
-        "outliers": 23,
-        "iterations": 200,
-        "refMeanMs": 0.1633
-      },
-      "verified": true
-    },
-    {
-      "workload": "bnpoLossBenchmark.bnpo_loss_fwd_batch16_seqlen01024_eager",
-      "timingResults": {
-        "mean_ms": 0.0311,
-        "std_ms": 0.0267,
-        "min_ms": 0.0256,
-        "max_ms": 0.4049,
-        "q1_ms": 0.0276,
-        "q3_ms": 0.0295,
-        "iqr_ms": 0.0018,
-        "outliers": 18,
-        "iterations": 200,
-        "refMeanMs": 0.1761
-      },
-      "verified": true
-    },
-    {
-      "workload": "bnpoLossBenchmark.bnpo_loss_fwd_batch16_seqlen02781_eager",
-      "timingResults": {
-        "mean_ms": 0.0288,
-        "std_ms": 0.003,
-        "min_ms": 0.027,
-        "max_ms": 0.0554,
-        "q1_ms": 0.0278,
-        "q3_ms": 0.0284,
-        "iqr_ms": 0.0006,
-        "outliers": 22,
-        "iterations": 200,
-        "refMeanMs": 0.1755
-      },
-      "verified": true
-    },
-    {
-      "workload": "bnpoLossBenchmark.bnpo_loss_fwd_batch32_seqlen02048_eager",
-      "timingResults": {
-        "mean_ms": 0.031,
-        "std_ms": 0.0034,
-        "min_ms": 0.0281,
-        "max_ms": 0.0484,
-        "q1_ms": 0.0296,
-        "q3_ms": 0.0306,
-        "iqr_ms": 0.0009,
-        "outliers": 27,
-        "iterations": 200,
-        "refMeanMs": 0.1533
-      },
-      "verified": true
-    },
-    {
-      "workload": "bnpoLossBenchmark.bnpo_loss_fwd_batch64_seqlen04096_eager",
-      "timingResults": {
-        "mean_ms": 0.031,
-        "std_ms": 0.0041,
-        "min_ms": 0.0286,
-        "max_ms": 0.0625,
-        "q1_ms": 0.0294,
-        "q3_ms": 0.0305,
-        "iqr_ms": 0.0011,
-        "outliers": 22,
-        "iterations": 200,
-        "refMeanMs": 0.1678
-      },
-      "verified": true
-    }
-  ],
-  "machineInfo": {
-    "gpu": "NVIDIA H100 80GB HBM3",
-    "backend": "CUDA 13.0",
-    "pytorchVersion": "2.11.0+cu130",
-    "os": "Linux 6.11.0-1016-nvidia",
-    "cpu": "x86_64"
-  },
-  "kernelCommitSha": "84e79b2f3ee3088a",
-  "benchmarkScriptPath": "benchmarks",
-  "benchmarkScriptSha": "68426064f76adff2066ad365f6c97be3fe279bd6b20d025b3dc5614f9b2da449"
-}

benchmark_results/grpo_loss_compiled/grpo_loss_compiled_dark_animation.svg DELETED Viewed

benchmark_results/grpo_loss_compiled/grpo_loss_compiled_dark_latency.svg DELETED Viewed

benchmark_results/grpo_loss_compiled/grpo_loss_compiled_dark_throughput.svg DELETED Viewed

benchmark_results/grpo_loss_compiled/grpo_loss_compiled_light_animation.svg DELETED Viewed

benchmark_results/grpo_loss_compiled/grpo_loss_compiled_light_latency.svg DELETED Viewed

benchmark_results/grpo_loss_compiled/grpo_loss_compiled_light_throughput.svg DELETED Viewed

benchmark_results/grpo_loss_compiled/results.json DELETED Viewed

@@ -1,174 +0,0 @@
-{
-  "results": [
-    {
-      "workload": "GrpoLossBenchmark.grpo_loss_batch128_seqlen02781_compiled",
-      "timingResults": {
-        "mean_ms": 0.0329,
-        "std_ms": 0.0042,
-        "min_ms": 0.0301,
-        "max_ms": 0.0632,
-        "q1_ms": 0.031,
-        "q3_ms": 0.0326,
-        "iqr_ms": 0.0016,
-        "outliers": 22,
-        "iterations": 200,
-        "refMeanMs": 0.0874
-      },
-      "verified": true
-    },
-    {
-      "workload": "GrpoLossBenchmark.grpo_loss_batch128_seqlen08192_compiled",
-      "timingResults": {
-        "mean_ms": 0.0337,
-        "std_ms": 0.0045,
-        "min_ms": 0.0305,
-        "max_ms": 0.065,
-        "q1_ms": 0.0318,
-        "q3_ms": 0.0333,
-        "iqr_ms": 0.0015,
-        "outliers": 23,
-        "iterations": 200,
-        "refMeanMs": 0.0824
-      },
-      "verified": true
-    },
-    {
-      "workload": "GrpoLossBenchmark.grpo_loss_batch16_seqlen01024_compiled",
-      "timingResults": {
-        "mean_ms": 0.0323,
-        "std_ms": 0.0045,
-        "min_ms": 0.0286,
-        "max_ms": 0.0621,
-        "q1_ms": 0.0306,
-        "q3_ms": 0.0321,
-        "iqr_ms": 0.0015,
-        "outliers": 24,
-        "iterations": 200,
-        "refMeanMs": 0.0626
-      },
-      "verified": true
-    },
-    {
-      "workload": "GrpoLossBenchmark.grpo_loss_batch32_seqlen02048_compiled",
-      "timingResults": {
-        "mean_ms": 0.0324,
-        "std_ms": 0.0046,
-        "min_ms": 0.0286,
-        "max_ms": 0.0688,
-        "q1_ms": 0.0305,
-        "q3_ms": 0.0321,
-        "iqr_ms": 0.0016,
-        "outliers": 22,
-        "iterations": 200,
-        "refMeanMs": 0.0633
-      },
-      "verified": true
-    },
-    {
-      "workload": "GrpoLossBenchmark.grpo_loss_batch64_seqlen04096_compiled",
-      "timingResults": {
-        "mean_ms": 0.0349,
-        "std_ms": 0.0058,
-        "min_ms": 0.0315,
-        "max_ms": 0.0814,
-        "q1_ms": 0.0325,
-        "q3_ms": 0.0341,
-        "iqr_ms": 0.0016,
-        "outliers": 26,
-        "iterations": 200,
-        "refMeanMs": 0.0869
-      },
-      "verified": true
-    },
-    {
-      "workload": "GrpoLossBenchmark.grpo_loss_fwd_batch128_seqlen02781_compiled",
-      "timingResults": {
-        "mean_ms": 0.033,
-        "std_ms": 0.0038,
-        "min_ms": 0.0295,
-        "max_ms": 0.0543,
-        "q1_ms": 0.0313,
-        "q3_ms": 0.0333,
-        "iqr_ms": 0.0019,
-        "outliers": 16,
-        "iterations": 200,
-        "refMeanMs": 0.0772
-      },
-      "verified": true
-    },
-    {
-      "workload": "GrpoLossBenchmark.grpo_loss_fwd_batch128_seqlen08192_compiled",
-      "timingResults": {
-        "mean_ms": 0.0331,
-        "std_ms": 0.0032,
-        "min_ms": 0.0295,
-        "max_ms": 0.0535,
-        "q1_ms": 0.0316,
-        "q3_ms": 0.0331,
-        "iqr_ms": 0.0015,
-        "outliers": 19,
-        "iterations": 200,
-        "refMeanMs": 0.0767
-      },
-      "verified": true
-    },
-    {
-      "workload": "GrpoLossBenchmark.grpo_loss_fwd_batch16_seqlen01024_compiled",
-      "timingResults": {
-        "mean_ms": 0.033,
-        "std_ms": 0.0032,
-        "min_ms": 0.029,
-        "max_ms": 0.051,
-        "q1_ms": 0.0315,
-        "q3_ms": 0.0332,
-        "iqr_ms": 0.0016,
-        "outliers": 17,
-        "iterations": 200,
-        "refMeanMs": 0.0845
-      },
-      "verified": true
-    },
-    {
-      "workload": "GrpoLossBenchmark.grpo_loss_fwd_batch32_seqlen02048_compiled",
-      "timingResults": {
-        "mean_ms": 0.0339,
-        "std_ms": 0.006,
-        "min_ms": 0.03,
-        "max_ms": 0.0674,
-        "q1_ms": 0.0314,
-        "q3_ms": 0.0331,
-        "iqr_ms": 0.0017,
-        "outliers": 23,
-        "iterations": 200,
-        "refMeanMs": 0.1052
-      },
-      "verified": true
-    },
-    {
-      "workload": "GrpoLossBenchmark.grpo_loss_fwd_batch64_seqlen04096_compiled",
-      "timingResults": {
-        "mean_ms": 0.034,
-        "std_ms": 0.004,
-        "min_ms": 0.031,
-        "max_ms": 0.0623,
-        "q1_ms": 0.0323,
-        "q3_ms": 0.0339,
-        "iqr_ms": 0.0016,
-        "outliers": 20,
-        "iterations": 200,
-        "refMeanMs": 0.0796
-      },
-      "verified": true
-    }
-  ],
-  "machineInfo": {
-    "gpu": "NVIDIA H100 80GB HBM3",
-    "backend": "CUDA 13.0",
-    "pytorchVersion": "2.11.0+cu130",
-    "os": "Linux 6.11.0-1016-nvidia",
-    "cpu": "x86_64"
-  },
-  "kernelCommitSha": "ad285d68b8c8c0ff",
-  "benchmarkScriptPath": "benchmarks",
-  "benchmarkScriptSha": "ff35d63fbca37cfcbf5c94f067c930adc2bd0043ce6788f286dbad5a4f9b9d4a"
-}

benchmark_results/grpo_loss_eager/grpo_loss_eager_dark_animation.svg DELETED Viewed

benchmark_results/grpo_loss_eager/grpo_loss_eager_dark_latency.svg DELETED Viewed

benchmark_results/grpo_loss_eager/grpo_loss_eager_dark_throughput.svg DELETED Viewed

benchmark_results/grpo_loss_eager/grpo_loss_eager_light_animation.svg DELETED Viewed

benchmark_results/grpo_loss_eager/grpo_loss_eager_light_latency.svg DELETED Viewed

benchmark_results/grpo_loss_eager/grpo_loss_eager_light_throughput.svg DELETED Viewed

benchmark_results/grpo_loss_eager/results.json DELETED Viewed

@@ -1,174 +0,0 @@
-{
-  "results": [
-    {
-      "workload": "GrpoLossBenchmark.grpo_loss_batch128_seqlen02781_eager",
-      "timingResults": {
-        "mean_ms": 0.0313,
-        "std_ms": 0.0029,
-        "min_ms": 0.0281,
-        "max_ms": 0.0482,
-        "q1_ms": 0.03,
-        "q3_ms": 0.0314,
-        "iqr_ms": 0.0013,
-        "outliers": 16,
-        "iterations": 200,
-        "refMeanMs": 0.6643
-      },
-      "verified": true
-    },
-    {
-      "workload": "GrpoLossBenchmark.grpo_loss_batch128_seqlen08192_eager",
-      "timingResults": {
-        "mean_ms": 0.0309,
-        "std_ms": 0.0031,
-        "min_ms": 0.0285,
-        "max_ms": 0.0477,
-        "q1_ms": 0.0298,
-        "q3_ms": 0.0306,
-        "iqr_ms": 0.0008,
-        "outliers": 19,
-        "iterations": 200,
-        "refMeanMs": 0.5961
-      },
-      "verified": true
-    },
-    {
-      "workload": "GrpoLossBenchmark.grpo_loss_batch16_seqlen01024_eager",
-      "timingResults": {
-        "mean_ms": 0.0315,
-        "std_ms": 0.0033,
-        "min_ms": 0.0293,
-        "max_ms": 0.0507,
-        "q1_ms": 0.0302,
-        "q3_ms": 0.0311,
-        "iqr_ms": 0.0009,
-        "outliers": 23,
-        "iterations": 200,
-        "refMeanMs": 0.6132
-      },
-      "verified": true
-    },
-    {
-      "workload": "GrpoLossBenchmark.grpo_loss_batch32_seqlen02048_eager",
-      "timingResults": {
-        "mean_ms": 0.0302,
-        "std_ms": 0.0029,
-        "min_ms": 0.028,
-        "max_ms": 0.0467,
-        "q1_ms": 0.029,
-        "q3_ms": 0.0299,
-        "iqr_ms": 0.0008,
-        "outliers": 20,
-        "iterations": 200,
-        "refMeanMs": 0.6043
-      },
-      "verified": true
-    },
-    {
-      "workload": "GrpoLossBenchmark.grpo_loss_batch64_seqlen04096_eager",
-      "timingResults": {
-        "mean_ms": 0.0295,
-        "std_ms": 0.003,
-        "min_ms": 0.0268,
-        "max_ms": 0.0465,
-        "q1_ms": 0.0279,
-        "q3_ms": 0.03,
-        "iqr_ms": 0.002,
-        "outliers": 12,
-        "iterations": 200,
-        "refMeanMs": 0.5798
-      },
-      "verified": true
-    },
-    {
-      "workload": "GrpoLossBenchmark.grpo_loss_fwd_batch128_seqlen02781_eager",
-      "timingResults": {
-        "mean_ms": 0.0306,
-        "std_ms": 0.0032,
-        "min_ms": 0.0281,
-        "max_ms": 0.0513,
-        "q1_ms": 0.0293,
-        "q3_ms": 0.0302,
-        "iqr_ms": 0.0009,
-        "outliers": 24,
-        "iterations": 200,
-        "refMeanMs": 0.1716
-      },
-      "verified": true
-    },
-    {
-      "workload": "GrpoLossBenchmark.grpo_loss_fwd_batch128_seqlen08192_eager",
-      "timingResults": {
-        "mean_ms": 0.0302,
-        "std_ms": 0.0031,
-        "min_ms": 0.0284,
-        "max_ms": 0.0594,
-        "q1_ms": 0.0291,
-        "q3_ms": 0.0299,
-        "iqr_ms": 0.0008,
-        "outliers": 21,
-        "iterations": 200,
-        "refMeanMs": 0.1701
-      },
-      "verified": true
-    },
-    {
-      "workload": "GrpoLossBenchmark.grpo_loss_fwd_batch16_seqlen01024_eager",
-      "timingResults": {
-        "mean_ms": 0.0306,
-        "std_ms": 0.0027,
-        "min_ms": 0.0286,
-        "max_ms": 0.0455,
-        "q1_ms": 0.0294,
-        "q3_ms": 0.0304,
-        "iqr_ms": 0.001,
-        "outliers": 16,
-        "iterations": 200,
-        "refMeanMs": 0.1741
-      },
-      "verified": true
-    },
-    {
-      "workload": "GrpoLossBenchmark.grpo_loss_fwd_batch32_seqlen02048_eager",
-      "timingResults": {
-        "mean_ms": 0.0299,
-        "std_ms": 0.0029,
-        "min_ms": 0.0269,
-        "max_ms": 0.0488,
-        "q1_ms": 0.0287,
-        "q3_ms": 0.0301,
-        "iqr_ms": 0.0015,
-        "outliers": 14,
-        "iterations": 200,
-        "refMeanMs": 0.1647
-      },
-      "verified": true
-    },
-    {
-      "workload": "GrpoLossBenchmark.grpo_loss_fwd_batch64_seqlen04096_eager",
-      "timingResults": {
-        "mean_ms": 0.0314,
-        "std_ms": 0.0028,
-        "min_ms": 0.0289,
-        "max_ms": 0.0465,
-        "q1_ms": 0.0301,
-        "q3_ms": 0.0312,
-        "iqr_ms": 0.0011,
-        "outliers": 22,
-        "iterations": 200,
-        "refMeanMs": 0.1751
-      },
-      "verified": true
-    }
-  ],
-  "machineInfo": {
-    "gpu": "NVIDIA H100 80GB HBM3",
-    "backend": "CUDA 13.0",
-    "pytorchVersion": "2.11.0+cu130",
-    "os": "Linux 6.11.0-1016-nvidia",
-    "cpu": "x86_64"
-  },
-  "kernelCommitSha": "87ec9b61421d0121",
-  "benchmarkScriptPath": "benchmarks",
-  "benchmarkScriptSha": "ff35d63fbca37cfcbf5c94f067c930adc2bd0043ce6788f286dbad5a4f9b9d4a"
-}

benchmark_results/reverse_kl_compiled/results.json DELETED Viewed

@@ -1,206 +0,0 @@
-{
-  "results": [
-    {
-      "workload": "ReverseKLBenchmark.reverse_kl_batch01_seqlen064_vocab248320_compiled",
-      "timingResults": {
-        "mean_ms": 0.1039,
-        "std_ms": 0.0035,
-        "min_ms": 0.1,
-        "max_ms": 0.1229,
-        "q1_ms": 0.1018,
-        "q3_ms": 0.104,
-        "iqr_ms": 0.0022,
-        "outliers": 28,
-        "iterations": 200,
-        "refMeanMs": 0.2322
-      },
-      "verified": true
-    },
-    {
-      "workload": "ReverseKLBenchmark.reverse_kl_batch02_seqlen128_vocab248320_compiled",
-      "timingResults": {
-        "mean_ms": 0.2483,
-        "std_ms": 0.0035,
-        "min_ms": 0.2418,
-        "max_ms": 0.2612,
-        "q1_ms": 0.2457,
-        "q3_ms": 0.2513,
-        "iqr_ms": 0.0057,
-        "outliers": 2,
-        "iterations": 200,
-        "refMeanMs": 0.6455
-      },
-      "verified": true
-    },
-    {
-      "workload": "ReverseKLBenchmark.reverse_kl_batch04_seqlen256_vocab248320_compiled",
-      "timingResults": {
-        "mean_ms": 0.8322,
-        "std_ms": 0.0044,
-        "min_ms": 0.8232,
-        "max_ms": 0.8623,
-        "q1_ms": 0.8303,
-        "q3_ms": 0.8335,
-        "iqr_ms": 0.0032,
-        "outliers": 18,
-        "iterations": 200,
-        "refMeanMs": 2.2082
-      },
-      "verified": true
-    },
-    {
-      "workload": "ReverseKLBenchmark.reverse_kl_batch08_seqlen1024_vocab248320_compiled",
-      "timingResults": {
-        "mean_ms": 6.1083,
-        "std_ms": 0.0054,
-        "min_ms": 6.097,
-        "max_ms": 6.1513,
-        "q1_ms": 6.1054,
-        "q3_ms": 6.11,
-        "iqr_ms": 0.0046,
-        "outliers": 13,
-        "iterations": 200,
-        "refMeanMs": 16.4779
-      },
-      "verified": true
-    },
-    {
-      "workload": "ReverseKLBenchmark.reverse_kl_batch08_seqlen512_vocab248320_compiled",
-      "timingResults": {
-        "mean_ms": 3.0861,
-        "std_ms": 0.0045,
-        "min_ms": 3.0769,
-        "max_ms": 3.1155,
-        "q1_ms": 3.0832,
-        "q3_ms": 3.0883,
-        "iqr_ms": 0.0051,
-        "outliers": 5,
-        "iterations": 200,
-        "refMeanMs": 8.3849
-      },
-      "verified": true
-    },
-    {
-      "workload": "ReverseKLBenchmark.reverse_kl_batch08_seqlen981_vocab248320_compiled",
-      "timingResults": {
-        "mean_ms": 5.8622,
-        "std_ms": 0.0044,
-        "min_ms": 5.8544,
-        "max_ms": 5.8821,
-        "q1_ms": 5.859,
-        "q3_ms": 5.8646,
-        "iqr_ms": 0.0056,
-        "outliers": 6,
-        "iterations": 200,
-        "refMeanMs": 15.8101
-      },
-      "verified": true
-    },
-    {
-      "workload": "ReverseKLBenchmark.reverse_kl_fwd_batch01_seqlen064_vocab248320_compiled",
-      "timingResults": {
-        "mean_ms": 0.0657,
-        "std_ms": 0.0041,
-        "min_ms": 0.0619,
-        "max_ms": 0.093,
-        "q1_ms": 0.0635,
-        "q3_ms": 0.0656,
-        "iqr_ms": 0.0021,
-        "outliers": 24,
-        "iterations": 200,
-        "refMeanMs": 0.1434
-      },
-      "verified": true
-    },
-    {
-      "workload": "ReverseKLBenchmark.reverse_kl_fwd_batch02_seqlen128_vocab248320_compiled",
-      "timingResults": {
-        "mean_ms": 0.1234,
-        "std_ms": 0.0041,
-        "min_ms": 0.1187,
-        "max_ms": 0.1464,
-        "q1_ms": 0.1208,
-        "q3_ms": 0.1244,
-        "iqr_ms": 0.0036,
-        "outliers": 16,
-        "iterations": 200,
-        "refMeanMs": 0.3277
-      },
-      "verified": true
-    },
-    {
-      "workload": "ReverseKLBenchmark.reverse_kl_fwd_batch04_seqlen256_vocab248320_compiled",
-      "timingResults": {
-        "mean_ms": 0.3764,
-        "std_ms": 0.0037,
-        "min_ms": 0.3699,
-        "max_ms": 0.3926,
-        "q1_ms": 0.3733,
-        "q3_ms": 0.3787,
-        "iqr_ms": 0.0054,
-        "outliers": 2,
-        "iterations": 200,
-        "refMeanMs": 0.9228
-      },
-      "verified": true
-    },
-    {
-      "workload": "ReverseKLBenchmark.reverse_kl_fwd_batch08_seqlen1024_vocab248320_compiled",
-      "timingResults": {
-        "mean_ms": 2.658,
-        "std_ms": 0.0089,
-        "min_ms": 2.6359,
-        "max_ms": 2.6859,
-        "q1_ms": 2.6524,
-        "q3_ms": 2.663,
-        "iqr_ms": 0.0106,
-        "outliers": 4,
-        "iterations": 200,
-        "refMeanMs": 6.6033
-      },
-      "verified": true
-    },
-    {
-      "workload": "ReverseKLBenchmark.reverse_kl_fwd_batch08_seqlen512_vocab248320_compiled",
-      "timingResults": {
-        "mean_ms": 1.38,
-        "std_ms": 0.0035,
-        "min_ms": 1.37,
-        "max_ms": 1.3924,
-        "q1_ms": 1.3776,
-        "q3_ms": 1.3818,
-        "iqr_ms": 0.0042,
-        "outliers": 6,
-        "iterations": 200,
-        "refMeanMs": 3.3854
-      },
-      "verified": true
-    },
-    {
-      "workload": "ReverseKLBenchmark.reverse_kl_fwd_batch08_seqlen981_vocab248320_compiled",
-      "timingResults": {
-        "mean_ms": 2.5422,
-        "std_ms": 0.0091,
-        "min_ms": 2.5286,
-        "max_ms": 2.5773,
-        "q1_ms": 2.5356,
-        "q3_ms": 2.5455,
-        "iqr_ms": 0.0099,
-        "outliers": 9,
-        "iterations": 200,
-        "refMeanMs": 6.2191
-      },
-      "verified": true
-    }
-  ],
-  "machineInfo": {
-    "gpu": "NVIDIA H100 80GB HBM3",
-    "backend": "CUDA 13.0",
-    "pytorchVersion": "2.11.0+cu130",
-    "os": "Linux 6.11.0-1016-nvidia",
-    "cpu": "x86_64"
-  },
-  "kernelCommitSha": "ca5cbc20b4d2c7d8",
-  "benchmarkScriptPath": "benchmarks",
-  "benchmarkScriptSha": "690eea1f54f31bef1ad248380201005fd667d4b9c535f92f06eb6a5a33380d22"
-}

benchmark_results/reverse_kl_compiled/reverse_kl_compiled_dark_animation.svg DELETED Viewed

benchmark_results/reverse_kl_compiled/reverse_kl_compiled_dark_latency.svg DELETED Viewed

benchmark_results/reverse_kl_compiled/reverse_kl_compiled_dark_throughput.svg DELETED Viewed

benchmark_results/reverse_kl_compiled/reverse_kl_compiled_light_animation.svg DELETED Viewed

benchmark_results/reverse_kl_compiled/reverse_kl_compiled_light_latency.svg DELETED Viewed

benchmark_results/reverse_kl_compiled/reverse_kl_compiled_light_throughput.svg DELETED Viewed

benchmark_results/reverse_kl_eager/results.json DELETED Viewed

@@ -1,206 +0,0 @@
-{
-  "results": [
-    {
-      "workload": "ReverseKLBenchmark.reverse_kl_batch01_seqlen064_vocab248320_eager",
-      "timingResults": {
-        "mean_ms": 0.1029,
-        "std_ms": 0.0032,
-        "min_ms": 0.0982,
-        "max_ms": 0.1129,
-        "q1_ms": 0.101,
-        "q3_ms": 0.1036,
-        "iqr_ms": 0.0026,
-        "outliers": 27,
-        "iterations": 200,
-        "refMeanMs": 0.5293
-      },
-      "verified": true
-    },
-    {
-      "workload": "ReverseKLBenchmark.reverse_kl_batch02_seqlen128_vocab248320_eager",
-      "timingResults": {
-        "mean_ms": 0.248,
-        "std_ms": 0.0037,
-        "min_ms": 0.2417,
-        "max_ms": 0.2592,
-        "q1_ms": 0.2451,
-        "q3_ms": 0.251,
-        "iqr_ms": 0.0058,
-        "outliers": 0,
-        "iterations": 200,
-        "refMeanMs": 1.624
-      },
-      "verified": true
-    },
-    {
-      "workload": "ReverseKLBenchmark.reverse_kl_batch04_seqlen256_vocab248320_eager",
-      "timingResults": {
-        "mean_ms": 0.8321,
-        "std_ms": 0.0035,
-        "min_ms": 0.8234,
-        "max_ms": 0.854,
-        "q1_ms": 0.8306,
-        "q3_ms": 0.8335,
-        "iqr_ms": 0.003,
-        "outliers": 20,
-        "iterations": 200,
-        "refMeanMs": 6.174
-      },
-      "verified": true
-    },
-    {
-      "workload": "ReverseKLBenchmark.reverse_kl_batch08_seqlen1024_vocab248320_eager",
-      "timingResults": {
-        "mean_ms": 6.1046,
-        "std_ms": 0.0041,
-        "min_ms": 6.0961,
-        "max_ms": 6.1376,
-        "q1_ms": 6.1023,
-        "q3_ms": 6.106,
-        "iqr_ms": 0.0037,
-        "outliers": 9,
-        "iterations": 200,
-        "refMeanMs": 48.4051
-      },
-      "verified": true
-    },
-    {
-      "workload": "ReverseKLBenchmark.reverse_kl_batch08_seqlen512_vocab248320_eager",
-      "timingResults": {
-        "mean_ms": 3.0816,
-        "std_ms": 0.0035,
-        "min_ms": 3.0743,
-        "max_ms": 3.0939,
-        "q1_ms": 3.0794,
-        "q3_ms": 3.0832,
-        "iqr_ms": 0.0038,
-        "outliers": 8,
-        "iterations": 200,
-        "refMeanMs": 24.3385
-      },
-      "verified": true
-    },
-    {
-      "workload": "ReverseKLBenchmark.reverse_kl_batch08_seqlen981_vocab248320_eager",
-      "timingResults": {
-        "mean_ms": 5.8549,
-        "std_ms": 0.0045,
-        "min_ms": 5.8459,
-        "max_ms": 5.8819,
-        "q1_ms": 5.8524,
-        "q3_ms": 5.8561,
-        "iqr_ms": 0.0037,
-        "outliers": 14,
-        "iterations": 200,
-        "refMeanMs": 46.4274
-      },
-      "verified": true
-    },
-    {
-      "workload": "ReverseKLBenchmark.reverse_kl_fwd_batch01_seqlen064_vocab248320_eager",
-      "timingResults": {
-        "mean_ms": 0.0638,
-        "std_ms": 0.0027,
-        "min_ms": 0.0604,
-        "max_ms": 0.0787,
-        "q1_ms": 0.0624,
-        "q3_ms": 0.064,
-        "iqr_ms": 0.0016,
-        "outliers": 20,
-        "iterations": 200,
-        "refMeanMs": 0.2532
-      },
-      "verified": true
-    },
-    {
-      "workload": "ReverseKLBenchmark.reverse_kl_fwd_batch02_seqlen128_vocab248320_eager",
-      "timingResults": {
-        "mean_ms": 0.1217,
-        "std_ms": 0.0038,
-        "min_ms": 0.1166,
-        "max_ms": 0.1428,
-        "q1_ms": 0.1193,
-        "q3_ms": 0.1227,
-        "iqr_ms": 0.0034,
-        "outliers": 19,
-        "iterations": 200,
-        "refMeanMs": 0.7671
-      },
-      "verified": true
-    },
-    {
-      "workload": "ReverseKLBenchmark.reverse_kl_fwd_batch04_seqlen256_vocab248320_eager",
-      "timingResults": {
-        "mean_ms": 0.3753,
-        "std_ms": 0.0033,
-        "min_ms": 0.3695,
-        "max_ms": 0.3843,
-        "q1_ms": 0.3726,
-        "q3_ms": 0.3779,
-        "iqr_ms": 0.0053,
-        "outliers": 0,
-        "iterations": 200,
-        "refMeanMs": 2.869
-      },
-      "verified": true
-    },
-    {
-      "workload": "ReverseKLBenchmark.reverse_kl_fwd_batch08_seqlen1024_vocab248320_eager",
-      "timingResults": {
-        "mean_ms": 2.6484,
-        "std_ms": 0.0065,
-        "min_ms": 2.6364,
-        "max_ms": 2.7044,
-        "q1_ms": 2.6449,
-        "q3_ms": 2.6515,
-        "iqr_ms": 0.0067,
-        "outliers": 3,
-        "iterations": 200,
-        "refMeanMs": 22.3336
-      },
-      "verified": true
-    },
-    {
-      "workload": "ReverseKLBenchmark.reverse_kl_fwd_batch08_seqlen512_vocab248320_eager",
-      "timingResults": {
-        "mean_ms": 1.365,
-        "std_ms": 0.0046,
-        "min_ms": 1.3548,
-        "max_ms": 1.3865,
-        "q1_ms": 1.3618,
-        "q3_ms": 1.3675,
-        "iqr_ms": 0.0057,
-        "outliers": 4,
-        "iterations": 200,
-        "refMeanMs": 11.2401
-      },
-      "verified": true
-    },
-    {
-      "workload": "ReverseKLBenchmark.reverse_kl_fwd_batch08_seqlen981_vocab248320_eager",
-      "timingResults": {
-        "mean_ms": 2.5316,
-        "std_ms": 0.0059,
-        "min_ms": 2.5203,
-        "max_ms": 2.5523,
-        "q1_ms": 2.5272,
-        "q3_ms": 2.5355,
-        "iqr_ms": 0.0083,
-        "outliers": 3,
-        "iterations": 200,
-        "refMeanMs": 21.4099
-      },
-      "verified": true
-    }
-  ],
-  "machineInfo": {
-    "gpu": "NVIDIA H100 80GB HBM3",
-    "backend": "CUDA 13.0",
-    "pytorchVersion": "2.11.0+cu130",
-    "os": "Linux 6.11.0-1016-nvidia",
-    "cpu": "x86_64"
-  },
-  "kernelCommitSha": "3e023eb5121761b8",
-  "benchmarkScriptPath": "benchmarks",
-  "benchmarkScriptSha": "690eea1f54f31bef1ad248380201005fd667d4b9c535f92f06eb6a5a33380d22"
-}

benchmark_results/reverse_kl_eager/reverse_kl_eager_dark_animation.svg DELETED Viewed

benchmark_results/reverse_kl_eager/reverse_kl_eager_dark_latency.svg DELETED Viewed

benchmark_results/reverse_kl_eager/reverse_kl_eager_dark_throughput.svg DELETED Viewed

benchmark_results/reverse_kl_eager/reverse_kl_eager_light_animation.svg DELETED Viewed

benchmark_results/reverse_kl_eager/reverse_kl_eager_light_latency.svg DELETED Viewed

benchmark_results/reverse_kl_eager/reverse_kl_eager_light_throughput.svg DELETED Viewed

build/torch-cuda/__init__.py DELETED Viewed

@@ -1,69 +0,0 @@
-"""Geometric-AI CuteDSL kernels for RL / distillation training.
-Public surface:
-    * ``bnpo_loss`` / ``bnpo_loss_autograd`` / ``bnpo_loss_fwd`` —
-      fused fwd+bwd BNPO loss with three entry points (direct
-      ``(loss, grad)``, autograd-wrapped, forward-only).
-    * ``grpo_loss`` / ``grpo_loss_autograd`` / ``grpo_loss_fwd`` —
-      fused fwd+bwd GRPO loss (TRL's per-response normalization
-      variant). Same three-entry-point shape as BNPO. Requires
-      ``completions_mask``.
-    * ``reverse_kl`` / ``reverse_kl_autograd`` /
-      ``reverse_kl_fwd`` — fused fwd+bwd reverse-KL
-      self-distillation loss with the same three-entry-point shape.
-HF Kernels integration: :mod:`geometric_ai_kernels.layers` exposes
-``nn.Module`` adapters per kernel (``bnpoLoss`` / ``bnpoLossInference``,
-``grpoLoss`` / ``grpoLossInference``, ``ReverseKL`` /
-``ReverseKLInference``) for use with the ``kernels``
-library's ``kernelize()`` flow.
-"""
-from __future__ import annotations
-import torch._dynamo
-from .bnpo_loss import bnpo_loss, bnpo_loss_autograd, bnpo_loss_fwd
-from .grpo_loss import grpo_loss, grpo_loss_autograd, grpo_loss_fwd
-from .layers import (
-    ReverseKL,
-    ReverseKLInference,
-    bnpoLoss,
-    bnpoLossInference,
-    grpoLoss,
-    grpoLossInference,
-)
-from .reverse_kl import (
-    reverse_kl,
-    reverse_kl_autograd,
-    reverse_kl_fwd,
-)
-# Required so ``torch.compile(fullgraph=True)`` can trace through
-# ``torch.autograd.grad`` calls — without it Dynamo graph-breaks at the
-# autograd.grad call site even when AOTAutograd has already derived the
-# joint fwd+bwd graph. Set at package import so any consumer (benches,
-# user training loops, ``kernelize`` flows) gets it for free. Guarded
-# because ``trace_autograd_ops`` was added in torch 2.10 and the
-# Nix-pinned build environment may be on an older torch (2.9 today);
-# the underlying ``Config.__setattr__`` raises on unknown keys.
-if hasattr(torch._dynamo.config, "trace_autograd_ops"):
-    torch._dynamo.config.trace_autograd_ops = True  # ty: ignore[invalid-assignment]
-__all__ = [
-    "ReverseKL",
-    "ReverseKLInference",
-    "bnpoLoss",
-    "bnpoLossInference",
-    "bnpo_loss",
-    "bnpo_loss_autograd",
-    "bnpo_loss_fwd",
-    "grpoLoss",
-    "grpoLossInference",
-    "grpo_loss",
-    "grpo_loss_autograd",
-    "grpo_loss_fwd",
-    "reverse_kl",
-    "reverse_kl_autograd",
-    "reverse_kl_fwd",
-]

build/torch-cuda/_ops.py DELETED Viewed

@@ -1,38 +0,0 @@
-import torch
-def get_backend() -> str:
-    """Detect the backend by inspecting torch."""
-    import torch
-    if hasattr(torch, "neuron"):
-        # Needs to be sorted before specific Torch builds, since Neuron
-        # extension can be loaded into e.g. CUDA Torch builds.
-        return "neuron"
-    elif torch.version.cuda is not None:
-        return "cuda"
-    elif torch.version.hip is not None:
-        return "rocm"
-    elif torch.backends.mps.is_available():
-        return "metal"
-    elif hasattr(torch.version, "xpu") and torch.version.xpu is not None:
-        return "xpu"
-    else:
-        return "cpu"
-def _find_ops_name() -> str:
-    kernel_name = "geometric_ai_kernels"
-    unique_id = "a766fbd_dirty"
-    backend = get_backend()
-    return f"_{kernel_name}_{backend}_{unique_id}"
-_OPS_NAME = _find_ops_name()
-ops = getattr(torch.ops, _OPS_NAME)
-def add_op_namespace_prefix(op_name: str) -> str:
-    """
-    Prefix op by namespace.
-    """
-    return f"{_OPS_NAME}::{op_name}"

build/torch-cuda/bnpo_loss/__init__.py DELETED Viewed

@@ -1,196 +0,0 @@
-"""bnpo loss with CuteDSL fused fwd+bwd.
-Two public APIs route to two compiled kernels:
-* :func:`bnpo_loss` — primary training entry point. Returns
-  ``(loss, grad_policy_logprobs)`` from a single fused fwd+bwd kernel
-  launch. Inputs do **not** need ``requires_grad=True`` and there is no
-  ``torch.autograd.Function`` wrapper — chain the gradient into the
-  upstream model with ``policy_logprobs.backward(grad)`` (or, more
-  commonly, by passing ``grad`` to whatever step does the next leg of
-  backprop).
-* :func:`bnpo_loss_fwd` — inference / validation path. Returns the
-  scalar ``loss`` from a forward-only kernel that computes the masked
-  mean denominator on-GPU via a last-block trick (no host
-  ``completions_mask.sum()``).
-The two share the same compiled-kernel cache; per-call output and
-gradient buffers are allocated inside the runner, and cross-CTA scratch
-(atomic accumulators + counters) is owned by the compiled-kernel
-closure and self-resets each launch — callers don't manage scratch.
-Why no autograd wrapper here? bnpo's gradient is closed-form — the
-kernel already writes ``dL/d(policy_logprobs)`` in the same launch as
-the loss. Wrapping in ``torch.autograd.Function`` would cost an extra
-``grad_output * dpolicy`` kernel launch on backward (typically a
-no-op multiply by ``1.0``), plus per-call autograd graph bookkeeping.
-The autograd-aware sibling :func:`bnpo_loss_autograd` uses
-``torch.library.custom_op`` instead, which composes with
-``torch.compile``.
-"""
-from __future__ import annotations
-from functools import lru_cache
-from typing import TYPE_CHECKING, cast
-import torch
-from .cute_bnpo_loss import (
-    create_compiled_bnpo_loss,
-    create_compiled_bnpo_loss_with_backward,
-)
-if TYPE_CHECKING:
-    from collections.abc import Callable
-__all__ = ["bnpo_loss", "bnpo_loss_autograd", "bnpo_loss_fwd"]
-@lru_cache(maxsize=32)
-def _get_compiled_fwd(
-    dtype: torch.dtype,
-    epsilon: float,
-    epsilon_high: float,
-    beta: float,
-) -> Callable[..., torch.Tensor]:
-    return cast(
-        "Callable[..., torch.Tensor]",
-        create_compiled_bnpo_loss(
-            policy_dtype=dtype,
-            epsilon=epsilon,
-            epsilon_high=epsilon_high,
-            beta=beta,
-            compute_backward=False,
-        ),
-    )
-@lru_cache(maxsize=32)
-def _get_compiled_fwd_bwd(
-    dtype: torch.dtype,
-    epsilon: float,
-    epsilon_high: float,
-    beta: float,
-) -> Callable[..., tuple[torch.Tensor, torch.Tensor]]:
-    return create_compiled_bnpo_loss_with_backward(
-        policy_dtype=dtype,
-        epsilon=epsilon,
-        epsilon_high=epsilon_high,
-        beta=beta,
-    )
-def bnpo_loss_fwd(
-    policy_logprobs: torch.Tensor,
-    old_policy_logprobs: torch.Tensor,
-    ref_logprobs: torch.Tensor,
-    advantages: torch.Tensor,
-    completions_mask: torch.Tensor,
-    epsilon: float = 0.2,
-    epsilon_high: float = 0.2,
-    beta: float = 0.1,
-) -> torch.Tensor:
-    """Forward-only bnpo loss. Returns the scalar ``loss``.
-    Use for inference / validation. The masked mean denominator is
-    computed on-GPU by an atomic accumulator + last-block trick — no
-    host ``completions_mask.sum()`` syncs.
-    Args:
-        policy_logprobs, old_policy_logprobs, ref_logprobs: ``(bs, seq_len)``.
-        advantages: ``(bs,)``.
-        completions_mask: bool/int8 mask ``(bs, seq_len)``; truthy = valid token.
-        epsilon, epsilon_high: PPO-style clipping bounds.
-        beta: KL-penalty coefficient. ``0.0`` compiles away the KL branch.
-    Returns:
-        Scalar tensor (0-dim) with the same dtype as ``policy_logprobs``.
-    """
-    run = _get_compiled_fwd(
-        policy_logprobs.dtype,
-        float(epsilon),
-        float(epsilon_high),
-        float(beta),
-    )
-    mask_arg = (
-        completions_mask
-        if completions_mask.dtype == torch.int8
-        else completions_mask.to(torch.int8)
-    )
-    return run(
-        policy_logprobs,
-        old_policy_logprobs,
-        ref_logprobs,
-        advantages,
-        mask_arg,
-    )
-def bnpo_loss(
-    policy_logprobs: torch.Tensor,
-    old_policy_logprobs: torch.Tensor,
-    ref_logprobs: torch.Tensor,
-    advantages: torch.Tensor,
-    completions_mask: torch.Tensor,
-    epsilon: float = 0.2,
-    epsilon_high: float = 0.2,
-    beta: float = 0.1,
-) -> tuple[torch.Tensor, torch.Tensor]:
-    """Fused fwd+bwd bnpo loss. Returns ``(loss, grad_policy_logprobs)``.
-    Single-launch training entry point. The kernel writes both the
-    scalar loss and the scaled ``dL/d(policy_logprobs)`` tensor in one
-    ``@cute.jit`` dispatch — a bundled mask-sum kernel runs inside the
-    same launch so ``inv_total`` is populated on-GPU without a host-side
-    ``torch.sum`` round trip.
-    Inputs do **not** need ``requires_grad=True``. To chain ``grad``
-    into the upstream model that produced ``policy_logprobs``::
-        loss, grad = bnpo_loss(policy_logprobs, ..., completions_mask=mask)
-        policy_logprobs.backward(grad)
-        optimizer.step()
-    Args:
-        policy_logprobs, old_policy_logprobs, ref_logprobs: ``(bs, seq_len)``.
-        advantages: ``(bs,)``.
-        completions_mask: bool/int8 mask ``(bs, seq_len)``.
-        epsilon, epsilon_high: PPO-style clipping bounds.
-        beta: KL-penalty coefficient. ``0.0`` compiles away the KL branch.
-    Returns:
-        ``(loss, grad_policy_logprobs)`` — ``loss`` is a 0-dim tensor in
-        ``policy_logprobs.dtype``; ``grad_policy_logprobs`` has shape
-        ``(bs, seq_len)`` and is already scaled by ``1 / n_valid``. The
-        gradient tensor is freshly allocated per call (no shared cache),
-        so callers may keep it around freely.
-    For inference / validation where you only need the loss, use
-    :func:`bnpo_loss_fwd` — it skips the dpolicy write entirely and
-    computes the mean denominator with the on-GPU last-block trick.
-    """
-    run = _get_compiled_fwd_bwd(
-        policy_logprobs.dtype,
-        float(epsilon),
-        float(epsilon_high),
-        float(beta),
-    )
-    mask_arg = (
-        completions_mask
-        if completions_mask.dtype == torch.int8
-        else completions_mask.to(torch.int8)
-    )
-    return run(
-        policy_logprobs,
-        old_policy_logprobs,
-        ref_logprobs,
-        advantages,
-        mask_arg,
-    )
-# Imported at the bottom: ``autograd.py`` imports ``bnpo_loss`` from this
-# module, so the function must be fully defined before its import runs.
-from .autograd import bnpo_loss_autograd  # noqa: E402

build/torch-cuda/bnpo_loss/_torch_ref.py DELETED Viewed

@@ -1,56 +0,0 @@
-"""Plain-PyTorch bnpo reference shared between the bench and the tests.
-This module is intentionally minimal — every op is a vanilla torch op so
-``AOTAutograd`` can derive the joint fwd+bwd graph and Inductor can fuse
-both passes (used by ``benchmarks/benchmark_bnpo_loss.py``'s compiled
-baseline). The same function is imported by ``tests/test_bnpo_loss.py``
-as the correctness reference, so both paths agree on what "the eager
-torch implementation of bnpo loss" means.
-Underscore-prefixed module name signals "shared internal", not a public
-API surface — there's no re-export from the package's top-level
-``__init__.py``.
-"""
-from __future__ import annotations
-import torch
-def torch_bnpo_loss(
-    policy_logprobs: torch.Tensor,
-    old_policy_logprobs: torch.Tensor,
-    ref_logprobs: torch.Tensor,
-    advantages: torch.Tensor,
-    completions_mask: torch.Tensor,
-    epsilon: float = 0.2,
-    epsilon_high: float = 0.2,
-    beta: float = 0.1,
-) -> torch.Tensor:
-    """Plain-Python bnpo reference traceable by AOTAutograd / Inductor.
-    Operates in the input dtype throughout (no internal fp32 cast),
-    which is what real torch users would write — and what
-    ``torch.compile`` competes against in the bench.
-    """
-    ratio = torch.exp(policy_logprobs - old_policy_logprobs)
-    adv = advantages.unsqueeze(1)
-    surrogate = ratio * adv
-    surrogate_clipped = torch.clamp(ratio, 1.0 - epsilon, 1.0 + epsilon_high) * adv
-    policy_loss = -torch.min(surrogate, surrogate_clipped)
-    log_ratio_ref = ref_logprobs - policy_logprobs
-    kl = torch.exp(log_ratio_ref) - log_ratio_ref - 1.0
-    # Cast n_valid to fp32: int64 → fp16 overflows when n_valid > 65504.
-    # ``clamp(min=1.0)`` matches TRL's ``mask.sum().clamp(min=1)``: a
-    # fully-masked batch produces ``loss=0`` instead of inf/NaN. Mirrors
-    # the cute kernel's ``cute.arch.fmax(..., 1.0)`` before ``rcp_approx``
-    # in ``cute_bnpo_loss.py``.
-    n_valid = completions_mask.sum().to(torch.float32).clamp(min=1.0)
-    policy_loss = (policy_loss * completions_mask).sum() / n_valid
-    kl = (kl * completions_mask).sum() / n_valid
-    loss = policy_loss + beta * kl
-    return loss.to(policy_logprobs.dtype)

build/torch-cuda/bnpo_loss/autograd.py DELETED Viewed

@@ -1,149 +0,0 @@
-"""Autograd-aware wrapper for bnpo loss via ``torch.library.custom_op``.
-The fused cute kernel writes both the scalar loss and the closed-form
-``dL/d(policy_logprobs)`` in one launch. This module wraps that into an
-autograd-compatible op so callers can write::
-    loss = bnpo_loss_autograd(policy, old, ref, adv, completions_mask)
-    loss.backward()  # propagates through to the upstream model
-instead of the manual ``policy.backward(grad)`` chain. The cost is
-~12µs of autograd dispatcher overhead per call (vs the direct
-``bnpo_loss`` ``(loss, grad)`` tuple); for ergonomic / kernelize() flows
-that's cheap, but for tight microbenches use the direct path.
-Implementation notes:
-- The registered op returns ``(loss, dpolicy)`` so ``setup_context`` can
-  ``save_for_backward(dpolicy)``. The public ``bnpo_loss_autograd``
-  wrapper hides the second output.
-- ``dpolicy`` is allocated fresh by the runner on every call (no shared
-  cache), so ``ctx.save_for_backward(dpolicy)`` keeps a stable reference
-  across subsequent calls without any extra copy.
-- Backward returns ``grad_loss * dpolicy``. Under ``torch.compile``,
-  when ``loss`` is consumed by ``.backward()`` directly, ``grad_loss``
-  is the constant 1.0 and Inductor can fold the multiply away — that's
-  the main reason this path uses ``custom_op`` instead of a plain
-  ``autograd.Function``.
-- ``register_fake`` provides the meta kernel for ``torch.compile`` shape
-  propagation; the real cute kernel never runs under ``FakeTensorMode``.
-"""
-from __future__ import annotations
-import torch
-from . import bnpo_loss as _bnpo_loss_fwd_bwd
-__all__ = ["bnpo_loss_autograd"]
-@torch.library.custom_op(
-    "geometric_ai_kernels::_bnpo_loss_with_grad",
-    mutates_args=(),
-)
-def _bnpo_loss_with_grad(
-    policy_logprobs: torch.Tensor,
-    old_policy_logprobs: torch.Tensor,
-    ref_logprobs: torch.Tensor,
-    advantages: torch.Tensor,
-    completions_mask: torch.Tensor,
-    epsilon: float,
-    epsilon_high: float,
-    beta: float,
-) -> tuple[torch.Tensor, torch.Tensor]:
-    loss, dpolicy = _bnpo_loss_fwd_bwd(
-        policy_logprobs,
-        old_policy_logprobs,
-        ref_logprobs,
-        advantages,
-        completions_mask,
-        epsilon=epsilon,
-        epsilon_high=epsilon_high,
-        beta=beta,
-    )
-    return loss, dpolicy
-@_bnpo_loss_with_grad.register_fake
-def _(
-    policy_logprobs: torch.Tensor,
-    old_policy_logprobs: torch.Tensor,
-    ref_logprobs: torch.Tensor,
-    advantages: torch.Tensor,
-    completions_mask: torch.Tensor,
-    epsilon: float,
-    epsilon_high: float,
-    beta: float,
-) -> tuple[torch.Tensor, torch.Tensor]:
-    # Signature must mirror the op; only ``policy_logprobs`` shapes the outputs.
-    del old_policy_logprobs, ref_logprobs, advantages, completions_mask
-    del epsilon, epsilon_high, beta
-    loss = policy_logprobs.new_empty(())
-    dpolicy = torch.empty_like(policy_logprobs)
-    return loss, dpolicy
-def _setup_context(ctx, inputs, output) -> None:  # type: ignore[no-untyped-def]
-    del inputs  # only ``output`` carries what we need to save.
-    _, dpolicy = output
-    ctx.save_for_backward(dpolicy)
-def _backward(ctx, grad_loss, grad_dpolicy):  # type: ignore[no-untyped-def]
-    # ``grad_dpolicy`` is unused — ``dpolicy`` is an internal intermediate
-    # exposed only so ``setup_context`` can save it. Under typical usage
-    # (``loss.backward()``) it arrives as ``None`` or a zero tensor.
-    del grad_dpolicy
-    (dpolicy,) = ctx.saved_tensors
-    grad_policy = grad_loss * dpolicy
-    # One return per input to the op (8): policy_logprobs gets the grad,
-    # everything else gets None (no autograd flow).
-    return grad_policy, None, None, None, None, None, None, None
-torch.library.register_autograd(
-    "geometric_ai_kernels::_bnpo_loss_with_grad",
-    _backward,
-    setup_context=_setup_context,
-)
-def bnpo_loss_autograd(
-    policy_logprobs: torch.Tensor,
-    old_policy_logprobs: torch.Tensor,
-    ref_logprobs: torch.Tensor,
-    advantages: torch.Tensor,
-    completions_mask: torch.Tensor,
-    epsilon: float = 0.2,
-    epsilon_high: float = 0.2,
-    beta: float = 0.1,
-) -> torch.Tensor:
-    """Autograd-aware bnpo loss. Returns scalar ``loss``.
-    Same numerics as :func:`bnpo_loss` but registered as a
-    ``torch.library`` custom op with autograd, so::
-        loss = bnpo_loss_autograd(policy, ..., completions_mask)
-        loss.backward()
-    propagates through to whatever produced ``policy_logprobs``. For
-    direct ``(loss, grad)`` access without the autograd dispatcher
-    overhead, use :func:`bnpo_loss` and chain the gradient manually
-    via ``policy_logprobs.backward(grad)``.
-    Composes with ``torch.compile``: the op is opaque to Inductor but
-    has a fake/meta kernel registered, so models containing this layer
-    can be compiled end-to-end without graph breaks.
-    """
-    loss, _ = _bnpo_loss_with_grad(
-        policy_logprobs,
-        old_policy_logprobs,
-        ref_logprobs,
-        advantages,
-        completions_mask,
-        float(epsilon),
-        float(epsilon_high),
-        float(beta),
-    )
-    return loss

build/torch-cuda/bnpo_loss/cute_bnpo_loss.py DELETED Viewed

@@ -1,1081 +0,0 @@
-"""CuteDSL kernel for bnpo loss.
-Computes (element-wise over ``(bs, seq_len)`` logprob tensors, reduced to a
-scalar):
-    ratio          = exp(policy - old_policy)
-    surrogate      = ratio * adv
-    clipped        = clip(ratio, 1 - eps, 1 + eps_high) * adv
-    policy_loss    = -min(surrogate, clipped)
-    log_ratio_ref  = ref - policy
-    kl             = exp(log_ratio_ref) - log_ratio_ref - 1
-    L_bnpo         = (policy_loss * mask).sum() / n_valid
-                     + beta * (kl * mask).sum() / n_valid
-where ``n_valid = max(completions_mask.sum(), 1)``. The mean denominator is
-computed entirely on-GPU — the forward-only path uses an atomic accumulator
-+ last-block trick on ``valid_acc``; the fused fwd+bwd path bundles a small
-companion mask-sum kernel into the same ``@cute.jit`` launch that writes
-``1 / completions_mask.sum()`` into the ``inv_total`` GMEM scalar before the
-main kernel reads it. Every block needs ``inv_total`` mid-loop to scale its
-``dpolicy`` slab, so the fwd-only last-block trick doesn't compose with
-backward; bundling the mask-sum keeps both paths host-sync-free and CUDA-graph
-compatible.
-When ``beta=0`` the KL term is skipped at compile time (no ``ref`` tensor
-access, no ``kl_acc`` atomic add).
-Sequence lengths that are **not** a multiple of ``TILE_N`` are handled
-natively: the grid launches ``ceil(seq_len / TILE_N)`` column tiles; full tiles
-use the vectorized ``LDG.128`` path and the tail tile uses predicated vector
-loads with neutral prefill.
-Two compiled-kernel flavors are exposed:
-* :func:`create_compiled_bnpo_loss` — forward-only.
-* :func:`create_compiled_bnpo_loss_with_backward` — fused fwd+bwd. Returns
-  ``(loss, dpolicy)`` directly — no ``torch.autograd.Function`` wrapper. The
-  autograd-aware sibling lives in ``autograd.py`` and uses
-  ``torch.library.custom_op`` instead.
-Per-call output (``loss``, ``dpolicy``, ``inv_total``) is allocated inside the
-runner. Cross-CTA scratch (atomic accumulators + counters) is allocated lazily
-on first call inside the compiled-kernel closure and self-resets each launch
-via the kernel's last-block epilogue + ``atom.inc.u32`` wrap-around — callers
-don't manage scratch state.
-"""
-from __future__ import annotations
-import math
-import operator
-from typing import TYPE_CHECKING, Any
-from typing import cast as _typing_cast
-import cutlass
-import cutlass.utils
-import torch
-from cutlass import cute
-from cutlass._mlir.dialects import llvm
-from cutlass.base_dsl.typing import cast
-from cutlass.cutlass_dsl import T, dsl_user_op
-if TYPE_CHECKING:
-    from collections.abc import Callable
-TILE_N: int = 512
-NUM_WARPS: int = 4
-# ``VEC=4`` (fp32) emits 128-bit ``LDG.128``. Pairs with ``NUM_WARPS=4`` so
-# each block processes ``block_size * VEC = 512 = TILE_N`` elements per iter.
-VEC: int = 4
-# Large-tile variant: at very long ``seq_len`` the small-TILE_N grid
-# explodes (e.g. 8192/512 = 16 col-tiles per row → thousands of CTAs),
-# inflating last-block-detection latency and atomic contention. A second
-# compiled variant with this larger tile is dispatched when
-# ``seq_len >= TILE_N_LARGE_THRESHOLD``.
-TILE_N_LARGE: int = 4096
-TILE_N_LARGE_THRESHOLD: int = 2048
-_LOG2_E: float = math.log2(math.e)
-_TORCH_TO_CUTLASS_DTYPE: dict[torch.dtype, Any] = {
-    torch.float32: cutlass.Float32,
-    torch.float16: cutlass.Float16,
-    torch.bfloat16: cutlass.BFloat16,
-}
-@dsl_user_op
-def _atomic_add_f32_gmem(
-    ptr_i64: Any,
-    val: cutlass.Float32,
-    *,
-    loc: Any = None,
-    ip: Any = None,
-) -> None:
-    llvm.inline_asm(
-        T.f32(),
-        [ptr_i64, cutlass.Float32(val).ir_value(loc=loc, ip=ip)],
-        "atom.global.add.f32 $0, [$1], $2;",
-        "=f,l,f",
-        has_side_effects=True,
-        is_align_stack=False,
-        asm_dialect=llvm.AsmDialect.AD_ATT,
-    )
-@dsl_user_op
-def _atomic_add_s32_gmem(
-    ptr_i64: Any,
-    val: cutlass.Int32,
-    *,
-    loc: Any = None,
-    ip: Any = None,
-) -> None:
-    """Emit ``atom.global.add.s32`` to a 64-bit GMEM address."""
-    llvm.inline_asm(
-        T.i32(),
-        [ptr_i64, cutlass.Int32(val).ir_value(loc=loc, ip=ip)],
-        "atom.global.add.s32 $0, [$1], $2;",
-        "=r,l,r",
-        has_side_effects=True,
-        is_align_stack=False,
-        asm_dialect=llvm.AsmDialect.AD_ATT,
-    )
-@dsl_user_op
-def _dp4a_u32_acc_s32(
-    packed_a: cutlass.Uint32,
-    packed_b: cutlass.Uint32,
-    acc: cutlass.Int32,
-    *,
-    loc: Any = None,
-    ip: Any = None,
-) -> cutlass.Int32:
-    """``dp4a.u32.u32`` — sum 4 packed u8 products into an s32 acc.
-    Computes ``a[0]*b[0] + a[1]*b[1] + a[2]*b[2] + a[3]*b[3] + acc`` in
-    one ``IDP4A.U8.S32`` instruction (full-rate on Hopper/Blackwell).
-    For mask summation, pass ``packed_b = 0x01010101`` so the products
-    reduce to ``sum(a_bytes) + acc`` — 4× fewer ALU ops than 4 separate
-    int8→int32 widens + adds.
-    """
-    return cutlass.Int32(
-        llvm.inline_asm(
-            T.i32(),
-            [
-                cutlass.Uint32(packed_a).ir_value(loc=loc, ip=ip),
-                cutlass.Uint32(packed_b).ir_value(loc=loc, ip=ip),
-                cutlass.Int32(acc).ir_value(loc=loc, ip=ip),
-            ],
-            "dp4a.u32.u32 $0, $1, $2, $3;",
-            "=r,r,r,r",
-            has_side_effects=False,
-            is_align_stack=False,
-            asm_dialect=llvm.AsmDialect.AD_ATT,
-        )
-    )
-@dsl_user_op
-def _atomic_inc_u32_gmem(
-    ptr_i64: Any,
-    threshold: cutlass.Int32,
-    *,
-    loc: Any = None,
-    ip: Any = None,
-) -> cutlass.Int32:
-    """``atom.global.inc.u32`` — returns old value; wraps to 0 at threshold."""
-    return cutlass.Int32(
-        llvm.inline_asm(
-            T.i32(),
-            [ptr_i64, cutlass.Int32(threshold).ir_value(loc=loc, ip=ip)],
-            "atom.global.inc.u32 $0, [$1], $2;",
-            "=r,l,r",
-            has_side_effects=True,
-            is_align_stack=False,
-            asm_dialect=llvm.AsmDialect.AD_ATT,
-        )
-    )
-# ---------------------------------------------------------------------------
-# Mask-sum kernel — replaces ``torch.sum(completions_mask)`` on the fwd+bwd
-# path. Bundled into the same ``@cute.jit`` launch as the main kernel so the
-# whole step is one tvm-ffi dispatch (no extra Python/torch dispatcher round
-# trip). The kernel writes ``1 / completions_mask.sum()`` directly into
-# ``inv_total_tensor`` so the main kernel reads it as a pre-inverted scalar.
-# ---------------------------------------------------------------------------
-def _make_mask_sum_kernel(tile_n: int) -> Callable[..., None]:
-    """Return a ``@cute.kernel`` that reduces ``completions_mask`` and writes 1/sum.
-    Grid mirrors the main kernel — ``(bs, num_col_tiles)`` — so the mask is
-    read once with the same vectorised LDG pattern as the main compute.
-    Each block:
-    1. Loads its ``tile_n`` int8 slab of ``completions_mask`` (predicated tail).
-    2. Reduces to a per-block ``int32`` scalar (bit-exact, no per-element
-       i8→f32 cast — IADD throughput equals FADD on Hopper/Blackwell).
-    3. Atomically adds it to ``valid_acc`` (global int32 accumulator).
-    4. Increments ``mask_counter``; the last block reads ``valid_acc``,
-       casts to fp32, computes ``rcp_approx`` and writes
-       ``inv_total_tensor[0]``, then resets ``valid_acc`` to ``0`` so
-       the next call starts fresh. The counter self-resets via
-       ``atom.inc.u32`` wrap-around.
-    A separate ``mask_counter`` tensor (not the main kernel's ``counter``)
-    is required because the two kernels run in series within the same
-    ``@cute.jit`` and both rely on a wrap-around for self-reset; sharing
-    one counter would race.
-    """
-    @cute.kernel
-    def _mask_sum_kernel(
-        completions_mask: cute.Tensor,  # (bs, seq_len) int8
-        inv_total_tensor: cute.Tensor,  # (1,) fp32 — output
-        valid_acc: cute.Tensor,  # (1,) int32 — accumulator
-        mask_counter: cute.Tensor,  # (1,) i32 — last-block detection
-        total_blocks: cutlass.Int32,
-        num_full_tiles: cutlass.Int32,
-        tail_len: cutlass.Int32,
-    ) -> None:
-        block_size = NUM_WARPS * 32
-        iters = tile_n // (block_size * VEC)
-        _no_alloc = cute.nvgpu.CacheEvictionPriority.NO_ALLOCATE
-        g2r_op = cute.nvgpu.CopyUniversalOp()
-        g2r_mask_atom = cute.make_copy_atom(
-            g2r_op,
-            completions_mask.element_type,
-            num_bits_per_copy=0,
-            l1c_evict_priority=_no_alloc,
-        )
-        row = cute.arch.block_idx()[0]
-        col_block = cute.arch.block_idx()[1]
-        tid = cute.arch.thread_idx()[0]
-        local_valid_sum = cutlass.Int32(0)
-        mask_row = cute.slice_(completions_mask, (row, None))
-        # ``dp4a.u32.u32`` consumes a packed-u8x4 register. With VEC=4 each
-        # thread loads 4 contiguous int8 bytes per iteration, so we recast
-        # the fragment as a single ``Uint32`` view and feed it directly
-        # into dp4a — one instruction sums all 4 bytes, vs the previous
-        # cast+reduce which emitted 4 widens + 3 adds per iteration.
-        ones_packed = cutlass.Uint32(0x01010101)
-        if col_block < num_full_tiles:
-            mask_slab = cute.local_tile(mask_row, (tile_n,), (col_block,))
-            for k in cutlass.range(iters, unroll_full=True):
-                sub_idx = tid + k * block_size
-                mask_src = cute.local_tile(mask_slab, (VEC,), (sub_idx,))
-                mask_frag = cute.make_fragment_like(mask_src)
-                cute.copy(g2r_mask_atom, mask_src, mask_frag)
-                packed = cute.recast_tensor(mask_frag, cutlass.Uint32)[0]
-                local_valid_sum = _dp4a_u32_acc_s32(packed, ones_packed, local_valid_sum)
-        else:
-            mask_slab = cute.local_tile(mask_row, (tile_n,), (col_block,))
-            for k in cutlass.range(iters, unroll_full=True):
-                sub_idx = tid + k * block_size
-                chunk_base = sub_idx * VEC
-                if chunk_base < tail_len:
-                    mask_src = cute.local_tile(mask_slab, (VEC,), (sub_idx,))
-                    pred = cute.make_rmem_tensor(mask_src.shape, cutlass.Boolean)
-                    for v in cutlass.range(VEC, unroll_full=True):
-                        pred[v] = cute.elem_less(chunk_base + v, tail_len)
-                    mask_frag = cute.make_fragment_like(mask_src)
-                    mask_frag.fill(0)
-                    cute.copy(g2r_mask_atom, mask_src, mask_frag, pred=pred)
-                    packed = cute.recast_tensor(mask_frag, cutlass.Uint32)[0]
-                    local_valid_sum = _dp4a_u32_acc_s32(packed, ones_packed, local_valid_sum)
-        # Warp + cross-warp reduction (same pattern as main kernel).
-        warp_valid = cute.arch.warp_reduction(local_valid_sum, operator.add)
-        smem = cutlass.utils.SmemAllocator()
-        buf_valid = smem.allocate_tensor(cutlass.Int32, cute.make_layout(NUM_WARPS))
-        lane_idx = cute.arch.lane_idx()
-        warp_idx = cute.arch.warp_idx()
-        if lane_idx == 0:
-            buf_valid[warp_idx] = warp_valid
-        cute.arch.barrier()
-        if warp_idx == 0:
-            val_v = cutlass.Int32(0)
-            if lane_idx < NUM_WARPS:
-                val_v = buf_valid[lane_idx]
-            block_valid = cute.arch.warp_reduction(val_v, operator.add, threads_in_group=NUM_WARPS)
-            if lane_idx == 0:
-                valid_ptr = valid_acc.iterator.toint().ir_value()  # ty: ignore[unresolved-attribute]
-                counter_ptr = mask_counter.iterator.toint().ir_value()  # ty: ignore[unresolved-attribute]
-                _atomic_add_s32_gmem(valid_ptr, block_valid)
-                cute.arch.fence_acq_rel_gpu()
-                old = _atomic_inc_u32_gmem(counter_ptr, total_blocks - 1)
-                if old == total_blocks - 1:
-                    # Clamp to >=1.0 so a fully-masked batch (n_valid=0)
-                    # produces ``loss=0`` instead of inf/NaN — matches
-                    # TRL's ``mask.sum().clamp(min=1)`` semantics.
-                    n_valid = cute.arch.fmax(cutlass.Float32(valid_acc[0]), cutlass.Float32(1.0))
-                    inv_total_tensor[0] = cute.arch.rcp_approx(n_valid)
-                    valid_acc[0] = cutlass.Int32(0)
-    return _mask_sum_kernel
-def _make_bnpo_kernel(
-    compute_kl: bool,
-    compute_backward: bool,
-    tile_n: int,
-) -> Callable[..., None]:
-    """Return a ``@cute.kernel`` specialised on compile-time flags.
-    The returned kernel captures *compute_kl*, *compute_backward*, and
-    *tile_n* in its closure. ``cutlass.const_expr`` evaluates the booleans
-    at trace time so dead branches are eliminated from the compiled PTX.
-    ``tile_n`` is a Python ``int`` captured at trace time, so the same
-    factory can emit two specialised kernels (small / large tile) — see
-    :func:`create_compiled_bnpo_loss` for dispatch.
-    When *compute_backward* is True the kernel additionally writes
-    ``dpolicy = dL/d(policy_logprobs)`` to GMEM in the same inner loop —
-    no extra HBM reads of the inputs. Because every block must scale
-    ``dpolicy`` by ``inv_total`` mid-loop, the on-GPU last-block computation
-    of ``inv_total`` from the masked accumulator does **not** compose with
-    backward; the bundled mask-sum kernel populates ``inv_total_tensor``
-    before the main kernel runs.
-    When *compute_backward* is False the kernel accumulates the
-    mask-element count via ``valid_acc`` and computes
-    ``inv_total = 1 / n_valid`` on-GPU in the last-block path — no
-    host-side ``completions_mask.sum()`` required.
-    """
-    @cute.kernel
-    def _bnpo_loss_kernel(
-        policy: cute.Tensor,
-        old_policy: cute.Tensor,
-        ref: cute.Tensor,
-        advantages: cute.Tensor,
-        completions_mask: cute.Tensor,
-        dpolicy: cute.Tensor,  # (bs, seq_len) when compute_backward; (bs, 1) dummy otherwise
-        inv_total_tensor: cute.Tensor,  # (1,) fp32 — caller-populated 1/n_valid
-        policy_acc: cute.Tensor,
-        kl_acc: cute.Tensor,
-        valid_acc: cute.Tensor,  # (1,) int32 — mask-element count accumulator
-        counter: cute.Tensor,
-        output: cute.Tensor,
-        epsilon: cutlass.Float32,
-        epsilon_high: cutlass.Float32,
-        beta: cutlass.Float32,
-        total_blocks: cutlass.Int32,
-        num_full_tiles: cutlass.Int32,
-        tail_len: cutlass.Int32,
-    ) -> None:
-        block_size = NUM_WARPS * 32
-        iters = tile_n // (block_size * VEC)
-        # Read inv_total from GMEM once per block (hoisted, single load).
-        # Skipped on the fwd-only path which uses an on-GPU last-block
-        # computation from the valid_acc accumulator instead. On the
-        # compute_backward path the bundled mask-sum kernel writes
-        # ``1 / completions_mask.sum()`` into ``inv_total_tensor`` before
-        # this kernel runs, so the load returns the pre-inverted scalar.
-        accumulate_valid = not compute_backward
-        if cutlass.const_expr(not accumulate_valid):
-            inv_total = cast(inv_total_tensor[0], cutlass.Float32)
-        _no_alloc = cute.nvgpu.CacheEvictionPriority.NO_ALLOCATE
-        g2r_op = cute.nvgpu.CopyUniversalOp()
-        g2r_atom = cute.make_copy_atom(
-            g2r_op,
-            policy.element_type,
-            num_bits_per_copy=0,
-            l1c_evict_priority=_no_alloc,
-        )
-        g2r_mask_atom = cute.make_copy_atom(
-            g2r_op,
-            completions_mask.element_type,
-            num_bits_per_copy=0,
-            l1c_evict_priority=_no_alloc,
-        )
-        if cutlass.const_expr(compute_backward):
-            r2g_atom = cute.make_copy_atom(
-                g2r_op,
-                dpolicy.element_type,
-                num_bits_per_copy=0,
-            )
-        row = cute.arch.block_idx()[0]
-        col_block = cute.arch.block_idx()[1]
-        tid = cute.arch.thread_idx()[0]
-        adv = cast(advantages[row], cutlass.Float32)
-        lo = cutlass.Float32(1.0) - epsilon
-        hi = cutlass.Float32(1.0) + epsilon_high
-        local_policy_sum = cutlass.Float32(0.0)
-        local_kl_sum = cutlass.Float32(0.0)
-        # mask_vec is already cast to fp32 for loss/kl multiplications, so
-        # accumulate valid in fp32 too (avoids a separate i8→i32 reduction).
-        # Cast to int32 only at the atomic boundary so the shared
-        # ``valid_acc`` global can remain int32 — see ``_atomic_add_s32_gmem``.
-        local_valid_sum = cutlass.Float32(0.0)
-        pol_row = cute.slice_(policy, (row, None))
-        old_row = cute.slice_(old_policy, (row, None))
-        if cutlass.const_expr(compute_kl):
-            ref_row = cute.slice_(ref, (row, None))
-        mask_row = cute.slice_(completions_mask, (row, None))
-        if cutlass.const_expr(compute_backward):
-            dp_row = cute.slice_(dpolicy, (row, None))
-        # ---- Full-tile vectorised path (LDG.128) ----
-        if col_block < num_full_tiles:
-            pol_slab = cute.local_tile(pol_row, (tile_n,), (col_block,))
-            old_slab = cute.local_tile(old_row, (tile_n,), (col_block,))
-            if cutlass.const_expr(compute_kl):
-                ref_slab = cute.local_tile(ref_row, (tile_n,), (col_block,))
-            mask_slab = cute.local_tile(mask_row, (tile_n,), (col_block,))
-            if cutlass.const_expr(compute_backward):
-                dp_slab = cute.local_tile(dp_row, (tile_n,), (col_block,))
-            for k in cutlass.range(iters, unroll_full=True):
-                sub_idx = tid + k * block_size
-                pol_src = cute.local_tile(pol_slab, (VEC,), (sub_idx,))
-                old_src = cute.local_tile(old_slab, (VEC,), (sub_idx,))
-                pol_frag = cute.make_fragment_like(pol_src)
-                old_frag = cute.make_fragment_like(old_src)
-                cute.copy(g2r_atom, pol_src, pol_frag)
-                cute.copy(g2r_atom, old_src, old_frag)
-                if cutlass.const_expr(compute_kl):
-                    ref_src = cute.local_tile(ref_slab, (VEC,), (sub_idx,))
-                    ref_frag = cute.make_fragment_like(ref_src)
-                    cute.copy(g2r_atom, ref_src, ref_frag)
-                mask_src = cute.local_tile(mask_slab, (VEC,), (sub_idx,))
-                mask_frag = cute.make_fragment_like(mask_src)
-                cute.copy(g2r_mask_atom, mask_src, mask_frag)
-                pol_vec = pol_frag.load().to(cutlass.Float32)
-                old_vec = old_frag.load().to(cutlass.Float32)
-                log_ratio = pol_vec - old_vec
-                ratio = cute.math.exp2(log_ratio * _LOG2_E, fastmath=True)
-                surrogate = ratio * adv
-                clipped_ratio = cute.where(
-                    ratio < lo,
-                    lo,
-                    cute.where(ratio > hi, hi, ratio),
-                )
-                clipped = clipped_ratio * adv
-                policy_loss = -cute.where(surrogate < clipped, surrogate, clipped)
-                if cutlass.const_expr(compute_kl):
-                    ref_vec = ref_frag.load().to(cutlass.Float32)
-                    log_ratio_ref = ref_vec - pol_vec
-                    ratio_ref = cute.math.exp2(log_ratio_ref * _LOG2_E, fastmath=True)
-                    # FFMA-friendly rearrangement: ``(ratio_ref - 1) - log_ratio_ref``
-                    # exposes a ``ratio_ref + (-1)`` pair that ptxas folds with
-                    # the subsequent subtract — same arithmetic, fewer FADDs
-                    # surviving SASS than the original 3-term ``a - b - c``.
-                    kl_val = (ratio_ref - cutlass.Float32(1.0)) - log_ratio_ref
-                mask_vec = mask_frag.load().to(cutlass.Float32)
-                local_policy_sum += (policy_loss * mask_vec).reduce(
-                    cute.ReductionOp.ADD,
-                    cutlass.Float32(0.0),
-                    reduction_profile=0,
-                )
-                if cutlass.const_expr(not compute_backward):
-                    local_valid_sum += mask_vec.reduce(
-                        cute.ReductionOp.ADD,
-                        cutlass.Float32(0.0),
-                        reduction_profile=0,
-                    )
-                if cutlass.const_expr(compute_kl):
-                    local_kl_sum += (kl_val * mask_vec).reduce(
-                        cute.ReductionOp.ADD,
-                        cutlass.Float32(0.0),
-                        reduction_profile=0,
-                    )
-                # ---- Backward: write scaled dpolicy slab in same loop ----
-                # use_unclipped = (surrogate <= clipped) — matches torch's
-                # convention. d/d(policy) of -min(surrogate, clipped) is
-                # -adv*ratio when use_unclipped, else 0 (clamp grad = 0).
-                # ``-(adv * ratio)`` is just ``-surrogate`` (already in
-                # scope) — saves one FMUL per element.
-                # KL term: d/d(policy) of (ratio_ref - log_ratio_ref - 1)
-                #          = -(ratio_ref - 1) = 1 - ratio_ref.
-                if cutlass.const_expr(compute_backward):
-                    neg_surrogate_grad = cute.where(
-                        surrogate <= clipped,
-                        -surrogate,
-                        cutlass.Float32(0.0),
-                    )
-                    if cutlass.const_expr(compute_kl):
-                        # ``beta - beta*ratio_ref`` instead of ``beta*(1 - ratio_ref)``
-                        # gives ptxas an obvious FFMA pattern (``FFMA -beta,
-                        # ratio_ref, beta``) — saves one FMUL per element vs
-                        # the (1 - ratio_ref) intermediate.
-                        kl_grad = beta - beta * ratio_ref
-                        dpolicy_vec = neg_surrogate_grad + kl_grad
-                    else:
-                        dpolicy_vec = neg_surrogate_grad
-                    dpolicy_vec = dpolicy_vec * mask_vec
-                    dpolicy_vec = dpolicy_vec * inv_total
-                    dp_dst = cute.local_tile(dp_slab, (VEC,), (sub_idx,))
-                    dp_frag = cute.make_fragment_like(dp_dst)
-                    dp_frag.store(dpolicy_vec.to(dpolicy.element_type))
-                    cute.copy(r2g_atom, dp_frag, dp_dst)
-        else:
-            # ---- Predicated vector tail path (< tile_n valid elements) ----
-            pol_slab = cute.local_tile(pol_row, (tile_n,), (col_block,))
-            old_slab = cute.local_tile(old_row, (tile_n,), (col_block,))
-            if cutlass.const_expr(compute_kl):
-                ref_slab = cute.local_tile(ref_row, (tile_n,), (col_block,))
-            mask_slab = cute.local_tile(mask_row, (tile_n,), (col_block,))
-            if cutlass.const_expr(compute_backward):
-                dp_slab = cute.local_tile(dp_row, (tile_n,), (col_block,))
-            for k in cutlass.range(iters, unroll_full=True):
-                sub_idx = tid + k * block_size
-                chunk_base = sub_idx * VEC
-                if chunk_base < tail_len:
-                    pol_src = cute.local_tile(pol_slab, (VEC,), (sub_idx,))
-                    old_src = cute.local_tile(old_slab, (VEC,), (sub_idx,))
-                    pred = cute.make_rmem_tensor(pol_src.shape, cutlass.Boolean)
-                    for v in cutlass.range(VEC, unroll_full=True):
-                        pred[v] = cute.elem_less(chunk_base + v, tail_len)
-                    pol_frag = cute.make_fragment_like(pol_src)
-                    old_frag = cute.make_fragment_like(old_src)
-                    pol_frag.fill(0.0)
-                    old_frag.fill(0.0)
-                    cute.copy(g2r_atom, pol_src, pol_frag, pred=pred)
-                    cute.copy(g2r_atom, old_src, old_frag, pred=pred)
-                    if cutlass.const_expr(compute_kl):
-                        ref_src = cute.local_tile(ref_slab, (VEC,), (sub_idx,))
-                        ref_frag = cute.make_fragment_like(ref_src)
-                        ref_frag.fill(0.0)
-                        cute.copy(g2r_atom, ref_src, ref_frag, pred=pred)
-                    mask_src = cute.local_tile(mask_slab, (VEC,), (sub_idx,))
-                    mask_frag = cute.make_fragment_like(mask_src)
-                    mask_frag.fill(0)
-                    cute.copy(g2r_mask_atom, mask_src, mask_frag, pred=pred)
-                    pol_vec = pol_frag.load().to(cutlass.Float32)
-                    old_vec = old_frag.load().to(cutlass.Float32)
-                    valid_vec = cute.where(
-                        pred.load(),
-                        cute.full_like(pol_vec, cutlass.Float32(1.0)),
-                        cute.zeros_like(pol_vec, dtype=cutlass.Float32),
-                    )
-                    log_ratio = pol_vec - old_vec
-                    ratio = cute.math.exp2(log_ratio * _LOG2_E, fastmath=True)
-                    surrogate = ratio * adv
-                    clipped_ratio = cute.where(
-                        ratio < lo,
-                        lo,
-                        cute.where(ratio > hi, hi, ratio),
-                    )
-                    clipped = clipped_ratio * adv
-                    policy_loss = -cute.where(surrogate < clipped, surrogate, clipped)
-                    if cutlass.const_expr(compute_kl):
-                        ref_vec = ref_frag.load().to(cutlass.Float32)
-                        log_ratio_ref = ref_vec - pol_vec
-                        ratio_ref = cute.math.exp2(log_ratio_ref * _LOG2_E, fastmath=True)
-                        # FFMA-friendly rearrangement — see full-tile path.
-                        kl_val = (ratio_ref - cutlass.Float32(1.0)) - log_ratio_ref
-                    mask_vec = mask_frag.load().to(cutlass.Float32) * valid_vec
-                    local_policy_sum += (policy_loss * mask_vec).reduce(
-                        cute.ReductionOp.ADD,
-                        cutlass.Float32(0.0),
-                        reduction_profile=0,
-                    )
-                    if cutlass.const_expr(not compute_backward):
-                        local_valid_sum += mask_vec.reduce(
-                            cute.ReductionOp.ADD,
-                            cutlass.Float32(0.0),
-                            reduction_profile=0,
-                        )
-                    if cutlass.const_expr(compute_kl):
-                        local_kl_sum += (kl_val * mask_vec).reduce(
-                            cute.ReductionOp.ADD,
-                            cutlass.Float32(0.0),
-                            reduction_profile=0,
-                        )
-                    # ---- Backward: predicated dpolicy slab write ----
-                    # Same gradient math as the full-tile path. ``valid_vec``
-                    # already encodes the in-bounds predicate (1.0 inside,
-                    # 0.0 outside) and is folded into ``mask_vec``, so
-                    # multiplying by it zeros out the padded positions.
-                    if cutlass.const_expr(compute_backward):
-                        neg_surrogate_grad = cute.where(
-                            surrogate <= clipped,
-                            -surrogate,
-                            cutlass.Float32(0.0),
-                        )
-                        if cutlass.const_expr(compute_kl):
-                            kl_grad = beta - beta * ratio_ref
-                            dpolicy_vec = neg_surrogate_grad + kl_grad
-                        else:
-                            dpolicy_vec = neg_surrogate_grad
-                        dpolicy_vec = dpolicy_vec * mask_vec
-                        dpolicy_vec = dpolicy_vec * inv_total
-                        dp_dst = cute.local_tile(dp_slab, (VEC,), (sub_idx,))
-                        dp_frag = cute.make_fragment_like(dp_dst)
-                        dp_frag.store(dpolicy_vec.to(dpolicy.element_type))
-                        cute.copy(r2g_atom, dp_frag, dp_dst, pred=pred)
-        # ---- Stage 1: Intra-warp reduction (butterfly XOR shuffles) ----
-        warp_policy = cute.arch.warp_reduction(local_policy_sum, operator.add)
-        if cutlass.const_expr(compute_kl):
-            warp_kl = cute.arch.warp_reduction(local_kl_sum, operator.add)
-        smem = cutlass.utils.SmemAllocator()
-        buf_policy = smem.allocate_tensor(cutlass.Float32, cute.make_layout(NUM_WARPS))
-        if cutlass.const_expr(compute_kl):
-            buf_kl = smem.allocate_tensor(cutlass.Float32, cute.make_layout(NUM_WARPS))
-        lane_idx = cute.arch.lane_idx()
-        warp_idx = cute.arch.warp_idx()
-        # When compute_backward is True the bundled mask-sum kernel populates
-        # inv_total_tensor before this kernel runs, so on-GPU mask-element
-        # accumulation is dead code.
-        if cutlass.const_expr(accumulate_valid):
-            warp_valid = cute.arch.warp_reduction(local_valid_sum, operator.add)
-            buf_valid = smem.allocate_tensor(cutlass.Float32, cute.make_layout(NUM_WARPS))
-        # ---- Stage 2: Cross-warp reduction via SMEM ----
-        if lane_idx == 0:
-            buf_policy[warp_idx] = warp_policy
-            if cutlass.const_expr(compute_kl):
-                buf_kl[warp_idx] = warp_kl
-            if cutlass.const_expr(accumulate_valid):
-                buf_valid[warp_idx] = warp_valid
-        cute.arch.barrier()
-        if warp_idx == 0:
-            val_p = cutlass.Float32(0.0)
-            if lane_idx < NUM_WARPS:
-                val_p = buf_policy[lane_idx]
-            block_policy = cute.arch.warp_reduction(val_p, operator.add, threads_in_group=NUM_WARPS)
-            if cutlass.const_expr(compute_kl):
-                val_k = cutlass.Float32(0.0)
-                if lane_idx < NUM_WARPS:
-                    val_k = buf_kl[lane_idx]
-                block_kl = cute.arch.warp_reduction(val_k, operator.add, threads_in_group=NUM_WARPS)
-            if cutlass.const_expr(accumulate_valid):
-                val_v = cutlass.Float32(0.0)
-                if lane_idx < NUM_WARPS:
-                    val_v = buf_valid[lane_idx]
-                block_valid = cute.arch.warp_reduction(
-                    val_v, operator.add, threads_in_group=NUM_WARPS
-                )
-            # ---- Stage 3: Cross-CTA atomic accumulation ----
-            if lane_idx == 0:
-                policy_ptr = policy_acc.iterator.toint().ir_value()  # ty: ignore[unresolved-attribute]
-                counter_ptr = counter.iterator.toint().ir_value()  # ty: ignore[unresolved-attribute]
-                _atomic_add_f32_gmem(policy_ptr, block_policy)
-                if cutlass.const_expr(compute_kl):
-                    kl_ptr = kl_acc.iterator.toint().ir_value()  # ty: ignore[unresolved-attribute]
-                    _atomic_add_f32_gmem(kl_ptr, block_kl)
-                if cutlass.const_expr(accumulate_valid):
-                    valid_ptr = valid_acc.iterator.toint().ir_value()  # ty: ignore[unresolved-attribute]
-                    # valid_acc is int32. Per-block sums of int8 0/1 values
-                    # fit exactly in fp32 (≤ tile_n ≤ 4096 ≪ 2²⁴) so the
-                    # cast is bit-exact.
-                    _atomic_add_s32_gmem(valid_ptr, cutlass.Int32(block_valid))
-                cute.arch.fence_acq_rel_gpu()
-                old = _atomic_inc_u32_gmem(counter_ptr, total_blocks - 1)
-                if old == total_blocks - 1:
-                    pol_sum = policy_acc[0]
-                    if cutlass.const_expr(accumulate_valid):
-                        # Clamp to >=1.0 so a fully-masked batch (n_valid=0)
-                        # produces ``loss=0`` instead of inf/NaN — matches
-                        # TRL's ``mask.sum().clamp(min=1)`` semantics.
-                        n_valid = cute.arch.fmax(
-                            cutlass.Float32(valid_acc[0]), cutlass.Float32(1.0)
-                        )
-                        inv_total_computed = cute.arch.rcp_approx(n_valid)
-                    else:
-                        # compute_backward path: bundled mask-sum kernel
-                        # already wrote the inverse so forward and backward
-                        # share the same scalar.
-                        inv_total_computed = inv_total
-                    if cutlass.const_expr(compute_kl):
-                        kl_sum = kl_acc[0]
-                        loss = (pol_sum + beta * kl_sum) * inv_total_computed
-                    else:
-                        loss = pol_sum * inv_total_computed
-                    output[0] = cast(loss, output.element_type)  # ty: ignore[invalid-argument-type]
-                    # Reset accumulators for the next invocation.
-                    # Counter self-resets via atom.inc wrap-around.
-                    policy_acc[0] = cutlass.Float32(0.0)
-                    if cutlass.const_expr(compute_kl):
-                        kl_acc[0] = cutlass.Float32(0.0)
-                    if cutlass.const_expr(accumulate_valid):
-                        valid_acc[0] = cutlass.Int32(0)
-    return _bnpo_loss_kernel
-def create_compiled_bnpo_loss(
-    policy_dtype: torch.dtype,
-    epsilon: float,
-    epsilon_high: float,
-    beta: float,
-    compute_backward: bool = False,
-) -> Callable[..., torch.Tensor | tuple[torch.Tensor, torch.Tensor]]:
-    """Compile the bnpo loss kernel for a given dtype/KL/backward configuration.
-    The runner allocates per-call scratch (``output``, ``inv_total``, and on
-    the fwd+bwd path ``dpolicy``) inside ``_run`` itself; cross-CTA scratch
-    (atomic accumulators + counters) is allocated lazily on first call from
-    the input device and self-resets each launch via the kernel's last-block
-    epilogue + ``atom.inc.u32`` wrap-around.
-    """
-    compute_kl = beta != 0.0
-    if policy_dtype not in _TORCH_TO_CUTLASS_DTYPE:
-        raise ValueError(f"Unsupported dtype for bnpo kernel: {policy_dtype}")
-    tile_n_small = TILE_N
-    tile_n_large = TILE_N_LARGE
-    seq_len_threshold = TILE_N_LARGE_THRESHOLD
-    block_size = NUM_WARPS * 32
-    if tile_n_small % (block_size * VEC) != 0:
-        raise ValueError(
-            f"TILE_N={tile_n_small} must be a multiple of BLOCK_SIZE*VEC={block_size * VEC}"
-        )
-    if tile_n_large % (block_size * VEC) != 0:
-        raise ValueError(
-            f"TILE_N_LARGE={tile_n_large} must be a multiple of BLOCK_SIZE*VEC={block_size * VEC}"
-        )
-    bs_sym = cute.sym_int()
-    seq_len_sym = cute.sym_int()
-    cute_dtype = _TORCH_TO_CUTLASS_DTYPE[policy_dtype]
-    def _fake2d(dt: Any, cols: Any) -> Any:
-        return cute.runtime.make_fake_compact_tensor(
-            dt,
-            (bs_sym, cols),
-            stride_order=(1, 0),
-            assumed_align=16,
-        )
-    fake_pol = _fake2d(cute_dtype, seq_len_sym)
-    fake_old = _fake2d(cute_dtype, seq_len_sym)
-    fake_ref = _fake2d(cute_dtype, seq_len_sym)
-    fake_adv = cute.runtime.make_fake_compact_tensor(
-        cute_dtype,
-        (bs_sym,),
-        assumed_align=16,
-    )
-    fake_mask = cute.runtime.make_fake_compact_tensor(
-        cutlass.Int8,
-        (bs_sym, seq_len_sym),
-        stride_order=(1, 0),
-        assumed_align=16,
-    )
-    dpolicy_cols = seq_len_sym if compute_backward else 1
-    fake_dpolicy = cute.runtime.make_fake_compact_tensor(
-        cute_dtype,
-        (bs_sym, dpolicy_cols),
-        stride_order=(1, 0),
-        assumed_align=16,
-    )
-    fake_scalar_f32 = cute.runtime.make_fake_compact_tensor(
-        cutlass.Float32,
-        (1,),
-        assumed_align=16,
-    )
-    fake_valid_acc = cute.runtime.make_fake_compact_tensor(
-        cutlass.Int32,
-        (1,),
-        assumed_align=16,
-    )
-    fake_counter = cute.runtime.make_fake_compact_tensor(
-        cutlass.Int32,
-        (1,),
-        assumed_align=16,
-    )
-    fake_mask_counter = cute.runtime.make_fake_compact_tensor(
-        cutlass.Int32,
-        (1,),
-        assumed_align=16,
-    )
-    fake_output = cute.runtime.make_fake_compact_tensor(
-        cute_dtype,
-        (1,),
-        assumed_align=16,
-    )
-    def _build_launch(tile_n_v: int) -> Callable[..., None]:
-        """Build a ``@cute.jit`` ``_launch`` for a given ``tile_n``.
-        Captures *tile_n_v* via closure; both the main kernel and the
-        (optional) mask-sum kernel are specialised to this tile size.
-        One ``_launch`` per tier; the runner dispatches at call time.
-        """
-        specialized_kernel = _make_bnpo_kernel(compute_kl, compute_backward, tile_n_v)
-        if compute_backward:
-            mask_sum_kernel = _make_mask_sum_kernel(tile_n_v)
-        @cute.jit
-        def _launch(
-            pol_ct: cute.Tensor,
-            old_ct: cute.Tensor,
-            ref_ct: cute.Tensor,
-            adv_ct: cute.Tensor,
-            mask_ct: cute.Tensor,
-            dpolicy_ct: cute.Tensor,
-            inv_total_ct: cute.Tensor,
-            policy_acc_ct: cute.Tensor,
-            kl_acc_ct: cute.Tensor,
-            valid_acc_ct: cute.Tensor,
-            counter_ct: cute.Tensor,
-            mask_counter_ct: cute.Tensor,
-            output_ct: cute.Tensor,
-            epsilon_v: cutlass.Float32,
-            epsilon_high_v: cutlass.Float32,
-            beta_v: cutlass.Float32,
-            total_blocks_v: cutlass.Int32,
-            num_full_tiles_v: cutlass.Int32,
-            tail_len_v: cutlass.Int32,
-            num_col_tiles_v: cutlass.Int32,
-        ) -> None:
-            bs_v = pol_ct.shape[0]  # ty: ignore[not-subscriptable]
-            # Bundled mask-sum (compute_backward only) — writes
-            # ``1 / completions_mask.sum()`` into ``inv_total_ct`` before the
-            # main kernel reads it. Both kernels in one tvm-ffi dispatch
-            # eliminates the per-call ``torch.sum`` + reciprocal round trip.
-            if cutlass.const_expr(compute_backward):
-                mask_sum_kernel(  # ty: ignore[unresolved-attribute]
-                    mask_ct,
-                    inv_total_ct,
-                    valid_acc_ct,
-                    mask_counter_ct,
-                    total_blocks_v,
-                    num_full_tiles_v,
-                    tail_len_v,
-                ).launch(
-                    grid=(bs_v, num_col_tiles_v, 1),
-                    block=(NUM_WARPS * 32, 1, 1),
-                )
-            specialized_kernel(  # ty: ignore[unresolved-attribute]
-                pol_ct,
-                old_ct,
-                ref_ct,
-                adv_ct,
-                mask_ct,
-                dpolicy_ct,
-                inv_total_ct,
-                policy_acc_ct,
-                kl_acc_ct,
-                valid_acc_ct,
-                counter_ct,
-                output_ct,
-                epsilon_v,
-                epsilon_high_v,
-                beta_v,
-                total_blocks_v,
-                num_full_tiles_v,
-                tail_len_v,
-            ).launch(
-                grid=(bs_v, num_col_tiles_v, 1),
-                block=(NUM_WARPS * 32, 1, 1),
-            )
-        return _launch
-    def _compile_launch(launch_fn: Callable[..., None]) -> Callable[..., None]:
-        return cute.compile(
-            launch_fn,
-            fake_pol,
-            fake_old,
-            fake_ref,
-            fake_adv,
-            fake_mask,
-            fake_dpolicy,
-            fake_scalar_f32,
-            fake_scalar_f32,
-            fake_scalar_f32,
-            fake_valid_acc,
-            fake_counter,
-            fake_mask_counter,
-            fake_output,
-            cutlass.Float32(epsilon),
-            cutlass.Float32(epsilon_high),
-            cutlass.Float32(beta),
-            cutlass.Int32(1),
-            cutlass.Int32(1),
-            cutlass.Int32(0),
-            cutlass.Int32(1),
-            options="--enable-tvm-ffi",
-        )
-    compiled_small = _compile_launch(_build_launch(tile_n_small))
-    if tile_n_large == tile_n_small:
-        compiled_large = compiled_small
-    else:
-        compiled_large = _compile_launch(_build_launch(tile_n_large))
-    eps_const = cutlass.Float32(epsilon)
-    eps_high_const = cutlass.Float32(epsilon_high)
-    beta_const = cutlass.Float32(beta)
-    # Cross-CTA scratch slab — one int32 buffer with stride-4 (16-byte) slices
-    # so each slot is individually 16-byte aligned (``assumed_align=16`` at
-    # compile time). Bit-pattern of int32 0 equals fp32 0.0, so a single
-    # ``zeros`` factory legitimately initialises both the int32 counters and
-    # the fp32 accumulators. The kernel's last block self-resets accumulators
-    # in its epilogue and the counters self-reset via ``atom.inc.u32``
-    # wrap-around, so the up-front ``torch.zeros`` only matters for the very
-    # first call.
-    _scratch: list[torch.Tensor | None] = [None]
-    def _ensure_scratch(device: torch.device) -> tuple[torch.Tensor, ...]:
-        s = _scratch[0]
-        if s is None or s.device != device:
-            s = torch.zeros(20, dtype=torch.int32, device=device)
-            _scratch[0] = s
-        return (
-            s[0:1],  # counter (int32)
-            s[4:5],  # mask_counter (int32)
-            s[8:9],  # valid_acc (int32)
-            s[12:13].view(torch.float32),  # policy_acc (fp32)
-            s[16:17].view(torch.float32),  # kl_acc (fp32)
-        )
-    def _run(
-        policy_logprobs_r: torch.Tensor,
-        old_policy_logprobs_r: torch.Tensor,
-        ref_logprobs_r: torch.Tensor,
-        advantages_r: torch.Tensor,
-        completions_mask_r: torch.Tensor,
-    ) -> torch.Tensor | tuple[torch.Tensor, torch.Tensor]:
-        bs, seq_len = policy_logprobs_r.shape
-        device = policy_logprobs_r.device
-        dtype = policy_logprobs_r.dtype
-        # Tier dispatch: long sequences pay too much last-block-detection
-        # latency under the small-tile grid, so swap to the large-tile
-        # compiled variant.
-        if seq_len >= seq_len_threshold:
-            tile_n_active = tile_n_large
-            compiled_active = compiled_large
-        else:
-            tile_n_active = tile_n_small
-            compiled_active = compiled_small
-        num_full_tiles = seq_len // tile_n_active
-        tail_len = seq_len % tile_n_active
-        num_col_tiles = num_full_tiles + (1 if tail_len > 0 else 0)
-        total_blocks = bs * num_col_tiles
-        # Per-call write-only buffers — ``empty`` is enough (Liger / TE
-        # pattern). ``inv_total`` is populated by the bundled mask-sum
-        # kernel (compute_backward path) or by the main kernel's last-block
-        # trick (fwd-only path); the runner never reads it.
-        output_r = torch.empty(1, dtype=dtype, device=device)
-        inv_total_r = torch.empty(1, dtype=torch.float32, device=device)
-        if compute_backward:
-            dpolicy_r = torch.empty_like(policy_logprobs_r)
-        else:
-            dpolicy_r = torch.empty(bs, 1, dtype=dtype, device=device)
-        counter_r, mask_counter_r, valid_acc_r, policy_acc_r, kl_acc_r = _ensure_scratch(device)
-        compiled_active(
-            policy_logprobs_r,
-            old_policy_logprobs_r,
-            ref_logprobs_r,
-            advantages_r,
-            completions_mask_r,
-            dpolicy_r,
-            inv_total_r,
-            policy_acc_r,
-            kl_acc_r,
-            valid_acc_r,
-            counter_r,
-            mask_counter_r,
-            output_r,
-            eps_const,
-            eps_high_const,
-            beta_const,
-            total_blocks,
-            num_full_tiles,
-            tail_len,
-            num_col_tiles,
-        )
-        out_view = output_r.view(())
-        if compute_backward:
-            return out_view, dpolicy_r
-        return out_view
-    return _run
-# ---------------------------------------------------------------------------
-# Fused forward + backward — direct (loss, grad) runner, no autograd
-# ---------------------------------------------------------------------------
-def create_compiled_bnpo_loss_with_backward(
-    policy_dtype: torch.dtype,
-    epsilon: float,
-    epsilon_high: float,
-    beta: float,
-) -> Callable[..., tuple[torch.Tensor, torch.Tensor]]:
-    """Compile the fused fwd+bwd bnpo kernel and return a tuple-returning runner.
-    The returned callable runs one training-step worth of work: a single
-    ``@cute.jit`` dispatch produces both the scalar loss and the scaled
-    ``dL/d(policy_logprobs)`` tensor. It returns ``(loss, dpolicy)`` directly
-    — no ``torch.autograd.Function`` wrapper, no extra ``grad_output * dpolicy``
-    backward kernel. Callers that need autograd integration (so
-    ``loss.backward()`` works) wrap this themselves at the public-API layer;
-    callers that control gradient flow manually (benchmarks, custom training
-    loops) can use it as-is for zero overhead.
-    ``inv_total`` is computed entirely on-GPU by a bundled mask-sum kernel
-    that runs in series with the main kernel inside the same ``@cute.jit``
-    launch — no host sync, no extra ``torch.sum`` dispatch, CUDA-graph
-    compatible.
-    """
-    return _typing_cast(
-        "Callable[..., tuple[torch.Tensor, torch.Tensor]]",
-        create_compiled_bnpo_loss(
-            policy_dtype=policy_dtype,
-            epsilon=epsilon,
-            epsilon_high=epsilon_high,
-            beta=beta,
-            compute_backward=True,
-        ),
-    )

build/torch-cuda/geometric_ai_kernels/__init__.py DELETED Viewed

@@ -1,26 +0,0 @@
-import ctypes
-import importlib.util
-import sys
-from pathlib import Path
-from types import ModuleType
-def _import_from_path(file_path: Path) -> ModuleType:
-    # We cannot use the module name as-is, after adding it to `sys.modules`,
-    # it would also be used for other imports. So, we make a module name that
-    # depends on the path for it to be unique using the hex-encoded hash of
-    # the path.
-    path_hash = "{:x}".format(ctypes.c_size_t(hash(file_path.absolute())).value)
-    module_name = path_hash
-    spec = importlib.util.spec_from_file_location(module_name, file_path)
-    if spec is None:
-        raise ImportError(f"Cannot load spec for {module_name} from {file_path}")
-    module = importlib.util.module_from_spec(spec)
-    if module is None:
-        raise ImportError(f"Cannot load module {module_name} from spec")
-    sys.modules[module_name] = module
-    spec.loader.exec_module(module)  # type: ignore
-    return module
-globals().update(vars(_import_from_path(Path(__file__).parent.parent / "__init__.py")))