art87able commited on about 3 hours ago

Commit

0a55ff6

0 Parent(s):

Lean Laguna: lossless DFlash speculative decoding on Laguna XS.2 (harness, environment, results)

Browse files

Files changed (26) hide show

.gitattributes +35 -0
README.md +185 -0
bench/measure.py +184 -0
bench/rollout_bench.py +325 -0
configs/endpoints.toml +69 -0
configs/rl/laguna-spec.toml +45 -0
evals/humaneval_subset.py +145 -0
results/.gitkeep +0 -0
results/README.md +31 -0
results/baseline.json +12 -0
results/dflash.json +16 -0
results/humaneval_dflash.json +12 -0
results/parity.json +12 -0
scripts/check_results.py +67 -0
scripts/dress_rehearsal.sh +213 -0
scripts/eval_local.py +305 -0
scripts/fill_submission.py +116 -0
scripts/gen_local.py +110 -0
scripts/hf_job_ab.py +287 -0
scripts/parity_local.sh +33 -0
scripts/run_min_on_prime.sh +90 -0
scripts/serve_vllm.py +126 -0
scripts/stub_server.py +187 -0
spec_rl/README.md +129 -0
spec_rl/pyproject.toml +21 -0
spec_rl/spec_rl.py +453 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,185 @@

+---
+license: apache-2.0
+base_model: poolside/Laguna-XS.2
+tags:
+  - speculative-decoding
+  - dflash
+  - inference
+  - vllm
+  - lossless
+---
+# Lean Laguna — Laguna XS.2 + DFlash, lossless single-GPU speedup
+*Project: **Lean Laguna** — making Laguna XS.2 cheaper to run and to post-train on a single GPU.*
+> **One-line claim:** Laguna XS.2 generates **2.76× faster on a single GPU** — **19.6 → 54.2
+> tokens/sec** — with **byte-identical greedy output** (0 / 14 mismatches) on a mixed-difficulty code
+> set (2.47× corroborated on a trivial set; **lossless in both**) vs the no-speculator baseline.
+Speculative decoding with Poolside's **DFlash** speculator on **Laguna XS.2**, served in vLLM on
+one GPU. The throughput win is measured; the output is provably **lossless under greedy decoding**
+(token-for-token identical to baseline) and distribution-preserving under sampling.
+Submission for the Poolside Research Hackathon — Foundations track
+(`poolside-laguna-hackathon` HF org).
+## Goal & judging criteria
+> **Meaningfully improve Laguna XS.2, either by:** expanding model use cases (computer use,
+> multi-agent coordination, evaluation design); *or* **reducing cost & latency** (optimizations,
+> speed, quantization). **For:** an economically valuable task (a function/application); *or*
+> **any novel research idea.**
+> **Scored on: GENERALISABILITY · REPRODUCIBILITY · TECHNICAL CONTRIBUTIONS.**
+Lean Laguna sits on **reduce cost & latency** for **a novel research idea** (lossless
+speculative decoding → cheaper RL rollouts), and is built to score all three axes:
+- **Generalisability** — any target + drafter via one `--speculative-config`; the `spec_rl` env +
+  `configs/endpoints.toml` point any RL run at any OpenAI-compatible endpoint; the reward is a
+  swappable seam (a *reusable RL environment + reward signal* — a listed submission idea).
+- **Reproducibility** — greedy byte-parity + directly-measured throughput behind `make` targets and a
+  one-command HF-Jobs run (below); anyone re-runs the before/after table. (τ from `/metrics` read at
+  the γ+1 ceiling on both runs → we treat it as unreliable and **don't quote it**. HumanEval pass@1
+  sweep = a documented next step; greedy parity is the stronger guarantee.)
+- **Technical contributions** — a measured, provably-lossless throughput win (**2.76×** on a
+  mixed-difficulty code set, 0 mismatches; 2.47× corroborated on a trivial set) on the *released*
+  Laguna XS.2 + DFlash, carried into **cheaper RL rollouts**; the open problem of **speculative
+  decoding under a moving RL policy** (drafter staleness) and NVFP4 attention-weight calibration as
+  the posed research stretches.
+### Cheaper RL rollouts — the generalisability + frontier story
+The speedup is a *decode-time* property, so it carries into any RL trainer whose rollout phase is
+OpenAI-compatible vLLM inference — e.g. **`verifiers`** envs (our `spec_rl`, or third-party Hub envs
+like [`pandelis/zerolang-editing`](https://app.primeintellect.ai/dashboard/environments/pandelis/zerolang-editing)
+— install + repoint `endpoints.toml`, zero code change) and **[OpenPipe ART](https://github.com/openpipe/art)**
+(GRPO + LoRA, rollouts served via vLLM). Drop `--speculative-config` into the rollout server →
+cheaper rollouts.
+**The honest open problem:** in RL the policy moves every batch (e.g. ART's LoRA), so a drafter
+trained on the *base* model drifts → acceptance τ decays → the speedup erodes across training. Within
+a batch the policy is frozen, so the per-batch win is real; the frontier is keeping the drafter useful
+as the policy moves (periodic drafter distillation, hidden-state-conditioned drafters, or measuring
+and amortizing the re-sync cost). This is the "novel research idea" axis, stated plainly.
+---
+## Method
+- **Target model:** `poolside/Laguna-XS.2` — 33.4B-total / 3B-active MoE, single GPU, FP8 native,
+  128K (→256K) context, Apache 2.0, built for agentic coding.
+- **Draft model:** `poolside/Laguna-XS.2-speculator.dflash` — a 0.6B-parameter draft model
+  (block-diffusion-style speculative-decoding method).
+- **How it works:** DFlash proposes **γ = 7** candidate tokens per round; Laguna XS.2 verifies all
+  7 in a **single forward pass** and commits the longest matching prefix plus one free bonus token.
+  Same output, fewer expensive target passes.
+- **Why lossless:** under greedy decoding the target only commits tokens equal to its own argmax,
+  so the output is token-identical to the baseline. Under sampling, vLLM's rejection sampling
+  preserves the target's output distribution. **Decode-time property — independent of training.**
+- **Regime:** the win lands at **low batch / memory-bound decode** — the single-GPU, single-agent
+  case. It shrinks (and can invert) at high batch / compute-bound. See the honesty note below.
+### The exact vLLM flag
+Baseline and DFlash differ by **one flag only** — that is the whole experiment:
+```bash
+--speculative-config '{"model":"poolside/Laguna-XS.2-speculator.dflash","num_speculative_tokens":7,"method":"dflash"}'
+```
+Requires **vLLM ≥ 0.21.0** and `VLLM_USE_DEEP_GEMM=0`.
+---
+## Results
+Same prompts, same `max_tokens`, **temperature 0 (greedy)**, same single GPU,
+`--tensor-parallel-size 1`. Only `--speculative-config` differs between the two servers.
+Measured on an **H200**, vLLM 0.22.0, `--enforce-eager`, `--max-model-len 4096`, greedy. Two runs:
+a **14-prompt mixed-difficulty** code set (trivial `fib`/`is_prime` → hard `lcs`/`dijkstra`/`LRUCache`)
+plus a corroborating **20-prompt trivial** set.
+| Metric | Baseline | + DFlash | Δ |
+|---|---|---|---|
+| tokens/sec — mixed-difficulty (N=14) | 19.6 | 54.2 | **2.76×** ↑ |
+| tokens/sec — trivial (N=20) | 19.5 | 48.1 | **2.47×** ↑ |
+| greedy parity | — | **identical** | **0 / 14 and 0 / 20 mismatches** ✓ |
+| HumanEval pass@1 | not run† | not run† | — |
+- **tokens/sec is the headline win** — directly measured wall-clock. The speedup *holds and is larger*
+  on the harder, more diverse set (**2.76×**) than on the trivial one (2.47×), and output is
+  byte-identical in **both**.
+- **No acceptance-length (τ) claim — on purpose.** vLLM's `/metrics` τ pinned at *exactly* the γ+1
+  ceiling (8.0) on **both** runs, and per-prompt deltas didn't resolve a distribution — almost
+  certainly a metrics artifact, not true 100% acceptance. So we report only the directly-measured
+  speedup + parity and treat τ as unreliable. *The metric we can't trust, we don't quote.*
+- **parity** = baseline vs DFlash greedy outputs are token-identical — the lossless proof.
+- **†No TTFT or HumanEval-pass@1 row.** This MIN A/B measured throughput + byte-parity only; the
+  harness did not isolate true time-to-first-token, and a full HumanEval pass@1 sweep is a documented
+  next step. Byte-identical greedy output ⇒ identical pass@1 *by construction*, so parity is the
+  stronger guarantee here.
+---
+## How to reproduce
+**The exact run that produced the numbers above** — one self-contained command on Hugging Face Jobs
+(no ssh; serves baseline → measures → re-serves with DFlash → measures → byte-parity), funded by the
+HF Jobs credit pool:
+```bash
+hf jobs uv run --flavor h200 --timeout 1500 --detach --secrets HF_TOKEN scripts/hf_job_ab.py
+# then: hf jobs logs <id>  →  the [job] RESULT / BASELINE_JSON / DFLASH_JSON / PARITY_JSON lines
+```
+`scripts/hf_job_ab.py` pins the working vLLM env (Triton MoE + Torch sampler + FlashAttention, so no
+CUDA toolkit is needed in the slim image — see `THE_JOURNEY.md` for *why*). Below is the equivalent
+local two-server flow for any CUDA box with the released weights (vLLM ≥ 0.21.0):
+```bash
+# 1. Baseline server (speed floor)
+python scripts/serve_vllm.py --mode baseline --run        # serves on :8000
+# 2. Benchmark baseline (separate shell)
+python bench/measure.py --base-url http://localhost:8000 --model laguna \
+    --label baseline --n 20 --out results/baseline.json
+# 3. DFlash server — same command + the one --speculative-config flag
+python scripts/serve_vllm.py --mode dflash --run
+python bench/measure.py --base-url http://localhost:8000 --model laguna \
+    --label dflash --n 20 --out results/dflash.json
+# 4. Quality + lossless parity
+python evals/humaneval_subset.py --base-url http://localhost:8000 --model laguna \
+    --n 25 --out results/humaneval_dflash.json
+python evals/humaneval_subset.py --parity \
+    --base-url http://localhost:8000 --base-url-b http://localhost:8001 --model laguna --n 25
+```
+The results table above is the diff of `results/baseline.json` and `results/dflash.json` plus the
+parity result. τ is read from vLLM's `/metrics`.
+---
+## Honesty note — the low-batch regime
+This is deliberately a **single-GPU, low-concurrency** result: one box, one agent, maximum
+tokens/sec.
+Speculative decoding helps **most at low batch size / memory-bound decode**, where each step
+reloads the active weights to emit a single token and doing useful work for several tokens per
+pass is a large win. It helps **less at high batch size / compute-bound decode** — once the GPU is
+saturated, the matmuls dominate and the extra verify work for rejected drafts can slightly hurt.
+At very high concurrency you would tune γ down or turn speculation off.
+The reported speedup, τ, and acceptance numbers are for the low-batch single-GPU regime on
+coding-style prompts. The lossless claim (greedy parity) holds regardless of regime — it is a
+correctness property of the verification step, not a function of batch size.
+---
+## License
+Apache 2.0, inheriting `poolside/Laguna-XS.2`.

bench/measure.py ADDED Viewed

	@@ -0,0 +1,184 @@

+#!/usr/bin/env python3
+"""
+measure.py — the benchmark harness. Hits an OpenAI-compatible endpoint (the one
+`vllm serve` exposes) and records the three demo numbers:
+    tokens/sec   (decode throughput)   <- THE WIN
+    TTFT         (time to first token) <- should be ~unchanged with DFlash
+    acceptance length tau              <- WHY it's faster (read from vLLM metrics)
+Run it twice at the venue — once against the baseline server, once against the
+DFlash server — and diff the JSON. That diff IS the before/after table.
+This file is endpoint-driven, so it runs anywhere (including the Mac) AS LONG AS
+something is serving on --base-url. On the Mac you can point it at a local
+tiny-model OpenAI server to shape-test; at the venue you point it at vLLM.
+acceptance length tau:
+  tau = mean(number of tokens committed per target forward pass).
+  With a draft of gamma=7, tau ranges from 1 (everything rejected, +1 bonus)
+  up to gamma+1=8 (all accepted + bonus). The DFlash card publishes per-position
+  acceptance only (~70.7% at position 1, decaying to ~2% by position 7), NOT a
+  tau figure -- measure tau at the venue (expect roughly 2-3). vLLM exposes
+  accepted/draft counts in its metrics; we
+  read them from /metrics (Prometheus) when present and otherwise estimate tau
+  from the speedup. VERIFY AT ONBOARDING which metric names the vLLM build uses
+  (e.g. vllm:spec_decode_num_accepted_tokens / _num_draft_tokens).
+Usage:
+  python bench/measure.py --base-url http://localhost:8000 --model laguna \
+      --label dflash --out results/dflash.json --n 20
+  python bench/measure.py --base-url http://localhost:8000 --model laguna \
+      --label baseline --out results/baseline.json --n 20
+Requires only stdlib + requests-free urllib, so no extra venue deps.
+"""
+from __future__ import annotations
+import argparse
+import json
+import os
+import time
+import urllib.request
+from statistics import mean
+PROMPTS = [
+    "Write a Python function that returns the nth Fibonacci number iteratively.",
+    "Implement binary search over a sorted list in Python. Return the index or -1.",
+    "Write a function to check if a string is a palindrome, ignoring case and spaces.",
+    "Implement quicksort in Python.",
+    "Write a function that merges two sorted lists into one sorted list.",
+]
+def _post(url: str, payload: dict) -> dict:
+    data = json.dumps(payload).encode()
+    req = urllib.request.Request(url, data=data,
+                                 headers={"Content-Type": "application/json"})
+    with urllib.request.urlopen(req, timeout=600) as r:
+        return json.loads(r.read().decode())
+def _try_metrics(base_url: str) -> dict:
+    """Best-effort read of vLLM Prometheus spec-decode counters."""
+    out = {}
+    try:
+        with urllib.request.urlopen(base_url.rstrip("/") + "/metrics", timeout=10) as r:
+            text = r.read().decode()
+    except Exception:
+        return out
+    for line in text.splitlines():
+        if line.startswith("#"):
+            continue
+        # VERIFY metric names at onboarding; these are the common vLLM ones.
+        for key in ("spec_decode_num_accepted_tokens",
+                    "spec_decode_num_draft_tokens",
+                    "spec_decode_num_emitted_tokens"):
+            if key in line:
+                try:
+                    out[key] = float(line.split()[-1])
+                except ValueError:
+                    pass
+    return out
+def measure_one(base_url: str, model: str, prompt: str, max_tokens: int) -> dict:
+    url = base_url.rstrip("/") + "/v1/completions"
+    # Greedy (temperature 0) so output is deterministic — this is what makes the
+    # baseline-vs-DFlash output comparison a LOSSLESS check.
+    payload = {
+        "model": model,
+        "prompt": prompt,
+        "max_tokens": max_tokens,
+        "temperature": 0.0,
+        "stream": True,
+    }
+    data = json.dumps(payload).encode()
+    req = urllib.request.Request(url, data=data,
+                                 headers={"Content-Type": "application/json"})
+    t0 = time.perf_counter()
+    ttft = None
+    n_tokens = 0
+    chunks = []
+    with urllib.request.urlopen(req, timeout=600) as r:
+        for raw in r:
+            line = raw.decode().strip()
+            if not line or not line.startswith("data:"):
+                continue
+            body = line[len("data:"):].strip()
+            if body == "[DONE]":
+                break
+            obj = json.loads(body)
+            piece = obj.get("choices", [{}])[0].get("text", "")
+            if piece:
+                if ttft is None:
+                    ttft = time.perf_counter() - t0
+                n_tokens += 1
+                chunks.append(piece)
+    total = time.perf_counter() - t0
+    decode_time = max(total - (ttft or 0.0), 1e-9)
+    tps = (n_tokens - 1) / decode_time if n_tokens > 1 else 0.0
+    return {
+        "ttft_s": ttft,
+        "total_s": total,
+        "new_tokens": n_tokens,
+        "tokens_per_s": tps,
+        "text": "".join(chunks),
+    }
+def main() -> None:
+    p = argparse.ArgumentParser(description="Benchmark tokens/sec, TTFT, acceptance length against a vLLM endpoint.")
+    p.add_argument("--base-url", default="http://localhost:8000")
+    p.add_argument("--model", default="laguna")
+    p.add_argument("--label", required=True, help="baseline | dflash (used in the output).")
+    p.add_argument("--n", type=int, default=20, help="Number of generations (cycles through the prompt set).")
+    p.add_argument("--max-tokens", type=int, default=256)
+    p.add_argument("--out", default=None, help="Write JSON here (e.g. results/dflash.json).")
+    args = p.parse_args()
+    before = _try_metrics(args.base_url)
+    runs = []
+    for i in range(args.n):
+        prompt = PROMPTS[i % len(PROMPTS)]
+        runs.append(measure_one(args.base_url, args.model, prompt, args.max_tokens))
+        print(f"  [{args.label}] run {i+1}/{args.n}  "
+              f"tps={runs[-1]['tokens_per_s']:.1f}  ttft={runs[-1]['ttft_s']:.3f}s")
+    after = _try_metrics(args.base_url)
+    # acceptance length tau from metric deltas, if available.
+    tau = None
+    acc = after.get("spec_decode_num_accepted_tokens", 0) - before.get("spec_decode_num_accepted_tokens", 0)
+    emitted = after.get("spec_decode_num_emitted_tokens", 0) - before.get("spec_decode_num_emitted_tokens", 0)
+    draft = after.get("spec_decode_num_draft_tokens", 0) - before.get("spec_decode_num_draft_tokens", 0)
+    # tau ~= total committed tokens / number of target verification passes.
+    # accepted + 1 bonus per pass; passes ~= draft / gamma. Best-effort only.
+    if draft > 0:
+        passes = draft / NUM_SPECULATIVE_TOKENS  # gamma
+        committed = acc + passes  # +1 bonus token per pass
+        tau = committed / passes if passes > 0 else None
+    summary = {
+        "label": args.label,
+        "model": args.model,
+        "base_url": args.base_url,
+        "n": args.n,
+        "tokens_per_s_mean": mean(r["tokens_per_s"] for r in runs),
+        "ttft_s_mean": mean(r["ttft_s"] for r in runs if r["ttft_s"] is not None),
+        "acceptance_length_tau": tau,  # None if metrics unavailable — read off /metrics manually then
+        "spec_metrics_before": before,
+        "spec_metrics_after": after,
+        "runs": runs,
+    }
+    print(json.dumps({k: v for k, v in summary.items() if k != "runs"}, indent=2))
+    if args.out:
+        os.makedirs(os.path.dirname(args.out) or ".", exist_ok=True)
+        with open(args.out, "w") as f:
+            json.dump(summary, f, indent=2)
+        print(f"[measure] wrote {args.out}")
+NUM_SPECULATIVE_TOKENS = 7  # gamma, per the DFlash card
+if __name__ == "__main__":
+    main()

bench/rollout_bench.py ADDED Viewed

	@@ -0,0 +1,325 @@

+#!/usr/bin/env python3
+"""
+rollout_bench.py — the COMBINED-THESIS benchmark. It measures the same endpoint
+that verifiers points its RL rollouts at (see configs/endpoints.toml), but frames
+the numbers the way an RL post-training run cares about:
+    rollout throughput  (completions/sec, tokens/sec)   <- THE WIN
+    TTFT                 (time to first token)           <- ~unchanged with DFlash
+    acceptance length tau                                <- WHY it's faster
+    projected $/run saved                                <- WHY it's CHEAPER
+The thesis is "lossless DFlash speculative decoding makes RL post-training
+cheaper." RL spends most of its wall-clock generating rollouts, so a faster
+rollout endpoint — at IDENTICAL greedy output — buys the same reward curve for
+fewer GPU-hours. This script measures that, live, against whatever is serving on
+--base-url. It is a sibling of measure.py and reuses the same conventions:
+stdlib urllib only, streaming /v1/completions, greedy decode, best-effort read of
+vLLM /metrics. The ONE design rule: baseline vs DFlash is a one-flag swap on the
+SERVER (serve_vllm.py --mode), never a change here — so the same command produces
+both halves of the A/B.
+Workload: an RL "rollout batch" = a fixed prompt set, replayed identically, with
+--rollouts-per-example completions per prompt. The workload is deterministic
+(temperature 0 by default) so the BASELINE and DFLASH runs do identical work and
+the only thing that moves is speed.
+acceptance length tau:
+  tau = mean tokens committed per target forward pass. With gamma=7 it ranges
+  from 1 (all drafts rejected, +1 bonus) to 8 (all accepted + bonus). tau is NOT
+  published in any Laguna/DFlash primary source — the model card gives per-position
+  acceptance rates only (position-1 ~70.7%, decaying to ~2% at position-7). So we
+  MEASURE it here from vLLM /metrics deltas. Expect roughly 2-3; never quote a
+  published figure. None is printed if /metrics is unavailable — read it off the
+  server's /metrics by hand then. VERIFY the exact metric names at onboarding.
+Losslessness:
+  --assert-parity runs the deterministic (greedy) workload TWICE against the same
+  endpoint and asserts byte-identical completions. On a correct speculative-decoding
+  implementation greedy output is invariant, so two runs must match. (The
+  baseline-vs-DFlash cross-server parity check lives in evals/humaneval_subset.py
+  --parity; this in-run check guards against nondeterminism in the served config.)
+This does NOT fabricate anything. Every number comes from live HTTP calls. If the
+endpoint is down you get an error, not a made-up result.
+Usage:
+  # measure a DFlash run and project savings at $3.50/GPU-hour
+  python bench/rollout_bench.py --base-url http://localhost:8000 --model laguna \\
+      --label dflash --prompts 8 --rollouts-per-example 8 --max-tokens 512 \\
+      --hourly-rate 3.50 --out results/rollout_dflash.json
+  # measure the baseline (re-serve with serve_vllm.py --mode baseline first)
+  python bench/rollout_bench.py --base-url http://localhost:8000 --model laguna \\
+      --label baseline --hourly-rate 3.50 --out results/rollout_baseline.json
+  # prove losslessness: two greedy runs against the same endpoint must be identical
+  python bench/rollout_bench.py --base-url http://localhost:8000 --model laguna \\
+      --label dflash --assert-parity
+Requires only the stdlib (urllib), so no extra venue deps.
+"""
+from __future__ import annotations
+import argparse
+import json
+import os
+import time
+import urllib.request
+from statistics import mean
+# Draft length gamma, per the DFlash model card. Used only to estimate the number
+# of target verification passes when turning /metrics counters into tau.
+NUM_SPECULATIVE_TOKENS = 7
+# The fixed rollout prompt set. Coding-style, matching the DFlash card's domain
+# and measure.py's set, so tau and tokens/sec are comparable across the harness.
+PROMPTS = [
+    "Write a Python function that returns the nth Fibonacci number iteratively.",
+    "Implement binary search over a sorted list in Python. Return the index or -1.",
+    "Write a function to check if a string is a palindrome, ignoring case and spaces.",
+    "Implement quicksort in Python.",
+    "Write a function that merges two sorted lists into one sorted list.",
+    "Write a Python function that returns the prime factors of an integer.",
+    "Implement a function that reverses words in a sentence in place.",
+    "Write a function that flattens an arbitrarily nested list of integers.",
+]
+def _try_metrics(base_url: str) -> dict:
+    """Best-effort read of vLLM Prometheus spec-decode counters. Empty if absent."""
+    out: dict = {}
+    try:
+        with urllib.request.urlopen(base_url.rstrip("/") + "/metrics", timeout=10) as r:
+            text = r.read().decode()
+    except Exception:
+        return out
+    for line in text.splitlines():
+        if line.startswith("#"):
+            continue
+        # VERIFY metric names at onboarding; these are the common vLLM ones.
+        for key in ("spec_decode_num_accepted_tokens",
+                    "spec_decode_num_draft_tokens",
+                    "spec_decode_num_emitted_tokens"):
+            if key in line:
+                try:
+                    out[key] = float(line.split()[-1])
+                except ValueError:
+                    pass
+    return out
+def generate_one(base_url: str, model: str, prompt: str, max_tokens: int,
+                 temperature: float) -> dict:
+    """One streamed completion. Returns timing + the generated text."""
+    url = base_url.rstrip("/") + "/v1/completions"
+    payload = {
+        "model": model,
+        "prompt": prompt,
+        "max_tokens": max_tokens,
+        "temperature": temperature,   # 0.0 => greedy => deterministic => lossless-comparable
+        "stream": True,
+    }
+    data = json.dumps(payload).encode()
+    req = urllib.request.Request(url, data=data,
+                                 headers={"Content-Type": "application/json"})
+    t0 = time.perf_counter()
+    ttft = None
+    n_tokens = 0
+    chunks = []
+    with urllib.request.urlopen(req, timeout=600) as r:
+        for raw in r:
+            line = raw.decode().strip()
+            if not line or not line.startswith("data:"):
+                continue
+            body = line[len("data:"):].strip()
+            if body == "[DONE]":
+                break
+            obj = json.loads(body)
+            piece = obj.get("choices", [{}])[0].get("text", "")
+            if piece:
+                if ttft is None:
+                    ttft = time.perf_counter() - t0
+                n_tokens += 1
+                chunks.append(piece)
+    total = time.perf_counter() - t0
+    decode_time = max(total - (ttft or 0.0), 1e-9)
+    tps = (n_tokens - 1) / decode_time if n_tokens > 1 else 0.0
+    return {
+        "ttft_s": ttft,
+        "total_s": total,
+        "new_tokens": n_tokens,
+        "tokens_per_s": tps,
+        "text": "".join(chunks),
+    }
+def run_rollout_batch(base_url: str, model: str, prompts: list[str],
+                      rollouts_per_example: int, max_tokens: int,
+                      temperature: float, label: str) -> list[dict]:
+    """Replay the prompt set rollouts_per_example times — one RL rollout batch."""
+    runs = []
+    total = len(prompts) * rollouts_per_example
+    k = 0
+    for r in range(rollouts_per_example):
+        for prompt in prompts:
+            k += 1
+            res = generate_one(base_url, model, prompt, max_tokens, temperature)
+            runs.append(res)
+            print(f"  [{label}] rollout {k}/{total}  "
+                  f"tps={res['tokens_per_s']:.1f}  ttft={res['ttft_s']:.3f}s")
+    return runs
+def estimate_tau(before: dict, after: dict) -> float | None:
+    """tau from vLLM /metrics deltas. None if counters are unavailable.
+    Committed tokens per target pass = accepted + 1 bonus per pass; the number of
+    passes ~= draft_tokens / gamma. Best-effort, exactly as measure.py does it.
+    """
+    acc = after.get("spec_decode_num_accepted_tokens", 0) - before.get("spec_decode_num_accepted_tokens", 0)
+    draft = after.get("spec_decode_num_draft_tokens", 0) - before.get("spec_decode_num_draft_tokens", 0)
+    if draft > 0:
+        passes = draft / NUM_SPECULATIVE_TOKENS
+        if passes > 0:
+            committed = acc + passes   # +1 bonus token per verification pass
+            return committed / passes
+    return None
+def assert_parity(base_url: str, model: str, prompts: list[str], max_tokens: int) -> dict:
+    """Run the GREEDY workload twice and assert byte-identical completions.
+    On correct speculative decoding, greedy output is invariant — two runs MUST
+    match. A mismatch means the served config is nondeterministic (or broken), not
+    lossless. Raises AssertionError on any mismatch so a CI/demo run fails loudly.
+    """
+    print("[parity] greedy run A ...")
+    a = run_rollout_batch(base_url, model, prompts, 1, max_tokens, 0.0, "parity-A")
+    print("[parity] greedy run B ...")
+    b = run_rollout_batch(base_url, model, prompts, 1, max_tokens, 0.0, "parity-B")
+    mismatches = sum(1 for x, y in zip(a, b) if x["text"] != y["text"])
+    identical = len(a) - mismatches
+    result = {
+        "parity_pairs": len(a),
+        "identical": identical,
+        "mismatches": mismatches,
+        "lossless": mismatches == 0,
+    }
+    print(json.dumps(result, indent=2))
+    assert mismatches == 0, (
+        f"PARITY FAILED: {mismatches}/{len(a)} greedy completions differed across "
+        f"two runs of the same endpoint — output is NOT deterministic/lossless."
+    )
+    print("[parity] PASS — greedy output is byte-identical across runs (lossless).")
+    return result
+def main() -> None:
+    p = argparse.ArgumentParser(
+        description="Rollout-throughput benchmark (completions/sec, tokens/sec, TTFT, "
+                    "acceptance length tau, projected $/run) against an OpenAI-compatible "
+                    "endpoint. Measures live; never fabricates.",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+    p.add_argument("--base-url", default="http://localhost:8000",
+                   help="OpenAI-compatible endpoint root (vLLM serves /v1 and /metrics under it).")
+    p.add_argument("--model", default="laguna",
+                   help="Served model name/id (serve_vllm.py registers the alias 'laguna').")
+    p.add_argument("--label", default="dflash",
+                   help="Tag for the output: baseline | dflash. Just labels the JSON.")
+    p.add_argument("--prompts", type=int, default=len(PROMPTS),
+                   help="How many of the built-in prompts to use (1..%d)." % len(PROMPTS))
+    p.add_argument("--rollouts-per-example", type=int, default=8,
+                   help="Completions sampled per prompt — mirrors the RL config's group size.")
+    p.add_argument("--max-tokens", type=int, default=512,
+                   help="Max new tokens per completion. Match the RL sampling cap for honest $/run.")
+    p.add_argument("--temperature", type=float, default=0.0,
+                   help="0.0 = greedy/deterministic (the lossless-comparable workload). "
+                        "Keep 0 for the A/B so baseline and DFlash do identical work.")
+    p.add_argument("--hourly-rate", type=float, default=None,
+                   help="GPU $/hour. If set, projects rollout-batch cost and (with --baseline-tps) savings.")
+    p.add_argument("--baseline-tps", type=float, default=None,
+                   help="Baseline tokens/sec from a prior --label baseline run. Lets this run project "
+                        "the $ SAVED vs baseline for the same rollout workload.")
+    p.add_argument("--assert-parity", action="store_true",
+                   help="Run the greedy workload twice and assert byte-identical output (lossless check). "
+                        "Exits nonzero on mismatch. Skips the throughput batch.")
+    p.add_argument("--out", default=None, help="Write JSON summary here (e.g. results/rollout_dflash.json).")
+    args = p.parse_args()
+    prompts = PROMPTS[:max(1, min(args.prompts, len(PROMPTS)))]
+    if args.assert_parity:
+        result = assert_parity(args.base_url, args.model, prompts, args.max_tokens)
+        if args.out:
+            os.makedirs(os.path.dirname(args.out) or ".", exist_ok=True)
+            with open(args.out, "w") as f:
+                json.dump({"label": args.label, "parity": result}, f, indent=2)
+            print(f"[rollout_bench] wrote {args.out}")
+        return
+    before = _try_metrics(args.base_url)
+    t_start = time.perf_counter()
+    runs = run_rollout_batch(args.base_url, args.model, prompts,
+                             args.rollouts_per_example, args.max_tokens,
+                             args.temperature, args.label)
+    wall_s = time.perf_counter() - t_start
+    after = _try_metrics(args.base_url)
+    tau = estimate_tau(before, after)
+    total_tokens = sum(r["new_tokens"] for r in runs)
+    n_rollouts = len(runs)
+    completions_per_s = n_rollouts / wall_s if wall_s > 0 else 0.0
+    tokens_per_s_aggregate = total_tokens / wall_s if wall_s > 0 else 0.0
+    summary = {
+        "label": args.label,
+        "model": args.model,
+        "base_url": args.base_url,
+        "prompts": len(prompts),
+        "rollouts_per_example": args.rollouts_per_example,
+        "n_rollouts": n_rollouts,
+        "max_tokens": args.max_tokens,
+        "temperature": args.temperature,
+        "wall_s": wall_s,
+        "completions_per_s": completions_per_s,        # rollout throughput — the headline
+        "total_new_tokens": total_tokens,
+        "tokens_per_s_aggregate": tokens_per_s_aggregate,
+        "tokens_per_s_mean_per_rollout": mean(r["tokens_per_s"] for r in runs),
+        "ttft_s_mean": mean(r["ttft_s"] for r in runs if r["ttft_s"] is not None),
+        "acceptance_length_tau": tau,   # None if /metrics absent — read it off /metrics by hand then
+        "spec_metrics_before": before,
+        "spec_metrics_after": after,
+    }
+    # ---- projected $/run -------------------------------------------------
+    # Cost of THIS rollout batch at the given GPU price. If a baseline tokens/sec
+    # is supplied, also project what the SAME workload would have cost at baseline
+    # speed, and the savings — the dollars-and-cents form of the thesis.
+    if args.hourly_rate is not None:
+        batch_cost = (wall_s / 3600.0) * args.hourly_rate
+        cost = {"hourly_rate": args.hourly_rate, "batch_cost_usd": batch_cost}
+        if args.baseline_tps and args.baseline_tps > 0 and total_tokens > 0:
+            baseline_wall_s = total_tokens / args.baseline_tps
+            baseline_cost = (baseline_wall_s / 3600.0) * args.hourly_rate
+            cost.update({
+                "baseline_tps_reference": args.baseline_tps,
+                "projected_baseline_wall_s": baseline_wall_s,
+                "projected_baseline_cost_usd": baseline_cost,
+                "projected_savings_usd": baseline_cost - batch_cost,
+                "speedup_x": (args.baseline_tps and tokens_per_s_aggregate / args.baseline_tps) or None,
+            })
+        summary["cost_projection"] = cost
+    print(json.dumps(summary, indent=2))
+    if args.out:
+        os.makedirs(os.path.dirname(args.out) or ".", exist_ok=True)
+        # Persist per-rollout detail alongside the summary for later inspection.
+        with open(args.out, "w") as f:
+            json.dump({**summary, "runs": runs}, f, indent=2)
+        print(f"[rollout_bench] wrote {args.out}")
+if __name__ == "__main__":
+    main()

configs/endpoints.toml ADDED Viewed

	@@ -0,0 +1,69 @@

+# endpoints.toml — verifiers / Prime inference endpoints for the Laguna hackathon.
+#
+# THIS FILE IS THE SEAM. It is the single place that decides WHERE a verifiers
+# environment sends its rollout generations. Point it at the vanilla Laguna
+# endpoint and rollouts run at baseline speed; point it at the vLLM+DFlash
+# endpoint and the SAME rollouts run faster — at byte-identical greedy output.
+# That swap (and nothing else) is the combined thesis: "lossless DFlash
+# speculative decoding makes RL post-training cheaper" — same reward curve,
+# higher rollout throughput, lower $/run.
+#
+# SCHEMA follows the Prime lab-cookbook (configs/endpoints.toml): an array of
+# [[endpoint]] tables, each with:
+#   endpoint_id = "<alias>"        # what you pass to `prime eval run -m <alias>`
+#   model       = "<repo-id>"
+#   url         = "<openai-compatible base url, ending in /v1>"
+#   key         = "<ENV_VAR_NAME>"  # the NAME of the env var holding the key, not the key itself
+#   type        = "openai_chat_completions"   # vLLM + Prime Inference are OpenAI-compatible
+#
+# How to use at the venue:
+#   1. Serve the model on the GPU:
+#        python laguna-hack/scripts/serve_vllm.py --mode dflash --run
+#      (baseline is the SAME command with --mode baseline — one flag flips it.)
+#   2. Run rollouts against the local DFlash server:
+#        prime eval run spec_rl -m local-dflash -n 128
+#   3. For the BEFORE number, re-serve with --mode baseline and re-run with the
+#      same endpoint_id (url identical; only the server's spec-config differs),
+#      so the reward curve is a clean A/B on throughput alone.
+#
+# [verify at onboarding] Confirm the exact `type` string and whether `key` for a
+# no-auth local vLLM should be an env-var name or a literal, against the venue's
+# installed `prime`/`verifiers` version (the cookbook uses env-var NAMES like
+# PRIME_API_KEY). Adjust if the CLI complains.
+# ---------------------------------------------------------------------------
+# local-dflash (ACTIVE) — our own vLLM server with the DFlash speculator, on the
+# OpenAI-compatible surface vLLM exposes at :8000. vLLM requires a non-empty key
+# but does not authenticate it, so EMPTY is a placeholder.
+# ---------------------------------------------------------------------------
+[[endpoint]]
+endpoint_id = "local-dflash"
+model       = "poolside/Laguna-XS.2"
+url         = "http://localhost:8000/v1"
+key         = "EMPTY"
+type        = "openai_chat_completions"
+# ---------------------------------------------------------------------------
+# local-baseline (OPTIONAL) — a second vLLM server with NO speculator on :8001,
+# for a side-by-side A/B without re-serving. Only if the GPU has room for two
+# servers; on a single small GPU prefer re-serving on :8000 (flip --mode).
+# ---------------------------------------------------------------------------
+[[endpoint]]
+endpoint_id = "local-baseline"
+model       = "poolside/Laguna-XS.2"
+url         = "http://localhost:8001/v1"
+key         = "EMPTY"
+type        = "openai_chat_completions"
+# ---------------------------------------------------------------------------
+# prime (HOSTED FALLBACK) — Prime Intellect managed inference. Use if the local
+# vLLM is down or while waiting on venue compute. Costs PI credits per token (the
+# $50 pool covers Prime Inference + Sandboxes + On-Demand GPUs). PRIME_API_KEY is
+# read from the environment — never hard-code a key here.
+# ---------------------------------------------------------------------------
+[[endpoint]]
+endpoint_id = "prime"
+model       = "poolside/Laguna-XS.2"
+url         = "https://api.pinference.ai/api/v1"
+key         = "PRIME_API_KEY"
+type        = "openai_chat_completions"

configs/rl/laguna-spec.toml ADDED Viewed

	@@ -0,0 +1,45 @@

+# laguna-spec.toml — RL post-training config for the COMBINED thesis run.
+#
+# The claim: lossless DFlash speculative decoding makes RL post-training cheaper.
+# RL post-training spends most of its wall-clock GENERATING rollouts (the policy
+# samples completions, the rubric scores them, the gradient step is comparatively
+# tiny). So if rollouts come back faster — at IDENTICAL output, because greedy
+# DFlash is lossless — the SAME reward curve arrives in less time / fewer $.
+#
+# This file is consumed by:  prime train configs/rl/laguna-spec.toml
+# Rollout inference is routed by ./configs/endpoints.toml (the SEAM). Serve the
+# model with DFlash (serve_vllm.py --mode dflash) and the rollouts below run on
+# the speculator; serve baseline and they run on the floor. The RL math is
+# unchanged either way — that is the whole point of the A/B.
+#
+# Hosted Laguna training at the venue is FREE but capped: 1 concurrent run per
+# user and batch_size <= 128. Stay inside those limits.
+model = "poolside/Laguna-XS.2"   # the policy being post-trained (same id as the served model)
+# ---- training loop -------------------------------------------------------
+max_steps          = 50          # gradient steps; keep small — this is a venue demo, not a full run
+batch_size         = 64          # prompts per step. MUST be <= 128 (hosted-run hard cap). 64 leaves headroom.
+rollouts_per_example = 8         # completions sampled per prompt (the "group" in GRPO-style RL).
+                                 # This is the rollout multiplier: batch_size * rollouts_per_example
+                                 # = 64 * 8 = 512 generations per step. THIS is the work DFlash speeds up.
+learning_rate      = 1.0e-6      # conservative LR for post-training a 33B-total/3B-active MoE; avoid drift.
+# ---- sampling (how rollouts are generated) -------------------------------
+[sampling]
+max_tokens      = 512            # cap per rollout completion; matches the bench workload so $/token lines up.
+enable_thinking = false          # NO reasoning trace during RL rollouts — keeps completions short, comparable,
+                                 # and cheap. (Laguna's chat template defaults thinking ON; we force it off here.)
+temperature     = 1.0            # RL needs STOCHASTIC exploration, so this run is sampled, not greedy.
+                                 # NOTE: the losslessness proof is a SEPARATE greedy check (rollout_bench.py
+                                 # --parity / humaneval_subset.py --parity); DFlash is lossless under greedy.
+                                 # At temperature>0 DFlash stays distribution-faithful via rejection sampling,
+                                 # so the reward curve still matches baseline within sampling noise.
+top_p           = 1.0            # no nucleus truncation; keep the sampling distribution intact for the A/B.
+# ---- environment (what the rollouts are scored against) ------------------
+# The verifiers env that defines the task + rubric. It exposes
+# load_environment(...) -> vf.Environment and is resolved by id. Swap this id
+# to point the run at a different Taskset/Rubric without touching the loop above.
+[[env]]
+id = "spec_rl"                   # the spec-decode RL env (coding-style Taskset + reward rubric).

evals/humaneval_subset.py ADDED Viewed

	@@ -0,0 +1,145 @@

+#!/usr/bin/env python3
+"""
+humaneval_subset.py — a 20-30 problem pass@1 check against an OpenAI-compatible
+endpoint. Purpose at the venue: PROVE the DFlash run produces the SAME quality
+as the baseline (and ideally the same greedy text), so "lossless" isn't just a
+claim — it's a measured parity check.
+Two modes:
+  1. Quality: run pass@1 on a HumanEval subset and print the score.
+  2. Parity:  run greedy on both endpoints and assert outputs are token-identical.
+This loads HumanEval via `datasets` (openai_humaneval). On the Mac you can dry-run
+the harness against a tiny local server; the real numbers come from Laguna on PI.
+SAFETY: this executes model-generated code to grade pass@1. Run ONLY in the
+disposable venue sandbox / container, never on your laptop with real data.
+A --no-exec flag skips execution and just dumps completions for manual review.
+Usage:
+  python evals/humaneval_subset.py --base-url http://localhost:8000 --model laguna \
+      --n 25 --out results/humaneval_dflash.json
+  # parity check:
+  python evals/humaneval_subset.py --parity \
+      --base-url http://localhost:8000 --base-url-b http://localhost:8001 \
+      --model laguna --n 25
+"""
+from __future__ import annotations
+import argparse
+import json
+import os
+import signal
+import urllib.request
+from contextlib import contextmanager
+def load_problems(n: int):
+    # datasets >= 3 requires a namespaced repo id; the bare "openai_humaneval"
+    # legacy name now raises. Override with HUMANEVAL_DATASET if the venue image
+    # pins a different datasets version / mirror.
+    import os
+    from datasets import load_dataset
+    dataset_id = os.environ.get("HUMANEVAL_DATASET", "openai/openai_humaneval")
+    ds = load_dataset(dataset_id, split="test")
+    n = min(n, len(ds))
+    return [ds[i] for i in range(n)]
+def complete(base_url: str, model: str, prompt: str, max_tokens: int) -> str:
+    url = base_url.rstrip("/") + "/v1/completions"
+    payload = {
+        "model": model,
+        "prompt": prompt,
+        "max_tokens": max_tokens,
+        "temperature": 0.0,          # greedy => deterministic => lossless-comparable
+        "stop": ["\nclass ", "\ndef ", "\n#", "\nif __name__"],
+    }
+    data = json.dumps(payload).encode()
+    req = urllib.request.Request(url, data=data, headers={"Content-Type": "application/json"})
+    with urllib.request.urlopen(req, timeout=600) as r:
+        obj = json.loads(r.read().decode())
+    return obj["choices"][0]["text"]
+@contextmanager
+def time_limit(seconds: int):
+    def handler(signum, frame):
+        raise TimeoutError("timed out")
+    signal.signal(signal.SIGALRM, handler)
+    signal.alarm(seconds)
+    try:
+        yield
+    finally:
+        signal.alarm(0)
+def passes(problem: dict, completion: str) -> bool:
+    program = problem["prompt"] + completion + "\n" + problem["test"] + \
+        f"\ncheck({problem['entry_point']})\n"
+    try:
+        with time_limit(8):
+            ns: dict = {}
+            exec(program, ns)  # noqa: S102 — sandbox only
+        return True
+    except Exception:
+        return False
+def run_quality(args) -> None:
+    problems = load_problems(args.n)
+    results = []
+    n_pass = 0
+    for i, prob in enumerate(problems):
+        comp = complete(args.base_url, args.model, prob["prompt"], args.max_tokens)
+        ok = False if args.no_exec else passes(prob, comp)
+        n_pass += int(ok)
+        results.append({"task_id": prob["task_id"], "passed": ok, "completion": comp})
+        print(f"  [{i+1}/{len(problems)}] {prob['task_id']}: {'PASS' if ok else ('?' if args.no_exec else 'fail')}")
+    score = n_pass / len(problems) if problems else 0.0
+    out = {"model": args.model, "base_url": args.base_url, "n": len(problems),
+           "pass_at_1": score, "no_exec": args.no_exec, "results": results}
+    print(json.dumps({k: v for k, v in out.items() if k != "results"}, indent=2))
+    if args.out:
+        os.makedirs(os.path.dirname(args.out) or ".", exist_ok=True)
+        with open(args.out, "w") as f:
+            json.dump(out, f, indent=2)
+        print(f"[humaneval] wrote {args.out}  pass@1={score:.3f}")
+def run_parity(args) -> None:
+    """Greedy outputs from baseline (A) and DFlash (B) must be token-identical."""
+    problems = load_problems(args.n)
+    mismatches = 0
+    for i, prob in enumerate(problems):
+        a = complete(args.base_url, args.model, prob["prompt"], args.max_tokens)
+        b = complete(args.base_url_b, args.model, prob["prompt"], args.max_tokens)
+        same = a == b
+        mismatches += int(not same)
+        print(f"  [{i+1}/{len(problems)}] {prob['task_id']}: {'IDENTICAL' if same else 'MISMATCH'}")
+    n = len(problems)
+    print(json.dumps({"parity_pairs": n, "identical": n - mismatches,
+                      "mismatches": mismatches,
+                      "lossless": mismatches == 0}, indent=2))
+def main() -> None:
+    p = argparse.ArgumentParser(description="HumanEval subset pass@1 + baseline/DFlash greedy parity check.")
+    p.add_argument("--base-url", default="http://localhost:8000")
+    p.add_argument("--base-url-b", default="http://localhost:8001", help="DFlash endpoint for --parity.")
+    p.add_argument("--model", default="laguna")
+    p.add_argument("--n", type=int, default=25)
+    p.add_argument("--max-tokens", type=int, default=512)
+    p.add_argument("--no-exec", action="store_true", help="Skip code execution; dump completions only.")
+    p.add_argument("--parity", action="store_true", help="Compare two endpoints' greedy outputs.")
+    p.add_argument("--out", default=None)
+    args = p.parse_args()
+    if args.parity:
+        run_parity(args)
+    else:
+        run_quality(args)
+if __name__ == "__main__":
+    main()

results/.gitkeep ADDED Viewed

File without changes

results/README.md ADDED Viewed

	@@ -0,0 +1,31 @@

+# results/
+Benchmark + eval output lands here. These files are the **demo's money slide**:
+the before/after table is the diff of `baseline.json` and `dflash.json`.
+Generated by `bench/measure.py` (and `evals/humaneval_subset.py` for `--out`).
+Locally they come from the stub server (`make parity-local`); at the venue they
+come from real vLLM + Laguna. The JSON files themselves are git-ignored
+(`.gitignore`) — only this README and `.gitkeep` are tracked.
+## Schema (per `measure.py` run)
+```json
+{
+  "label": "dflash | baseline",
+  "model": "laguna",
+  "base_url": "http://localhost:8000",
+  "n": 5,
+  "tokens_per_s_mean": 0.0,      // THE WIN — higher with dflash
+  "ttft_s_mean": 0.0,            // ~flat (dflash improves TPOT, not TTFT)
+  "acceptance_length_tau": 2.6,  // WHY it's faster; null if /metrics had no spec counters
+  "spec_metrics_before": {},
+  "spec_metrics_after": {},
+  "runs": [ { "ttft_s", "total_s", "new_tokens", "tokens_per_s", "text" }, ... ]
+}
+```
+`scripts/check_results.py` validates this shape: `make check-results`.
+The parity check (`humaneval_subset.py --parity`) prints `lossless: true` when the
+baseline and dflash greedy outputs are token-identical — the bulletproof claim.

results/baseline.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "label": "baseline",
+  "model": "poolside/Laguna-XS.2",
+  "n": 14,
+  "tokens_per_s_mean": 19.64077204940069,
+  "ttft_s_mean": 6.58612985270364,
+  "acceptance_length_tau": 1.0,
+  "source": "HF Job 6a19d8b73a4b8cae6044dfdf (h200), 2026-05-29; vLLM 0.22.0, --enforce-eager, --max-model-len 4096, greedy (temperature=0), no speculator",
+  "prompt_set": "14 distinct mixed-difficulty Python prompts (trivial fib/is_prime -> medium binary_search/roman_to_int -> hard lcs/parse_duration/dijkstra/LRUCache)",
+  "corroborating_run": "An earlier 20-prompt trivial-only run (job 6a19d2105c8d10ffa1107774) gave baseline 19.47 tok/s.",
+  "note": "ttft_s_mean here is full-completion latency, NOT true time-to-first-token; we make no TTFT claim. Summary stats are over all n=14."
+}

results/dflash.json ADDED Viewed

	@@ -0,0 +1,16 @@

+{
+  "label": "dflash",
+  "model": "poolside/Laguna-XS.2",
+  "speculator": "poolside/Laguna-XS.2-speculator.dflash",
+  "num_speculative_tokens": 7,
+  "method": "dflash",
+  "n": 14,
+  "tokens_per_s_mean": 54.1741150379158,
+  "ttft_s_mean": 2.5821559940065657,
+  "acceptance_length_tau": null,
+  "tau_note": "tau read from vLLM /metrics pinned at EXACTLY gamma+1 (=8.0) on BOTH the trivial and the mixed-difficulty runs, and the per-prompt /metrics deltas did not resolve a distribution (counter refresh granularity). We therefore treat the /metrics tau as UNRELIABLE and make NO acceptance-length claim. The load-bearing, directly-measured results are the wall-clock speedup and the byte-parity.",
+  "source": "HF Job 6a19d8b73a4b8cae6044dfdf (h200), 2026-05-29; vLLM 0.22.0, --enforce-eager, --max-model-len 4096, greedy (temperature=0), --speculative-config method=dflash gamma=7",
+  "prompt_set": "same 14 distinct mixed-difficulty prompts as baseline (trivial -> hard)",
+  "corroborating_run": "An earlier 20-prompt trivial-only run (job 6a19d2105c8d10ffa1107774) gave DFlash 48.09 tok/s = 2.47x; this mixed-difficulty run gives 54.17 tok/s = 2.76x. Lossless in both.",
+  "note": "DFlash completions are byte-identical to baseline (greedy) — see results/parity.json."
+}

results/humaneval_dflash.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "kind": "greedy_byte_parity",
+  "compared": 14,
+  "mismatches": 0,
+  "lossless": true,
+  "decoding": "greedy (temperature=0)",
+  "method": "Each of 14 distinct mixed-difficulty prompts was completed by Laguna XS.2 with and without the DFlash speculator; the two outputs were compared byte-for-byte.",
+  "pass_at_1": null,
+  "pass_at_1_note": "HumanEval pass@1 was NOT run. Byte-level greedy parity is the strict superset guarantee (identical bytes => identical pass@1 by construction). A full HumanEval sweep is a documented next step.",
+  "also_lossless": "An earlier 20-prompt trivial run was also 0/20 lossless.",
+  "source": "HF Job 6a19d8b73a4b8cae6044dfdf (h200), 2026-05-29"
+}

results/parity.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "kind": "greedy_byte_parity",
+  "compared": 14,
+  "mismatches": 0,
+  "lossless": true,
+  "decoding": "greedy (temperature=0)",
+  "method": "Each of 14 distinct mixed-difficulty prompts was completed by Laguna XS.2 with and without the DFlash speculator; the two outputs were compared byte-for-byte.",
+  "pass_at_1": null,
+  "pass_at_1_note": "HumanEval pass@1 was NOT run. Byte-level greedy parity is the strict superset guarantee (identical bytes => identical pass@1 by construction). A full HumanEval sweep is a documented next step.",
+  "also_lossless": "An earlier 20-prompt trivial run was also 0/20 lossless.",
+  "source": "HF Job 6a19d8b73a4b8cae6044dfdf (h200), 2026-05-29"
+}

scripts/check_results.py ADDED Viewed

	@@ -0,0 +1,67 @@

+#!/usr/bin/env python3
+"""check_results.py — smoke-validate the schema of measure.py output JSON.
+The benchmark's value is the before/after diff of results/baseline.json and
+results/dflash.json; this asserts those files have the shape the demo expects so a
+broken run is caught locally, not on stage.
+Usage: python scripts/check_results.py results/dflash.json results/baseline.json
+Exit 0 = all valid, 1 = problems listed.
+"""
+from __future__ import annotations
+import json
+import sys
+REQUIRED = {
+    "label": str,
+    "model": str,
+    "n": int,
+    "tokens_per_s_mean": (int, float),
+    "ttft_s_mean": (int, float),
+    "runs": list,
+}
+RUN_KEYS = {"ttft_s", "total_s", "new_tokens", "tokens_per_s", "text"}
+def check(path: str) -> list[str]:
+    problems: list[str] = []
+    try:
+        obj = json.load(open(path))
+    except (OSError, json.JSONDecodeError) as e:
+        return [f"{path}: cannot read/parse ({e})"]
+    for key, typ in REQUIRED.items():
+        if key not in obj:
+            problems.append(f"{path}: missing key '{key}'")
+        elif not isinstance(obj[key], typ):
+            problems.append(f"{path}: key '{key}' has wrong type {type(obj[key]).__name__}")
+    runs = obj.get("runs") or []
+    if isinstance(runs, list) and runs:
+        missing = RUN_KEYS - set(runs[0])
+        if missing:
+            problems.append(f"{path}: run[0] missing keys {sorted(missing)}")
+    elif isinstance(runs, list):
+        problems.append(f"{path}: 'runs' is empty")
+    return problems
+def main(paths: list[str]) -> int:
+    if not paths:
+        print(__doc__)
+        return 2
+    problems: list[str] = []
+    for p in paths:
+        problems += check(p)
+    for p in paths:
+        print(f"checked {p}")
+    if problems:
+        print("\nFAIL:")
+        for pr in problems:
+            print("  -", pr)
+        return 1
+    print("\nOK: all result files have the expected schema.")
+    return 0
+if __name__ == "__main__":
+    sys.exit(main(sys.argv[1:]))

scripts/dress_rehearsal.sh ADDED Viewed

	@@ -0,0 +1,213 @@

+#!/usr/bin/env bash
+# dress_rehearsal.sh — OFFLINE end-to-end dress rehearsal of the COMBINED pipeline.
+#
+# The combined thesis: "lossless DFlash speculative decoding makes RL post-training
+# cheaper." This script proves the WHOLE pipeline is wired — measurement, the
+# lossless parity check, the RL eval loop, and the rollout/$-savings benchmark —
+# with NO Prime Intellect credits and NO GPU. It runs entirely against the two
+# local stdlib stubs (scripts/stub_server.py): a baseline stub on :8000 and a
+# "dflash" stub on :8001 that exposes the spec_decode_* metrics measure.py reads to
+# recover acceptance length tau.
+#
+# This is the LOCAL rung of the cheap->expensive ladder. When credits/GPU land at
+# the venue, the EXACT same flow runs with --base-url pointed at real Laguna
+# (baseline vLLM vs DFlash-speculated vLLM) — no script changes, just real URLs.
+#
+# What it chains, in order:
+#   0. start baseline stub (:8000) + dflash stub (:8001); wait until both accept.
+#   1. bench/measure.py against each   -> results/baseline.json, results/dflash.json
+#   2. evals/humaneval_subset.py --parity (greedy parity = lossless proof) +
+#      a pass@1 dry-run (--no-exec) so the quality harness is exercised too.
+#   3. scripts/eval_local.py against :8000 -> the verifiers RL eval loop (reward).
+#   4. bench/rollout_bench.py against each endpoint (rollout throughput + $/run,
+#      plus an in-run --assert-parity losslessness guard on the dflash endpoint).
+#   5. scripts/check_results.py — schema-gate the result JSON.
+# Then it prints a PASS/FAIL banner with the key numbers (lossless?, tau, tokens/sec).
+#
+# The stub PIDs are ALWAYS killed on exit (trap), even on error or Ctrl-C.
+set -euo pipefail
+cd "$(dirname "$0")/.."
+# python: prefer the project venv (it carries datasets/openai); else python3.
+# `python` is not on PATH on this Mac, so we never rely on it.
+if [[ -x ".venv/bin/python" ]]; then
+  PY=".venv/bin/python"
+else
+  PY="python3"
+fi
+echo "[rehearse] using interpreter: $PY"
+BASE_URL="http://localhost:8000"   # baseline stub
+DFLASH_URL="http://localhost:8001" # dflash stub (has tau metrics)
+mkdir -p results
+# ---------------------------------------------------------------------------
+# 0. Start both stubs in the background; ALWAYS kill them on exit.
+# ---------------------------------------------------------------------------
+"$PY" scripts/stub_server.py --port 8000 &        BASE_PID=$!
+"$PY" scripts/stub_server.py --port 8001 --spec & DFLASH_PID=$!
+cleanup() {
+  kill "$BASE_PID" "$DFLASH_PID" 2>/dev/null || true
+  wait "$BASE_PID" "$DFLASH_PID" 2>/dev/null || true
+}
+trap cleanup EXIT INT TERM
+# Wait for both ports to accept connections (no shell sleep — poll in python).
+"$PY" - <<'PY'
+import socket, time, sys
+for port in (8000, 8001):
+    for _ in range(100):
+        with socket.socket() as s:
+            if s.connect_ex(("127.0.0.1", port)) == 0:
+                break
+        time.sleep(0.05)
+    else:
+        sys.exit(f"[rehearse] stub on {port} never came up")
+print("[rehearse] both stubs ready (baseline :8000, dflash :8001)")
+PY
+# Track per-stage outcome but keep going so the banner always has the numbers.
+# check_results (the schema gate) is the hard PASS/FAIL.
+STAGE_FAILS=0
+stage() {  # stage "<name>" <cmd...>
+  local name="$1"; shift
+  echo
+  echo "==================================================================="
+  echo "[rehearse] STAGE: $name"
+  echo "==================================================================="
+  if "$@"; then
+    echo "[rehearse] STAGE OK: $name"
+  else
+    echo "[rehearse] STAGE FAILED: $name"
+    STAGE_FAILS=$((STAGE_FAILS + 1))
+  fi
+}
+# ---------------------------------------------------------------------------
+# 1. Measurement: tokens/sec, TTFT, tau — baseline (:8000) and dflash (:8001).
+# ---------------------------------------------------------------------------
+stage "measure baseline (:8000)" \
+  "$PY" bench/measure.py --base-url "$BASE_URL"   --model laguna --label baseline --n 5 --out results/baseline.json
+stage "measure dflash (:8001)" \
+  "$PY" bench/measure.py --base-url "$DFLASH_URL" --model laguna --label dflash   --n 5 --out results/dflash.json
+# ---------------------------------------------------------------------------
+# 2. Lossless proof: greedy parity across the two endpoints + a pass@1 dry-run.
+#    --no-exec keeps the dry-run from executing model code locally (it just
+#    confirms the quality harness drives the endpoint end-to-end).
+# ---------------------------------------------------------------------------
+stage "greedy parity (lossless: baseline vs dflash)" \
+  "$PY" evals/humaneval_subset.py --parity --base-url "$BASE_URL" --base-url-b "$DFLASH_URL" --model laguna --n 3
+stage "humaneval pass@1 dry-run (--no-exec)" \
+  "$PY" evals/humaneval_subset.py --base-url "$BASE_URL" --model laguna --n 3 --no-exec
+# ---------------------------------------------------------------------------
+# 3. The verifiers RL eval loop (the COMBINED half): reward over rollouts.
+#    Local stub returns a canned body so the real tests score 0.0 — expected;
+#    the point is the loop runs end-to-end. Real reward comes from Laguna.
+# ---------------------------------------------------------------------------
+stage "RL eval loop (spec_rl) against :8000" \
+  "$PY" scripts/eval_local.py --base-url "$BASE_URL" --model laguna --n 3 --out results/eval_local.json
+# ---------------------------------------------------------------------------
+# 4. Rollout-throughput benchmark + $/run projection, and an in-run lossless
+#    guard (two greedy runs of the dflash endpoint must be byte-identical).
+# ---------------------------------------------------------------------------
+stage "rollout assert-parity (dflash endpoint deterministic)" \
+  "$PY" bench/rollout_bench.py --base-url "$DFLASH_URL" --model laguna --label dflash \
+        --prompts 3 --max-tokens 64 --assert-parity
+stage "rollout bench baseline (:8000)" \
+  "$PY" bench/rollout_bench.py --base-url "$BASE_URL" --model laguna --label baseline \
+        --prompts 3 --rollouts-per-example 2 --max-tokens 64 --hourly-rate 3.50 \
+        --out results/rollout_baseline.json
+# Feed the baseline aggregate tokens/sec into the dflash run so the $/run-saved
+# projection (the dollars half of the thesis) is exercised end-to-end too.
+BASELINE_TPS="$("$PY" -c 'import json;print(json.load(open("results/rollout_baseline.json")).get("tokens_per_s_aggregate") or 0)' 2>/dev/null || echo 0)"
+stage "rollout bench dflash (:8001, vs baseline)" \
+  "$PY" bench/rollout_bench.py --base-url "$DFLASH_URL" --model laguna --label dflash \
+        --prompts 3 --rollouts-per-example 2 --max-tokens 64 --hourly-rate 3.50 \
+        --baseline-tps "$BASELINE_TPS" \
+        --out results/rollout_dflash.json
+# ---------------------------------------------------------------------------
+# 5. Schema gate — the hard PASS/FAIL on the demo's money-slide JSON.
+# ---------------------------------------------------------------------------
+CHECK_RC=0
+stage "check results schema" \
+  "$PY" scripts/check_results.py results/dflash.json results/baseline.json || CHECK_RC=$?
+# ---------------------------------------------------------------------------
+# Banner — pull the headline numbers straight out of the result JSON.
+# ---------------------------------------------------------------------------
+SUMMARY="$("$PY" - "$STAGE_FAILS" <<'PY'
+import json, sys
+stage_fails = int(sys.argv[1])
+def load(path):
+    try:
+        with open(path) as f:
+            return json.load(f)
+    except Exception:
+        return {}
+base   = load("results/baseline.json")
+dflash = load("results/dflash.json")
+evl    = load("results/eval_local.json")
+rb     = load("results/rollout_dflash.json")
+base_tps   = base.get("tokens_per_s_mean")
+dflash_tps = dflash.get("tokens_per_s_mean")
+tau        = dflash.get("acceptance_length_tau")
+speedup    = (dflash_tps / base_tps) if (base_tps and dflash_tps) else None
+reward     = evl.get("mean_reward")
+cost       = (rb.get("cost_projection") or {})
+savings    = cost.get("projected_savings_usd")
+def f(x, nd=2, suffix=""):
+    return f"{x:.{nd}f}{suffix}" if isinstance(x, (int, float)) else "n/a"
+# Lossless verdict: the parity stage prints lossless:true; here we assert the
+# proxy that makes that true on a stub — both endpoints serve identical greedy
+# text (same canned completion), so tau is the only thing that should move.
+lossless = "YES" if stage_fails == 0 else "see stage log"
+print("LOSSLESS|"      + lossless)
+print("TAU|"           + f(tau, 2))
+print("BASE_TPS|"      + f(base_tps, 1))
+print("DFLASH_TPS|"    + f(dflash_tps, 1))
+print("SPEEDUP|"       + (f(speedup, 2, "x") if speedup else "n/a"))
+print("REWARD|"        + f(reward, 3))
+print("SAVINGS|"       + (f(savings, 4, " USD/batch") if savings is not None else "n/a"))
+PY
+)"
+get() { echo "$SUMMARY" | grep "^$1|" | cut -d'|' -f2-; }
+echo
+echo "==================================================================="
+if [[ "$STAGE_FAILS" -eq 0 && "$CHECK_RC" -eq 0 ]]; then
+  VERDICT="PASS"
+else
+  VERDICT="FAIL"
+fi
+echo "  DRESS REHEARSAL: $VERDICT   (offline, local stubs, no credits/GPU)"
+echo "-------------------------------------------------------------------"
+echo "  lossless (greedy parity) : $(get LOSSLESS)"
+echo "  acceptance length tau    : $(get TAU)   (dflash stub; MEASURE on real Laguna)"
+echo "  tokens/sec  baseline     : $(get BASE_TPS)"
+echo "  tokens/sec  dflash       : $(get DFLASH_TPS)"
+echo "  throughput speedup       : $(get SPEEDUP)   (stub = wall-clock noise; real win is on Laguna)"
+echo "  RL eval mean reward      : $(get REWARD)   (stub canned output -> 0.0 expected)"
+echo "  projected rollout saving : $(get SAVINGS)"
+echo "-------------------------------------------------------------------"
+echo "  stages failed: $STAGE_FAILS    schema gate: $([[ $CHECK_RC -eq 0 ]] && echo OK || echo FAIL)"
+echo "  NOTE: stub numbers are shape-only. At the venue, re-run with"
+echo "        --base-url pointed at real Laguna for the real table."
+echo "==================================================================="
+# Exit nonzero if any stage or the schema gate failed (so CI / a venue dry-run
+# fails loudly rather than silently shipping a broken harness).
+if [[ "$STAGE_FAILS" -ne 0 || "$CHECK_RC" -ne 0 ]]; then
+  exit 1
+fi

scripts/eval_local.py ADDED Viewed

	@@ -0,0 +1,305 @@

+#!/usr/bin/env python3
+"""
+eval_local.py — run the spec_rl RL-eval loop OFFLINE against the local stub.
+Purpose
+-------
+Prove the *shape* of the RL evaluation loop with NO Prime Intellect credits and
+NO GPU: drive the spec_rl HumanEval code task's rollouts against the local,
+stdlib OpenAI-compatible stub (scripts/stub_server.py) and compute the SAME
+reward the verifiers environment computes (`@vf.reward code_reward`) — run the
+model's candidate code against the problem's unit tests and return the FRACTION
+of assertions that pass (dense RL signal; the pass@1 eval stays binary).
+At the venue the same loop points at the DFlash-speculated vLLM endpoint instead
+of the stub. Because greedy speculative decoding is lossless, the reward curve is
+identical; only the cost per rollout drops. This script lets us validate the loop
+end-to-end before any credits are spent.
+Reward logic is NOT reimplemented here — it is imported verbatim from
+`environments/spec_rl/spec_rl.py` (`fraction_passing`, `passes`, `STOP`,
+`load_problems`), so what runs locally is byte-identical to what the verifiers
+env scores at the venue.
+Two execution paths (auto-selected, reported in the output)
+-----------------------------------------------------------
+  1. "verifiers"  — if `verifiers` imports AND `spec_rl.load_environment()`
+     constructs cleanly AND the endpoint exposes /v1/chat/completions, drive the
+     real `vf.SingleTurnEnv.evaluate(...)`. This is the true RL-eval API.
+  2. "manual"     — otherwise, a minimal hand-rolled rollout loop: build the same
+     chat prompt, call the endpoint, trim at STOP, score with spec_rl.passes.
+     This is the path that actually runs against the canned-completion stub
+     (which serves only /v1/completions), and it is reported as such.
+Note on the stub: it returns a fixed canned completion for EVERY prompt, so the
+real HumanEval tests will almost always fail (reward 0.0). That is expected and
+correct — the point here is to prove the loop runs end-to-end offline without
+erroring, not to produce a real pass@1. Real rewards come from Laguna at the venue.
+SAFETY: scoring executes model-generated code in a timed subprocess (see
+spec_rl.passes). Locally the "code" is the stub's harmless canned snippet. Run RL
+rollouts only in the disposable venue sandbox, never against real data.
+Usage
+-----
+  # start a stub first:  make stub        (baseline, :8000)
+  #                  or:  make stub-b      (dflash,   :8001)
+  python scripts/eval_local.py --base-url http://localhost:8000 --model laguna --n 5
+"""
+from __future__ import annotations
+import argparse
+import json
+import os
+import sys
+import urllib.error
+import urllib.request
+from pathlib import Path
+# ---------------------------------------------------------------------------
+# Import the spec_rl env module so reward logic is shared, not duplicated. The
+# env lives in a sibling tree (environments/spec_rl/spec_rl.py); add it to the
+# path. spec_rl is import-safe even when `verifiers` is absent (its vf import is
+# guarded), so this works on the Mac with no GPU and no verifiers.
+# ---------------------------------------------------------------------------
+_HERE = Path(__file__).resolve()
+_REPO = _HERE.parents[1]                      # .../laguna-hack
+_GPU_HW = _HERE.parents[2]                    # .../gpu_and_inference_hw
+_SPEC_RL_DIR = _GPU_HW / "environments" / "spec_rl"
+if str(_SPEC_RL_DIR) not in sys.path:
+    sys.path.insert(0, str(_SPEC_RL_DIR))
+import spec_rl  # noqa: E402  — shared reward core (passes, STOP, load_problems, ...)
+DEFAULT_OUT = _REPO / "results" / "eval_local.json"
+# System prompt mirrors spec_rl.load_environment so the manual loop sends the
+# exact same instruction the verifiers env would.
+SYSTEM_PROMPT = (
+    "You are an expert Python programmer. You will be given a function "
+    "signature and docstring. Complete the function body only. Do not repeat "
+    "the signature, do not add explanations, and do not wrap the code in "
+    "markdown fences. Output only the indented function body."
+)
+# ---------------------------------------------------------------------------
+# Endpoint helpers (stdlib urllib only — matches the rest of the harness).
+# ---------------------------------------------------------------------------
+def _post_json(url: str, payload: dict, timeout: int = 600) -> dict:
+    data = json.dumps(payload).encode()
+    req = urllib.request.Request(
+        url, data=data, headers={"Content-Type": "application/json"}
+    )
+    with urllib.request.urlopen(req, timeout=timeout) as r:
+        return json.loads(r.read().decode())
+def _endpoint_has_chat(base_url: str) -> bool:
+    """True if the endpoint answers /v1/chat/completions (vLLM does; stub does not)."""
+    url = base_url.rstrip("/") + "/v1/chat/completions"
+    probe = {
+        "model": "probe",
+        "messages": [{"role": "user", "content": "ping"}],
+        "max_tokens": 1,
+        "temperature": 0.0,
+    }
+    try:
+        _post_json(url, probe, timeout=10)
+        return True
+    except urllib.error.HTTPError as e:
+        # 4xx/5xx still means the route exists and parsed our body; only a
+        # 404 means "no chat endpoint here" (the stub returns 404 for it).
+        return e.code != 404
+    except Exception:
+        return False
+def complete_chat(base_url: str, model: str, user_content: str, max_tokens: int) -> str:
+    """Greedy chat completion (Laguna/vLLM path)."""
+    url = base_url.rstrip("/") + "/v1/chat/completions"
+    payload = {
+        "model": model,
+        "messages": [
+            {"role": "system", "content": SYSTEM_PROMPT},
+            {"role": "user", "content": user_content},
+        ],
+        "max_tokens": max_tokens,
+        "temperature": 0.0,  # greedy => deterministic => lossless-comparable
+        "stop": spec_rl.STOP,
+    }
+    obj = _post_json(url, payload)
+    return obj["choices"][0]["message"]["content"] or ""
+def complete_text(base_url: str, model: str, prompt: str, max_tokens: int) -> str:
+    """Greedy text completion (the stub path; also valid for vLLM completions)."""
+    url = base_url.rstrip("/") + "/v1/completions"
+    payload = {
+        "model": model,
+        "prompt": prompt,
+        "max_tokens": max_tokens,
+        "temperature": 0.0,
+        "stop": spec_rl.STOP,
+    }
+    obj = _post_json(url, payload)
+    return obj["choices"][0]["text"] or ""
+def _trim_at_stop(text: str) -> str:
+    """Cut at the first STOP sequence, mirroring the env's code_passes reward."""
+    for stop in spec_rl.STOP:
+        idx = text.find(stop)
+        if idx != -1:
+            text = text[:idx]
+    return text
+# ---------------------------------------------------------------------------
+# Path 1 — drive the real verifiers env, if (and only if) it constructs cleanly
+# AND the endpoint speaks chat. Returns a results dict, or None to fall back.
+# ---------------------------------------------------------------------------
+def try_verifiers(base_url: str, model: str, n: int) -> dict | None:
+    try:
+        import verifiers as vf  # noqa: F401
+    except Exception:
+        return None
+    # load_environment() builds a vf.SingleTurnEnv. In some verifiers versions
+    # the symbols spec_rl references (e.g. vf.Dataset) may not exist; guard the
+    # whole construction so a mismatch falls back to the manual loop instead of
+    # crashing the eval.
+    try:
+        env = spec_rl.load_environment(num_examples=n)
+    except Exception as e:  # AttributeError/ImportError/etc. -> manual fallback
+        print(f"[eval_local] verifiers env did not construct ({type(e).__name__}: {e});"
+              " falling back to manual rollout loop.")
+        return None
+    if not _endpoint_has_chat(base_url):
+        print("[eval_local] endpoint has no /v1/chat/completions (the local stub "
+              "serves only /v1/completions); using manual rollout loop instead.")
+        return None
+    try:
+        from openai import OpenAI  # type: ignore
+    except Exception:
+        print("[eval_local] openai client not available; using manual rollout loop.")
+        return None
+    client = OpenAI(base_url=base_url.rstrip("/") + "/v1", api_key="EMPTY")
+    out = env.evaluate(client=client, model=model, num_examples=n, save_results=False)
+    # Normalize verifiers' GenerateOutputs into our flat per-example shape.
+    rewards = list(getattr(out, "reward", []) or [])
+    completions = list(getattr(out, "completion", []) or [])
+    infos = list(getattr(out, "info", []) or [])
+    per_example = []
+    for i, r in enumerate(rewards):
+        info = infos[i] if i < len(infos) else {}
+        per_example.append({
+            "index": i,
+            "task_id": (info or {}).get("task_id", f"example_{i}"),
+            "score": float(r),
+            "completion": completions[i] if i < len(completions) else "",
+        })
+    mean = sum(p["score"] for p in per_example) / len(per_example) if per_example else 0.0
+    return {"driver": "verifiers", "mean_reward": mean, "per_example": per_example}
+# ---------------------------------------------------------------------------
+# Path 2 — manual rollout loop (the offline / stub path). Reuses spec_rl.passes
+# and spec_rl.STOP so the reward is identical to the env's @vf.reward.
+# ---------------------------------------------------------------------------
+def manual_rollouts(base_url: str, model: str, n: int, max_tokens: int) -> dict:
+    problems = spec_rl.load_problems(n)
+    use_chat = _endpoint_has_chat(base_url)
+    transport = "chat" if use_chat else "completions"
+    print(f"[eval_local] manual loop: {len(problems)} examples via /v1/{transport} "
+          f"at {base_url} (model={model})")
+    per_example = []
+    for i, prob in enumerate(problems):
+        if use_chat:
+            raw = complete_chat(base_url, model, prob["prompt"], max_tokens)
+        else:
+            # Stub path: it ignores the prompt and returns a canned body, so we
+            # send the bare code prompt the same way humaneval_subset.py does.
+            raw = complete_text(base_url, model, prob["prompt"], max_tokens)
+        completion = _trim_at_stop(raw)
+        # Reward: identical logic to spec_rl's @vf.reward code_passes — rebuild
+        # the problem from its own fields (never trust the model to echo it) and
+        # run the unit tests in a timed subprocess.
+        problem = {
+            "prompt": prob["prompt"],
+            "test": prob["test"],
+            "entry_point": prob["entry_point"],
+        }
+        score = spec_rl.fraction_passing(problem, completion)
+        per_example.append({
+            "index": i,
+            "task_id": prob["task_id"],
+            "score": score,
+            "completion": completion,
+        })
+        print(f"  [{i+1}/{len(problems)}] {prob['task_id']}: "
+              f"reward={score:.3f}")
+    mean = sum(p["score"] for p in per_example) / len(per_example) if per_example else 0.0
+    return {
+        "driver": "manual",
+        "transport": transport,
+        "mean_reward": mean,
+        "per_example": per_example,
+    }
+def main() -> int:
+    p = argparse.ArgumentParser(
+        description="Run the spec_rl RL-eval loop offline against the local stub "
+                    "(or any OpenAI-compatible endpoint) and compute the reward."
+    )
+    p.add_argument("--base-url", default="http://localhost:8000",
+                   help="OpenAI-compatible endpoint (stub :8000 / dflash stub :8001 / vLLM).")
+    p.add_argument("--model", default="laguna")
+    p.add_argument("--n", type=int, default=5, help="Number of HumanEval problems (rollouts).")
+    p.add_argument("--max-tokens", type=int, default=512)
+    p.add_argument("--out", default=str(DEFAULT_OUT),
+                   help="Where to write the small JSON summary.")
+    p.add_argument("--force-manual", action="store_true",
+                   help="Skip the verifiers path; always use the manual rollout loop.")
+    args = p.parse_args()
+    result = None
+    if not args.force_manual:
+        result = try_verifiers(args.base_url, args.model, args.n)
+    if result is None:
+        result = manual_rollouts(args.base_url, args.model, args.n, args.max_tokens)
+    summary = {
+        "base_url": args.base_url,
+        "model": args.model,
+        "n": len(result["per_example"]),
+        "driver": result["driver"],
+        "transport": result.get("transport", "chat"),
+        "mean_reward": result["mean_reward"],
+        "scores": [p["score"] for p in result["per_example"]],
+        "per_example": [
+            {"task_id": p["task_id"], "score": p["score"]}
+            for p in result["per_example"]
+        ],
+    }
+    print(json.dumps(
+        {k: v for k, v in summary.items() if k != "per_example"}, indent=2
+    ))
+    print(f"[eval_local] driver={summary['driver']}  "
+          f"mean_reward={summary['mean_reward']:.3f}  n={summary['n']}")
+    out_path = Path(args.out)
+    out_path.parent.mkdir(parents=True, exist_ok=True)
+    out_path.write_text(json.dumps(summary, indent=2))
+    print(f"[eval_local] wrote {out_path}")
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

scripts/fill_submission.py ADDED Viewed

	@@ -0,0 +1,116 @@

+#!/usr/bin/env python3
+"""fill_submission.py — turn measured results into ready-to-paste submission numbers.
+Reads the before/after benchmark JSONs (and, if given, the HumanEval/parity JSON),
+computes the headline figures (speedup, tau, TTFT delta, pass@1, parity verdict),
+and PRINTS:
+  * a warning if the data is STILL STUB (shape-only) — so you never submit fake numbers,
+  * the values to drop into MODEL_CARD.md / RESULTS.html,
+  * a filled one-line claim for the demo.
+It does NOT edit files — paste the numbers yourself, so nothing is silently overwritten.
+Usage:
+  python scripts/fill_submission.py \
+    --baseline results/baseline.json --dflash results/dflash.json \
+    [--humaneval results/humaneval_dflash.json]
+"""
+from __future__ import annotations
+import argparse
+import json
+from pathlib import Path
+from typing import Any
+def _load(path: str) -> dict[str, Any]:
+    return json.loads(Path(path).read_text())
+def _looks_stub(obj: dict[str, Any]) -> bool:
+    """Heuristic: the dress-rehearsal stub stamps a tell-tale completion string."""
+    for r in obj.get("runs", []) or []:
+        if "stub completion" in str(r.get("text", "")).lower():
+            return True
+    return obj.get("base_url", "").endswith((":8000", ":8001")) and bool(
+        [r for r in obj.get("runs", []) or [] if "stub" in str(r.get("text", "")).lower()]
+    )
+def _g(obj: dict[str, Any], *keys: str, default: Any = None) -> Any:
+    for k in keys:
+        if k in obj:
+            return obj[k]
+    return default
+def main() -> int:
+    ap = argparse.ArgumentParser(description=__doc__)
+    ap.add_argument("--baseline", default="results/baseline.json")
+    ap.add_argument("--dflash", default="results/dflash.json")
+    ap.add_argument("--humaneval", default=None,
+                    help="optional pass@1 / parity JSON from humaneval_subset.py")
+    args = ap.parse_args()
+    for p in (args.baseline, args.dflash):
+        if not Path(p).exists():
+            print(f"no results yet at {p} — run the A/B (scripts/hf_job_ab.py) or 'make rehearse' first.")
+            return 3
+    base = _load(args.baseline)
+    dfl = _load(args.dflash)
+    stub = _looks_stub(base) or _looks_stub(dfl)
+    if stub:
+        print("=" * 64)
+        print("  ⚠️  STUB DATA DETECTED — do NOT submit these numbers.")
+        print("  These are shape-only dress-rehearsal results. Re-run measure.py")
+        print("  against the real Laguna+DFlash vLLM endpoint, then re-run this.")
+        print("=" * 64)
+    b_tps = float(_g(base, "tokens_per_s_mean", default=0.0))
+    d_tps = float(_g(dfl, "tokens_per_s_mean", default=0.0))
+    b_ttft = float(_g(base, "ttft_s_mean", default=0.0)) * 1000  # ms
+    d_ttft = float(_g(dfl, "ttft_s_mean", default=0.0)) * 1000   # ms
+    tau = _g(dfl, "acceptance_length_tau")
+    speedup = (d_tps / b_tps) if b_tps else 0.0
+    # optional quality / parity
+    pass1 = parity = lossless = None
+    if args.humaneval and Path(args.humaneval).exists():
+        he = _load(args.humaneval)
+        pass1 = _g(he, "pass_at_1", "pass@1", "pass1")
+        lossless = _g(he, "lossless")
+        parity = _g(he, "mismatches", "token_mismatches")
+    def fmt(x, nd=1, suffix=""):
+        return f"{x:.{nd}f}{suffix}" if isinstance(x, (int, float)) else "—"
+    print("\n--- HEADLINE (paste into MODEL_CARD.md + RESULTS.html) ---")
+    print(f"  baseline tokens/sec : {fmt(b_tps)}")
+    print(f"  dflash   tokens/sec : {fmt(d_tps)}")
+    print(f"  speedup             : {fmt(speedup, 2, 'x')}")
+    print(f"  acceptance length tau: {fmt(tau, 2) if tau is not None else '— (read from /metrics)'}")
+    print(f"  TTFT baseline / dflash (ms): {fmt(b_ttft)} / {fmt(d_ttft)}  (expect ~equal)")
+    print(f"  HumanEval pass@1    : {pass1 if pass1 is not None else '— (run humaneval_subset.py)'}")
+    print(f"  greedy parity       : "
+          + ("LOSSLESS ✓ (0 mismatches)" if (lossless is True or parity == 0)
+             else (f"{parity} mismatches ⚠️" if parity is not None else "— (run --parity)")))
+    print("\n--- ONE-LINE CLAIM (demo opener) ---")
+    if b_tps and d_tps:
+        tau_clause = f', tau={fmt(tau,2)}' if tau is not None else ''
+        print(f'  "Lean Laguna: DFlash makes Laguna XS.2 generate {fmt(speedup,2,"x")} faster '
+              f'on one GPU ({fmt(b_tps)} -> {fmt(d_tps)} tok/s{tau_clause}) '
+              f'with byte-identical output."')
+    else:
+        print("  (fill once real tokens/sec are present)")
+    if stub:
+        print("\n[fill_submission] refusing to call this submittable: STUB data.")
+        return 2
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

scripts/gen_local.py ADDED Viewed

	@@ -0,0 +1,110 @@

+#!/usr/bin/env python3
+"""
+gen_local.py — TINY-model generation on Apple Silicon (MPS), purely to validate
+the PIPELINE SHAPE before the venue. This does NOT run Laguna and does NOT do
+speculative decoding — it proves the measure-generate-report loop works so the
+same harness can be pointed at the real model on Prime Intellect.
+What it measures (the same two numbers we care about at the venue):
+  - TTFT  (time to first token): wall-clock from submit to the first new token.
+  - tokens/sec (decode throughput): generated tokens / (total - TTFT).
+JVM analogy: think of this as a JUnit smoke test against an in-memory stub —
+it asserts the wiring is correct so the integration run against the real
+service (vLLM + Laguna on CUDA) can't fail on plumbing.
+Usage (Mac):
+  uv run python scripts/gen_local.py --model sshleifer/tiny-gpt2 --max-new-tokens 64
+  uv run python scripts/gen_local.py --model gpt2 --prompt "def quicksort(arr):"
+At the venue you'd point --model at a small HF model first, then (on GPU) at
+Laguna itself for a sanity generation BEFORE wiring up vLLM serving.
+"""
+from __future__ import annotations
+import argparse
+import time
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+def pick_device() -> str:
+    if torch.cuda.is_available():
+        return "cuda"
+    if torch.backends.mps.is_available():
+        return "mps"
+    return "cpu"
+def main() -> None:
+    p = argparse.ArgumentParser(description="Tiny-model gen + TTFT/tokens-per-sec on MPS/CPU.")
+    p.add_argument("--model", default="sshleifer/tiny-gpt2",
+                   help="HF model id. Tiny by default; swap to gpt2 or (on GPU) Laguna.")
+    p.add_argument("--prompt", default="def fibonacci(n):\n    ",
+                   help="Coding-style prompt (matches the hackathon track).")
+    p.add_argument("--max-new-tokens", type=int, default=64)
+    p.add_argument("--greedy", action="store_true", default=True,
+                   help="Greedy decode so output is deterministic (lossless baseline).")
+    args = p.parse_args()
+    device = pick_device()
+    print(f"[gen_local] device={device} model={args.model}")
+    tok = AutoTokenizer.from_pretrained(args.model)
+    model = AutoModelForCausalLM.from_pretrained(args.model).to(device)
+    model.eval()
+    inputs = tok(args.prompt, return_tensors="pt").to(device)
+    n_prompt = inputs["input_ids"].shape[1]
+    # --- Warmup: first run triggers lazy kernel compilation on MPS; if we timed
+    #     it, TTFT would absorb the one-off compile cost and tokens/sec would be
+    #     garbage. Run one throwaway pass to warm the kernels, THEN measure. ---
+    with torch.no_grad():
+        _ = model.generate(**inputs, max_new_tokens=2, do_sample=False,
+                            pad_token_id=tok.eos_token_id)
+    if device == "mps":
+        torch.mps.synchronize()
+    # --- TTFT: generate exactly 1 token, time it (warmed) ---
+    if device == "mps":
+        torch.mps.synchronize()
+    t0 = time.perf_counter()
+    with torch.no_grad():
+        _ = model.generate(**inputs, max_new_tokens=1, do_sample=False,
+                           pad_token_id=tok.eos_token_id)
+    if device == "mps":
+        torch.mps.synchronize()
+    ttft = time.perf_counter() - t0
+    # --- Full generation: time the whole thing, derive decode tokens/sec ---
+    if device == "mps":
+        torch.mps.synchronize()
+    t1 = time.perf_counter()
+    with torch.no_grad():
+        out = model.generate(**inputs, max_new_tokens=args.max_new_tokens,
+                             do_sample=False, pad_token_id=tok.eos_token_id)
+    if device == "mps":
+        torch.mps.synchronize()
+    total = time.perf_counter() - t1
+    new_tokens = out.shape[1] - n_prompt
+    # tokens/sec over the decode phase: exclude the first token (its time is TTFT).
+    decode_time = max(total - ttft, 1e-9)
+    tps = (new_tokens - 1) / decode_time if new_tokens > 1 else 0.0
+    text = tok.decode(out[0][n_prompt:], skip_special_tokens=True)
+    print("\n--- generation ---")
+    print(text)
+    print("\n--- metrics (PIPELINE-SHAPE ONLY; not Laguna numbers) ---")
+    print(f"prompt_tokens     : {n_prompt}")
+    print(f"new_tokens        : {new_tokens}")
+    print(f"TTFT_s            : {ttft:.4f}")
+    print(f"total_s           : {total:.4f}")
+    print(f"decode_tokens_per_s: {tps:.2f}")
+if __name__ == "__main__":
+    main()

scripts/hf_job_ab.py ADDED Viewed

	@@ -0,0 +1,287 @@

+# /// script
+# requires-python = ">=3.10"
+# dependencies = ["vllm>=0.21", "huggingface_hub>=0.25"]
+# ///
+"""hf_job_ab.py — the real Lean Laguna MIN A/B, as a self-contained HF Jobs run.
+Runs ON Hugging Face Jobs (a GPU batch job, no ssh, auto-stops when done). It:
+  1. serves Laguna XS.2 baseline in vLLM, measures tokens/sec + TTFT over N prompts,
+  2. re-serves with the DFlash speculator (one --speculative-config), measures again + reads
+     acceptance length tau from /metrics,
+  3. greedy-parity-checks baseline vs DFlash outputs (must be byte-identical),
+  4. writes results/{baseline,dflash}.json + parity, and uploads them to an HF dataset repo
+     so the orchestrator can fetch them without ssh.
+Submit with:
+  hf jobs uv run --flavor rtx-pro-6000 --timeout 1800 \
+     --secrets HF_TOKEN --env RESULTS_REPO=art87able/lean-laguna-results scripts/hf_job_ab.py
+Everything is MEASURED — no fabricated numbers. A hard wall-clock budget bounds the spend.
+"""
+from __future__ import annotations
+import json
+import os
+import subprocess
+import sys
+import time
+import urllib.request
+MODEL = os.environ.get("MODEL", "poolside/Laguna-XS.2")
+SPECULATOR = os.environ.get("SPECULATOR", "poolside/Laguna-XS.2-speculator.dflash")
+GAMMA = int(os.environ.get("GAMMA", "7"))
+N = int(os.environ.get("N", "0"))                           # 0 => use the full curated prompt set
+MAX_TOKENS = int(os.environ.get("MAX_TOKENS", "256"))
+BUDGET_S = int(os.environ.get("BUDGET_S", "1500"))          # hard wall-clock cap (credit guard)
+RESULTS_REPO = os.environ.get("RESULTS_REPO", "")            # HF dataset repo to upload results to
+PORT = 8000
+STOP = ["\nclass ", "\ndef ", "\n#", "\nif __name__"]
+T0 = time.time()
+# A mixed-difficulty set so acceptance length tau is measured across EASY -> HARD, not just
+# trivial canonical functions (which pin tau at the gamma+1 ceiling and over-state the win).
+PROMPTS = [
+    # --- trivial canonical (high acceptance: the ceiling case) ---
+    "def fib(n):\n    \"\"\"Return the n-th Fibonacci number.\"\"\"\n",
+    "def is_prime(n):\n    \"\"\"Return True iff n is prime.\"\"\"\n",
+    "def factorial(n):\n    \"\"\"Return n! (n factorial).\"\"\"\n",
+    "def reverse_words(s):\n    \"\"\"Reverse the order of words in s.\"\"\"\n",
+    # --- medium ---
+    "def binary_search(arr, target):\n    \"\"\"Return the index of target in sorted arr, else -1.\"\"\"\n",
+    "def merge_sorted(a, b):\n    \"\"\"Merge two sorted lists into one sorted list.\"\"\"\n",
+    "def is_balanced(s):\n    \"\"\"Return True iff the brackets ()[]{} in s are balanced.\"\"\"\n",
+    "def roman_to_int(s):\n    \"\"\"Convert a Roman numeral string to an integer.\"\"\"\n",
+    "def flatten(nested):\n    \"\"\"Flatten an arbitrarily nested list of ints into a flat list.\"\"\"\n",
+    # --- harder / branchy / rare-token (acceptance should drop here) ---
+    "def lcs(a, b):\n    \"\"\"Return the length of the longest common subsequence of strings a and b.\"\"\"\n",
+    "def parse_duration(s):\n    \"\"\"Parse strings like '1h30m', '45s', '2d' into total seconds. Raise ValueError on bad input.\"\"\"\n",
+    "def group_anagrams(words):\n    \"\"\"Group words that are anagrams of each other into a list of lists.\"\"\"\n",
+    "class LRUCache:\n    \"\"\"A fixed-capacity LRU cache with get(key) and put(key, value).\"\"\"\n",
+    "def dijkstra(graph, start):\n    \"\"\"graph: dict node -> list of (neighbor, weight). Return dict of shortest distances from start.\"\"\"\n",
+]
+if N <= 0:
+    N = len(PROMPTS)
+PROMPTS = (PROMPTS * ((N // len(PROMPTS)) + 1))[:N]      # repeat only if a larger N is forced
+def budget_left() -> float:
+    return BUDGET_S - (time.time() - T0)
+def serve(dflash: bool) -> subprocess.Popen:
+    env = {**os.environ,
+           "VLLM_USE_DEEP_GEMM": "0",
+           # Laguna is an UNQUANTIZED bf16 MoE. The slim uv image ships only pip CUDA *runtime*
+           # wheels — no nvcc/toolkit at /usr/local/cuda. vLLM/FlashInfer lazily JIT-compile
+           # several kernels on first use (inside profile_run), each needing nvcc, so each dies
+           # "Could not find nvcc". We disable EVERY FlashInfer JIT path and pin prebuilt
+           # alternatives:
+           #   - MoE  -> Triton fused-MoE (PTX via Triton). [verified: sm90+sm120 cutlass JIT crash]
+           #   - sampler -> torch top-k/top-p (not FlashInfer). [verified: sampling JIT crash]
+           #   - attention -> FLASH_ATTN (prebuilt flash-attn wheel, not FlashInfer JIT).
+           "VLLM_USE_FLASHINFER_MOE_FP16": "0",
+           "VLLM_USE_FLASHINFER_MOE_FP8": "0",
+           "VLLM_USE_FLASHINFER_SAMPLER": "0",
+           "VLLM_ATTENTION_BACKEND": os.environ.get("VLLM_ATTENTION_BACKEND", "FLASH_ATTN")}
+    cmd = [sys.executable, "-m", "vllm.entrypoints.openai.api_server",
+           "--model", MODEL, "--port", str(PORT), "--tensor-parallel-size", "1",
+           "--trust-remote-code",                 # Laguna's custom MoE arch needs it in vLLM
+           "--enforce-eager",                     # skip CUDA-graph capture: leaner + faster start; A/B ratio unaffected
+           "--gpu-memory-utilization", "0.9",
+           "--max-model-len", os.environ.get("SPECRL_MAX_LEN", "4096")]
+    # NOTE: base poolside/Laguna-XS.2 loads in bf16 at ~62 GiB (full MoE resident). It fits a
+    # 96GB-class GPU (rtx-pro-6000) with room for KV; h200 (141GB) is the safe, best-tested target.
+    # The earlier failures were NOT OOM — they were the nvcc/FlashInfer-JIT issue fixed above.
+    if dflash:
+        cmd += ["--speculative-config",
+                json.dumps({"model": SPECULATOR, "num_speculative_tokens": GAMMA, "method": "dflash"})]
+    print(f"[job] serving {'DFlash' if dflash else 'baseline'}: {' '.join(cmd)}", flush=True)
+    return subprocess.Popen(cmd, env=env)
+def wait_health(proc: subprocess.Popen, timeout: int = 900) -> None:
+    url = f"http://localhost:{PORT}/health"
+    t = time.time()
+    while time.time() - t < timeout:
+        if proc.poll() is not None:
+            raise RuntimeError("vLLM server exited during startup (check logs above)")
+        try:
+            urllib.request.urlopen(url, timeout=5)
+            print("[job] server healthy", flush=True)
+            return
+        except Exception:
+            time.sleep(5)
+    raise TimeoutError("server did not become healthy in time")
+def _post(path: str, payload: dict) -> dict:
+    req = urllib.request.Request(f"http://localhost:{PORT}{path}",
+                                 data=json.dumps(payload).encode(),
+                                 headers={"Content-Type": "application/json"})
+    with urllib.request.urlopen(req, timeout=300) as r:
+        return json.loads(r.read().decode())
+def complete(prompt: str) -> tuple[str, float, float]:
+    t = time.time()
+    obj = _post("/v1/completions", {"model": MODEL, "prompt": prompt,
+                                    "max_tokens": MAX_TOKENS, "temperature": 0.0, "stop": STOP})
+    dt = time.time() - t
+    ch = obj["choices"][0]
+    text = ch.get("text", "") or ""
+    ntok = (obj.get("usage") or {}).get("completion_tokens") or len(text.split())
+    return text, (ntok / dt if dt else 0.0), dt
+def tau_from_metrics() -> float | None:
+    try:
+        with urllib.request.urlopen(f"http://localhost:{PORT}/metrics", timeout=10) as r:
+            body = r.read().decode()
+    except Exception:
+        return None
+    acc = draft = None
+    for line in body.splitlines():
+        if line.startswith("vllm:spec_decode_num_accepted_tokens"):
+            acc = float(line.split()[-1])
+        elif line.startswith("vllm:spec_decode_num_draft_tokens"):
+            draft = float(line.split()[-1])
+    if acc is not None and draft and draft > 0:
+        passes = draft / GAMMA
+        return (acc + passes) / passes if passes else None
+    return None
+def spec_counters() -> "tuple[float, float] | None":
+    """Raw cumulative (accepted, draft) spec-decode token counters from /metrics."""
+    try:
+        with urllib.request.urlopen(f"http://localhost:{PORT}/metrics", timeout=10) as r:
+            body = r.read().decode()
+    except Exception:
+        return None
+    acc = draft = None
+    for line in body.splitlines():
+        if line.startswith("vllm:spec_decode_num_accepted_tokens"):
+            acc = float(line.split()[-1])
+        elif line.startswith("vllm:spec_decode_num_draft_tokens"):
+            draft = float(line.split()[-1])
+    if acc is None or draft is None:
+        return None
+    return acc, draft
+def _tau_from_delta(d_acc: float, d_draft: float) -> "float | None":
+    """Per-prompt acceptance length from the change in counters over one completion."""
+    passes = d_draft / GAMMA
+    return (d_acc + passes) / passes if passes > 0 else None
+def measure(dflash: bool) -> dict:
+    texts, tps, ttft, taus = [], [], [], []
+    prev = spec_counters() if dflash else None
+    for p in PROMPTS:
+        if budget_left() < 120:
+            print("[job] budget guard hit — stopping measure early", flush=True)
+            break
+        txt, t_ps, dt = complete(p)
+        texts.append(txt); tps.append(t_ps); ttft.append(dt)
+        if dflash:
+            cur = spec_counters()
+            if prev and cur:
+                ti = _tau_from_delta(cur[0] - prev[0], cur[1] - prev[1])
+                taus.append(round(ti, 3) if ti is not None else None)
+            prev = cur
+    out = {
+        "label": "dflash" if dflash else "baseline", "model": MODEL, "n": len(texts),
+        "tokens_per_s_mean": sum(tps) / len(tps) if tps else 0.0,
+        "ttft_s_mean": sum(ttft) / len(ttft) if ttft else 0.0,   # NB: full-completion latency, not true TTFT
+        "acceptance_length_tau": tau_from_metrics() if dflash else 1.0,   # aggregate over the whole set
+        "texts": texts,
+        "runs": [{"ttft_s": d, "total_s": d, "new_tokens": len(t.split()),
+                  "tokens_per_s": s, "text": t} for t, s, d in zip(texts, tps, ttft)],
+    }
+    if dflash:
+        clean = [t for t in taus if t is not None]
+        cs = sorted(clean)
+        out["tau_per_prompt"] = taus
+        out["tau_min"] = min(clean) if clean else None
+        out["tau_median"] = cs[len(cs) // 2] if cs else None
+        out["tau_max"] = max(clean) if clean else None
+        out["tau_mean"] = round(sum(clean) / len(clean), 3) if clean else None
+    return out
+def run_one(dflash: bool) -> dict:
+    proc = serve(dflash)
+    try:
+        wait_health(proc)
+        return measure(dflash)
+    finally:
+        proc.terminate()
+        try:
+            proc.wait(timeout=30)
+        except Exception:
+            proc.kill()
+        time.sleep(5)
+def _expose_wheel_nvcc() -> None:
+    """Safety net: if no CUDA toolkit is on PATH but the pip nvidia-cuda-nvcc wheel is
+    installed, expose its nvcc + set CUDA_HOME so ANY residual FlashInfer JIT can compile
+    instead of hard-failing 'Could not find nvcc'. Never exercised when the FlashInfer paths
+    are disabled (see serve()); pure belt-and-suspenders. Set in os.environ BEFORE serve()
+    so the vLLM subprocess inherits it."""
+    import shutil
+    import site
+    if shutil.which("nvcc") or os.path.isdir("/usr/local/cuda"):
+        return
+    roots = []
+    try:
+        roots = list(site.getsitepackages())
+    except Exception:
+        pass
+    roots += [os.path.dirname(os.path.dirname(__file__))]
+    for root in roots:
+        cand = os.path.join(root, "nvidia", "cuda_nvcc")
+        if os.path.exists(os.path.join(cand, "bin", "nvcc")):
+            os.environ["CUDA_HOME"] = cand
+            os.environ["CUDA_PATH"] = cand
+            os.environ["PATH"] = os.path.join(cand, "bin") + ":" + os.environ.get("PATH", "")
+            print(f"[job] exposed wheel nvcc (CUDA_HOME={cand})", flush=True)
+            return
+    print("[job] no wheel nvcc found to expose (FlashInfer JIT paths are disabled anyway)", flush=True)
+def main() -> int:
+    print(f"[job] start; budget {BUDGET_S}s; N={N}; model={MODEL}", flush=True)
+    _expose_wheel_nvcc()
+    base = run_one(dflash=False)
+    dfl = run_one(dflash=True)
+    mism = sum(1 for a, b in zip(base["texts"], dfl["texts"]) if a != b)
+    parity = {"compared": min(len(base["texts"]), len(dfl["texts"])),
+              "mismatches": mism, "lossless": mism == 0}
+    speedup = (dfl["tokens_per_s_mean"] / base["tokens_per_s_mean"]
+               if base["tokens_per_s_mean"] else 0.0)
+    summary = {"speedup_x": round(speedup, 3), "tau": dfl["acceptance_length_tau"],
+               "baseline_tps": base["tokens_per_s_mean"], "dflash_tps": dfl["tokens_per_s_mean"],
+               "parity": parity, "elapsed_s": round(time.time() - T0, 1)}
+    print("[job] RESULT " + json.dumps(summary), flush=True)
+    os.makedirs("results", exist_ok=True)
+    for d, name in ((base, "baseline.json"), (dfl, "dflash.json")):
+        json.dump(d, open(f"results/{name}", "w"), indent=2)
+    json.dump({**summary, "parity": parity}, open("results/summary.json", "w"), indent=2)
+    # No repo creation/upload — zero public surface. Emit results to the job logs as
+    # tagged JSON lines; the orchestrator parses them from `hf jobs logs <id>` and writes
+    # results/*.json locally, then pushes ONLY to the authorized poolside-laguna-hackathon org.
+    def _compact(d: dict) -> dict:
+        return {k: v for k, v in d.items() if k not in ("texts", "runs")}
+    print("[job] BASELINE_JSON " + json.dumps(_compact(base)), flush=True)
+    print("[job] DFLASH_JSON " + json.dumps(_compact(dfl)), flush=True)
+    print("[job] PARITY_JSON " + json.dumps(parity), flush=True)
+    print("[job] SAMPLE_BASELINE " + json.dumps(base["texts"][:2]), flush=True)
+    print("[job] SAMPLE_DFLASH " + json.dumps(dfl["texts"][:2]), flush=True)
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

scripts/parity_local.sh ADDED Viewed

	@@ -0,0 +1,33 @@

+#!/usr/bin/env bash
+# parity_local.sh — full local dry-run of the benchmark + parity harness on the Mac.
+# Starts two stub servers (baseline :8000, "dflash" :8001), waits until both are
+# ready, runs measure.py against each (writing results/*.json) and the greedy
+# parity check across both, then tears the stubs down. No CUDA / vLLM / Laguna.
+set -euo pipefail
+cd "$(dirname "$0")/.."
+PY=.venv/bin/python
+"$PY" scripts/stub_server.py --port 8000 &       A=$!
+"$PY" scripts/stub_server.py --port 8001 --spec & B=$!
+trap 'kill $A $B 2>/dev/null || true' EXIT
+# Wait for both ports to accept connections (no shell sleep — poll in python).
+"$PY" - <<'PY'
+import socket, time, sys
+for port in (8000, 8001):
+    for _ in range(100):
+        with socket.socket() as s:
+            if s.connect_ex(("127.0.0.1", port)) == 0:
+                break
+        time.sleep(0.05)
+    else:
+        sys.exit(f"stub on {port} never came up")
+print("[parity_local] both stubs ready")
+PY
+mkdir -p results
+"$PY" bench/measure.py --base-url http://localhost:8001 --model laguna --label dflash   --n 5 --out results/dflash.json
+"$PY" bench/measure.py --base-url http://localhost:8000 --model laguna --label baseline --n 5 --out results/baseline.json
+"$PY" evals/humaneval_subset.py --parity --base-url http://localhost:8000 --base-url-b http://localhost:8001 --model laguna --n 3
+"$PY" scripts/check_results.py results/dflash.json results/baseline.json
+echo "[parity_local] OK — results/ written, parity checked"

scripts/run_min_on_prime.sh ADDED Viewed

	@@ -0,0 +1,90 @@

+#!/usr/bin/env bash
+# run_min_on_prime.sh — provision a GPU, run the Lean Laguna MIN A/B, ALWAYS tear down.
+#
+# Credit safety is the whole point of this script:
+#   * a hard wallet check before provisioning,
+#   * an EXIT/INT/TERM trap that terminates the pod no matter how the script ends
+#     (success, error, or Ctrl-C) — so a botched bring-up can't leave a GPU billing,
+#   * the cheap->expensive ladder (tiny smoke before the real run).
+#
+# It does NOT fabricate anything: it runs serve_vllm + measure + parity on the real
+# Laguna+DFlash and writes results/*.json, then runs fill_submission.py (which itself
+# refuses stub data). Review it before running; some remote-exec lines are marked
+# [VERIFY] because the exact `prime pods ssh` non-interactive form can vary by CLI build.
+#
+# Usage:   MAX_USD=5 GPU_TYPE=GH200_96GB ./scripts/run_min_on_prime.sh
+set -euo pipefail
+GPU_TYPE="${GPU_TYPE:-GH200_96GB}"   # Hopper = native FP8 for Laguna. (A100 lacks native FP8.)
+GPU_COUNT="${GPU_COUNT:-1}"
+DISK_GB="${DISK_GB:-120}"            # Laguna FP8 (~33GB) + drafter + room
+N="${N:-20}"                         # prompts per measure
+MAX_USD="${MAX_USD:-5}"              # abort if wallet can't cover this; teardown caps real spend
+POD_NAME="${POD_NAME:-lean-laguna-min}"
+HERE="$(cd "$(dirname "$0")/.." && pwd)"   # laguna-hack/
+export PATH="$HOME/.local/bin:$PATH"
+say() { printf '\n\033[1;32m[run-min]\033[0m %s\n' "$*"; }
+die() { printf '\n\033[1;31m[run-min] ABORT:\033[0m %s\n' "$*" >&2; exit 1; }
+# --- 0. preconditions (free) ---------------------------------------------------
+command -v prime >/dev/null || die "prime CLI not found"
+prime whoami >/dev/null 2>&1 || die "not logged into Prime (run: prime login)"
+say "wallet:"; prime --plain wallet 2>&1 | head -4
+read -r -p "Provision a ${GPU_TYPE} (~\$2-3/hr) and run the MIN A/B, cap ~\$${MAX_USD}? [y/N] " ok
+[ "$ok" = "y" ] || die "cancelled by user"
+# --- 1. provision + ALWAYS-teardown trap --------------------------------------
+say "creating pod ${POD_NAME} (${GPU_TYPE} x${GPU_COUNT})…"
+POD_ID="$(prime pods create --gpu-type "$GPU_TYPE" --gpu-count "$GPU_COUNT" \
+            --disk-size "$DISK_GB" --name "$POD_NAME" --yes --plain 2>&1 \
+          | grep -oE '[0-9a-f-]{8,}' | head -1)"   # [VERIFY] parse the pod id from output
+[ -n "${POD_ID:-}" ] || die "could not create pod / parse id"
+# CRITICAL: terminate on ANY exit so a failed run never leaves a GPU billing.
+trap 'echo; echo "[run-min] tearing down pod $POD_ID"; prime pods terminate "$POD_ID" --yes >/dev/null 2>&1 || true' EXIT INT TERM
+say "pod $POD_ID created — teardown armed."
+# --- 2. wait until running ------------------------------------------------------
+for _ in $(seq 1 60); do
+  st="$(prime --plain pods status "$POD_ID" 2>/dev/null | grep -iE 'status' | head -1 || true)"
+  echo "  $st"; echo "$st" | grep -qi 'running' && break
+  sleep 10
+done
+echo "$st" | grep -qi 'running' || die "pod did not reach RUNNING"
+# helper: run a command on the pod   [VERIFY] exact non-interactive form for your CLI build
+pod() { prime pods ssh "$POD_ID" -- "$@"; }
+# --- 3. push the harness + install deps ---------------------------------------
+say "syncing harness to pod…"
+# Option A (private repo): clone with the PAT; Option B: rsync $HERE. Pick one. [VERIFY]
+pod "mkdir -p ~/laguna-hack" || die "ssh failed"
+rsync -az -e "prime pods ssh $POD_ID --" \
+  "$HERE/scripts" "$HERE/bench" "$HERE/evals" "$HERE/Makefile" \
+  "$HERE/requirements-venue.txt" ":~/laguna-hack/" 2>/dev/null \
+  || say "[VERIFY] rsync transport differs — fall back to git clone with PAT on the pod"
+pod "cd ~/laguna-hack && uv pip install -r requirements-venue.txt && vllm --version"
+# --- 4. the cheap->expensive ladder -------------------------------------------
+say "RUNG 1: tiny smoke (no Laguna) to prove the path"
+pod "cd ~/laguna-hack && python scripts/gen_local.py || true"
+say "RUNG 2/3: baseline then DFlash, measure both, parity"
+pod "cd ~/laguna-hack && python scripts/serve_vllm.py --mode baseline --run >/tmp/b.log 2>&1 & sleep 90 && python bench/measure.py --base-url http://localhost:8000 --n $N && pkill -f serve_vllm || true"
+pod "cd ~/laguna-hack && python scripts/serve_vllm.py --mode dflash --run >/tmp/d.log 2>&1 & sleep 90 && python bench/measure.py --base-url http://localhost:8000 --n $N && pkill -f serve_vllm || true"
+pod "cd ~/laguna-hack && python evals/humaneval_subset.py --n 25 || true"
+# --- 5. pull results back ------------------------------------------------------
+say "pulling results…"
+rsync -az -e "prime pods ssh $POD_ID --" ":~/laguna-hack/results/" "$HERE/results/" 2>/dev/null \
+  || say "[VERIFY] copy results manually: prime pods ssh $POD_ID -- 'cat ~/laguna-hack/results/dflash.json'"
+# --- 6. teardown happens via trap; then fill locally --------------------------
+say "done on GPU — pod will terminate now (trap)."
+trap - EXIT INT TERM
+prime pods terminate "$POD_ID" --yes >/dev/null 2>&1 || true
+say "filling submission numbers (refuses stub data):"
+python3 "$HERE/scripts/fill_submission.py" \
+  --baseline "$HERE/results/baseline.json" --dflash "$HERE/results/dflash.json" \
+  --humaneval "$HERE/results/humaneval_dflash.json" || true
+say "if fill_submission exited 0, paste the numbers into MODEL_CARD.md/RESULTS.html and run the hf push (SUBMISSION.md §3)."

scripts/serve_vllm.py ADDED Viewed

	@@ -0,0 +1,126 @@

+#!/usr/bin/env python3
+"""
+serve_vllm.py — VENUE ONLY (Prime Intellect, CUDA GPU). DOES NOT RUN ON THE MAC.
+This is a thin, documented wrapper that prints (and optionally execs) the exact
+`vllm serve` command for three configs:
+  1. baseline  — Laguna XS.2 alone (the speed floor).
+  2. dflash    — Laguna XS.2 + the DFlash speculator (the speed we're claiming).
+  3. quant     — a quantized Laguna checkpoint (FP8/INT4/NVFP4) + FP8 KV cache.
+                 This is the FALLBACK lane (see FALLBACK_QUANT.md): if DFlash hits
+                 a vLLM-version/draft-model snag at the venue, a quantized weights
+                 checkpoint still tells a clean single-GPU story (smaller footprint,
+                 FP8 KV cache ~doubles concurrent trajectories per the [TR]).
+baseline vs dflash are IDENTICAL except for --speculative-config — flip one flag,
+get faster tokens, same greedy output. quant is a different lever (shrink each
+pass instead of cutting passes); the two can stack, but the fallback keeps it
+simple with quant alone.
+Grounding (cite at the demo):
+  - DFlash config shape is from the HF model card
+    huggingface.co/poolside/Laguna-XS.2-speculator.dflash:
+        --speculative-config '{"model":"poolside/Laguna-XS.2-speculator.dflash",
+                               "num_speculative_tokens":7,"method":"dflash"}'
+  - num_speculative_tokens = 7 is the card's value (this is gamma, the draft length).
+  - vLLM >= 0.21.0 and VLLM_USE_DEEP_GEMM=0 per the card.
+  - parsers --tool-call-parser poolside_v1 / --reasoning-parser poolside_v1 per the card.
+VERIFY AT ONBOARDING: exact vLLM version on the PI image, whether
+--trust-remote-code is required, and whether `method` is spelled "dflash"
+in the build you get. The card is authoritative; confirm against `vllm serve --help`.
+Usage (on Prime Intellect):
+  python scripts/serve_vllm.py --mode baseline --print     # show the command
+  python scripts/serve_vllm.py --mode dflash --run         # actually serve
+"""
+from __future__ import annotations
+import argparse
+import json
+import os
+import shlex
+import subprocess
+import sys
+MODEL = os.environ.get("LAGUNA_MODEL", "poolside/Laguna-XS.2")
+SPECULATOR = os.environ.get("LAGUNA_SPECULATOR", "poolside/Laguna-XS.2-speculator.dflash")
+# Draft length gamma. Per the DFlash model card.
+NUM_SPECULATIVE_TOKENS = 7
+# Quantized checkpoints for the fallback lane. The [TR] says XS.2 ships FP8 (W8A8),
+# INT4 (W4A16/AWQ) and NVFP4 quants in the HF collection. EXACT repo names are NOT
+# confirmed pre-event — these are documented placeholders; VERIFY AT ONBOARDING
+# against huggingface.co/collections/poolside/laguna-xs2 (or override via env).
+QUANT_MODELS = {
+    "fp8":   os.environ.get("LAGUNA_FP8_MODEL",   "poolside/Laguna-XS.2-FP8"),
+    "int4":  os.environ.get("LAGUNA_INT4_MODEL",  "poolside/Laguna-XS.2-INT4"),
+    "nvfp4": os.environ.get("LAGUNA_NVFP4_MODEL", "poolside/Laguna-XS.2-NVFP4"),
+}
+def build_cmd(mode: str, max_model_len: int, tp: int, quant: str) -> list[str]:
+    model = QUANT_MODELS[quant] if mode == "quant" else MODEL
+    base = [
+        "vllm", "serve", model,
+        "--tensor-parallel-size", str(tp),
+        "--max-model-len", str(max_model_len),
+        "--served-model-name", "laguna",
+        # Poolside-specific parsers (from the model card):
+        "--tool-call-parser", "poolside_v1",
+        "--reasoning-parser", "poolside_v1",
+        "--enable-auto-tool-choice",
+        "--default-chat-template-kwargs", '{"enable_thinking": true}',
+    ]
+    if mode == "dflash":
+        spec = {
+            "model": SPECULATOR,
+            "num_speculative_tokens": NUM_SPECULATIVE_TOKENS,
+            "method": "dflash",
+        }
+        base += ["--speculative-config", json.dumps(spec)]
+    if mode == "quant":
+        # FP8 KV cache is the high-leverage single-GPU win ([TR]: ~2x concurrent
+        # trajectories). Weight quant is auto-detected from the checkpoint config.
+        base += ["--kv-cache-dtype", "fp8"]
+    return base
+def main() -> None:
+    if sys.platform == "darwin":
+        print("[serve_vllm] REFUSING TO RUN: this is a Mac. vLLM needs CUDA.\n"
+              "             Run this on Prime Intellect. Use --print to inspect the command here.",
+              file=sys.stderr)
+        # Still allow --print on Mac for inspection; block --run.
+    p = argparse.ArgumentParser(description="Print/run the vLLM serve command for Laguna (baseline / dflash / quant).")
+    p.add_argument("--mode", choices=["baseline", "dflash", "quant"], required=True)
+    p.add_argument("--quant", choices=["fp8", "int4", "nvfp4"], default="fp8",
+                   help="Quant format for --mode quant (the fallback lane). Default fp8.")
+    p.add_argument("--max-model-len", type=int, default=16384,
+                   help="Card example uses 16384; raise toward 131072/262144 if VRAM allows. Verify at onboarding.")
+    p.add_argument("--tensor-parallel-size", type=int, default=1,
+                   help="Single GPU = 1. The whole hook is one-GPU serving.")
+    g = p.add_mutually_exclusive_group(required=True)
+    g.add_argument("--print", action="store_true", help="Print the command only.")
+    g.add_argument("--run", action="store_true", help="Actually exec vllm serve (venue only).")
+    args = p.parse_args()
+    cmd = build_cmd(args.mode, args.max_model_len, args.tensor_parallel_size, args.quant)
+    env_prefix = "VLLM_USE_DEEP_GEMM=0"
+    printable = f"{env_prefix} " + " ".join(shlex.quote(c) for c in cmd)
+    print(printable)
+    if args.run:
+        if sys.platform == "darwin":
+            print("[serve_vllm] --run blocked on Mac.", file=sys.stderr)
+            sys.exit(2)
+        env = dict(os.environ)
+        env["VLLM_USE_DEEP_GEMM"] = "0"  # per the model card
+        os.execvpe(cmd[0], cmd, env)
+if __name__ == "__main__":
+    main()

scripts/stub_server.py ADDED Viewed

	@@ -0,0 +1,187 @@

+#!/usr/bin/env python3
+"""
+stub_server.py — a tiny, stdlib-only OpenAI-compatible STUB so the benchmark and
+eval harness (bench/measure.py, evals/humaneval_subset.py) can be exercised
+END-TO-END on the Mac, with NO CUDA / vLLM / Laguna. It fakes just enough of the
+vLLM surface to shape-test the whole pipeline before the venue.
+What it fakes:
+  * POST /v1/completions  — both streaming (SSE, for measure.py) and non-streaming
+    (single JSON, for humaneval_subset.py). Output is DETERMINISTIC given the prompt,
+    so two stubs return identical greedy text → the parity check proves "lossless".
+  * GET  /metrics         — Prometheus text. With --spec, it exposes the
+    spec_decode_* counters measure.py reads to compute acceptance length τ
+    (tuned so τ ≈ 2.6, in the DFlash card's 2.56–3.07 range). Without --spec it's a
+    plain baseline (no spec counters → measure.py reports τ = None, which is correct).
+JVM analogy: this is WireMock for an LLM endpoint — a canned stub standing in for
+the real service so you can integration-test the client/harness without the backend.
+Usage:
+  python scripts/stub_server.py --port 8000           # baseline stub
+  python scripts/stub_server.py --port 8001 --spec     # "dflash" stub (has τ metrics)
+"""
+from __future__ import annotations
+import argparse
+import json
+import threading
+from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
+GAMMA = 7          # draft length, matches the DFlash card / serve_vllm.py
+TAU_TARGET = 2.6   # acceptance length we want measure.py to report for the spec stub
+# Deterministic canned completion (same for every prompt → greedy parity is identical).
+# Content is irrelevant locally: humaneval runs with --no-exec, measure.py only times it.
+COMPLETION = (
+    "\n    # stub completion (local shape-test only; not a real model)\n"
+    "    result = 0\n"
+    "    for i in range(n):\n"
+    "        result += i\n"
+    "    return result\n"
+)
+def _tokens(text: str) -> list[str]:
+    """Split into whitespace-preserving 'tokens' so streaming has several chunks."""
+    out, buf = [], ""
+    for ch in text:
+        buf += ch
+        if ch.isspace():
+            out.append(buf)
+            buf = ""
+    if buf:
+        out.append(buf)
+    return out
+class State:
+    """Shared mutable counters (one server instance)."""
+    def __init__(self, spec: bool):
+        self.spec = spec
+        self.emitted = 0
+        self.lock = threading.Lock()
+    def add_emitted(self, n: int) -> None:
+        with self.lock:
+            self.emitted += n
+    def metrics_text(self) -> str:
+        lines = [
+            "# HELP stub_up 1 if the stub is serving",
+            "# TYPE stub_up gauge",
+            "stub_up 1",
+        ]
+        if self.spec:
+            # Invert measure.py's math so it recovers TAU_TARGET:
+            #   passes = emitted / tau ; draft = passes*gamma ; accepted = emitted - passes
+            #   measure.py: passes' = draft/gamma = passes ; committed = accepted + passes = emitted
+            #               tau = committed / passes = emitted / passes = TAU_TARGET
+            passes = max(self.emitted / TAU_TARGET, 0.0)
+            draft = passes * GAMMA
+            accepted = max(self.emitted - passes, 0.0)
+            lines += [
+                f"spec_decode_num_draft_tokens {draft:.0f}",
+                f"spec_decode_num_accepted_tokens {accepted:.0f}",
+                f"spec_decode_num_emitted_tokens {self.emitted:.0f}",
+            ]
+        return "\n".join(lines) + "\n"
+class Handler(BaseHTTPRequestHandler):
+    state: State = None  # set on the class before serving
+    def log_message(self, *args):  # quiet
+        pass
+    def _send(self, code: int, body: bytes, ctype: str) -> None:
+        self.send_response(code)
+        self.send_header("Content-Type", ctype)
+        self.send_header("Content-Length", str(len(body)))
+        self.end_headers()
+        self.wfile.write(body)
+    def do_GET(self):
+        if self.path.rstrip("/") == "/metrics":
+            self._send(200, self.state.metrics_text().encode(), "text/plain; version=0.0.4")
+        else:
+            self._send(404, b"not found\n", "text/plain")
+    def do_POST(self):
+        path = self.path.rstrip("/")
+        # Real vLLM serves both the legacy text route (/v1/completions, used by
+        # bench/measure.py) and the chat route (/v1/chat/completions, used by the
+        # Kotlin load-test client). The only wire difference is the chunk shape:
+        # chat streams {delta:{content:...}}, legacy streams {text:...}.
+        is_chat = path == "/v1/chat/completions"
+        if not is_chat and path != "/v1/completions":
+            self._send(404, b"not found\n", "text/plain")
+            return
+        n = int(self.headers.get("Content-Length", 0))
+        try:
+            req = json.loads(self.rfile.read(n) or b"{}")
+        except json.JSONDecodeError:
+            self._send(400, b'{"error":"bad json"}', "application/json")
+            return
+        max_tokens = int(req.get("max_tokens", 64))
+        toks = _tokens(COMPLETION)[:max_tokens]
+        text = "".join(toks)
+        self.state.add_emitted(len(toks))
+        if req.get("stream"):
+            self.send_response(200)
+            self.send_header("Content-Type", "text/event-stream")
+            self.end_headers()
+            for t in toks:
+                if is_chat:
+                    chunk = {"choices": [{"delta": {"content": t}, "index": 0,
+                                          "finish_reason": None}]}
+                else:
+                    chunk = {"choices": [{"text": t, "index": 0,
+                                          "finish_reason": None}]}
+                self.wfile.write(f"data: {json.dumps(chunk)}\n\n".encode())
+                self.wfile.flush()
+            self.wfile.write(b"data: [DONE]\n\n")
+            self.wfile.flush()
+        elif is_chat:
+            body = {
+                "id": "stub-chatcmpl",
+                "object": "chat.completion",
+                "model": req.get("model", "laguna"),
+                "choices": [{"message": {"role": "assistant", "content": text},
+                             "index": 0, "finish_reason": "stop"}],
+            }
+            self._send(200, json.dumps(body).encode(), "application/json")
+        else:
+            body = {
+                "id": "stub-cmpl",
+                "object": "text_completion",
+                "model": req.get("model", "laguna"),
+                "choices": [{"text": text, "index": 0, "finish_reason": "stop"}],
+            }
+            self._send(200, json.dumps(body).encode(), "application/json")
+def main() -> None:
+    p = argparse.ArgumentParser(description="Stdlib OpenAI-compatible stub for local harness shape-tests.")
+    p.add_argument("--port", type=int, default=8000)
+    p.add_argument("--spec", action="store_true",
+                   help="Expose spec_decode_* metrics (simulate the DFlash endpoint, τ≈2.6).")
+    args = p.parse_args()
+    Handler.state = State(spec=args.spec)
+    srv = ThreadingHTTPServer(("127.0.0.1", args.port), Handler)
+    tag = "dflash-stub (with τ metrics)" if args.spec else "baseline-stub"
+    print(f"[stub] {tag} serving on http://127.0.0.1:{args.port}  "
+          f"(/v1/completions, /v1/chat/completions, /metrics)")
+    try:
+        srv.serve_forever()
+    except KeyboardInterrupt:
+        pass
+    finally:
+        srv.shutdown()
+if __name__ == "__main__":
+    main()

spec_rl/README.md ADDED Viewed

	@@ -0,0 +1,129 @@

+# spec_rl — code RL on a DFlash-speculated endpoint
+A small [`verifiers`](https://github.com/PrimeIntellect-ai/verifiers) environment
+for the combined hackathon thesis:
+> **Lossless DFlash speculative decoding makes RL post-training cheaper.**
+`spec_rl` is a HumanEval-style code-completion task. The policy model
+(Laguna XS.2) is given a function signature + docstring and must write the body.
+The `@vf.reward` `code_reward` function executes that body against the problem's
+unit tests and returns the **fraction of assertions that pass** (a value in
+`[0,1]`) via `fraction_passing(problem, text)`. This is a *unit-test-grounded,
+verifiable, dense* reward — exactly the kind verifiers RL is built for. A
+fractional (rather than binary all-or-nothing) reward avoids GRPO all-zero-group
+advantage collapse on hard prompts, where every rollout would otherwise score
+`0.0`. The reported pass@1 **eval** stays binary (`evals/humaneval_subset.py`):
+reward is the learning signal, eval is the scoreboard.
+## The point
+`verifiers` runs RL rollouts against an OpenAI-compatible endpoint declared in
+`./configs/endpoints.toml`. Point that endpoint at the **DFlash-speculated vLLM
+server** instead of a plain one and you get the **same reward curve at higher
+rollout throughput**:
+- Speculative decoding is **lossless** under greedy decoding. The 0.6B DFlash
+  drafter proposes `num_speculative_tokens = 7` tokens; the target model
+  (Laguna XS.2) verifies them, so accepted text is **token-identical** to the
+  no-speculator baseline.
+- The reward depends only on the generated text, so an identical reward signal
+  is produced.
+- Only the **cost per rollout** drops (fewer target-model forward passes per
+  accepted token → higher tokens/sec → cheaper RL).
+That is the measurable claim: feed the same env two endpoints (baseline vs
+DFlash), show one reward curve, two throughputs.
+## How the reward works
+1. The dataset carries each HumanEval problem's original `prompt` (signature +
+   docstring), `test` (the `check(candidate)` harness), and `entry_point` in
+   `info` — so the grader never depends on the model echoing the signature.
+2. The model's completion is trimmed at the first stop sequence
+   (`\nclass `, `\ndef `, `\n#`, `\nif __name__`) so a chatty model can't smuggle
+   a second definition past the grader. This matches `evals/humaneval_subset.py`.
+3. `spec_rl.fraction_passing()` assembles `prompt + completion + test +
+   check(entry_point)` and runs it in a **fresh `python` subprocess with an 8s
+   wall-clock timeout**, isolated from the rollout worker. It AST-instruments each
+   `assert` in the HumanEval `check()` (via `_AssertCounter`) so a failing assert
+   is **counted in the denominator instead of aborting on the first failure** —
+   this also makes loop-based checks fractional. The reward is `passed_asserts /
+   total_asserts`, a value in `[0,1]`. A crash, exception, or timeout before any
+   assertion runs → `0.0`; every assertion passing → `1.0`.
+The execution + pass/fail logic is plain stdlib and importable without
+`verifiers` or a GPU, so it is unit-testable locally on Apple Silicon. A built-in
+smoke test runs with:
+```bash
+python spec_rl.py   # checks passing / failing / timeout completions
+```
+> **Safety:** this executes model-generated code to grade it. Each candidate
+> runs in a short-lived, isolated subprocess. Run RL rollouts only in the
+> disposable venue sandbox, never against real data.
+## Layout
+```
+spec_rl/
+  spec_rl.py      # load_environment(num_examples=20) -> vf.Environment
+  pyproject.toml  # name = "spec-rl", depends on verifiers + datasets
+  README.md
+```
+`load_environment(num_examples=20)` builds a `vf.SingleTurnEnv` over the first
+`num_examples` HumanEval problems with a `vf.Rubric` wrapping the `@vf.reward`
+`code_reward` function (which scores via `fraction_passing`).
+## Run it
+Install the env, then evaluate Laguna XS.2 through it:
+```bash
+prime env install spec_rl
+prime eval run spec_rl -m poolside/Laguna-XS.2 -n 20
+prime eval view
+```
+`-m poolside/Laguna-XS.2` resolves to whatever endpoint you alias in
+`./configs/endpoints.toml`. To show the cheaper-rollout result, define two
+aliases pointing at the same model — one plain vLLM server, one DFlash-speculated
+server — and run the eval against each:
+```toml
+# configs/endpoints.toml
+[[endpoint]]
+endpoint_id = "laguna-baseline"
+model = "poolside/Laguna-XS.2"
+url = "http://<baseline-vllm-host>:8000/v1"
+key = "VLLM_API_KEY"
+type = "openai_chat_completions"
+[[endpoint]]
+endpoint_id = "laguna-dflash"
+model = "poolside/Laguna-XS.2"
+url = "http://<dflash-vllm-host>:8000/v1"
+key = "VLLM_API_KEY"
+type = "openai_chat_completions"
+```
+The DFlash server is launched with the speculator config:
+```bash
+VLLM_USE_DEEP_GEMM=0 vllm serve poolside/Laguna-XS.2 \
+  --speculative-config '{"model":"poolside/Laguna-XS.2-speculator.dflash","num_speculative_tokens":7,"method":"dflash"}'
+# vLLM >= 0.21.0, parsers poolside_v1; vLLM does NOT need --trust-remote-code.
+```
+Then:
+```bash
+prime eval run spec_rl -m laguna-baseline -n 20
+prime eval run spec_rl -m laguna-dflash   -n 20
+```
+Identical reward, higher throughput on the DFlash run. Read realized acceptance
+length (tau) and tokens/sec from the DFlash server's `/metrics` — these are
+**measured at the venue**, not quoted from any published figure.

spec_rl/pyproject.toml ADDED Viewed

	@@ -0,0 +1,21 @@

+[project]
+name = "spec-rl"
+version = "0.1.0"
+description = "HumanEval-style code RL environment whose rollouts are served by the DFlash-speculated Laguna XS.2 vLLM endpoint — same reward curve, cheaper rollouts."
+tags = ["code", "humaneval", "single-turn", "rl", "eval", "speculative-decoding", "dflash"]
+requires-python = ">=3.11"
+dependencies = [
+    "verifiers",
+    "datasets",
+]
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[tool.hatch.build]
+include = ["spec_rl.py", "pyproject.toml", "README.md"]
+[tool.verifiers.eval]
+num_examples = 20
+rollouts_per_example = 1

spec_rl/spec_rl.py ADDED Viewed

	@@ -0,0 +1,453 @@

+#!/usr/bin/env python3
+"""
+spec_rl.py — a small `verifiers` environment for the combined hackathon thesis:
+"lossless DFlash speculative decoding makes RL post-training cheaper."
+The environment is a HumanEval-style code-completion task. The policy model
+(Laguna XS.2) is prompted with a function signature + docstring and must emit
+the function body. The reward executes the candidate completion against the
+problem's unit tests in a SUBPROCESS WITH A TIMEOUT and returns 1.0 if every
+FRACTION of the problem's unit-test assertions that pass (a dense RL signal in
+[0,1]); the pass@1 eval stays binary (evals/humaneval_subset.py). Reward is the
+dense learning signal; the eval is the binary scoreboard.
+Why this exists for the hackathon
+---------------------------------
+verifiers runs RL rollouts against an OpenAI-compatible endpoint declared in
+`./configs/endpoints.toml`. Point that endpoint at the DFlash-speculated vLLM
+server and the *same* reward curve is produced at higher rollout throughput,
+because speculative decoding is lossless under greedy decoding (the drafted
+tokens are verified by the target model, so accepted text is token-identical to
+the no-speculator baseline). The reward signal does not change; only the cost
+per rollout drops. That is the "cheaper RL" claim, made measurable.
+Local-dev note (Apple Silicon, no CUDA): this module is import-safe even when
+`verifiers` is not installed. `import verifiers as vf` is guarded; a clear
+ImportError is raised only when `load_environment()` is actually called. The
+reward's code-execution + pass/fail logic is plain stdlib and is unit-testable
+without verifiers or a GPU.
+SAFETY: this executes model-generated code to grade it. Each candidate runs in a
+short-lived subprocess with a wall-clock timeout, isolated from this process.
+Run RL rollouts only in the disposable venue sandbox, never against real data.
+"""
+from __future__ import annotations
+import ast
+import json
+import subprocess
+import sys
+import tempfile
+from pathlib import Path
+from typing import Any
+# ---------------------------------------------------------------------------
+# Import guard: keep the module importable without `verifiers` installed so the
+# reward logic can be unit-tested locally on the Mac. The real dependency is
+# only required when building the live environment.
+# ---------------------------------------------------------------------------
+try:
+    import verifiers as vf  # type: ignore
+except ImportError:  # pragma: no cover - exercised only when dep is absent
+    vf = None  # type: ignore
+# Per-candidate execution budget (seconds). Generous enough for HumanEval's
+# bounded reference tests, short enough to bound a runaway rollout.
+EXEC_TIMEOUT_S = 8
+# Stop sequences mirror evals/humaneval_subset.py so completion shape matches
+# the parity/pass@1 harness used to prove losslessness.
+STOP = ["\nclass ", "\ndef ", "\n#", "\nif __name__"]
+# ---------------------------------------------------------------------------
+# Dataset — reuse the HumanEval subset shape: {prompt, test, entry_point}.
+# We load the canonical HumanEval test split (same source as
+# evals/humaneval_subset.py) and keep only the first `num_examples` problems so
+# RL rollouts stay small and cheap during the hackathon.
+# ---------------------------------------------------------------------------
+def load_problems(num_examples: int) -> list[dict[str, Any]]:
+    """Return the first `num_examples` code problems as {prompt, test, entry_point}.
+    Default source is the canonical HumanEval test split (same as
+    evals/humaneval_subset.py). Two overrides, in precedence order:
+      * ``SPEC_RL_DATASET`` — a local ``.jsonl`` path (one problem per line) OR
+        a Hugging Face dataset id. This is the drop-in seam for an
+        Adaption-curated / exported dataset: as long as each row carries
+        ``{prompt, test, entry_point}`` it runs unchanged, so a richer code
+        taskset built with the hackathon's Adaption credits swaps in with one
+        env var and no code change.
+      * ``HUMANEVAL_DATASET`` — override just the HF repo id if the venue image
+        pins a mirror. ``SPEC_RL_DATASET_SPLIT`` overrides the split (default
+        ``test``).
+    With no env vars set the behaviour is identical to before.
+    """
+    import json
+    import os
+    src = os.environ.get("SPEC_RL_DATASET")
+    if src and src.endswith(".jsonl") and os.path.exists(src):
+        with open(src) as f:
+            rows = [json.loads(line) for line in f if line.strip()]
+        return rows[:num_examples]
+    from datasets import load_dataset
+    dataset_id = src or os.environ.get("HUMANEVAL_DATASET", "openai/openai_humaneval")
+    split = os.environ.get("SPEC_RL_DATASET_SPLIT", "test")
+    ds = load_dataset(dataset_id, split=split)
+    num_examples = min(num_examples, len(ds))
+    return [dict(ds[i]) for i in range(num_examples)]
+# ---------------------------------------------------------------------------
+# Reward core — execute the candidate completion against the unit tests in a
+# fresh subprocess with a timeout. Pure stdlib, no verifiers/GPU needed, so it
+# can be tested locally. Returns True iff all tests pass within the budget.
+# ---------------------------------------------------------------------------
+def _build_program(problem: dict[str, Any], completion: str) -> str:
+    """Assemble the runnable program: signature+docstring + body + tests."""
+    return (
+        problem["prompt"]
+        + completion
+        + "\n"
+        + problem["test"]
+        + f"\ncheck({problem['entry_point']})\n"
+    )
+def passes(problem: dict[str, Any], completion: str, timeout_s: int = EXEC_TIMEOUT_S) -> bool:
+    """True iff `completion` makes the problem's unit tests pass.
+    Runs the assembled program in a separate `python` subprocess so a hang,
+    crash, or `sys.exit` in model-generated code cannot take down the rollout
+    worker. A non-zero exit code, a raised exception, or a timeout all count as
+    a failure (reward 0.0).
+    """
+    program = _build_program(problem, completion)
+    with tempfile.TemporaryDirectory() as tmp:
+        prog_path = Path(tmp) / "candidate.py"
+        prog_path.write_text(program)
+        try:
+            result = subprocess.run(
+                [sys.executable, str(prog_path)],
+                capture_output=True,
+                text=True,
+                timeout=timeout_s,
+                cwd=tmp,
+            )
+        except subprocess.TimeoutExpired:
+            return False
+        return result.returncode == 0
+class _AssertCounter(ast.NodeTransformer):
+    """Rewrite each ``assert`` so a failure is COUNTED, not fatal.
+    ``assert <test>`` becomes, roughly::
+        try: __ok = bool(<test>)
+        except BaseException: __ok = False
+        __tally['total'] += 1
+        if __ok: __tally['passed'] += 1
+    So every assertion that executes (including inside a ``for`` loop over many
+    input/output pairs) contributes one test to the denominator, and the
+    numerator is how many held — turning HumanEval's single all-or-nothing
+    ``check()`` into a fractional pass rate.
+    """
+    def visit_Assert(self, node: ast.Assert):
+        try_node = ast.Try(
+            body=[ast.Assign(
+                targets=[ast.Name(id="__ok", ctx=ast.Store())],
+                value=ast.Call(func=ast.Name(id="bool", ctx=ast.Load()),
+                               args=[node.test], keywords=[]),
+            )],
+            handlers=[ast.ExceptHandler(
+                type=ast.Name(id="BaseException", ctx=ast.Load()),
+                name=None,
+                body=[ast.Assign(
+                    targets=[ast.Name(id="__ok", ctx=ast.Store())],
+                    value=ast.Constant(value=False))],
+            )],
+            orelse=[], finalbody=[],
+        )
+        incr_total = ast.parse("__tally['total'] += 1").body[0]
+        incr_pass = ast.parse("if __ok:\n    __tally['passed'] += 1").body[0]
+        out = [try_node, incr_total, incr_pass]
+        for n in out:
+            ast.copy_location(n, node)
+            ast.fix_missing_locations(n)
+        return out
+def fraction_passing(problem: dict[str, Any], completion: str,
+                     timeout_s: int = EXEC_TIMEOUT_S) -> float:
+    """Fraction of the problem's unit-test assertions the completion passes.
+    Returns a value in [0.0, 1.0]: 1.0 = all assertions pass, 0.5 = half, 0.0 =
+    none (or the code didn't even run). This is the dense RL TRAINING reward; the
+    reported pass@1 EVAL stays binary (evals/humaneval_subset.py). Reward is the
+    learning signal, eval is the scoreboard — a dense reward avoids GRPO's
+    all-zero-group advantage collapse on hard prompts (every rollout failing a
+    hard problem otherwise yields a zero-variance group with no gradient).
+    Mechanism: instrument the test's ``assert``s (via _AssertCounter) so each is
+    counted instead of aborting on the first failure, run the assembled program
+    in a timed subprocess, and read back passed/total. Falls back to the binary
+    ``passes()`` result if the test can't be parsed or exposes no assertions.
+    """
+    try:
+        tree = ast.parse(problem["test"])
+    except SyntaxError:
+        return 1.0 if passes(problem, completion, timeout_s) else 0.0
+    tree = _AssertCounter().visit(tree)
+    ast.fix_missing_locations(tree)
+    try:
+        instrumented_test = ast.unparse(tree)
+    except Exception:  # pragma: no cover - ast.unparse needs py>=3.9
+        return 1.0 if passes(problem, completion, timeout_s) else 0.0
+    program = (
+        "__tally = {'passed': 0, 'total': 0}\n"
+        + problem["prompt"] + completion + "\n"
+        + instrumented_test + "\n"
+        + "try:\n"
+        + f"    check({problem['entry_point']})\n"
+        + "except BaseException:\n"
+        + "    pass\n"
+        + "import json as __json\n"
+        + "print('__FRAC__' + __json.dumps(__tally))\n"
+    )
+    with tempfile.TemporaryDirectory() as tmp:
+        prog_path = Path(tmp) / "candidate.py"
+        prog_path.write_text(program)
+        try:
+            result = subprocess.run(
+                [sys.executable, str(prog_path)],
+                capture_output=True, text=True, timeout=timeout_s, cwd=tmp,
+            )
+        except subprocess.TimeoutExpired:
+            return 0.0
+    for line in result.stdout.splitlines():
+        if line.startswith("__FRAC__"):
+            try:
+                tally = json.loads(line[len("__FRAC__"):])
+                total = int(tally.get("total", 0))
+                passed = int(tally.get("passed", 0))
+            except Exception:
+                return 0.0
+            if total == 0:  # no assertions found -> fall back to all-or-nothing
+                return 1.0 if result.returncode == 0 else 0.0
+            return max(0.0, min(1.0, passed / total))
+    # No tally line => the program crashed before instrumentation ran (e.g. a
+    # syntax error in the completion) => nothing passed.
+    return 0.0
+def _extract_completion(state: Any) -> str:
+    """Pull the assistant's text out of a verifiers rollout state.
+    Tolerates both the chat-style completion (list of messages) and a plain
+    string, so the reward works across SingleTurnEnv shapes.
+    """
+    completion = None
+    if isinstance(state, dict):
+        completion = state.get("completion")
+    elif hasattr(state, "get"):
+        try:
+            completion = state.get("completion")
+        except Exception:
+            completion = None
+    if completion is None:
+        completion = getattr(state, "completion", None)
+    if isinstance(completion, str):
+        return completion
+    if isinstance(completion, list):
+        for message in reversed(completion):
+            if isinstance(message, dict) and message.get("role") == "assistant":
+                return str(message.get("content") or "")
+        # fall back to last item's content if roles are absent
+        if completion:
+            last = completion[-1]
+            if isinstance(last, dict):
+                return str(last.get("content") or "")
+            return str(last)
+    return ""
+# ---------------------------------------------------------------------------
+# System prompt — module constant so the offline manual loop (eval_local.py),
+# the classic SingleTurnEnv path, and the cookbook Taskset path all send the
+# exact same instruction.
+# ---------------------------------------------------------------------------
+SYSTEM_PROMPT = (
+    "You are an expert Python programmer. You will be given a function "
+    "signature and docstring. Complete the function body only. Do not repeat "
+    "the signature, do not add explanations, and do not wrap the code in "
+    "markdown fences. Output only the indented function body."
+)
+def _problem_from(row: Any) -> dict[str, Any]:
+    """Rebuild the gradeable problem from a task/info row (never the model output)."""
+    src = row.get("info") if hasattr(row, "get") and row.get("info") else row
+    return {
+        "prompt": src["code_prompt"],
+        "test": src["test"],
+        "entry_point": src["entry_point"],
+    }
+def _score_completion(row: Any, completion_text: str) -> float:
+    """Shared reward body: trim at the first STOP, return the fractional pass rate."""
+    text = completion_text or ""
+    for stop in STOP:
+        idx = text.find(stop)
+        if idx != -1:
+            text = text[:idx]
+    return fraction_passing(_problem_from(row), text)
+def _task_rows(num_examples: int) -> list[dict[str, Any]]:
+    """HumanEval-style rows carrying every field the reward needs — `info` nested
+    AND flattened, so both verifiers API shapes can read them."""
+    rows: list[dict[str, Any]] = []
+    for i, prob in enumerate(load_problems(num_examples)):
+        info = {
+            "task_id": prob.get("task_id", f"example_{i}"),
+            "code_prompt": prob["prompt"],
+            "test": prob["test"],
+            "entry_point": prob["entry_point"],
+        }
+        rows.append({"prompt": prob["prompt"], "answer": prob["entry_point"],
+                     "info": info, **info})
+    return rows
+# ---------------------------------------------------------------------------
+# Environment factory — supports BOTH verifiers API shapes, because this
+# workspace ships two references that disagree: the classic
+# vf.SingleTurnEnv/vf.Rubric API (AGENTS.md) and the Prime lab-cookbook
+# vf.Taskset/vf.Env/vf.EnvConfig API (reference/lab-cookbook/.../reverse_text).
+# The cookbook Taskset is registered only when the installed verifiers exposes
+# vf.Taskset; otherwise load_environment() falls back to the classic builder.
+# Both paths share the same stdlib reward core (fraction_passing), so the reward
+# is identical either way. [verify at onboarding] confirm which API the venue's
+# installed verifiers actually uses, and adjust if a symbol is missing.
+# ---------------------------------------------------------------------------
+if vf is not None and hasattr(vf, "Taskset"):
+    class SpecRLTasksetConfig(vf.TasksetConfig):  # type: ignore[misc]
+        dataset_name: str = "openai/openai_humaneval"
+        dataset_split: str = "test"
+        num_examples: int = 164  # full HumanEval pool; the harness samples -n from it
+    class SpecRLTaskset(vf.Taskset[SpecRLTasksetConfig]):  # type: ignore[misc]
+        def load_tasks(self):  # -> vf.Tasks
+            from datasets import Dataset
+            return Dataset.from_list(_task_rows(self.config.num_examples))
+        def load_system_prompt(self):  # -> vf.SystemPrompt
+            return SYSTEM_PROMPT
+        @vf.reward(weight=1.0)
+        async def code_reward(self, task, state) -> float:
+            """Dense fractional unit-test pass rate in [0,1] — the RL training reward."""
+            return _score_completion(task, _extract_completion(state))
+    def load_taskset(config):  # -> vf.Taskset
+        return SpecRLTaskset(config=config)
+def _build_singleturn_env(num_examples: int):
+    """Classic verifiers path: a vf.SingleTurnEnv whose vf.Rubric scores the
+    fractional unit-test reward. Used when the installed verifiers predates the
+    cookbook Taskset/Env API."""
+    dataset_rows = [
+        {
+            "prompt": [
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": row["code_prompt"]},
+            ],
+            "answer": row["entry_point"],
+            "info": row["info"],
+        }
+        for row in _task_rows(num_examples)
+    ]
+    dataset = vf.Dataset.from_list(dataset_rows)
+    @vf.reward
+    def code_reward(completion, info, **kwargs) -> float:
+        text = completion if isinstance(completion, str) else _extract_completion(
+            {"completion": completion}
+        )
+        return _score_completion({"info": info}, text)
+    return vf.SingleTurnEnv(dataset=dataset, system_prompt=SYSTEM_PROMPT,
+                            rubric=vf.Rubric(funcs=[code_reward]))
+def load_environment(config: Any = None, *, num_examples: int = 20):
+    """Build the spec_rl RL environment (dual-signature on purpose).
+    Two verifiers APIs ship in this workspace, so this supports both:
+      * Cookbook (Prime lab-cookbook): ``load_environment(config: vf.EnvConfig)
+        -> vf.Env`` — used by ``prime eval run`` / ``prime train``.
+      * Classic: ``load_environment(num_examples=N) -> vf.SingleTurnEnv`` —
+        used by eval_local.py's verifiers path.
+    Both share the same stdlib reward core, so rewards are identical. The reward
+    logic (spec_rl.fraction_passing / passes) is importable and testable WITHOUT
+    verifiers; the hard dependency is enforced only here.
+    [verify at onboarding] confirm the installed verifiers exposes the symbols
+    the active path uses (vf.Taskset/EnvConfig/Env, or vf.SingleTurnEnv/Rubric).
+    """
+    if vf is None:
+        raise ImportError(
+            "The 'verifiers' package is required to build the spec_rl environment. "
+            "Install it with `prime env install spec_rl` (or `pip install verifiers`). "
+            "The reward logic (spec_rl.fraction_passing) is importable without it."
+        )
+    if config is not None and hasattr(vf, "Taskset"):
+        return vf.Env(taskset=load_taskset(config=config.taskset))
+    return _build_singleturn_env(num_examples)
+# ---------------------------------------------------------------------------
+# Local smoke test (no verifiers, no GPU, no network): proves the reward core
+# distinguishes a passing completion from a failing one. Run:
+#   python spec_rl.py
+# ---------------------------------------------------------------------------
+def _selftest() -> None:
+    toy = {
+        "prompt": "def add(a, b):\n    \"\"\"Return a + b.\"\"\"\n",
+        "test": "def check(candidate):\n    assert candidate(2, 3) == 5\n    assert candidate(-1, 1) == 0\n",
+        "entry_point": "add",
+    }
+    good = "    return a + b\n"
+    bad = "    return a - b\n"
+    partial = "    return a + b if a > 0 else a - b\n"  # passes 1 of 2 asserts
+    loops_forever = "    while True:\n        pass\n"
+    report = {
+        "passing_fraction": fraction_passing(toy, good),
+        "failing_fraction": fraction_passing(toy, bad),
+        "partial_fraction": fraction_passing(toy, partial),
+        "timeout_fraction": fraction_passing(toy, loops_forever, timeout_s=2),
+        "binary_passes_good": passes(toy, good),
+        "verifiers_available": vf is not None,
+    }
+    print(json.dumps(report, indent=2))
+    assert report["passing_fraction"] == 1.0, "all asserts pass => 1.0"
+    assert report["failing_fraction"] == 0.0, "no asserts pass => 0.0"
+    assert report["partial_fraction"] == 0.5, "1 of 2 asserts => 0.5 (fractional)"
+    assert report["timeout_fraction"] == 0.0, "timeout => 0.0"
+    print("selftest OK")
+if __name__ == "__main__":
+    _selftest()