lean-laguna / spec_rl /README.md

Lean Laguna: lossless DFlash speculative decoding on Laguna XS.2 (harness, environment, results)

0a55ff6 about 6 hours ago

preview code

raw

history blame contribute delete

5.36 kB

spec_rl — code RL on a DFlash-speculated endpoint

A small verifiers environment for the combined hackathon thesis:

Lossless DFlash speculative decoding makes RL post-training cheaper.

spec_rl is a HumanEval-style code-completion task. The policy model (Laguna XS.2) is given a function signature + docstring and must write the body. The @vf.reward code_reward function executes that body against the problem's unit tests and returns the fraction of assertions that pass (a value in [0,1]) via fraction_passing(problem, text). This is a unit-test-grounded, verifiable, dense reward — exactly the kind verifiers RL is built for. A fractional (rather than binary all-or-nothing) reward avoids GRPO all-zero-group advantage collapse on hard prompts, where every rollout would otherwise score 0.0. The reported pass@1 eval stays binary (evals/humaneval_subset.py): reward is the learning signal, eval is the scoreboard.

The point

verifiers runs RL rollouts against an OpenAI-compatible endpoint declared in ./configs/endpoints.toml. Point that endpoint at the DFlash-speculated vLLM server instead of a plain one and you get the same reward curve at higher rollout throughput:

Speculative decoding is lossless under greedy decoding. The 0.6B DFlash drafter proposes num_speculative_tokens = 7 tokens; the target model (Laguna XS.2) verifies them, so accepted text is token-identical to the no-speculator baseline.
The reward depends only on the generated text, so an identical reward signal is produced.
Only the cost per rollout drops (fewer target-model forward passes per accepted token → higher tokens/sec → cheaper RL).

That is the measurable claim: feed the same env two endpoints (baseline vs DFlash), show one reward curve, two throughputs.

How the reward works

The dataset carries each HumanEval problem's original prompt (signature + docstring), test (the check(candidate) harness), and entry_point in info — so the grader never depends on the model echoing the signature.
The model's completion is trimmed at the first stop sequence (\nclass , \ndef , \n#, \nif __name__) so a chatty model can't smuggle a second definition past the grader. This matches evals/humaneval_subset.py.
spec_rl.fraction_passing() assembles prompt + completion + test + check(entry_point) and runs it in a fresh python subprocess with an 8s wall-clock timeout, isolated from the rollout worker. It AST-instruments each assert in the HumanEval check() (via _AssertCounter) so a failing assert is counted in the denominator instead of aborting on the first failure — this also makes loop-based checks fractional. The reward is passed_asserts / total_asserts, a value in [0,1]. A crash, exception, or timeout before any assertion runs → 0.0; every assertion passing → 1.0.

The execution + pass/fail logic is plain stdlib and importable without verifiers or a GPU, so it is unit-testable locally on Apple Silicon. A built-in smoke test runs with:

python spec_rl.py   # checks passing / failing / timeout completions

Safety: this executes model-generated code to grade it. Each candidate runs in a short-lived, isolated subprocess. Run RL rollouts only in the disposable venue sandbox, never against real data.

Layout

spec_rl/
  spec_rl.py      # load_environment(num_examples=20) -> vf.Environment
  pyproject.toml  # name = "spec-rl", depends on verifiers + datasets
  README.md

load_environment(num_examples=20) builds a vf.SingleTurnEnv over the first num_examples HumanEval problems with a vf.Rubric wrapping the @vf.reward code_reward function (which scores via fraction_passing).

Run it

Install the env, then evaluate Laguna XS.2 through it:

prime env install spec_rl
prime eval run spec_rl -m poolside/Laguna-XS.2 -n 20
prime eval view

-m poolside/Laguna-XS.2 resolves to whatever endpoint you alias in ./configs/endpoints.toml. To show the cheaper-rollout result, define two aliases pointing at the same model — one plain vLLM server, one DFlash-speculated server — and run the eval against each:

# configs/endpoints.toml
[[endpoint]]
endpoint_id = "laguna-baseline"
model = "poolside/Laguna-XS.2"
url = "http://<baseline-vllm-host>:8000/v1"
key = "VLLM_API_KEY"
type = "openai_chat_completions"

[[endpoint]]
endpoint_id = "laguna-dflash"
model = "poolside/Laguna-XS.2"
url = "http://<dflash-vllm-host>:8000/v1"
key = "VLLM_API_KEY"
type = "openai_chat_completions"

The DFlash server is launched with the speculator config:

VLLM_USE_DEEP_GEMM=0 vllm serve poolside/Laguna-XS.2 \
  --speculative-config '{"model":"poolside/Laguna-XS.2-speculator.dflash","num_speculative_tokens":7,"method":"dflash"}'
# vLLM >= 0.21.0, parsers poolside_v1; vLLM does NOT need --trust-remote-code.

Then:

prime eval run spec_rl -m laguna-baseline -n 20
prime eval run spec_rl -m laguna-dflash   -n 20

Identical reward, higher throughput on the DFlash run. Read realized acceptance length (tau) and tokens/sec from the DFlash server's /metrics — these are measured at the venue, not quoted from any published figure.