Lean Laguna: lossless DFlash speculative decoding on Laguna XS.2 (harness, environment, results)

0a55ff6 about 6 hours ago

5.36 kB

	# spec_rl — code RL on a DFlash-speculated endpoint

	A small [`verifiers`](https://github.com/PrimeIntellect-ai/verifiers) environment
	for the combined hackathon thesis:

	> Lossless DFlash speculative decoding makes RL post-training cheaper.

	`spec_rl` is a HumanEval-style code-completion task. The policy model
	(Laguna XS.2) is given a function signature + docstring and must write the body.
	The `@vf.reward` `code_reward` function executes that body against the problem's
	unit tests and returns the fraction of assertions that pass (a value in
	`[0,1]`) via `fraction_passing(problem, text)`. This is a *unit-test-grounded,
	verifiable, dense* reward — exactly the kind verifiers RL is built for. A
	fractional (rather than binary all-or-nothing) reward avoids GRPO all-zero-group
	advantage collapse on hard prompts, where every rollout would otherwise score
	`0.0`. The reported pass@1 eval stays binary (`evals/humaneval_subset.py`):
	reward is the learning signal, eval is the scoreboard.

	## The point

	`verifiers` runs RL rollouts against an OpenAI-compatible endpoint declared in
	`./configs/endpoints.toml`. Point that endpoint at the **DFlash-speculated vLLM
	server instead of a plain one and you get the same reward curve at higher
	rollout throughput**:

	- Speculative decoding is lossless under greedy decoding. The 0.6B DFlash
	drafter proposes `num_speculative_tokens = 7` tokens; the target model
	(Laguna XS.2) verifies them, so accepted text is token-identical to the
	no-speculator baseline.
	- The reward depends only on the generated text, so an identical reward signal
	is produced.
	- Only the cost per rollout drops (fewer target-model forward passes per
	accepted token → higher tokens/sec → cheaper RL).

	That is the measurable claim: feed the same env two endpoints (baseline vs
	DFlash), show one reward curve, two throughputs.

	## How the reward works

	1. The dataset carries each HumanEval problem's original `prompt` (signature +
	docstring), `test` (the `check(candidate)` harness), and `entry_point` in
	`info` — so the grader never depends on the model echoing the signature.
	2. The model's completion is trimmed at the first stop sequence
	(`\nclass `, `\ndef `, `\n#`, `\nif __name__`) so a chatty model can't smuggle
	a second definition past the grader. This matches `evals/humaneval_subset.py`.
	3. `spec_rl.fraction_passing()` assembles `prompt + completion + test +
	check(entry_point)` and runs it in a **fresh `python` subprocess with an 8s
	wall-clock timeout**, isolated from the rollout worker. It AST-instruments each
	`assert` in the HumanEval `check()` (via `_AssertCounter`) so a failing assert
	is counted in the denominator instead of aborting on the first failure —
	this also makes loop-based checks fractional. The reward is `passed_asserts /
	total_asserts`, a value in `[0,1]`. A crash, exception, or timeout before any
	assertion runs → `0.0`; every assertion passing → `1.0`.

	The execution + pass/fail logic is plain stdlib and importable without
	`verifiers` or a GPU, so it is unit-testable locally on Apple Silicon. A built-in
	smoke test runs with:

	```bash
	python spec_rl.py # checks passing / failing / timeout completions
	```

	> Safety: this executes model-generated code to grade it. Each candidate
	> runs in a short-lived, isolated subprocess. Run RL rollouts only in the
	> disposable venue sandbox, never against real data.

	## Layout

	```
	spec_rl/
	spec_rl.py # load_environment(num_examples=20) -> vf.Environment
	pyproject.toml # name = "spec-rl", depends on verifiers + datasets
	README.md
	```

	`load_environment(num_examples=20)` builds a `vf.SingleTurnEnv` over the first
	`num_examples` HumanEval problems with a `vf.Rubric` wrapping the `@vf.reward`
	`code_reward` function (which scores via `fraction_passing`).

	## Run it

	Install the env, then evaluate Laguna XS.2 through it:

	```bash
	prime env install spec_rl
	prime eval run spec_rl -m poolside/Laguna-XS.2 -n 20
	prime eval view
	```

	`-m poolside/Laguna-XS.2` resolves to whatever endpoint you alias in
	`./configs/endpoints.toml`. To show the cheaper-rollout result, define two
	aliases pointing at the same model — one plain vLLM server, one DFlash-speculated
	server — and run the eval against each:

	```toml
	# configs/endpoints.toml
	[[endpoint]]
	endpoint_id = "laguna-baseline"
	model = "poolside/Laguna-XS.2"
	url = "http://<baseline-vllm-host>:8000/v1"
	key = "VLLM_API_KEY"
	type = "openai_chat_completions"

	[[endpoint]]
	endpoint_id = "laguna-dflash"
	model = "poolside/Laguna-XS.2"
	url = "http://<dflash-vllm-host>:8000/v1"
	key = "VLLM_API_KEY"
	type = "openai_chat_completions"
	```

	The DFlash server is launched with the speculator config:

	```bash
	VLLM_USE_DEEP_GEMM=0 vllm serve poolside/Laguna-XS.2 \
	--speculative-config '{"model":"poolside/Laguna-XS.2-speculator.dflash","num_speculative_tokens":7,"method":"dflash"}'
	# vLLM >= 0.21.0, parsers poolside_v1; vLLM does NOT need --trust-remote-code.
	```

	Then:

	```bash
	prime eval run spec_rl -m laguna-baseline -n 20
	prime eval run spec_rl -m laguna-dflash -n 20
	```

	Identical reward, higher throughput on the DFlash run. Read realized acceptance
	length (tau) and tokens/sec from the DFlash server's `/metrics` — these are
	measured at the venue, not quoted from any published figure.