| # spec_rl β code RL on a DFlash-speculated endpoint |
| |
| A small [`verifiers`](https://github.com/PrimeIntellect-ai/verifiers) environment |
| for the combined hackathon thesis: |
| |
| > **Lossless DFlash speculative decoding makes RL post-training cheaper.** |
| |
| `spec_rl` is a HumanEval-style code-completion task. The policy model |
| (Laguna XS.2) is given a function signature + docstring and must write the body. |
| The `@vf.reward` `code_reward` function executes that body against the problem's |
| unit tests and returns the **fraction of assertions that pass** (a value in |
| `[0,1]`) via `fraction_passing(problem, text)`. This is a *unit-test-grounded, |
| verifiable, dense* reward β exactly the kind verifiers RL is built for. A |
| fractional (rather than binary all-or-nothing) reward avoids GRPO all-zero-group |
| advantage collapse on hard prompts, where every rollout would otherwise score |
| `0.0`. The reported pass@1 **eval** stays binary (`evals/humaneval_subset.py`): |
| reward is the learning signal, eval is the scoreboard. |
|
|
| ## The point |
|
|
| `verifiers` runs RL rollouts against an OpenAI-compatible endpoint declared in |
| `./configs/endpoints.toml`. Point that endpoint at the **DFlash-speculated vLLM |
| server** instead of a plain one and you get the **same reward curve at higher |
| rollout throughput**: |
|
|
| - Speculative decoding is **lossless** under greedy decoding. The 0.6B DFlash |
| drafter proposes `num_speculative_tokens = 7` tokens; the target model |
| (Laguna XS.2) verifies them, so accepted text is **token-identical** to the |
| no-speculator baseline. |
| - The reward depends only on the generated text, so an identical reward signal |
| is produced. |
| - Only the **cost per rollout** drops (fewer target-model forward passes per |
| accepted token β higher tokens/sec β cheaper RL). |
|
|
| That is the measurable claim: feed the same env two endpoints (baseline vs |
| DFlash), show one reward curve, two throughputs. |
|
|
| ## How the reward works |
|
|
| 1. The dataset carries each HumanEval problem's original `prompt` (signature + |
| docstring), `test` (the `check(candidate)` harness), and `entry_point` in |
| `info` β so the grader never depends on the model echoing the signature. |
| 2. The model's completion is trimmed at the first stop sequence |
| (`\nclass `, `\ndef `, `\n#`, `\nif __name__`) so a chatty model can't smuggle |
| a second definition past the grader. This matches `evals/humaneval_subset.py`. |
| 3. `spec_rl.fraction_passing()` assembles `prompt + completion + test + |
| check(entry_point)` and runs it in a **fresh `python` subprocess with an 8s |
| wall-clock timeout**, isolated from the rollout worker. It AST-instruments each |
| `assert` in the HumanEval `check()` (via `_AssertCounter`) so a failing assert |
| is **counted in the denominator instead of aborting on the first failure** β |
| this also makes loop-based checks fractional. The reward is `passed_asserts / |
| total_asserts`, a value in `[0,1]`. A crash, exception, or timeout before any |
| assertion runs β `0.0`; every assertion passing β `1.0`. |
|
|
| The execution + pass/fail logic is plain stdlib and importable without |
| `verifiers` or a GPU, so it is unit-testable locally on Apple Silicon. A built-in |
| smoke test runs with: |
|
|
| ```bash |
| python spec_rl.py # checks passing / failing / timeout completions |
| ``` |
|
|
| > **Safety:** this executes model-generated code to grade it. Each candidate |
| > runs in a short-lived, isolated subprocess. Run RL rollouts only in the |
| > disposable venue sandbox, never against real data. |
|
|
| ## Layout |
|
|
| ``` |
| spec_rl/ |
| spec_rl.py # load_environment(num_examples=20) -> vf.Environment |
| pyproject.toml # name = "spec-rl", depends on verifiers + datasets |
| README.md |
| ``` |
|
|
| `load_environment(num_examples=20)` builds a `vf.SingleTurnEnv` over the first |
| `num_examples` HumanEval problems with a `vf.Rubric` wrapping the `@vf.reward` |
| `code_reward` function (which scores via `fraction_passing`). |
|
|
| ## Run it |
|
|
| Install the env, then evaluate Laguna XS.2 through it: |
|
|
| ```bash |
| prime env install spec_rl |
| prime eval run spec_rl -m poolside/Laguna-XS.2 -n 20 |
| prime eval view |
| ``` |
|
|
| `-m poolside/Laguna-XS.2` resolves to whatever endpoint you alias in |
| `./configs/endpoints.toml`. To show the cheaper-rollout result, define two |
| aliases pointing at the same model β one plain vLLM server, one DFlash-speculated |
| server β and run the eval against each: |
|
|
| ```toml |
| # configs/endpoints.toml |
| [[endpoint]] |
| endpoint_id = "laguna-baseline" |
| model = "poolside/Laguna-XS.2" |
| url = "http://<baseline-vllm-host>:8000/v1" |
| key = "VLLM_API_KEY" |
| type = "openai_chat_completions" |
| |
| [[endpoint]] |
| endpoint_id = "laguna-dflash" |
| model = "poolside/Laguna-XS.2" |
| url = "http://<dflash-vllm-host>:8000/v1" |
| key = "VLLM_API_KEY" |
| type = "openai_chat_completions" |
| ``` |
|
|
| The DFlash server is launched with the speculator config: |
|
|
| ```bash |
| VLLM_USE_DEEP_GEMM=0 vllm serve poolside/Laguna-XS.2 \ |
| --speculative-config '{"model":"poolside/Laguna-XS.2-speculator.dflash","num_speculative_tokens":7,"method":"dflash"}' |
| # vLLM >= 0.21.0, parsers poolside_v1; vLLM does NOT need --trust-remote-code. |
| ``` |
|
|
| Then: |
|
|
| ```bash |
| prime eval run spec_rl -m laguna-baseline -n 20 |
| prime eval run spec_rl -m laguna-dflash -n 20 |
| ``` |
|
|
| Identical reward, higher throughput on the DFlash run. Read realized acceptance |
| length (tau) and tokens/sec from the DFlash server's `/metrics` β these are |
| **measured at the venue**, not quoted from any published figure. |
|
|