File size: 5,362 Bytes
8612587
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
# spec_rl β€” code RL on a DFlash-speculated endpoint

A small [`verifiers`](https://github.com/PrimeIntellect-ai/verifiers) environment
for the combined hackathon thesis:

> **Lossless DFlash speculative decoding makes RL post-training cheaper.**

`spec_rl` is a HumanEval-style code-completion task. The policy model
(Laguna XS.2) is given a function signature + docstring and must write the body.
The `@vf.reward` `code_reward` function executes that body against the problem's
unit tests and returns the **fraction of assertions that pass** (a value in
`[0,1]`) via `fraction_passing(problem, text)`. This is a *unit-test-grounded,
verifiable, dense* reward β€” exactly the kind verifiers RL is built for. A
fractional (rather than binary all-or-nothing) reward avoids GRPO all-zero-group
advantage collapse on hard prompts, where every rollout would otherwise score
`0.0`. The reported pass@1 **eval** stays binary (`evals/humaneval_subset.py`):
reward is the learning signal, eval is the scoreboard.

## The point

`verifiers` runs RL rollouts against an OpenAI-compatible endpoint declared in
`./configs/endpoints.toml`. Point that endpoint at the **DFlash-speculated vLLM
server** instead of a plain one and you get the **same reward curve at higher
rollout throughput**:

- Speculative decoding is **lossless** under greedy decoding. The 0.6B DFlash
  drafter proposes `num_speculative_tokens = 7` tokens; the target model
  (Laguna XS.2) verifies them, so accepted text is **token-identical** to the
  no-speculator baseline.
- The reward depends only on the generated text, so an identical reward signal
  is produced.
- Only the **cost per rollout** drops (fewer target-model forward passes per
  accepted token β†’ higher tokens/sec β†’ cheaper RL).

That is the measurable claim: feed the same env two endpoints (baseline vs
DFlash), show one reward curve, two throughputs.

## How the reward works

1. The dataset carries each HumanEval problem's original `prompt` (signature +
   docstring), `test` (the `check(candidate)` harness), and `entry_point` in
   `info` β€” so the grader never depends on the model echoing the signature.
2. The model's completion is trimmed at the first stop sequence
   (`\nclass `, `\ndef `, `\n#`, `\nif __name__`) so a chatty model can't smuggle
   a second definition past the grader. This matches `evals/humaneval_subset.py`.
3. `spec_rl.fraction_passing()` assembles `prompt + completion + test +
   check(entry_point)` and runs it in a **fresh `python` subprocess with an 8s
   wall-clock timeout**, isolated from the rollout worker. It AST-instruments each
   `assert` in the HumanEval `check()` (via `_AssertCounter`) so a failing assert
   is **counted in the denominator instead of aborting on the first failure** β€”
   this also makes loop-based checks fractional. The reward is `passed_asserts /
   total_asserts`, a value in `[0,1]`. A crash, exception, or timeout before any
   assertion runs β†’ `0.0`; every assertion passing β†’ `1.0`.

The execution + pass/fail logic is plain stdlib and importable without
`verifiers` or a GPU, so it is unit-testable locally on Apple Silicon. A built-in
smoke test runs with:

```bash
python spec_rl.py   # checks passing / failing / timeout completions
```

> **Safety:** this executes model-generated code to grade it. Each candidate
> runs in a short-lived, isolated subprocess. Run RL rollouts only in the
> disposable isolated sandbox, never against real data.

## Layout

```
spec_rl/
  spec_rl.py      # load_environment(num_examples=20) -> vf.Environment
  pyproject.toml  # name = "spec-rl", depends on verifiers + datasets
  README.md
```

`load_environment(num_examples=20)` builds a `vf.SingleTurnEnv` over the first
`num_examples` HumanEval problems with a `vf.Rubric` wrapping the `@vf.reward`
`code_reward` function (which scores via `fraction_passing`).

## Run it

Install the env, then evaluate Laguna XS.2 through it:

```bash
prime env install spec_rl
prime eval run spec_rl -m poolside/Laguna-XS.2 -n 20
prime eval view
```

`-m poolside/Laguna-XS.2` resolves to whatever endpoint you alias in
`./configs/endpoints.toml`. To show the cheaper-rollout result, define two
aliases pointing at the same model β€” one plain vLLM server, one DFlash-speculated
server β€” and run the eval against each:

```toml
# configs/endpoints.toml
[[endpoint]]
endpoint_id = "laguna-baseline"
model = "poolside/Laguna-XS.2"
url = "http://<baseline-vllm-host>:8000/v1"
key = "VLLM_API_KEY"
type = "openai_chat_completions"

[[endpoint]]
endpoint_id = "laguna-dflash"
model = "poolside/Laguna-XS.2"
url = "http://<dflash-vllm-host>:8000/v1"
key = "VLLM_API_KEY"
type = "openai_chat_completions"
```

The DFlash server is launched with the speculator config:

```bash
VLLM_USE_DEEP_GEMM=0 vllm serve poolside/Laguna-XS.2 \
  --speculative-config '{"model":"poolside/Laguna-XS.2-speculator.dflash","num_speculative_tokens":7,"method":"dflash"}'
# vLLM >= 0.21.0, parsers poolside_v1; vLLM does NOT need --trust-remote-code.
```

Then:

```bash
prime eval run spec_rl -m laguna-baseline -n 20
prime eval run spec_rl -m laguna-dflash   -n 20
```

Identical reward, higher throughput on the DFlash run. Read realized acceptance
length (tau) and tokens/sec from the DFlash server's `/metrics` β€” these are
**measured directly**, not quoted from any published figure.