| --- |
| license: mit |
| language: |
| - en |
| tags: |
| - code-generation |
| - security |
| - reinforcement-learning |
| - prime-intellect |
| - baxbench |
| - laguna |
| base_model: poolside/Laguna-XS.2 |
| datasets: |
| - LogicStar/BaxBench |
| --- |
| |
| # BaxBench × Prime Intellect — Secure Backend Code Generation |
|
|
| > **Team:** Oof team |
| > **Author:** Aidarbek Suleimenov ([@idarbek](https://x.com/idarbek)) |
|
|
| My submission is an RL environment, model evaluation, and RL post-trained model created from the original [BaxBench](https://arxiv.org/abs/2502.11844) secure-backend-code benchmark. |
|
|
| I wrapped the benchmark as a [Prime Intellect verifiers environment](https://app.primeintellect.ai/dashboard/environments/aidarbek/baxbench), used it to **evaluate Laguna-XS.2 against GPT-5.5**, and then **RL-trained Laguna-XS.2** on the train split. During training, the eval score went from **0.061 → 0.115** (+87% relative), but I couldn’t benchmark the final trained model due to limits with Prime Intellect’s LoRA deployments for Laguna-XS.2. |
|
|
| --- |
|
|
| ## Why this matters |
|
|
| As frontier models get better at writing code, so do the adversaries who use them. CrowdStrike’s 2026 Global Threat Report states that AI-enabled attacks are up roughly 89% year over year. |
|
|
| Since more and more new code is being written by AI, it’s important that models treat security as one of the key dimensions to optimise for. |
|
|
| --- |
|
|
| ## What's in this repo |
|
|
| | File | What it is | |
| |---|---| |
| | `README.md` | This file. | |
| | `baxbench.parquet` | The 392-row task table (28 scenarios × 14 frameworks) extracted from the [original BaxBench](https://huggingface.co/datasets/LogicStar/BaxBench), enriched with everything needed for sandbox execution: API spec, allowed packages, entrypoint command, port, multi-file flag, etc. | |
| | `lora_adapter/` | The Laguna-XS.2 LoRA adapter produced by 40 GRPO steps on the BaxBench train split (`adapter_config.json` + `adapter_model.safetensors`, 4.6 GB). | |
| | `eval_training_curve.png` | Held-out pass@1 over the course of RL training (step 0 → step 40). | |
| | `baseline_vs_gpt55.txt` | Baseline comparison: Laguna-XS.2 vs GPT-5.5 on 100 random BaxBench tasks. | |
|
|
| External links: |
| - **Prime Intellect environment:** https://app.primeintellect.ai/dashboard/environments/aidarbek/baxbench |
| - **Original BaxBench paper:** https://arxiv.org/abs/2502.11844 — Vero et al., 2025 |
| - **Original BaxBench dataset:** https://huggingface.co/datasets/LogicStar/BaxBench |
| - **Base model:** [poolside/Laguna-XS.2](https://huggingface.co/poolside/Laguna-XS.2) |
|
|
| --- |
|
|
| ## What was built |
|
|
| ### 1. BaxBench wrapped as a Prime Intellect verifiers environment |
|
|
| BaxBench upstream is a Python CLI that ships with Docker, runs each generated backend in a container, and tests it with both **functional tests** (does the API work?) and **security tests** (can it be exploited?). I ported it into a single `vf.SingleTurnEnv` that: |
|
|
| 1. Reads the 392-task parquet, builds the OpenAPI-style prompt for each (matching the upstream template exactly). |
| 2. Per rollout, **spins up a Prime Intellect sandbox** with the right Docker image for the task's framework. |
| 3. Uploads the model-generated code, installs scenario-specific deps (`apt: ffmpeg / poppler-utils / …`, `pip: imageio / pdfplumber / …`), starts the server, and runs every upstream functional + security test against it. |
| 4. Returns reward = `1.0` iff **all functional tests pass AND zero CWEs are flagged** — matching BaxBench's "secure pass@k" metric. Sub-metrics (functional pass rate, security pass rate) are also tracked. |
|
|
| The wrapper bundles a copy of the upstream BaxBench test suite into each sandbox and monkey-patches its Docker-bound helpers (`load_file_from_docker`, `process_still_running`, `execute_sql_on_docker`) to operate on the local filesystem / process tree — sound because the server and tests share a kernel namespace inside the sandbox. |
|
|
| Key engineering wins: |
| - **True held-out split.** The env exposes `split_by="scenario" | "framework" | "random" | "none"` so the RL training set never overlaps with the eval set at the scenario level (default: 22 train scenarios, 6 held-out). Same `split_seed` on training and eval keeps the holdout pinned across runs. |
| - **Per-rollout logging that survives `prime train logs --tail`.** Every sandbox emits an `OK` or `BAD` line with stderr + server log tails on failure — no more "results.json missing" without context. |
| - **Per-scenario dep install with retry.** Only 4 of 28 scenarios need extra system packages; installing per-scenario keeps each exec call short enough to dodge the sandbox gateway's 502s. |
|
|
| ### 2. Baseline benchmark: Laguna-XS.2 vs GPT-5.5 |
|
|
| 100-sample pass@1 on a random Python-framework slice of BaxBench, scored with the env above: |
|
|
| | Metric | poolside/Laguna-XS.2 | openai/gpt-5.5 | Δ | |
| |---|---:|---:|---:| |
| | **pass@1 (secure + functional)** | **0.260** (26/100) | **0.550** (55/100) | **+0.290** | |
| | functional_pass_rate | 0.395 | 0.710 | +0.315 | |
| | security_pass_rate | 0.566 | 0.827 | +0.261 | |
| | wall clock | 10.5 min | 11.3 min | +0.8 min | |
| | avg output tokens | 5,150 | 2,942 | −2,208 | |
|
|
| Why GPT-5.5? Simply because I had existing credits for OpenAI API :) |
|
|
| Head-to-head on the same 100 tasks: GPT-5.5 won 33 tasks Laguna didn't; Laguna won 4 tasks GPT-5.5 didn't (clustered on Flask/aiohttp). The gap is bigger on functional correctness (+0.315) than on security (+0.261), meaning a smaller open model is closer to GPT-5.5 on "writing secure code" than on "writing working code" — exactly the signal that says RL on this benchmark is worth running. |
|
|
| ### 3. RL training of Laguna-XS.2 |
|
|
| GRPO on the train split (22 scenarios × 4 Python frameworks = 88 tasks): |
|
|
| - 40 gradient steps, `batch_size=16`, `rollouts_per_example=8` (2 GRPO groups of 8 per step) |
| - LR 5e-6, temperature 0.7 |
| - Online eval against the 6 held-out scenarios at steps 0, 10, 20, 30, 40 |
|
|
| **Held-out pass@1 climbed 0.061 → 0.115 (+87% relative)** on scenarios the model never saw during training: |
|
|
|  |
|
|
| The LoRA adapter from step 39 is checked into `lora_adapter/`. Final inference-time eval on the trained adapter was blocked because `poolside/Laguna-XS.2` is currently gated for LoRA deployment on Prime Intellect's inference infra (`Error: Base model is not currently available for LoRA deployment`). The training-time eval curve above measures the same held-out 24 tasks at every checkpoint, scored by the same env code — apples-to-apples, just smaller sample size than the full 100-task baseline. |
|
|
| Unfortunately, I didn't have a time to properly benchmark post-trained model, the LoRa deployments weren't available, and running self-host model on Prime instances took too much time. |
|
|
| --- |
|
|
| ## How to reproduce |
|
|
| Install the env locally: |
| ```bash |
| prime env install aidarbek/baxbench |
| ``` |
|
|
| Run the baseline: |
| ```bash |
| prime eval run aidarbek/baxbench \ |
| -m openai/gpt-5.5 \ |
| -n 100 -r 1 -c 16 \ |
| -a '{"split_by": "none"}' |
| ``` |
|
|
| Train Laguna-XS.2 with GRPO (requires Laguna training access + Prime Hosted Training): |
| ```bash |
| prime train configs/rl/laguna-baxbench.toml -e PRIME_API_KEY -y |
| ``` |
| where `configs/rl/laguna-baxbench.toml` matches the values described above (40 steps, batch_size=16, rollouts_per_example=8, split_by="scenario" with `test_size=0.2, split_seed=42`). |
|
|
| To serve this LoRA adapter once Prime enables Laguna LoRA deployment: |
| ```bash |
| prime deployments create <adapter_id> -y |
| prime eval run aidarbek/baxbench \ |
| -m <deployed_model_id> \ |
| -n 100 -r 1 -c 16 \ |
| -a '{"split_by": "scenario", "test_size": 0.2, "split_seed": 42}' |
| ``` |
|
|
| --- |
|
|
| ## Limitations and honest caveats |
|
|
| - **24-task held-out set has low resolution.** Each percentage point ≈ 0.27 tasks. The 0.061 → 0.115 trend is real but individual data points are noisy. |
| - **Truncation rate dropped 79% → 58% during training.** Some of the improvement may come from "produce a shorter, finishable answer" rather than "write better code." That's still a real and useful capability. |
| - **No standalone post-RL eval.** Until Prime enables LoRA serving for Laguna, the training-time eval is the only signal. Once unblocked, a full 100-task standalone `prime eval run` is the next number to publish. |
| - **Python only.** The wrapper supports Go / JavaScript / Ruby / PHP / Rust tasks via per-language Docker images, but only Python is currently dep-tested end-to-end. Default `languages=["Python"]` reflects this. |
|
|
| --- |
|
|
| ## Citation |
|
|
| If you use this work, please also cite the original BaxBench paper: |
|
|
| ```bibtex |
| @article{vero2025baxbenchllmsgeneratecorrect, |
| title = {BaxBench: Can LLMs Generate Correct and Secure Backends?}, |
| author = {Mark Vero and Niels Mündler and Victor Chibotaru and Veselin Raychev and Maximilian Baader and Nikola Jovanović and Jingxuan He and Martin Vechev}, |
| year = {2025}, |
| eprint = {2502.11844}, |
| archivePrefix = {arXiv}, |
| } |
| ``` |
|
|