baxbench / README.md

Upload README.md with huggingface_hub

144de3d verified about 1 month ago

8.84 kB

	---
	license: mit
	language:
	- en
	tags:
	- code-generation
	- security
	- reinforcement-learning
	- prime-intellect
	- baxbench
	- laguna
	base_model: poolside/Laguna-XS.2
	datasets:
	- LogicStar/BaxBench
	---

	# BaxBench × Prime Intellect — Secure Backend Code Generation

	> Team: Oof team
	> Author: Aidarbek Suleimenov ([@idarbek](https://x.com/idarbek))

	My submission is an RL environment, model evaluation, and RL post-trained model created from the original [BaxBench](https://arxiv.org/abs/2502.11844) secure-backend-code benchmark.

	I wrapped the benchmark as a [Prime Intellect verifiers environment](https://app.primeintellect.ai/dashboard/environments/aidarbek/baxbench), used it to evaluate Laguna-XS.2 against GPT-5.5, and then RL-trained Laguna-XS.2 on the train split. During training, the eval score went from 0.061 → 0.115 (+87% relative), but I couldn’t benchmark the final trained model due to limits with Prime Intellect’s LoRA deployments for Laguna-XS.2.

	---

	## Why this matters

	As frontier models get better at writing code, so do the adversaries who use them. CrowdStrike’s 2026 Global Threat Report states that AI-enabled attacks are up roughly 89% year over year.

	Since more and more new code is being written by AI, it’s important that models treat security as one of the key dimensions to optimise for.

	---

	## What's in this repo

	\| File \| What it is \|
	\|---\|---\|
	\| `README.md` \| This file. \|
	\| `baxbench.parquet` \| The 392-row task table (28 scenarios × 14 frameworks) extracted from the [original BaxBench](https://huggingface.co/datasets/LogicStar/BaxBench), enriched with everything needed for sandbox execution: API spec, allowed packages, entrypoint command, port, multi-file flag, etc. \|
	\| `lora_adapter/` \| The Laguna-XS.2 LoRA adapter produced by 40 GRPO steps on the BaxBench train split (`adapter_config.json` + `adapter_model.safetensors`, 4.6 GB). \|
	\| `eval_training_curve.png` \| Held-out pass@1 over the course of RL training (step 0 → step 40). \|
	\| `baseline_vs_gpt55.txt` \| Baseline comparison: Laguna-XS.2 vs GPT-5.5 on 100 random BaxBench tasks. \|

	External links:
	- Prime Intellect environment: https://app.primeintellect.ai/dashboard/environments/aidarbek/baxbench
	- Original BaxBench paper: https://arxiv.org/abs/2502.11844 — Vero et al., 2025
	- Original BaxBench dataset: https://huggingface.co/datasets/LogicStar/BaxBench
	- Base model: [poolside/Laguna-XS.2](https://huggingface.co/poolside/Laguna-XS.2)

	---

	## What was built

	### 1. BaxBench wrapped as a Prime Intellect verifiers environment

	BaxBench upstream is a Python CLI that ships with Docker, runs each generated backend in a container, and tests it with both functional tests (does the API work?) and security tests (can it be exploited?). I ported it into a single `vf.SingleTurnEnv` that:

	1. Reads the 392-task parquet, builds the OpenAPI-style prompt for each (matching the upstream template exactly).
	2. Per rollout, spins up a Prime Intellect sandbox with the right Docker image for the task's framework.
	3. Uploads the model-generated code, installs scenario-specific deps (`apt: ffmpeg / poppler-utils / …`, `pip: imageio / pdfplumber / …`), starts the server, and runs every upstream functional + security test against it.
	4. Returns reward = `1.0` iff all functional tests pass AND zero CWEs are flagged — matching BaxBench's "secure pass@k" metric. Sub-metrics (functional pass rate, security pass rate) are also tracked.

	The wrapper bundles a copy of the upstream BaxBench test suite into each sandbox and monkey-patches its Docker-bound helpers (`load_file_from_docker`, `process_still_running`, `execute_sql_on_docker`) to operate on the local filesystem / process tree — sound because the server and tests share a kernel namespace inside the sandbox.

	Key engineering wins:
	- True held-out split. The env exposes `split_by="scenario" \| "framework" \| "random" \| "none"` so the RL training set never overlaps with the eval set at the scenario level (default: 22 train scenarios, 6 held-out). Same `split_seed` on training and eval keeps the holdout pinned across runs.
	- Per-rollout logging that survives `prime train logs --tail`. Every sandbox emits an `OK` or `BAD` line with stderr + server log tails on failure — no more "results.json missing" without context.
	- Per-scenario dep install with retry. Only 4 of 28 scenarios need extra system packages; installing per-scenario keeps each exec call short enough to dodge the sandbox gateway's 502s.

	### 2. Baseline benchmark: Laguna-XS.2 vs GPT-5.5

	100-sample pass@1 on a random Python-framework slice of BaxBench, scored with the env above:

	\| Metric \| poolside/Laguna-XS.2 \| openai/gpt-5.5 \| Δ \|
	\|---\|---:\|---:\|---:\|
	\| pass@1 (secure + functional) \| 0.260 (26/100) \| 0.550 (55/100) \| +0.290 \|
	\| functional_pass_rate \| 0.395 \| 0.710 \| +0.315 \|
	\| security_pass_rate \| 0.566 \| 0.827 \| +0.261 \|
	\| wall clock \| 10.5 min \| 11.3 min \| +0.8 min \|
	\| avg output tokens \| 5,150 \| 2,942 \| −2,208 \|

	Why GPT-5.5? Simply because I had existing credits for OpenAI API :)

	Head-to-head on the same 100 tasks: GPT-5.5 won 33 tasks Laguna didn't; Laguna won 4 tasks GPT-5.5 didn't (clustered on Flask/aiohttp). The gap is bigger on functional correctness (+0.315) than on security (+0.261), meaning a smaller open model is closer to GPT-5.5 on "writing secure code" than on "writing working code" — exactly the signal that says RL on this benchmark is worth running.

	### 3. RL training of Laguna-XS.2

	GRPO on the train split (22 scenarios × 4 Python frameworks = 88 tasks):

	- 40 gradient steps, `batch_size=16`, `rollouts_per_example=8` (2 GRPO groups of 8 per step)
	- LR 5e-6, temperature 0.7
	- Online eval against the 6 held-out scenarios at steps 0, 10, 20, 30, 40

	Held-out pass@1 climbed 0.061 → 0.115 (+87% relative) on scenarios the model never saw during training:

	![Training eval curve](eval_training_curve.png)

	The LoRA adapter from step 39 is checked into `lora_adapter/`. Final inference-time eval on the trained adapter was blocked because `poolside/Laguna-XS.2` is currently gated for LoRA deployment on Prime Intellect's inference infra (`Error: Base model is not currently available for LoRA deployment`). The training-time eval curve above measures the same held-out 24 tasks at every checkpoint, scored by the same env code — apples-to-apples, just smaller sample size than the full 100-task baseline.

	Unfortunately, I didn't have a time to properly benchmark post-trained model, the LoRa deployments weren't available, and running self-host model on Prime instances took too much time.

	---

	## How to reproduce

	Install the env locally:
	```bash
	prime env install aidarbek/baxbench
	```

	Run the baseline:
	```bash
	prime eval run aidarbek/baxbench \
	-m openai/gpt-5.5 \
	-n 100 -r 1 -c 16 \
	-a '{"split_by": "none"}'
	```

	Train Laguna-XS.2 with GRPO (requires Laguna training access + Prime Hosted Training):
	```bash
	prime train configs/rl/laguna-baxbench.toml -e PRIME_API_KEY -y
	```
	where `configs/rl/laguna-baxbench.toml` matches the values described above (40 steps, batch_size=16, rollouts_per_example=8, split_by="scenario" with `test_size=0.2, split_seed=42`).

	To serve this LoRA adapter once Prime enables Laguna LoRA deployment:
	```bash
	prime deployments create <adapter_id> -y
	prime eval run aidarbek/baxbench \
	-m <deployed_model_id> \
	-n 100 -r 1 -c 16 \
	-a '{"split_by": "scenario", "test_size": 0.2, "split_seed": 42}'
	```

	---

	## Limitations and honest caveats

	- 24-task held-out set has low resolution. Each percentage point ≈ 0.27 tasks. The 0.061 → 0.115 trend is real but individual data points are noisy.
	- Truncation rate dropped 79% → 58% during training. Some of the improvement may come from "produce a shorter, finishable answer" rather than "write better code." That's still a real and useful capability.
	- No standalone post-RL eval. Until Prime enables LoRA serving for Laguna, the training-time eval is the only signal. Once unblocked, a full 100-task standalone `prime eval run` is the next number to publish.
	- Python only. The wrapper supports Go / JavaScript / Ruby / PHP / Rust tasks via per-language Docker images, but only Python is currently dep-tested end-to-end. Default `languages=["Python"]` reflects this.

	---

	## Citation

	If you use this work, please also cite the original BaxBench paper:

	```bibtex
	@article{vero2025baxbenchllmsgeneratecorrect,
	title = {BaxBench: Can LLMs Generate Correct and Secure Backends?},
	author = {Mark Vero and Niels Mündler and Victor Chibotaru and Veselin Raychev and Maximilian Baader and Nikola Jovanović and Jingxuan He and Martin Vechev},
	year = {2025},
	eprint = {2502.11844},
	archivePrefix = {arXiv},
	}
	```