baxbench / README.md
aidarbek's picture
Upload README.md with huggingface_hub
144de3d verified
|
Raw
History Blame Contribute Delete
8.84 kB
---
license: mit
language:
- en
tags:
- code-generation
- security
- reinforcement-learning
- prime-intellect
- baxbench
- laguna
base_model: poolside/Laguna-XS.2
datasets:
- LogicStar/BaxBench
---
# BaxBench × Prime Intellect — Secure Backend Code Generation
> **Team:** Oof team
> **Author:** Aidarbek Suleimenov ([@idarbek](https://x.com/idarbek))
My submission is an RL environment, model evaluation, and RL post-trained model created from the original [BaxBench](https://arxiv.org/abs/2502.11844) secure-backend-code benchmark.
I wrapped the benchmark as a [Prime Intellect verifiers environment](https://app.primeintellect.ai/dashboard/environments/aidarbek/baxbench), used it to **evaluate Laguna-XS.2 against GPT-5.5**, and then **RL-trained Laguna-XS.2** on the train split. During training, the eval score went from **0.061 → 0.115** (+87% relative), but I couldn’t benchmark the final trained model due to limits with Prime Intellect’s LoRA deployments for Laguna-XS.2.
---
## Why this matters
As frontier models get better at writing code, so do the adversaries who use them. CrowdStrike’s 2026 Global Threat Report states that AI-enabled attacks are up roughly 89% year over year.
Since more and more new code is being written by AI, it’s important that models treat security as one of the key dimensions to optimise for.
---
## What's in this repo
| File | What it is |
|---|---|
| `README.md` | This file. |
| `baxbench.parquet` | The 392-row task table (28 scenarios × 14 frameworks) extracted from the [original BaxBench](https://huggingface.co/datasets/LogicStar/BaxBench), enriched with everything needed for sandbox execution: API spec, allowed packages, entrypoint command, port, multi-file flag, etc. |
| `lora_adapter/` | The Laguna-XS.2 LoRA adapter produced by 40 GRPO steps on the BaxBench train split (`adapter_config.json` + `adapter_model.safetensors`, 4.6 GB). |
| `eval_training_curve.png` | Held-out pass@1 over the course of RL training (step 0 → step 40). |
| `baseline_vs_gpt55.txt` | Baseline comparison: Laguna-XS.2 vs GPT-5.5 on 100 random BaxBench tasks. |
External links:
- **Prime Intellect environment:** https://app.primeintellect.ai/dashboard/environments/aidarbek/baxbench
- **Original BaxBench paper:** https://arxiv.org/abs/2502.11844 — Vero et al., 2025
- **Original BaxBench dataset:** https://huggingface.co/datasets/LogicStar/BaxBench
- **Base model:** [poolside/Laguna-XS.2](https://huggingface.co/poolside/Laguna-XS.2)
---
## What was built
### 1. BaxBench wrapped as a Prime Intellect verifiers environment
BaxBench upstream is a Python CLI that ships with Docker, runs each generated backend in a container, and tests it with both **functional tests** (does the API work?) and **security tests** (can it be exploited?). I ported it into a single `vf.SingleTurnEnv` that:
1. Reads the 392-task parquet, builds the OpenAPI-style prompt for each (matching the upstream template exactly).
2. Per rollout, **spins up a Prime Intellect sandbox** with the right Docker image for the task's framework.
3. Uploads the model-generated code, installs scenario-specific deps (`apt: ffmpeg / poppler-utils / …`, `pip: imageio / pdfplumber / …`), starts the server, and runs every upstream functional + security test against it.
4. Returns reward = `1.0` iff **all functional tests pass AND zero CWEs are flagged** — matching BaxBench's "secure pass@k" metric. Sub-metrics (functional pass rate, security pass rate) are also tracked.
The wrapper bundles a copy of the upstream BaxBench test suite into each sandbox and monkey-patches its Docker-bound helpers (`load_file_from_docker`, `process_still_running`, `execute_sql_on_docker`) to operate on the local filesystem / process tree — sound because the server and tests share a kernel namespace inside the sandbox.
Key engineering wins:
- **True held-out split.** The env exposes `split_by="scenario" | "framework" | "random" | "none"` so the RL training set never overlaps with the eval set at the scenario level (default: 22 train scenarios, 6 held-out). Same `split_seed` on training and eval keeps the holdout pinned across runs.
- **Per-rollout logging that survives `prime train logs --tail`.** Every sandbox emits an `OK` or `BAD` line with stderr + server log tails on failure — no more "results.json missing" without context.
- **Per-scenario dep install with retry.** Only 4 of 28 scenarios need extra system packages; installing per-scenario keeps each exec call short enough to dodge the sandbox gateway's 502s.
### 2. Baseline benchmark: Laguna-XS.2 vs GPT-5.5
100-sample pass@1 on a random Python-framework slice of BaxBench, scored with the env above:
| Metric | poolside/Laguna-XS.2 | openai/gpt-5.5 | Δ |
|---|---:|---:|---:|
| **pass@1 (secure + functional)** | **0.260** (26/100) | **0.550** (55/100) | **+0.290** |
| functional_pass_rate | 0.395 | 0.710 | +0.315 |
| security_pass_rate | 0.566 | 0.827 | +0.261 |
| wall clock | 10.5 min | 11.3 min | +0.8 min |
| avg output tokens | 5,150 | 2,942 | −2,208 |
Why GPT-5.5? Simply because I had existing credits for OpenAI API :)
Head-to-head on the same 100 tasks: GPT-5.5 won 33 tasks Laguna didn't; Laguna won 4 tasks GPT-5.5 didn't (clustered on Flask/aiohttp). The gap is bigger on functional correctness (+0.315) than on security (+0.261), meaning a smaller open model is closer to GPT-5.5 on "writing secure code" than on "writing working code" — exactly the signal that says RL on this benchmark is worth running.
### 3. RL training of Laguna-XS.2
GRPO on the train split (22 scenarios × 4 Python frameworks = 88 tasks):
- 40 gradient steps, `batch_size=16`, `rollouts_per_example=8` (2 GRPO groups of 8 per step)
- LR 5e-6, temperature 0.7
- Online eval against the 6 held-out scenarios at steps 0, 10, 20, 30, 40
**Held-out pass@1 climbed 0.061 → 0.115 (+87% relative)** on scenarios the model never saw during training:
![Training eval curve](eval_training_curve.png)
The LoRA adapter from step 39 is checked into `lora_adapter/`. Final inference-time eval on the trained adapter was blocked because `poolside/Laguna-XS.2` is currently gated for LoRA deployment on Prime Intellect's inference infra (`Error: Base model is not currently available for LoRA deployment`). The training-time eval curve above measures the same held-out 24 tasks at every checkpoint, scored by the same env code — apples-to-apples, just smaller sample size than the full 100-task baseline.
Unfortunately, I didn't have a time to properly benchmark post-trained model, the LoRa deployments weren't available, and running self-host model on Prime instances took too much time.
---
## How to reproduce
Install the env locally:
```bash
prime env install aidarbek/baxbench
```
Run the baseline:
```bash
prime eval run aidarbek/baxbench \
-m openai/gpt-5.5 \
-n 100 -r 1 -c 16 \
-a '{"split_by": "none"}'
```
Train Laguna-XS.2 with GRPO (requires Laguna training access + Prime Hosted Training):
```bash
prime train configs/rl/laguna-baxbench.toml -e PRIME_API_KEY -y
```
where `configs/rl/laguna-baxbench.toml` matches the values described above (40 steps, batch_size=16, rollouts_per_example=8, split_by="scenario" with `test_size=0.2, split_seed=42`).
To serve this LoRA adapter once Prime enables Laguna LoRA deployment:
```bash
prime deployments create <adapter_id> -y
prime eval run aidarbek/baxbench \
-m <deployed_model_id> \
-n 100 -r 1 -c 16 \
-a '{"split_by": "scenario", "test_size": 0.2, "split_seed": 42}'
```
---
## Limitations and honest caveats
- **24-task held-out set has low resolution.** Each percentage point ≈ 0.27 tasks. The 0.061 → 0.115 trend is real but individual data points are noisy.
- **Truncation rate dropped 79% → 58% during training.** Some of the improvement may come from "produce a shorter, finishable answer" rather than "write better code." That's still a real and useful capability.
- **No standalone post-RL eval.** Until Prime enables LoRA serving for Laguna, the training-time eval is the only signal. Once unblocked, a full 100-task standalone `prime eval run` is the next number to publish.
- **Python only.** The wrapper supports Go / JavaScript / Ruby / PHP / Rust tasks via per-language Docker images, but only Python is currently dep-tested end-to-end. Default `languages=["Python"]` reflects this.
---
## Citation
If you use this work, please also cite the original BaxBench paper:
```bibtex
@article{vero2025baxbenchllmsgeneratecorrect,
title = {BaxBench: Can LLMs Generate Correct and Secure Backends?},
author = {Mark Vero and Niels Mündler and Victor Chibotaru and Veselin Raychev and Maximilian Baader and Nikola Jovanović and Jingxuan He and Martin Vechev},
year = {2025},
eprint = {2502.11844},
archivePrefix = {arXiv},
}
```