--- license: mit language: - en tags: - code-generation - security - reinforcement-learning - prime-intellect - baxbench - laguna base_model: poolside/Laguna-XS.2 datasets: - LogicStar/BaxBench --- # BaxBench × Prime Intellect — Secure Backend Code Generation > **Team:** Oof team > **Author:** Aidarbek Suleimenov ([@idarbek](https://x.com/idarbek)) My submission is an RL environment, model evaluation, and RL post-trained model created from the original [BaxBench](https://arxiv.org/abs/2502.11844) secure-backend-code benchmark. I wrapped the benchmark as a [Prime Intellect verifiers environment](https://app.primeintellect.ai/dashboard/environments/aidarbek/baxbench), used it to **evaluate Laguna-XS.2 against GPT-5.5**, and then **RL-trained Laguna-XS.2** on the train split. During training, the eval score went from **0.061 → 0.115** (+87% relative), but I couldn’t benchmark the final trained model due to limits with Prime Intellect’s LoRA deployments for Laguna-XS.2. --- ## Why this matters As frontier models get better at writing code, so do the adversaries who use them. CrowdStrike’s 2026 Global Threat Report states that AI-enabled attacks are up roughly 89% year over year. Since more and more new code is being written by AI, it’s important that models treat security as one of the key dimensions to optimise for. --- ## What's in this repo | File | What it is | |---|---| | `README.md` | This file. | | `baxbench.parquet` | The 392-row task table (28 scenarios × 14 frameworks) extracted from the [original BaxBench](https://huggingface.co/datasets/LogicStar/BaxBench), enriched with everything needed for sandbox execution: API spec, allowed packages, entrypoint command, port, multi-file flag, etc. | | `lora_adapter/` | The Laguna-XS.2 LoRA adapter produced by 40 GRPO steps on the BaxBench train split (`adapter_config.json` + `adapter_model.safetensors`, 4.6 GB). | | `eval_training_curve.png` | Held-out pass@1 over the course of RL training (step 0 → step 40). | | `baseline_vs_gpt55.txt` | Baseline comparison: Laguna-XS.2 vs GPT-5.5 on 100 random BaxBench tasks. | External links: - **Prime Intellect environment:** https://app.primeintellect.ai/dashboard/environments/aidarbek/baxbench - **Original BaxBench paper:** https://arxiv.org/abs/2502.11844 — Vero et al., 2025 - **Original BaxBench dataset:** https://huggingface.co/datasets/LogicStar/BaxBench - **Base model:** [poolside/Laguna-XS.2](https://huggingface.co/poolside/Laguna-XS.2) --- ## What was built ### 1. BaxBench wrapped as a Prime Intellect verifiers environment BaxBench upstream is a Python CLI that ships with Docker, runs each generated backend in a container, and tests it with both **functional tests** (does the API work?) and **security tests** (can it be exploited?). I ported it into a single `vf.SingleTurnEnv` that: 1. Reads the 392-task parquet, builds the OpenAPI-style prompt for each (matching the upstream template exactly). 2. Per rollout, **spins up a Prime Intellect sandbox** with the right Docker image for the task's framework. 3. Uploads the model-generated code, installs scenario-specific deps (`apt: ffmpeg / poppler-utils / …`, `pip: imageio / pdfplumber / …`), starts the server, and runs every upstream functional + security test against it. 4. Returns reward = `1.0` iff **all functional tests pass AND zero CWEs are flagged** — matching BaxBench's "secure pass@k" metric. Sub-metrics (functional pass rate, security pass rate) are also tracked. The wrapper bundles a copy of the upstream BaxBench test suite into each sandbox and monkey-patches its Docker-bound helpers (`load_file_from_docker`, `process_still_running`, `execute_sql_on_docker`) to operate on the local filesystem / process tree — sound because the server and tests share a kernel namespace inside the sandbox. Key engineering wins: - **True held-out split.** The env exposes `split_by="scenario" | "framework" | "random" | "none"` so the RL training set never overlaps with the eval set at the scenario level (default: 22 train scenarios, 6 held-out). Same `split_seed` on training and eval keeps the holdout pinned across runs. - **Per-rollout logging that survives `prime train logs --tail`.** Every sandbox emits an `OK` or `BAD` line with stderr + server log tails on failure — no more "results.json missing" without context. - **Per-scenario dep install with retry.** Only 4 of 28 scenarios need extra system packages; installing per-scenario keeps each exec call short enough to dodge the sandbox gateway's 502s. ### 2. Baseline benchmark: Laguna-XS.2 vs GPT-5.5 100-sample pass@1 on a random Python-framework slice of BaxBench, scored with the env above: | Metric | poolside/Laguna-XS.2 | openai/gpt-5.5 | Δ | |---|---:|---:|---:| | **pass@1 (secure + functional)** | **0.260** (26/100) | **0.550** (55/100) | **+0.290** | | functional_pass_rate | 0.395 | 0.710 | +0.315 | | security_pass_rate | 0.566 | 0.827 | +0.261 | | wall clock | 10.5 min | 11.3 min | +0.8 min | | avg output tokens | 5,150 | 2,942 | −2,208 | Why GPT-5.5? Simply because I had existing credits for OpenAI API :) Head-to-head on the same 100 tasks: GPT-5.5 won 33 tasks Laguna didn't; Laguna won 4 tasks GPT-5.5 didn't (clustered on Flask/aiohttp). The gap is bigger on functional correctness (+0.315) than on security (+0.261), meaning a smaller open model is closer to GPT-5.5 on "writing secure code" than on "writing working code" — exactly the signal that says RL on this benchmark is worth running. ### 3. RL training of Laguna-XS.2 GRPO on the train split (22 scenarios × 4 Python frameworks = 88 tasks): - 40 gradient steps, `batch_size=16`, `rollouts_per_example=8` (2 GRPO groups of 8 per step) - LR 5e-6, temperature 0.7 - Online eval against the 6 held-out scenarios at steps 0, 10, 20, 30, 40 **Held-out pass@1 climbed 0.061 → 0.115 (+87% relative)** on scenarios the model never saw during training: ![Training eval curve](eval_training_curve.png) The LoRA adapter from step 39 is checked into `lora_adapter/`. Final inference-time eval on the trained adapter was blocked because `poolside/Laguna-XS.2` is currently gated for LoRA deployment on Prime Intellect's inference infra (`Error: Base model is not currently available for LoRA deployment`). The training-time eval curve above measures the same held-out 24 tasks at every checkpoint, scored by the same env code — apples-to-apples, just smaller sample size than the full 100-task baseline. Unfortunately, I didn't have a time to properly benchmark post-trained model, the LoRa deployments weren't available, and running self-host model on Prime instances took too much time. --- ## How to reproduce Install the env locally: ```bash prime env install aidarbek/baxbench ``` Run the baseline: ```bash prime eval run aidarbek/baxbench \ -m openai/gpt-5.5 \ -n 100 -r 1 -c 16 \ -a '{"split_by": "none"}' ``` Train Laguna-XS.2 with GRPO (requires Laguna training access + Prime Hosted Training): ```bash prime train configs/rl/laguna-baxbench.toml -e PRIME_API_KEY -y ``` where `configs/rl/laguna-baxbench.toml` matches the values described above (40 steps, batch_size=16, rollouts_per_example=8, split_by="scenario" with `test_size=0.2, split_seed=42`). To serve this LoRA adapter once Prime enables Laguna LoRA deployment: ```bash prime deployments create -y prime eval run aidarbek/baxbench \ -m \ -n 100 -r 1 -c 16 \ -a '{"split_by": "scenario", "test_size": 0.2, "split_seed": 42}' ``` --- ## Limitations and honest caveats - **24-task held-out set has low resolution.** Each percentage point ≈ 0.27 tasks. The 0.061 → 0.115 trend is real but individual data points are noisy. - **Truncation rate dropped 79% → 58% during training.** Some of the improvement may come from "produce a shorter, finishable answer" rather than "write better code." That's still a real and useful capability. - **No standalone post-RL eval.** Until Prime enables LoRA serving for Laguna, the training-time eval is the only signal. Once unblocked, a full 100-task standalone `prime eval run` is the next number to publish. - **Python only.** The wrapper supports Go / JavaScript / Ruby / PHP / Rust tasks via per-language Docker images, but only Python is currently dep-tested end-to-end. Default `languages=["Python"]` reflects this. --- ## Citation If you use this work, please also cite the original BaxBench paper: ```bibtex @article{vero2025baxbenchllmsgeneratecorrect, title = {BaxBench: Can LLMs Generate Correct and Secure Backends?}, author = {Mark Vero and Niels Mündler and Victor Chibotaru and Veselin Raychev and Maximilian Baader and Nikola Jovanović and Jingxuan He and Martin Vechev}, year = {2025}, eprint = {2502.11844}, archivePrefix = {arXiv}, } ```