Where agents learn to break APIs.

An OpenEnv reinforcement learning environment for API security testing. A live REST API with thirteen planted vulnerabilities, a verifiable reward function mapped to the OWASP API Security Top 10, and an episode that ends with a structured bug report.

01The premise

What this is.

A Gradio playground for an OpenEnv RL environment that trains AI agents to test REST APIs the way a security engineer would. Behind the UI is a Task Management API with 13 deliberately planted bugs covering 6 categories from the OWASP API Security Top 10.

The agent connects, sends HTTP requests, earns rewards for finding bugs and covering endpoints, and generates a bug bounty report when the episode ends.

Real API. Real bugs. Real OWASP categories — verifiable end to end.

02The gap

Why bother.

Every team ships APIs and every API has bugs. The usual tools Postman Schemathesis OWASP ZAP either need humans writing tests by hand or fall back to brute-force fuzzing.

This environment is the benchmark.

The agent doesn't get a written test plan. It reads the API spec, plans a campaign, runs it, and reports what broke. The reward function is verifiable — no LLM judge, no soft heuristics — and every signal maps to a real OWASP category, so episodes can be scored deterministically.

03How reward works

Five signals,
one episode.

The reward function is verifiable — no LLM judge, no soft heuristics. Each step accumulates from five components and the task grader caps the episode with a terminal score in [0, 1].

Bug discovery

+0.10 / +0.15 / +0.25

Finding a planted bug, scaled by severity. Easy bugs (status codes, missing fields) are worth 0.10. Medium (validation, auth) gets 0.15. Hard (BOLA, injection, broken auth chains) gets 0.25.

Coverage

+0.20

Hitting endpoints, methods, and status codes the agent hasn't tried yet.

Validity

+0.18

Well-formed requests, plus chaining IDs from previous responses.

Exploration

+0.05

Trying genuinely novel action patterns the agent hasn't tried before.

Penalty

−0.08

Repeating the same exact request twice — anti-spam, anti-loop.

When the episode ends, the task grader adds a terminal score based on its own criteria — CRUD coverage, dependency chaining, security probing, that kind of thing.

04How to use this

Five steps
to a verdict.

Pick a task

Three difficulty tiers in the dropdown on the left, from a CRUD smoke-test to a full BOLA + injection chain.

basic_validation edge_cases security_workflows

Reset the environment

Every reset spins up a fresh database with new users, new tasks, and randomized ownership, so the agent can't memorize answers between episodes.

Run a baseline

The Run Baseline Agent tab is open by default. Pick a strategy and watch it test the API step by step.

random sequential smart

Or test manually

Switch to Manual Testing. Quick Actions give one-click bug hunts, or craft your own request from scratch — method, endpoint, headers, body, expected status.

Watch the panel

Discovered Bugs and the Activity Log update live as the agent works. When the episode ends, expand the Bug Report (OWASP) drawer for the full structured findings, severities, and fix recommendations.

05Under the hood

Three layers.

Self-contained, reproducible, and runs on a free-tier HuggingFace Space.

L1 · ENVIRONMENT

FastAPI + SQLite

A buggy Task Management API wrapped in OpenEnv's step() / reset() / state() contract. Runs in-process or as a Docker image, with seed-randomized data on every reset so episodes can't be memorized.

L2 · INFERENCE

OpenAI-compatible client

inference.py talks to any HuggingFace-hosted model through the OpenAI SDK and structured JSON output. Plug in any model that follows the protocol — no environment-specific glue.

L3 · DEPLOY

Docker + HF Spaces

Containerized on top of the official openenv-base image and deployed as a public HuggingFace Space, so judges can hit it with a single HTTP call.

06The artifacts

Everything reproducible.

Source code, deployed environment, framework. Open and inspectable.

https://github.com/Mayankpratapsingh022/API-Testing-RL https://meta-pytorch.org/OpenEnv/