Spaces:
Sleeping
DebugZero Implementation Guide
What This Repository Is
DebugZero is an OpenEnv-style debugging environment built around a two-role loop:
- a
proposertakes correct Python code and tries to inject a realistic logical bug - a
solverreceives buggy code and tries to repair it
The repository is designed to support three things at once:
- a runnable environment server
- a GRPO-style training pipeline
- an API-based evaluation harness for smoke-testing the environment before training
The current setup is intentionally small and deterministic so it is easy to verify, train on, and explain in a hackathon setting.
High-Level Environment Workflow
At a high level, one episode works like this:
- The environment resets onto one curated seed task from the seed bank.
- The
proposersees the clean reference implementation and tries to make one small logical mistake. - The environment executes the proposed code against the task tests.
- If the code now fails tests in a valid way, the proposer gets rewarded and the environment hands the buggy code to the
solver. - The
solvertries to repair the code. - The environment executes the solver output against the same tests.
- If the fix passes, the solver gets rewarded and the episode ends.
This gives a clean adversarial/self-improvement loop:
- the proposer learns to create realistic, test-breaking bugs
- the solver learns to reverse those bugs
- the reward functions are role-specific
Project Architecture
The project is split into a few clear layers.
1. Task and Bug Data
seed_bank.pydefines the curated clean tasksbug_bank.pybuilds verified buggy variants from those tasks
2. Environment Runtime
server/debugZero_environment.pyis the main environment state machineserver/executor.pyruns generated code safely against testsserver/bug_injector.pyprovides AST-based bug mutationsserver/plausibility.pyscores whether a proposer bug looks realisticserver/app.pyexposes the environment through a FastAPI/OpenEnv server
3. Shared Interface
models.pydefines the request/response data models shared across client and serverclient.pyprovides a small client wrapper for interacting with the environment
4. Training
training/dual_role_sampler.pybuilds role-specific promptstraining/rewards.pycomputes proposer and solver rewardstraining/grpo_train.pybuilds the dataset, runs evaluation, and launches GRPO training
5. Evaluation
eval/api_baseline.pyis the main smoke-test harness
It first runs deterministic controls, then optionally runs a live API probe using a model served through an OpenAI-compatible endpoint.
Data Design
The current data setup is not Faker-style random generation. It is a curated-and-verified pipeline.
Seed Bank
The environment uses a fixed curated set of six Python tasks:
HumanEval/0->has_close_elementsDebugZero/1->sum_to_nDebugZero/2->middle_sliceDebugZero/3->is_non_decreasingDebugZero/4->count_nonemptyDebugZero/5->running_max
Each seed contains:
seed_identrypointpromptcanonical_solutiontest
Bug Bank
The bug bank is generated programmatically from the seed bank using AST mutation operators. Only bugs that meet all of the following conditions are kept:
- the code actually changed
- the mutated code still parses
- the code is considered safe enough to execute
- the mutated code fails the seed's tests
The current bug bank split is:
18training bug samples6evaluation holdout bug samples
This gives the training loop a deterministic but nontrivial set of bug/fix tasks.
Reward Design
The reward functions are intentionally role-specific.
Proposer Reward
The proposer is rewarded for creating realistic failing bugs, not just for changing code.
Current behavior:
- syntax error or unsafe output ->
-0.5 - unchanged code ->
0.0 - changed code that still passes tests ->
0.0 - valid failing bug ->
1.0 + plausibility_bonus + learnability_bonus
Where:
plausibility_bonusrewards bugs that look more like realistic programmer mistakeslearnability_bonusfavors bugs that are neither trivial nor impossible for the solver
Solver Reward
The solver reward is intentionally simpler:
- syntax error ->
-0.5 - failed fix ->
0.0 - passing fix ->
1.0
This makes the solver side easier to optimize and easier to explain.
Training Workflow
The training path is centered on GRPO-style optimization.
What Gets Trained
The training dataset is mixed-role:
- solver-heavy by design
- still includes proposer rows so both roles are represented
The current mixed-role build uses:
18solver rows9proposer rows27rows total
Training Loop Shape
training/grpo_train.py does the following:
- builds the verified bug bank
- constructs the mixed-role dataset
- formats prompts through the dual-role sampler
- evaluates pre-training behavior on a fixed holdout set
- runs GRPO training
- evaluates post-training behavior
- saves a small results plot
There is also a --dry_run path for quick local smoke testing.
Recommended Models
Best default for this environment:
unsloth/Qwen2.5-Coder-3B-Instruct
Reasonable alternatives:
1Bto2Bcoder models for faster cheap runs7Bto8Bcoder/instruct models for stronger evaluation if compute allows
In practice:
1Bto3Bis the most sensible training range for this repo's size and task complexity7Bto8Bis useful as a stronger API smoke-test model or a higher-end final experiment
Evaluation Workflow
The main evaluation script is eval/api_baseline.py.
It has two phases.
1. Deterministic Controls
Before any live model call, it verifies that:
- canonical code passes
- verified buggy code fails
- obvious syntax errors are detected
This is the fast check that the environment has real signal.
2. Live API Probe
If OPENAI_API_KEY and OPENAI_MODEL are present, it then runs a multi-episode proposer/solver loop over the seed bank.
It reports:
- proposer success rate
- solver success rate
- proposer valid bug rate
- proposer unchanged rate
- proposer changed-but-passing rate
- proposer syntax rate
- solver syntax rate
- average proposer reward
- average solver reward
- one representative success
- one representative failure
This is the main pre-training sanity check.
End-to-End Runtime Flow
If we trace one full path through the system, it looks like this:
seed_bank.pyprovides a clean seed task.server/debugZero_environment.pyresets onto that seed.- A proposer model generates code from a proposer prompt built by
training/dual_role_sampler.pyoreval/api_baseline.py. server/executor.pyruns the candidate code against the seed tests.training/rewards.pycomputes proposer reward.- If the proposer created a valid failing bug, the solver gets a repair prompt.
- The solver generates repaired code.
server/executor.pyruns the repair candidate.training/rewards.pycomputes solver reward.training/grpo_train.pyuses these rewards during GRPO training or fixed evaluation.
Tracked Python Files
Below is what each currently tracked .py file is doing.
Live Runtime And Training Files
init.py
Marks the repository root package so imports can work cleanly in package-style execution.
seed_bank.py
Defines the curated seed task bank. Each seed includes the prompt, canonical solution, test harness, and function entrypoint. This is the base dataset for the whole environment.
bug_bank.py
Builds and stores verified buggy samples from the seed bank. It filters mutations down to samples that are syntactically valid, meaningfully changed, safe to run, and test-failing. It also splits them into training and evaluation holdouts.
models.py
Defines the shared Pydantic models used across client/server communication. This includes the environment action and observation structures.
client.py
Provides a small client interface for talking to the environment server using the shared models. It is the clean consumer-side entrypoint for external interaction.
eval/api_baseline.py
Runs the main smoke-test and evaluation workflow. It verifies deterministic controls first, then runs a live proposer/solver API probe across the seed bank when API credentials are available.
server/init.py
Marks the server package.
server/app.py
Creates the FastAPI/OpenEnv application and wires the environment into the server layer so external tools can call it.
server/bug_injector.py
Contains the AST mutation logic used to generate plausible logical bugs from clean code. This is the mutation engine behind the verified bug bank.
server/debugZero_environment.py
Implements the main environment state machine. It handles reset, proposer steps, solver steps, execution feedback, seed progression, and observation construction.
This is the heart of the repo.
server/executor.py
Runs model-generated Python code in a constrained execution path with safety checks and test execution. This is the file that turns raw code into pass/fail execution signals.
server/plausibility.py
Scores how realistic a proposer bug looks. That plausibility signal is used as part of proposer reward shaping.
training/dual_role_sampler.py
Builds the role-specific prompts used during training. It formats proposer prompts for bug injection and solver prompts for bug repair, including the concise solver mode used for smaller models.
training/grpo_train.py
The main training entrypoint. It builds datasets, prepares the trainer, evaluates before and after training, and saves training artifacts such as the results plot.
training/rewards.py
Defines the reward logic for proposer and solver outputs. It is the main source of learning signal for GRPO.
Tracked Template / Scaffolding Files
These files are tracked in git, but they are not part of the live DebugZero runtime. They are template assets checked into the repo under .claude/... and appear to be scaffolding for generating OpenEnv environments.
.claude/skills/generate-openenv-env/assets/openenv_env_template/__init__.py
Template package marker for generated environments.
.claude/skills/generate-openenv-env/assets/openenv_env_template/client.py
Template client file used when scaffolding a new OpenEnv environment.
.claude/skills/generate-openenv-env/assets/openenv_env_template/models.py
Template shared models file used by the environment generator.
.claude/skills/generate-openenv-env/assets/openenv_env_template/server/__ENV_NAME___environment.py
Template environment implementation stub for newly generated environments.
.claude/skills/generate-openenv-env/assets/openenv_env_template/server/__init__.py
Template server package marker for generated environments.
.claude/skills/generate-openenv-env/assets/openenv_env_template/server/app.py
Template FastAPI/OpenEnv app file for generated environments.
What This Repo Is Optimized For
This repository is optimized for:
- explaining the environment clearly
- proving that the environment has real reward signal
- running a small but defensible GRPO training loop
- showing before/after improvement, especially on the solver side
It is not optimized for:
- huge-scale dataset diversity
- deployment polish
- extremely large benchmark coverage
That tradeoff is deliberate. The current design favors clarity, determinism, and hackathon-speed iteration.
In One Sentence
DebugZero is a compact self-play debugging environment where one model learns to inject realistic bugs, another learns to fix them, and the repo is structured so that this loop can be tested, trained, and explained cleanly.