Spaces:

The-Fool-09
/

debugZero

Sleeping

App Files Files Community

debugZero / implementation.md

The-Fool-09

Upload folder using huggingface_hub

3ba81b5 verified 12 days ago

preview code

raw

history blame contribute delete

13.2 kB

DebugZero Implementation Guide

What This Repository Is

DebugZero is an OpenEnv-style debugging environment built around a two-role loop:

a proposer takes correct Python code and tries to inject a realistic logical bug
a solver receives buggy code and tries to repair it

The repository is designed to support three things at once:

a runnable environment server
a GRPO-style training pipeline
an API-based evaluation harness for smoke-testing the environment before training

The current setup is intentionally small and deterministic so it is easy to verify, train on, and explain in a hackathon setting.

High-Level Environment Workflow

At a high level, one episode works like this:

The environment resets onto one curated seed task from the seed bank.
The proposer sees the clean reference implementation and tries to make one small logical mistake.
The environment executes the proposed code against the task tests.
If the code now fails tests in a valid way, the proposer gets rewarded and the environment hands the buggy code to the solver.
The solver tries to repair the code.
The environment executes the solver output against the same tests.
If the fix passes, the solver gets rewarded and the episode ends.

This gives a clean adversarial/self-improvement loop:

the proposer learns to create realistic, test-breaking bugs
the solver learns to reverse those bugs
the reward functions are role-specific

Project Architecture

The project is split into a few clear layers.

1. Task and Bug Data

seed_bank.py defines the curated clean tasks
bug_bank.py builds verified buggy variants from those tasks

2. Environment Runtime

server/debugZero_environment.py is the main environment state machine
server/executor.py runs generated code safely against tests
server/bug_injector.py provides AST-based bug mutations
server/plausibility.py scores whether a proposer bug looks realistic
server/app.py exposes the environment through a FastAPI/OpenEnv server

3. Shared Interface

models.py defines the request/response data models shared across client and server
client.py provides a small client wrapper for interacting with the environment

4. Training

training/dual_role_sampler.py builds role-specific prompts
training/rewards.py computes proposer and solver rewards
training/grpo_train.py builds the dataset, runs evaluation, and launches GRPO training

5. Evaluation

eval/api_baseline.py is the main smoke-test harness

It first runs deterministic controls, then optionally runs a live API probe using a model served through an OpenAI-compatible endpoint.

Data Design

The current data setup is not Faker-style random generation. It is a curated-and-verified pipeline.

Seed Bank

The environment uses a fixed curated set of six Python tasks:

HumanEval/0 -> has_close_elements
DebugZero/1 -> sum_to_n
DebugZero/2 -> middle_slice
DebugZero/3 -> is_non_decreasing
DebugZero/4 -> count_nonempty
DebugZero/5 -> running_max

Each seed contains:

seed_id
entrypoint
prompt
canonical_solution
test

Bug Bank

The bug bank is generated programmatically from the seed bank using AST mutation operators. Only bugs that meet all of the following conditions are kept:

the code actually changed
the mutated code still parses
the code is considered safe enough to execute
the mutated code fails the seed's tests

The current bug bank split is:

18 training bug samples
6 evaluation holdout bug samples

This gives the training loop a deterministic but nontrivial set of bug/fix tasks.

Reward Design

The reward functions are intentionally role-specific.

Proposer Reward

The proposer is rewarded for creating realistic failing bugs, not just for changing code.

Current behavior:

syntax error or unsafe output -> -0.5
unchanged code -> 0.0
changed code that still passes tests -> 0.0
valid failing bug -> 1.0 + plausibility_bonus + learnability_bonus

Where:

plausibility_bonus rewards bugs that look more like realistic programmer mistakes
learnability_bonus favors bugs that are neither trivial nor impossible for the solver

Solver Reward

The solver reward is intentionally simpler:

syntax error -> -0.5
failed fix -> 0.0
passing fix -> 1.0

This makes the solver side easier to optimize and easier to explain.

Training Workflow

The training path is centered on GRPO-style optimization.

What Gets Trained

The training dataset is mixed-role:

solver-heavy by design
still includes proposer rows so both roles are represented

The current mixed-role build uses:

18 solver rows
9 proposer rows
27 rows total

Training Loop Shape

training/grpo_train.py does the following:

builds the verified bug bank
constructs the mixed-role dataset
formats prompts through the dual-role sampler
evaluates pre-training behavior on a fixed holdout set
runs GRPO training
evaluates post-training behavior
saves a small results plot

There is also a --dry_run path for quick local smoke testing.

Recommended Models

Best default for this environment:

unsloth/Qwen2.5-Coder-3B-Instruct

Reasonable alternatives:

1B to 2B coder models for faster cheap runs
7B to 8B coder/instruct models for stronger evaluation if compute allows

In practice:

1B to 3B is the most sensible training range for this repo's size and task complexity
7B to 8B is useful as a stronger API smoke-test model or a higher-end final experiment

Evaluation Workflow

The main evaluation script is eval/api_baseline.py.

It has two phases.

1. Deterministic Controls

Before any live model call, it verifies that:

canonical code passes
verified buggy code fails
obvious syntax errors are detected

This is the fast check that the environment has real signal.

2. Live API Probe

If OPENAI_API_KEY and OPENAI_MODEL are present, it then runs a multi-episode proposer/solver loop over the seed bank.

It reports:

proposer success rate
solver success rate
proposer valid bug rate
proposer unchanged rate
proposer changed-but-passing rate
proposer syntax rate
solver syntax rate
average proposer reward
average solver reward
one representative success
one representative failure

This is the main pre-training sanity check.

End-to-End Runtime Flow

If we trace one full path through the system, it looks like this:

seed_bank.py provides a clean seed task.
server/debugZero_environment.py resets onto that seed.
A proposer model generates code from a proposer prompt built by training/dual_role_sampler.py or eval/api_baseline.py.
server/executor.py runs the candidate code against the seed tests.
training/rewards.py computes proposer reward.
If the proposer created a valid failing bug, the solver gets a repair prompt.
The solver generates repaired code.
server/executor.py runs the repair candidate.
training/rewards.py computes solver reward.
training/grpo_train.py uses these rewards during GRPO training or fixed evaluation.

Tracked Python Files

Below is what each currently tracked .py file is doing.

Live Runtime And Training Files

init.py

Marks the repository root package so imports can work cleanly in package-style execution.

seed_bank.py

Defines the curated seed task bank. Each seed includes the prompt, canonical solution, test harness, and function entrypoint. This is the base dataset for the whole environment.

Builds and stores verified buggy samples from the seed bank. It filters mutations down to samples that are syntactically valid, meaningfully changed, safe to run, and test-failing. It also splits them into training and evaluation holdouts.

models.py

Defines the shared Pydantic models used across client/server communication. This includes the environment action and observation structures.

client.py

Provides a small client interface for talking to the environment server using the shared models. It is the clean consumer-side entrypoint for external interaction.

eval/api_baseline.py

Runs the main smoke-test and evaluation workflow. It verifies deterministic controls first, then runs a live proposer/solver API probe across the seed bank when API credentials are available.

server/init.py

Marks the server package.

server/app.py

Creates the FastAPI/OpenEnv application and wires the environment into the server layer so external tools can call it.

server/bug_injector.py

Contains the AST mutation logic used to generate plausible logical bugs from clean code. This is the mutation engine behind the verified bug bank.

server/debugZero_environment.py

Implements the main environment state machine. It handles reset, proposer steps, solver steps, execution feedback, seed progression, and observation construction.

This is the heart of the repo.

server/executor.py

Runs model-generated Python code in a constrained execution path with safety checks and test execution. This is the file that turns raw code into pass/fail execution signals.

server/plausibility.py

Scores how realistic a proposer bug looks. That plausibility signal is used as part of proposer reward shaping.

training/dual_role_sampler.py

Builds the role-specific prompts used during training. It formats proposer prompts for bug injection and solver prompts for bug repair, including the concise solver mode used for smaller models.

training/grpo_train.py

The main training entrypoint. It builds datasets, prepares the trainer, evaluates before and after training, and saves training artifacts such as the results plot.

training/rewards.py

Defines the reward logic for proposer and solver outputs. It is the main source of learning signal for GRPO.

Tracked Template / Scaffolding Files

These files are tracked in git, but they are not part of the live DebugZero runtime. They are template assets checked into the repo under .claude/... and appear to be scaffolding for generating OpenEnv environments.

`.claude/skills/generate-openenv-env/assets/openenv_env_template/init.py`

Template package marker for generated environments.

`.claude/skills/generate-openenv-env/assets/openenv_env_template/client.py`

Template client file used when scaffolding a new OpenEnv environment.

`.claude/skills/generate-openenv-env/assets/openenv_env_template/models.py`

Template shared models file used by the environment generator.

`.claude/skills/generate-openenv-env/assets/openenv_env_template/server/__ENV_NAME___environment.py`

Template environment implementation stub for newly generated environments.

`.claude/skills/generate-openenv-env/assets/openenv_env_template/server/init.py`

Template server package marker for generated environments.

`.claude/skills/generate-openenv-env/assets/openenv_env_template/server/app.py`

Template FastAPI/OpenEnv app file for generated environments.

What This Repo Is Optimized For

This repository is optimized for:

explaining the environment clearly
proving that the environment has real reward signal
running a small but defensible GRPO training loop
showing before/after improvement, especially on the solver side

It is not optimized for:

huge-scale dataset diversity
deployment polish
extremely large benchmark coverage

That tradeoff is deliberate. The current design favors clarity, determinism, and hackathon-speed iteration.

In One Sentence

DebugZero is a compact self-play debugging environment where one model learns to inject realistic bugs, another learns to fix them, and the repo is structured so that this loop can be tested, trained, and explained cleanly.