--- title: CodeSensei Environment emoji: 🧠 colorFrom: purple colorTo: blue sdk: docker app_port: 7860 license: mit short_description: RL environment for teaching LLMs to debug Python code --- # CodeSensei An RL environment built on OpenEnv that trains LLMs to fix buggy Python code. The model gets a broken function, proposes a fix, runs tests, and learns from the results — basically the same loop a developer goes through when debugging, but automated with reinforcement learning. ## How it works 1. The environment picks a buggy Python function from the dataset 2. The LLM reads the code + failing test output 3. It proposes a corrected version 4. We run the tests in a sandboxed subprocess 5. A multi-signal reward tells the model what went well (or didn't) 6. Repeat for up to 6 attempts per bug The reward isn't just pass/fail — it accounts for partial progress, syntax validity, code variety, and whether the model is actually improving or just submitting the same thing over and over. ## Reward breakdown | Signal | When | Value | |---|---|---| | All tests pass | Bug fully fixed | +2.0 | | More tests pass than before | Making progress | +0.5 | | No improvement over previous best | Stuck | -0.3 | | Code crashes at runtime | Regression | -0.5 | | Syntax error | Invalid Python | -1.0 | | Duplicate submission | Same fix as before | -0.5 | ## Project layout ``` ├── inference.py # main inference script (OpenEnv submission) ├── openenv.yaml # environment spec ├── Dockerfile ├── requirements.txt ├── env/ │ ├── client.py # async client with from_docker_image() │ ├── models.py # Action, Observation, State dataclasses │ ├── data/ │ │ └── bug_dataset.json # 10 bugs with test suites │ └── server/ │ ├── app.py # FastAPI — /reset, /step, /health, /ws │ ├── environment.py # core logic (reset/step/state) │ ├── sandbox.py # restricted code execution │ └── test_runner.py # runs tests against proposed fixes ├── server/ │ └── app.py # entry point for openenv validate ├── training/ │ └── colab_train.py # GRPO training (Colab T4) └── demo/ └── app.py # Gradio demo ``` ## Running locally ```bash pip install -r requirements.txt uvicorn env.server.app:app --host 0.0.0.0 --port 7860 ``` Then hit `POST /reset` with `{}` to start an episode, and `POST /step` with your fix to iterate. ## Inference The inference script uses the OpenAI-compatible client pointed at HuggingFace's inference router. It connects to the environment via `from_docker_image()`, runs the debug loop, and logs everything in the required `[START]`/`[STEP]`/`[END]` format. ```bash export HF_TOKEN="your_token" python inference.py ``` Default model is `Qwen/Qwen2.5-Coder-32B-Instruct` (free via HF router). You can swap it by setting `MODEL_NAME`. ## Training Open `training/colab_train.py` in Google Colab with a T4 runtime. It uses GRPO from HuggingFace TRL with QLoRA (4-bit quantization + LoRA adapters) so the whole thing fits in 15GB VRAM. Checkpoints get pushed to HF Hub automatically. ## API endpoints | Method | Path | What it does | |---|---|---| | POST | `/reset` | Start a new debugging episode | | POST | `/step` | Submit a proposed fix | | GET | `/state?session_id=X` | Get current episode state | | GET | `/health` | Health check | | WS | `/ws` | WebSocket interface | ## Tech used - **Environment:** FastAPI + OpenEnv protocol - **Training:** TRL GRPO + QLoRA on Qwen2.5-Coder-32B-Instruct - **Inference:** OpenAI Python client → HuggingFace router (free tier) - **Deployment:** Docker on HF Spaces - **Security:** Code execution in sandboxed subprocesses with restricted builtins ## License MIT