Spaces:
Sleeping
Sleeping
| title: CodeSensei Environment | |
| emoji: π§ | |
| colorFrom: purple | |
| colorTo: blue | |
| sdk: docker | |
| app_port: 7860 | |
| license: mit | |
| short_description: RL environment for teaching LLMs to debug Python code | |
| # CodeSensei | |
| An RL environment built on OpenEnv that trains LLMs to fix buggy Python code. The model gets a broken function, proposes a fix, runs tests, and learns from the results β basically the same loop a developer goes through when debugging, but automated with reinforcement learning. | |
| ## How it works | |
| 1. The environment picks a buggy Python function from the dataset | |
| 2. The LLM reads the code + failing test output | |
| 3. It proposes a corrected version | |
| 4. We run the tests in a sandboxed subprocess | |
| 5. A multi-signal reward tells the model what went well (or didn't) | |
| 6. Repeat for up to 6 attempts per bug | |
| The reward isn't just pass/fail β it accounts for partial progress, syntax validity, code variety, and whether the model is actually improving or just submitting the same thing over and over. | |
| ## Reward breakdown | |
| | Signal | When | Value | | |
| |---|---|---| | |
| | All tests pass | Bug fully fixed | +2.0 | | |
| | More tests pass than before | Making progress | +0.5 | | |
| | No improvement over previous best | Stuck | -0.3 | | |
| | Code crashes at runtime | Regression | -0.5 | | |
| | Syntax error | Invalid Python | -1.0 | | |
| | Duplicate submission | Same fix as before | -0.5 | | |
| ## Project layout | |
| ``` | |
| βββ inference.py # main inference script (OpenEnv submission) | |
| βββ openenv.yaml # environment spec | |
| βββ Dockerfile | |
| βββ requirements.txt | |
| βββ env/ | |
| β βββ client.py # async client with from_docker_image() | |
| β βββ models.py # Action, Observation, State dataclasses | |
| β βββ data/ | |
| β β βββ bug_dataset.json # 10 bugs with test suites | |
| β βββ server/ | |
| β βββ app.py # FastAPI β /reset, /step, /health, /ws | |
| β βββ environment.py # core logic (reset/step/state) | |
| β βββ sandbox.py # restricted code execution | |
| β βββ test_runner.py # runs tests against proposed fixes | |
| βββ server/ | |
| β βββ app.py # entry point for openenv validate | |
| βββ training/ | |
| β βββ colab_train.py # GRPO training (Colab T4) | |
| βββ demo/ | |
| βββ app.py # Gradio demo | |
| ``` | |
| ## Running locally | |
| ```bash | |
| pip install -r requirements.txt | |
| uvicorn env.server.app:app --host 0.0.0.0 --port 7860 | |
| ``` | |
| Then hit `POST /reset` with `{}` to start an episode, and `POST /step` with your fix to iterate. | |
| ## Inference | |
| The inference script uses the OpenAI-compatible client pointed at HuggingFace's inference router. It connects to the environment via `from_docker_image()`, runs the debug loop, and logs everything in the required `[START]`/`[STEP]`/`[END]` format. | |
| ```bash | |
| export HF_TOKEN="your_token" | |
| python inference.py | |
| ``` | |
| Default model is `Qwen/Qwen2.5-Coder-32B-Instruct` (free via HF router). You can swap it by setting `MODEL_NAME`. | |
| ## Training | |
| Open `training/colab_train.py` in Google Colab with a T4 runtime. It uses GRPO from HuggingFace TRL with QLoRA (4-bit quantization + LoRA adapters) so the whole thing fits in 15GB VRAM. Checkpoints get pushed to HF Hub automatically. | |
| ## API endpoints | |
| | Method | Path | What it does | | |
| |---|---|---| | |
| | POST | `/reset` | Start a new debugging episode | | |
| | POST | `/step` | Submit a proposed fix | | |
| | GET | `/state?session_id=X` | Get current episode state | | |
| | GET | `/health` | Health check | | |
| | WS | `/ws` | WebSocket interface | | |
| ## Tech used | |
| - **Environment:** FastAPI + OpenEnv protocol | |
| - **Training:** TRL GRPO + QLoRA on Qwen2.5-Coder-32B-Instruct | |
| - **Inference:** OpenAI Python client β HuggingFace router (free tier) | |
| - **Deployment:** Docker on HF Spaces | |
| - **Security:** Code execution in sandboxed subprocesses with restricted builtins | |
| ## License | |
| MIT | |