Spaces:
Sleeping
Sleeping
File size: 3,912 Bytes
64c9646 f3f5cb0 64c9646 f3f5cb0 c47c81c f3f5cb0 c47c81c f3f5cb0 c47c81c f3f5cb0 c47c81c f3f5cb0 c47c81c f3f5cb0 c47c81c f3f5cb0 c47c81c f3f5cb0 c47c81c f3f5cb0 c47c81c f3f5cb0 c47c81c f3f5cb0 c47c81c f3f5cb0 c47c81c f3f5cb0 c47c81c f3f5cb0 c47c81c f3f5cb0 c47c81c f3f5cb0 c47c81c f3f5cb0 c47c81c f3f5cb0 c47c81c f3f5cb0 c47c81c f3f5cb0 c47c81c f3f5cb0 c47c81c f3f5cb0 c47c81c f3f5cb0 c47c81c f3f5cb0 c47c81c f3f5cb0 c47c81c f3f5cb0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 | ---
title: CodeSensei Environment
emoji: π§
colorFrom: purple
colorTo: blue
sdk: docker
app_port: 7860
license: mit
short_description: RL environment for teaching LLMs to debug Python code
---
# CodeSensei
An RL environment built on OpenEnv that trains LLMs to fix buggy Python code. The model gets a broken function, proposes a fix, runs tests, and learns from the results β basically the same loop a developer goes through when debugging, but automated with reinforcement learning.
## How it works
1. The environment picks a buggy Python function from the dataset
2. The LLM reads the code + failing test output
3. It proposes a corrected version
4. We run the tests in a sandboxed subprocess
5. A multi-signal reward tells the model what went well (or didn't)
6. Repeat for up to 6 attempts per bug
The reward isn't just pass/fail β it accounts for partial progress, syntax validity, code variety, and whether the model is actually improving or just submitting the same thing over and over.
## Reward breakdown
| Signal | When | Value |
|---|---|---|
| All tests pass | Bug fully fixed | +2.0 |
| More tests pass than before | Making progress | +0.5 |
| No improvement over previous best | Stuck | -0.3 |
| Code crashes at runtime | Regression | -0.5 |
| Syntax error | Invalid Python | -1.0 |
| Duplicate submission | Same fix as before | -0.5 |
## Project layout
```
βββ inference.py # main inference script (OpenEnv submission)
βββ openenv.yaml # environment spec
βββ Dockerfile
βββ requirements.txt
βββ env/
β βββ client.py # async client with from_docker_image()
β βββ models.py # Action, Observation, State dataclasses
β βββ data/
β β βββ bug_dataset.json # 10 bugs with test suites
β βββ server/
β βββ app.py # FastAPI β /reset, /step, /health, /ws
β βββ environment.py # core logic (reset/step/state)
β βββ sandbox.py # restricted code execution
β βββ test_runner.py # runs tests against proposed fixes
βββ server/
β βββ app.py # entry point for openenv validate
βββ training/
β βββ colab_train.py # GRPO training (Colab T4)
βββ demo/
βββ app.py # Gradio demo
```
## Running locally
```bash
pip install -r requirements.txt
uvicorn env.server.app:app --host 0.0.0.0 --port 7860
```
Then hit `POST /reset` with `{}` to start an episode, and `POST /step` with your fix to iterate.
## Inference
The inference script uses the OpenAI-compatible client pointed at HuggingFace's inference router. It connects to the environment via `from_docker_image()`, runs the debug loop, and logs everything in the required `[START]`/`[STEP]`/`[END]` format.
```bash
export HF_TOKEN="your_token"
python inference.py
```
Default model is `Qwen/Qwen2.5-Coder-32B-Instruct` (free via HF router). You can swap it by setting `MODEL_NAME`.
## Training
Open `training/colab_train.py` in Google Colab with a T4 runtime. It uses GRPO from HuggingFace TRL with QLoRA (4-bit quantization + LoRA adapters) so the whole thing fits in 15GB VRAM. Checkpoints get pushed to HF Hub automatically.
## API endpoints
| Method | Path | What it does |
|---|---|---|
| POST | `/reset` | Start a new debugging episode |
| POST | `/step` | Submit a proposed fix |
| GET | `/state?session_id=X` | Get current episode state |
| GET | `/health` | Health check |
| WS | `/ws` | WebSocket interface |
## Tech used
- **Environment:** FastAPI + OpenEnv protocol
- **Training:** TRL GRPO + QLoRA on Qwen2.5-Coder-32B-Instruct
- **Inference:** OpenAI Python client β HuggingFace router (free tier)
- **Deployment:** Docker on HF Spaces
- **Security:** Code execution in sandboxed subprocesses with restricted builtins
## License
MIT
|