codesensei-env / README.md
vineetshukla.work@gmail.com
docs: rewrite README, clean up repo structure
f3f5cb0
---
title: CodeSensei Environment
emoji: 🧠
colorFrom: purple
colorTo: blue
sdk: docker
app_port: 7860
license: mit
short_description: RL environment for teaching LLMs to debug Python code
---
# CodeSensei
An RL environment built on OpenEnv that trains LLMs to fix buggy Python code. The model gets a broken function, proposes a fix, runs tests, and learns from the results β€” basically the same loop a developer goes through when debugging, but automated with reinforcement learning.
## How it works
1. The environment picks a buggy Python function from the dataset
2. The LLM reads the code + failing test output
3. It proposes a corrected version
4. We run the tests in a sandboxed subprocess
5. A multi-signal reward tells the model what went well (or didn't)
6. Repeat for up to 6 attempts per bug
The reward isn't just pass/fail β€” it accounts for partial progress, syntax validity, code variety, and whether the model is actually improving or just submitting the same thing over and over.
## Reward breakdown
| Signal | When | Value |
|---|---|---|
| All tests pass | Bug fully fixed | +2.0 |
| More tests pass than before | Making progress | +0.5 |
| No improvement over previous best | Stuck | -0.3 |
| Code crashes at runtime | Regression | -0.5 |
| Syntax error | Invalid Python | -1.0 |
| Duplicate submission | Same fix as before | -0.5 |
## Project layout
```
β”œβ”€β”€ inference.py # main inference script (OpenEnv submission)
β”œβ”€β”€ openenv.yaml # environment spec
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ env/
β”‚ β”œβ”€β”€ client.py # async client with from_docker_image()
β”‚ β”œβ”€β”€ models.py # Action, Observation, State dataclasses
β”‚ β”œβ”€β”€ data/
β”‚ β”‚ └── bug_dataset.json # 10 bugs with test suites
β”‚ └── server/
β”‚ β”œβ”€β”€ app.py # FastAPI β€” /reset, /step, /health, /ws
β”‚ β”œβ”€β”€ environment.py # core logic (reset/step/state)
β”‚ β”œβ”€β”€ sandbox.py # restricted code execution
β”‚ └── test_runner.py # runs tests against proposed fixes
β”œβ”€β”€ server/
β”‚ └── app.py # entry point for openenv validate
β”œβ”€β”€ training/
β”‚ └── colab_train.py # GRPO training (Colab T4)
└── demo/
└── app.py # Gradio demo
```
## Running locally
```bash
pip install -r requirements.txt
uvicorn env.server.app:app --host 0.0.0.0 --port 7860
```
Then hit `POST /reset` with `{}` to start an episode, and `POST /step` with your fix to iterate.
## Inference
The inference script uses the OpenAI-compatible client pointed at HuggingFace's inference router. It connects to the environment via `from_docker_image()`, runs the debug loop, and logs everything in the required `[START]`/`[STEP]`/`[END]` format.
```bash
export HF_TOKEN="your_token"
python inference.py
```
Default model is `Qwen/Qwen2.5-Coder-32B-Instruct` (free via HF router). You can swap it by setting `MODEL_NAME`.
## Training
Open `training/colab_train.py` in Google Colab with a T4 runtime. It uses GRPO from HuggingFace TRL with QLoRA (4-bit quantization + LoRA adapters) so the whole thing fits in 15GB VRAM. Checkpoints get pushed to HF Hub automatically.
## API endpoints
| Method | Path | What it does |
|---|---|---|
| POST | `/reset` | Start a new debugging episode |
| POST | `/step` | Submit a proposed fix |
| GET | `/state?session_id=X` | Get current episode state |
| GET | `/health` | Health check |
| WS | `/ws` | WebSocket interface |
## Tech used
- **Environment:** FastAPI + OpenEnv protocol
- **Training:** TRL GRPO + QLoRA on Qwen2.5-Coder-32B-Instruct
- **Inference:** OpenAI Python client β†’ HuggingFace router (free tier)
- **Deployment:** Docker on HF Spaces
- **Security:** Code execution in sandboxed subprocesses with restricted builtins
## License
MIT