---
title: CodeSensei Environment
emoji: 🧠
colorFrom: purple
colorTo: blue
sdk: docker
app_port: 7860
license: mit
short_description: RL environment for teaching LLMs to debug Python code
---

# CodeSensei

An RL environment built on OpenEnv that trains LLMs to fix buggy Python code. The model gets a broken function, proposes a fix, runs tests, and learns from the results — basically the same loop a developer goes through when debugging, but automated with reinforcement learning.

## How it works

1. The environment picks a buggy Python function from the dataset
2. The LLM reads the code + failing test output
3. It proposes a corrected version
4. We run the tests in a sandboxed subprocess
5. A multi-signal reward tells the model what went well (or didn't)
6. Repeat for up to 6 attempts per bug

The reward isn't just pass/fail — it accounts for partial progress, syntax validity, code variety, and whether the model is actually improving or just submitting the same thing over and over.

## Reward breakdown

| Signal | When | Value |
|---|---|---|
| All tests pass | Bug fully fixed | +2.0 |
| More tests pass than before | Making progress | +0.5 |
| No improvement over previous best | Stuck | -0.3 |
| Code crashes at runtime | Regression | -0.5 |
| Syntax error | Invalid Python | -1.0 |
| Duplicate submission | Same fix as before | -0.5 |

## Project layout

```
├── inference.py             # main inference script (OpenEnv submission)
├── openenv.yaml             # environment spec
├── Dockerfile
├── requirements.txt
├── env/
│   ├── client.py            # async client with from_docker_image()
│   ├── models.py            # Action, Observation, State dataclasses
│   ├── data/
│   │   └── bug_dataset.json # 10 bugs with test suites
│   └── server/
│       ├── app.py           # FastAPI — /reset, /step, /health, /ws
│       ├── environment.py   # core logic (reset/step/state)
│       ├── sandbox.py       # restricted code execution
│       └── test_runner.py   # runs tests against proposed fixes
├── server/
│   └── app.py               # entry point for openenv validate
├── training/
│   └── colab_train.py       # GRPO training (Colab T4)
└── demo/
    └── app.py               # Gradio demo
```

## Running locally

```bash
pip install -r requirements.txt
uvicorn env.server.app:app --host 0.0.0.0 --port 7860
```

Then hit `POST /reset` with `{}` to start an episode, and `POST /step` with your fix to iterate.

## Inference

The inference script uses the OpenAI-compatible client pointed at HuggingFace's inference router. It connects to the environment via `from_docker_image()`, runs the debug loop, and logs everything in the required `[START]`/`[STEP]`/`[END]` format.

```bash
export HF_TOKEN="your_token"
python inference.py
```

Default model is `Qwen/Qwen2.5-Coder-32B-Instruct` (free via HF router). You can swap it by setting `MODEL_NAME`.

## Training

Open `training/colab_train.py` in Google Colab with a T4 runtime. It uses GRPO from HuggingFace TRL with QLoRA (4-bit quantization + LoRA adapters) so the whole thing fits in 15GB VRAM. Checkpoints get pushed to HF Hub automatically.

## API endpoints

| Method | Path | What it does |
|---|---|---|
| POST | `/reset` | Start a new debugging episode |
| POST | `/step` | Submit a proposed fix |
| GET | `/state?session_id=X` | Get current episode state |
| GET | `/health` | Health check |
| WS | `/ws` | WebSocket interface |

## Tech used

- **Environment:** FastAPI + OpenEnv protocol
- **Training:** TRL GRPO + QLoRA on Qwen2.5-Coder-32B-Instruct
- **Inference:** OpenAI Python client → HuggingFace router (free tier)
- **Deployment:** Docker on HF Spaces
- **Security:** Code execution in sandboxed subprocesses with restricted builtins

## License

MIT