Spaces:

Pandaisop
/

codesensei-env

Sleeping

File size: 3,912 Bytes

64c9646
 
 
 
 
 
 
 
f3f5cb0
64c9646
 
f3f5cb0
c47c81c
f3f5cb0
c47c81c
f3f5cb0
c47c81c
f3f5cb0
 
 
 
 
 
c47c81c
f3f5cb0
c47c81c
f3f5cb0
c47c81c
f3f5cb0
 
 
 
 
 
 
 
c47c81c
f3f5cb0
c47c81c
 
f3f5cb0
 
 
 
 
 
 
 
 
c47c81c
f3f5cb0
 
 
 
 
 
c47c81c
f3f5cb0
 
 
c47c81c
 
f3f5cb0
c47c81c
 
 
 
 
 
f3f5cb0
c47c81c
f3f5cb0
c47c81c
f3f5cb0
c47c81c
 
f3f5cb0
 
c47c81c
 
f3f5cb0
c47c81c
f3f5cb0
c47c81c
f3f5cb0
c47c81c
f3f5cb0
c47c81c
f3f5cb0
c47c81c
f3f5cb0
 
 
 
 
c47c81c
f3f5cb0
c47c81c
f3f5cb0
 
 
 
 
c47c81c
f3f5cb0
c47c81c
f3f5cb0

---
title: CodeSensei Environment
emoji: 🧠
colorFrom: purple
colorTo: blue
sdk: docker
app_port: 7860
license: mit
short_description: RL environment for teaching LLMs to debug Python code
---

# CodeSensei

An RL environment built on OpenEnv that trains LLMs to fix buggy Python code. The model gets a broken function, proposes a fix, runs tests, and learns from the results — basically the same loop a developer goes through when debugging, but automated with reinforcement learning.

## How it works

1. The environment picks a buggy Python function from the dataset
2. The LLM reads the code + failing test output
3. It proposes a corrected version
4. We run the tests in a sandboxed subprocess
5. A multi-signal reward tells the model what went well (or didn't)
6. Repeat for up to 6 attempts per bug

The reward isn't just pass/fail — it accounts for partial progress, syntax validity, code variety, and whether the model is actually improving or just submitting the same thing over and over.

## Reward breakdown

| Signal | When | Value |
|---|---|---|
| All tests pass | Bug fully fixed | +2.0 |
| More tests pass than before | Making progress | +0.5 |
| No improvement over previous best | Stuck | -0.3 |
| Code crashes at runtime | Regression | -0.5 |
| Syntax error | Invalid Python | -1.0 |
| Duplicate submission | Same fix as before | -0.5 |

## Project layout

```
├── inference.py             # main inference script (OpenEnv submission)
├── openenv.yaml             # environment spec
├── Dockerfile
├── requirements.txt
├── env/
│   ├── client.py            # async client with from_docker_image()
│   ├── models.py            # Action, Observation, State dataclasses
│   ├── data/
│   │   └── bug_dataset.json # 10 bugs with test suites
│   └── server/
│       ├── app.py           # FastAPI — /reset, /step, /health, /ws
│       ├── environment.py   # core logic (reset/step/state)
│       ├── sandbox.py       # restricted code execution
│       └── test_runner.py   # runs tests against proposed fixes
├── server/
│   └── app.py               # entry point for openenv validate
├── training/
│   └── colab_train.py       # GRPO training (Colab T4)
└── demo/
    └── app.py               # Gradio demo
```

## Running locally

```bash
pip install -r requirements.txt
uvicorn env.server.app:app --host 0.0.0.0 --port 7860
```

Then hit `POST /reset` with `{}` to start an episode, and `POST /step` with your fix to iterate.

## Inference

The inference script uses the OpenAI-compatible client pointed at HuggingFace's inference router. It connects to the environment via `from_docker_image()`, runs the debug loop, and logs everything in the required `[START]`/`[STEP]`/`[END]` format.

```bash
export HF_TOKEN="your_token"
python inference.py
```

Default model is `Qwen/Qwen2.5-Coder-32B-Instruct` (free via HF router). You can swap it by setting `MODEL_NAME`.

## Training

Open `training/colab_train.py` in Google Colab with a T4 runtime. It uses GRPO from HuggingFace TRL with QLoRA (4-bit quantization + LoRA adapters) so the whole thing fits in 15GB VRAM. Checkpoints get pushed to HF Hub automatically.

## API endpoints

| Method | Path | What it does |
|---|---|---|
| POST | `/reset` | Start a new debugging episode |
| POST | `/step` | Submit a proposed fix |
| GET | `/state?session_id=X` | Get current episode state |
| GET | `/health` | Health check |
| WS | `/ws` | WebSocket interface |

## Tech used

- **Environment:** FastAPI + OpenEnv protocol
- **Training:** TRL GRPO + QLoRA on Qwen2.5-Coder-32B-Instruct
- **Inference:** OpenAI Python client → HuggingFace router (free tier)
- **Deployment:** Docker on HF Spaces
- **Security:** Code execution in sandboxed subprocesses with restricted builtins

## License

MIT