Spaces:
Sleeping
title: CodeSensei Environment
emoji: π§
colorFrom: purple
colorTo: blue
sdk: docker
app_port: 7860
license: mit
short_description: RL environment for teaching LLMs to debug Python code
CodeSensei
An RL environment built on OpenEnv that trains LLMs to fix buggy Python code. The model gets a broken function, proposes a fix, runs tests, and learns from the results β basically the same loop a developer goes through when debugging, but automated with reinforcement learning.
How it works
- The environment picks a buggy Python function from the dataset
- The LLM reads the code + failing test output
- It proposes a corrected version
- We run the tests in a sandboxed subprocess
- A multi-signal reward tells the model what went well (or didn't)
- Repeat for up to 6 attempts per bug
The reward isn't just pass/fail β it accounts for partial progress, syntax validity, code variety, and whether the model is actually improving or just submitting the same thing over and over.
Reward breakdown
| Signal | When | Value |
|---|---|---|
| All tests pass | Bug fully fixed | +2.0 |
| More tests pass than before | Making progress | +0.5 |
| No improvement over previous best | Stuck | -0.3 |
| Code crashes at runtime | Regression | -0.5 |
| Syntax error | Invalid Python | -1.0 |
| Duplicate submission | Same fix as before | -0.5 |
Project layout
βββ inference.py # main inference script (OpenEnv submission)
βββ openenv.yaml # environment spec
βββ Dockerfile
βββ requirements.txt
βββ env/
β βββ client.py # async client with from_docker_image()
β βββ models.py # Action, Observation, State dataclasses
β βββ data/
β β βββ bug_dataset.json # 10 bugs with test suites
β βββ server/
β βββ app.py # FastAPI β /reset, /step, /health, /ws
β βββ environment.py # core logic (reset/step/state)
β βββ sandbox.py # restricted code execution
β βββ test_runner.py # runs tests against proposed fixes
βββ server/
β βββ app.py # entry point for openenv validate
βββ training/
β βββ colab_train.py # GRPO training (Colab T4)
βββ demo/
βββ app.py # Gradio demo
Running locally
pip install -r requirements.txt
uvicorn env.server.app:app --host 0.0.0.0 --port 7860
Then hit POST /reset with {} to start an episode, and POST /step with your fix to iterate.
Inference
The inference script uses the OpenAI-compatible client pointed at HuggingFace's inference router. It connects to the environment via from_docker_image(), runs the debug loop, and logs everything in the required [START]/[STEP]/[END] format.
export HF_TOKEN="your_token"
python inference.py
Default model is Qwen/Qwen2.5-Coder-32B-Instruct (free via HF router). You can swap it by setting MODEL_NAME.
Training
Open training/colab_train.py in Google Colab with a T4 runtime. It uses GRPO from HuggingFace TRL with QLoRA (4-bit quantization + LoRA adapters) so the whole thing fits in 15GB VRAM. Checkpoints get pushed to HF Hub automatically.
API endpoints
| Method | Path | What it does |
|---|---|---|
| POST | /reset |
Start a new debugging episode |
| POST | /step |
Submit a proposed fix |
| GET | /state?session_id=X |
Get current episode state |
| GET | /health |
Health check |
| WS | /ws |
WebSocket interface |
Tech used
- Environment: FastAPI + OpenEnv protocol
- Training: TRL GRPO + QLoRA on Qwen2.5-Coder-32B-Instruct
- Inference: OpenAI Python client β HuggingFace router (free tier)
- Deployment: Docker on HF Spaces
- Security: Code execution in sandboxed subprocesses with restricted builtins
License
MIT