codesensei-env / README.md
vineetshukla.work@gmail.com
docs: rewrite README, clean up repo structure
f3f5cb0
metadata
title: CodeSensei Environment
emoji: 🧠
colorFrom: purple
colorTo: blue
sdk: docker
app_port: 7860
license: mit
short_description: RL environment for teaching LLMs to debug Python code

CodeSensei

An RL environment built on OpenEnv that trains LLMs to fix buggy Python code. The model gets a broken function, proposes a fix, runs tests, and learns from the results β€” basically the same loop a developer goes through when debugging, but automated with reinforcement learning.

How it works

  1. The environment picks a buggy Python function from the dataset
  2. The LLM reads the code + failing test output
  3. It proposes a corrected version
  4. We run the tests in a sandboxed subprocess
  5. A multi-signal reward tells the model what went well (or didn't)
  6. Repeat for up to 6 attempts per bug

The reward isn't just pass/fail β€” it accounts for partial progress, syntax validity, code variety, and whether the model is actually improving or just submitting the same thing over and over.

Reward breakdown

Signal When Value
All tests pass Bug fully fixed +2.0
More tests pass than before Making progress +0.5
No improvement over previous best Stuck -0.3
Code crashes at runtime Regression -0.5
Syntax error Invalid Python -1.0
Duplicate submission Same fix as before -0.5

Project layout

β”œβ”€β”€ inference.py             # main inference script (OpenEnv submission)
β”œβ”€β”€ openenv.yaml             # environment spec
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ env/
β”‚   β”œβ”€β”€ client.py            # async client with from_docker_image()
β”‚   β”œβ”€β”€ models.py            # Action, Observation, State dataclasses
β”‚   β”œβ”€β”€ data/
β”‚   β”‚   └── bug_dataset.json # 10 bugs with test suites
β”‚   └── server/
β”‚       β”œβ”€β”€ app.py           # FastAPI β€” /reset, /step, /health, /ws
β”‚       β”œβ”€β”€ environment.py   # core logic (reset/step/state)
β”‚       β”œβ”€β”€ sandbox.py       # restricted code execution
β”‚       └── test_runner.py   # runs tests against proposed fixes
β”œβ”€β”€ server/
β”‚   └── app.py               # entry point for openenv validate
β”œβ”€β”€ training/
β”‚   └── colab_train.py       # GRPO training (Colab T4)
└── demo/
    └── app.py               # Gradio demo

Running locally

pip install -r requirements.txt
uvicorn env.server.app:app --host 0.0.0.0 --port 7860

Then hit POST /reset with {} to start an episode, and POST /step with your fix to iterate.

Inference

The inference script uses the OpenAI-compatible client pointed at HuggingFace's inference router. It connects to the environment via from_docker_image(), runs the debug loop, and logs everything in the required [START]/[STEP]/[END] format.

export HF_TOKEN="your_token"
python inference.py

Default model is Qwen/Qwen2.5-Coder-32B-Instruct (free via HF router). You can swap it by setting MODEL_NAME.

Training

Open training/colab_train.py in Google Colab with a T4 runtime. It uses GRPO from HuggingFace TRL with QLoRA (4-bit quantization + LoRA adapters) so the whole thing fits in 15GB VRAM. Checkpoints get pushed to HF Hub automatically.

API endpoints

Method Path What it does
POST /reset Start a new debugging episode
POST /step Submit a proposed fix
GET /state?session_id=X Get current episode state
GET /health Health check
WS /ws WebSocket interface

Tech used

  • Environment: FastAPI + OpenEnv protocol
  • Training: TRL GRPO + QLoRA on Qwen2.5-Coder-32B-Instruct
  • Inference: OpenAI Python client β†’ HuggingFace router (free tier)
  • Deployment: Docker on HF Spaces
  • Security: Code execution in sandboxed subprocesses with restricted builtins

License

MIT