Spaces:
Sleeping
Sleeping
vineetshukla.work@gmail.com commited on
Commit Β·
f3f5cb0
1
Parent(s): 1caebb9
docs: rewrite README, clean up repo structure
Browse files- .gitignore +1 -0
- README.md +63 -123
.gitignore
CHANGED
|
@@ -9,3 +9,4 @@ build/
|
|
| 9 |
*.log
|
| 10 |
.DS_Store
|
| 11 |
Thumbs.db
|
|
|
|
|
|
| 9 |
*.log
|
| 10 |
.DS_Store
|
| 11 |
Thumbs.db
|
| 12 |
+
codesensei_unwanted/
|
README.md
CHANGED
|
@@ -6,162 +6,102 @@ colorTo: blue
|
|
| 6 |
sdk: docker
|
| 7 |
app_port: 7860
|
| 8 |
license: mit
|
| 9 |
-
short_description:
|
| 10 |
---
|
| 11 |
|
| 12 |
-
#
|
| 13 |
|
| 14 |
-
|
| 15 |
|
| 16 |
-
|
| 17 |
-
[](https://huggingface.co/docs/trl)
|
| 18 |
-
[](https://huggingface.co/spaces)
|
| 19 |
-
[](LICENSE)
|
| 20 |
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
The LLM receives buggy Python functions, proposes fixes, and gets rewarded based on test results β learning to debug through trial and error.
|
| 28 |
|
| 29 |
-
|
| 30 |
|
| 31 |
-
|
| 32 |
-
- π― **4-Signal Reward System** β Correctness, progress, syntax, repetition
|
| 33 |
-
- π **Sandboxed Execution** β LLM-generated code runs in restricted subprocesses
|
| 34 |
-
- π **WebSocket First** β Designed for HF Spaces deployment
|
| 35 |
-
- π° **100% Free** β Colab T4 + HF Spaces free tier
|
| 36 |
-
- π **Live Demo** β Gradio app with baseline vs fine-tuned comparison
|
| 37 |
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
β GRPOTrainer β rollout_func() ββββΌβββββΊβ FastAPI + CodeDebugEnv β
|
| 47 |
-
β Qwen3-1.7B + vLLM β β Sandbox + Test Runner β
|
| 48 |
-
β β β β
|
| 49 |
-
β β push checkpoint every 5 steps β ββββββββββββββββββββββββββββββ
|
| 50 |
-
ββββββββββββββββ¬ββββββββββββββββββββ
|
| 51 |
-
β
|
| 52 |
-
βΌ
|
| 53 |
-
π€ HF Hub (model + checkpoints)
|
| 54 |
-
β
|
| 55 |
-
βΌ
|
| 56 |
-
βββββββββββββββββββββββββββββββ
|
| 57 |
-
β HF Space (codesensei-demo) β
|
| 58 |
-
β Gradio: baseline vs GRPO β
|
| 59 |
-
βββββββββββββββββββββββββββββββ
|
| 60 |
-
```
|
| 61 |
-
|
| 62 |
-
---
|
| 63 |
|
| 64 |
-
##
|
| 65 |
|
| 66 |
```
|
| 67 |
-
|
| 68 |
-
βββ
|
| 69 |
-
|
| 70 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
β βββ server/
|
| 72 |
-
β βββ
|
| 73 |
-
β βββ
|
| 74 |
-
β βββ
|
| 75 |
-
β βββ
|
|
|
|
|
|
|
| 76 |
βββ training/
|
| 77 |
-
β βββ colab_train.py
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
βββ Dockerfile # HF Spaces deployment
|
| 81 |
-
βββ requirements.txt # Server dependencies
|
| 82 |
-
βββ README.md
|
| 83 |
```
|
| 84 |
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
## π Quick Start
|
| 88 |
-
|
| 89 |
-
### 1. Run Environment Locally
|
| 90 |
|
| 91 |
```bash
|
| 92 |
pip install -r requirements.txt
|
| 93 |
uvicorn env.server.app:app --host 0.0.0.0 --port 7860
|
| 94 |
```
|
| 95 |
|
| 96 |
-
|
| 97 |
|
| 98 |
-
|
| 99 |
-
# Push to HF Spaces (Docker-based)
|
| 100 |
-
huggingface-cli repo create codesensei-env --type space --space-sdk docker
|
| 101 |
-
git remote add hf https://huggingface.co/spaces/YOUR-USERNAME/codesensei-env
|
| 102 |
-
git push hf main
|
| 103 |
-
```
|
| 104 |
-
|
| 105 |
-
### 3. Train on Colab
|
| 106 |
-
|
| 107 |
-
1. Open `training/colab_train.py` in Google Colab
|
| 108 |
-
2. Set GPU runtime β T4
|
| 109 |
-
3. Update `CODESENSEI_ENV_URL` to your HF Space
|
| 110 |
-
4. Run all cells
|
| 111 |
-
5. If session drops β re-run cell 10, it resumes from checkpoint
|
| 112 |
|
| 113 |
-
|
| 114 |
|
| 115 |
```bash
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
python app.py
|
| 119 |
```
|
| 120 |
|
| 121 |
-
---
|
| 122 |
-
|
| 123 |
-
## π― Reward System
|
| 124 |
|
| 125 |
-
|
| 126 |
-
|---|---|---|---|
|
| 127 |
-
| Correctness | All tests pass | +2.0 | Primary goal |
|
| 128 |
-
| Progress | More tests pass than before | +0.5 | Incremental improvement |
|
| 129 |
-
| Stagnation | No improvement | -0.3 | Prevent plateaus |
|
| 130 |
-
| Runtime Error | Code crashes | -0.5 | Penalize regressions |
|
| 131 |
-
| Syntax Error | Invalid Python | -1.0 | Force valid output |
|
| 132 |
-
| Repetition | Same fix submitted | -0.5 | Force exploration |
|
| 133 |
|
| 134 |
-
-
|
| 135 |
|
| 136 |
-
##
|
| 137 |
|
| 138 |
-
|
|
| 139 |
|---|---|---|
|
| 140 |
-
|
|
| 141 |
-
|
|
| 142 |
-
|
|
| 143 |
-
|
|
| 144 |
-
|
|
| 145 |
-
| Demo | Gradio | Free |
|
| 146 |
-
| **Total** | | **$0** |
|
| 147 |
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
## π Training Details
|
| 151 |
|
| 152 |
-
- **
|
| 153 |
-
- **
|
| 154 |
-
- **
|
| 155 |
-
- **
|
| 156 |
-
- **
|
| 157 |
-
- **Session Resilience:** Auto-resume from checkpoint on Colab crash
|
| 158 |
-
|
| 159 |
-
---
|
| 160 |
|
| 161 |
-
##
|
| 162 |
-
|
| 163 |
-
MIT License β see [LICENSE](LICENSE) for details.
|
| 164 |
-
|
| 165 |
-
---
|
| 166 |
|
| 167 |
-
|
|
|
|
| 6 |
sdk: docker
|
| 7 |
app_port: 7860
|
| 8 |
license: mit
|
| 9 |
+
short_description: RL environment for teaching LLMs to debug Python code
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# CodeSensei
|
| 13 |
|
| 14 |
+
An RL environment built on OpenEnv that trains LLMs to fix buggy Python code. The model gets a broken function, proposes a fix, runs tests, and learns from the results β basically the same loop a developer goes through when debugging, but automated with reinforcement learning.
|
| 15 |
|
| 16 |
+
## How it works
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
+
1. The environment picks a buggy Python function from the dataset
|
| 19 |
+
2. The LLM reads the code + failing test output
|
| 20 |
+
3. It proposes a corrected version
|
| 21 |
+
4. We run the tests in a sandboxed subprocess
|
| 22 |
+
5. A multi-signal reward tells the model what went well (or didn't)
|
| 23 |
+
6. Repeat for up to 6 attempts per bug
|
|
|
|
| 24 |
|
| 25 |
+
The reward isn't just pass/fail β it accounts for partial progress, syntax validity, code variety, and whether the model is actually improving or just submitting the same thing over and over.
|
| 26 |
|
| 27 |
+
## Reward breakdown
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
+
| Signal | When | Value |
|
| 30 |
+
|---|---|---|
|
| 31 |
+
| All tests pass | Bug fully fixed | +2.0 |
|
| 32 |
+
| More tests pass than before | Making progress | +0.5 |
|
| 33 |
+
| No improvement over previous best | Stuck | -0.3 |
|
| 34 |
+
| Code crashes at runtime | Regression | -0.5 |
|
| 35 |
+
| Syntax error | Invalid Python | -1.0 |
|
| 36 |
+
| Duplicate submission | Same fix as before | -0.5 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
+
## Project layout
|
| 39 |
|
| 40 |
```
|
| 41 |
+
βββ inference.py # main inference script (OpenEnv submission)
|
| 42 |
+
βββ openenv.yaml # environment spec
|
| 43 |
+
βββ Dockerfile
|
| 44 |
+
βββ requirements.txt
|
| 45 |
+
βββ env/
|
| 46 |
+
β βββ client.py # async client with from_docker_image()
|
| 47 |
+
β βββ models.py # Action, Observation, State dataclasses
|
| 48 |
+
β βββ data/
|
| 49 |
+
β β βββ bug_dataset.json # 10 bugs with test suites
|
| 50 |
β βββ server/
|
| 51 |
+
β βββ app.py # FastAPI β /reset, /step, /health, /ws
|
| 52 |
+
β βββ environment.py # core logic (reset/step/state)
|
| 53 |
+
β βββ sandbox.py # restricted code execution
|
| 54 |
+
β βββ test_runner.py # runs tests against proposed fixes
|
| 55 |
+
βββ server/
|
| 56 |
+
β βββ app.py # entry point for openenv validate
|
| 57 |
βββ training/
|
| 58 |
+
β βββ colab_train.py # GRPO training (Colab T4)
|
| 59 |
+
βββ demo/
|
| 60 |
+
βββ app.py # Gradio demo
|
|
|
|
|
|
|
|
|
|
| 61 |
```
|
| 62 |
|
| 63 |
+
## Running locally
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
|
| 65 |
```bash
|
| 66 |
pip install -r requirements.txt
|
| 67 |
uvicorn env.server.app:app --host 0.0.0.0 --port 7860
|
| 68 |
```
|
| 69 |
|
| 70 |
+
Then hit `POST /reset` with `{}` to start an episode, and `POST /step` with your fix to iterate.
|
| 71 |
|
| 72 |
+
## Inference
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
|
| 74 |
+
The inference script uses the OpenAI-compatible client pointed at HuggingFace's inference router. It connects to the environment via `from_docker_image()`, runs the debug loop, and logs everything in the required `[START]`/`[STEP]`/`[END]` format.
|
| 75 |
|
| 76 |
```bash
|
| 77 |
+
export HF_TOKEN="your_token"
|
| 78 |
+
python inference.py
|
|
|
|
| 79 |
```
|
| 80 |
|
| 81 |
+
Default model is `Qwen/Qwen2.5-Coder-32B-Instruct` (free via HF router). You can swap it by setting `MODEL_NAME`.
|
|
|
|
|
|
|
| 82 |
|
| 83 |
+
## Training
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
|
| 85 |
+
Open `training/colab_train.py` in Google Colab with a T4 runtime. It uses GRPO from HuggingFace TRL with QLoRA (4-bit quantization + LoRA adapters) so the whole thing fits in 15GB VRAM. Checkpoints get pushed to HF Hub automatically.
|
| 86 |
|
| 87 |
+
## API endpoints
|
| 88 |
|
| 89 |
+
| Method | Path | What it does |
|
| 90 |
|---|---|---|
|
| 91 |
+
| POST | `/reset` | Start a new debugging episode |
|
| 92 |
+
| POST | `/step` | Submit a proposed fix |
|
| 93 |
+
| GET | `/state?session_id=X` | Get current episode state |
|
| 94 |
+
| GET | `/health` | Health check |
|
| 95 |
+
| WS | `/ws` | WebSocket interface |
|
|
|
|
|
|
|
| 96 |
|
| 97 |
+
## Tech used
|
|
|
|
|
|
|
| 98 |
|
| 99 |
+
- **Environment:** FastAPI + OpenEnv protocol
|
| 100 |
+
- **Training:** TRL GRPO + QLoRA on Qwen2.5-Coder-32B-Instruct
|
| 101 |
+
- **Inference:** OpenAI Python client β HuggingFace router (free tier)
|
| 102 |
+
- **Deployment:** Docker on HF Spaces
|
| 103 |
+
- **Security:** Code execution in sandboxed subprocesses with restricted builtins
|
|
|
|
|
|
|
|
|
|
| 104 |
|
| 105 |
+
## License
|
|
|
|
|
|
|
|
|
|
|
|
|
| 106 |
|
| 107 |
+
MIT
|