vineetshukla.work@gmail.com commited on
Commit
f3f5cb0
Β·
1 Parent(s): 1caebb9

docs: rewrite README, clean up repo structure

Browse files
Files changed (2) hide show
  1. .gitignore +1 -0
  2. README.md +63 -123
.gitignore CHANGED
@@ -9,3 +9,4 @@ build/
9
  *.log
10
  .DS_Store
11
  Thumbs.db
 
 
9
  *.log
10
  .DS_Store
11
  Thumbs.db
12
+ codesensei_unwanted/
README.md CHANGED
@@ -6,162 +6,102 @@ colorTo: blue
6
  sdk: docker
7
  app_port: 7860
8
  license: mit
9
- short_description: GRPO-trained LLM code debugging environment (OpenEnv)
10
  ---
11
 
12
- # 🧠 CodeSensei β€” GRPO-Trained Code Debugger
13
 
14
- > **Teaching an LLM to think like a debugger through Reinforcement Learning.**
15
 
16
- [![OpenEnv](https://img.shields.io/badge/Built%20with-OpenEnv-blue)](https://github.com/meta-pytorch/OpenEnv)
17
- [![TRL](https://img.shields.io/badge/Training-TRL%20GRPO-green)](https://huggingface.co/docs/trl)
18
- [![HF Spaces](https://img.shields.io/badge/Deploy-HF%20Spaces-yellow)](https://huggingface.co/spaces)
19
- [![License](https://img.shields.io/badge/License-MIT-purple)](LICENSE)
20
 
21
- ---
22
-
23
- ## 🎯 What is CodeSensei?
24
-
25
- CodeSensei is a **custom OpenEnv RL environment** that teaches a language model to debug Python code using **GRPO (Group Relative Policy Optimization)** from HuggingFace TRL.
26
-
27
- The LLM receives buggy Python functions, proposes fixes, and gets rewarded based on test results β€” learning to debug through trial and error.
28
 
29
- ### ✨ Key Features
30
 
31
- - πŸ—οΈ **Custom OpenEnv Integration** β€” Full 3-method environment (`reset`, `step`, `state`)
32
- - 🎯 **4-Signal Reward System** β€” Correctness, progress, syntax, repetition
33
- - πŸ”’ **Sandboxed Execution** β€” LLM-generated code runs in restricted subprocesses
34
- - 🌐 **WebSocket First** β€” Designed for HF Spaces deployment
35
- - πŸ’° **100% Free** β€” Colab T4 + HF Spaces free tier
36
- - πŸ“Š **Live Demo** β€” Gradio app with baseline vs fine-tuned comparison
37
 
38
- ---
39
-
40
- ## πŸ—οΈ Architecture
41
-
42
- ```
43
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
44
- β”‚ Google Colab (Free T4 GPU) β”‚ β”‚ HF Space (codesensei-env) β”‚
45
- β”‚ β”‚ WS β”‚ β”‚
46
- β”‚ GRPOTrainer β†’ rollout_func() ───┼────►│ FastAPI + CodeDebugEnv β”‚
47
- β”‚ Qwen3-1.7B + vLLM β”‚ β”‚ Sandbox + Test Runner β”‚
48
- β”‚ β”‚ β”‚ β”‚
49
- β”‚ β†’ push checkpoint every 5 steps β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
50
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
51
- β”‚
52
- β–Ό
53
- πŸ€— HF Hub (model + checkpoints)
54
- β”‚
55
- β–Ό
56
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
57
- β”‚ HF Space (codesensei-demo) β”‚
58
- β”‚ Gradio: baseline vs GRPO β”‚
59
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
60
- ```
61
-
62
- ---
63
 
64
- ## πŸ“ Project Structure
65
 
66
  ```
67
- codesensei/
68
- β”œβ”€β”€ env/ # OpenEnv Environment
69
- β”‚ β”œβ”€β”€ models.py # Typed Action/Observation/State
70
- β”‚ β”œβ”€β”€ client.py # WebSocket client
 
 
 
 
 
71
  β”‚ └── server/
72
- β”‚ β”œβ”€β”€ environment.py # Core reset/step/state logic
73
- β”‚ β”œβ”€β”€ sandbox.py # Restricted Python execution
74
- β”‚ β”œβ”€β”€ test_runner.py # Test evaluation
75
- β”‚ └── app.py # FastAPI server
 
 
76
  β”œβ”€β”€ training/
77
- β”‚ └── colab_train.py # GRPO training notebook
78
- β”œβ”€β”€ demo/
79
- β”‚ └── app.py # Gradio comparison demo
80
- β”œβ”€β”€ Dockerfile # HF Spaces deployment
81
- β”œβ”€β”€ requirements.txt # Server dependencies
82
- └── README.md
83
  ```
84
 
85
- ---
86
-
87
- ## πŸš€ Quick Start
88
-
89
- ### 1. Run Environment Locally
90
 
91
  ```bash
92
  pip install -r requirements.txt
93
  uvicorn env.server.app:app --host 0.0.0.0 --port 7860
94
  ```
95
 
96
- ### 2. Deploy to HF Spaces
97
 
98
- ```bash
99
- # Push to HF Spaces (Docker-based)
100
- huggingface-cli repo create codesensei-env --type space --space-sdk docker
101
- git remote add hf https://huggingface.co/spaces/YOUR-USERNAME/codesensei-env
102
- git push hf main
103
- ```
104
-
105
- ### 3. Train on Colab
106
-
107
- 1. Open `training/colab_train.py` in Google Colab
108
- 2. Set GPU runtime β†’ T4
109
- 3. Update `CODESENSEI_ENV_URL` to your HF Space
110
- 4. Run all cells
111
- 5. If session drops β†’ re-run cell 10, it resumes from checkpoint
112
 
113
- ### 4. Run Demo
114
 
115
  ```bash
116
- cd demo
117
- pip install -r requirements.txt
118
- python app.py
119
  ```
120
 
121
- ---
122
-
123
- ## 🎯 Reward System
124
 
125
- | Signal | Condition | Value | Purpose |
126
- |---|---|---|---|
127
- | Correctness | All tests pass | +2.0 | Primary goal |
128
- | Progress | More tests pass than before | +0.5 | Incremental improvement |
129
- | Stagnation | No improvement | -0.3 | Prevent plateaus |
130
- | Runtime Error | Code crashes | -0.5 | Penalize regressions |
131
- | Syntax Error | Invalid Python | -1.0 | Force valid output |
132
- | Repetition | Same fix submitted | -0.5 | Force exploration |
133
 
134
- ---
135
 
136
- ## πŸ› οΈ Tech Stack
137
 
138
- | Component | Technology | Cost |
139
  |---|---|---|
140
- | Environment | OpenEnv + FastAPI | Free |
141
- | Training | TRL + GRPO + vLLM | Free |
142
- | GPU | Google Colab T4 | Free |
143
- | Model | Qwen3-1.7B | Free |
144
- | Deployment | HF Spaces | Free |
145
- | Demo | Gradio | Free |
146
- | **Total** | | **$0** |
147
 
148
- ---
149
-
150
- ## πŸ“ˆ Training Details
151
 
152
- - **Model:** Qwen/Qwen3-1.7B
153
- - **Algorithm:** GRPO (Group Relative Policy Optimization)
154
- - **Dataset:** 500 buggy Python functions
155
- - **Max Attempts:** 6 per episode
156
- - **Checkpoint:** Every 5 steps β†’ pushed to HF Hub
157
- - **Session Resilience:** Auto-resume from checkpoint on Colab crash
158
-
159
- ---
160
 
161
- ## πŸ“„ License
162
-
163
- MIT License β€” see [LICENSE](LICENSE) for details.
164
-
165
- ---
166
 
167
- Built for the **OpenEnv Hackathon** πŸ†
 
6
  sdk: docker
7
  app_port: 7860
8
  license: mit
9
+ short_description: RL environment for teaching LLMs to debug Python code
10
  ---
11
 
12
+ # CodeSensei
13
 
14
+ An RL environment built on OpenEnv that trains LLMs to fix buggy Python code. The model gets a broken function, proposes a fix, runs tests, and learns from the results β€” basically the same loop a developer goes through when debugging, but automated with reinforcement learning.
15
 
16
+ ## How it works
 
 
 
17
 
18
+ 1. The environment picks a buggy Python function from the dataset
19
+ 2. The LLM reads the code + failing test output
20
+ 3. It proposes a corrected version
21
+ 4. We run the tests in a sandboxed subprocess
22
+ 5. A multi-signal reward tells the model what went well (or didn't)
23
+ 6. Repeat for up to 6 attempts per bug
 
24
 
25
+ The reward isn't just pass/fail β€” it accounts for partial progress, syntax validity, code variety, and whether the model is actually improving or just submitting the same thing over and over.
26
 
27
+ ## Reward breakdown
 
 
 
 
 
28
 
29
+ | Signal | When | Value |
30
+ |---|---|---|
31
+ | All tests pass | Bug fully fixed | +2.0 |
32
+ | More tests pass than before | Making progress | +0.5 |
33
+ | No improvement over previous best | Stuck | -0.3 |
34
+ | Code crashes at runtime | Regression | -0.5 |
35
+ | Syntax error | Invalid Python | -1.0 |
36
+ | Duplicate submission | Same fix as before | -0.5 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
+ ## Project layout
39
 
40
  ```
41
+ β”œβ”€β”€ inference.py # main inference script (OpenEnv submission)
42
+ β”œβ”€β”€ openenv.yaml # environment spec
43
+ β”œβ”€β”€ Dockerfile
44
+ β”œβ”€β”€ requirements.txt
45
+ β”œβ”€β”€ env/
46
+ β”‚ β”œβ”€β”€ client.py # async client with from_docker_image()
47
+ β”‚ β”œβ”€β”€ models.py # Action, Observation, State dataclasses
48
+ β”‚ β”œβ”€β”€ data/
49
+ β”‚ β”‚ └── bug_dataset.json # 10 bugs with test suites
50
  β”‚ └── server/
51
+ β”‚ β”œβ”€β”€ app.py # FastAPI β€” /reset, /step, /health, /ws
52
+ β”‚ β”œβ”€β”€ environment.py # core logic (reset/step/state)
53
+ β”‚ β”œβ”€β”€ sandbox.py # restricted code execution
54
+ β”‚ └── test_runner.py # runs tests against proposed fixes
55
+ β”œβ”€β”€ server/
56
+ β”‚ └── app.py # entry point for openenv validate
57
  β”œβ”€β”€ training/
58
+ β”‚ └── colab_train.py # GRPO training (Colab T4)
59
+ └── demo/
60
+ └── app.py # Gradio demo
 
 
 
61
  ```
62
 
63
+ ## Running locally
 
 
 
 
64
 
65
  ```bash
66
  pip install -r requirements.txt
67
  uvicorn env.server.app:app --host 0.0.0.0 --port 7860
68
  ```
69
 
70
+ Then hit `POST /reset` with `{}` to start an episode, and `POST /step` with your fix to iterate.
71
 
72
+ ## Inference
 
 
 
 
 
 
 
 
 
 
 
 
 
73
 
74
+ The inference script uses the OpenAI-compatible client pointed at HuggingFace's inference router. It connects to the environment via `from_docker_image()`, runs the debug loop, and logs everything in the required `[START]`/`[STEP]`/`[END]` format.
75
 
76
  ```bash
77
+ export HF_TOKEN="your_token"
78
+ python inference.py
 
79
  ```
80
 
81
+ Default model is `Qwen/Qwen2.5-Coder-32B-Instruct` (free via HF router). You can swap it by setting `MODEL_NAME`.
 
 
82
 
83
+ ## Training
 
 
 
 
 
 
 
84
 
85
+ Open `training/colab_train.py` in Google Colab with a T4 runtime. It uses GRPO from HuggingFace TRL with QLoRA (4-bit quantization + LoRA adapters) so the whole thing fits in 15GB VRAM. Checkpoints get pushed to HF Hub automatically.
86
 
87
+ ## API endpoints
88
 
89
+ | Method | Path | What it does |
90
  |---|---|---|
91
+ | POST | `/reset` | Start a new debugging episode |
92
+ | POST | `/step` | Submit a proposed fix |
93
+ | GET | `/state?session_id=X` | Get current episode state |
94
+ | GET | `/health` | Health check |
95
+ | WS | `/ws` | WebSocket interface |
 
 
96
 
97
+ ## Tech used
 
 
98
 
99
+ - **Environment:** FastAPI + OpenEnv protocol
100
+ - **Training:** TRL GRPO + QLoRA on Qwen2.5-Coder-32B-Instruct
101
+ - **Inference:** OpenAI Python client β†’ HuggingFace router (free tier)
102
+ - **Deployment:** Docker on HF Spaces
103
+ - **Security:** Code execution in sandboxed subprocesses with restricted builtins
 
 
 
104
 
105
+ ## License
 
 
 
 
106
 
107
+ MIT